EDIT: Since writing this series of posts, I’ve learned not only that Web scraping violates Twitter’s terms of service but also that it’s really, really easy to use the Twitter API to collect the locations listed in users’ Twitter profiles (simply put, the API limits are much higher than I thought they were). I learned a lot from the Web scraping route, but I’d suggest using the Twitter API for your own location-retrieving needs!
Yesterday, I mentioned discovering the French hashtag #educattentats that was created in the wake of the 13 November terrorist attacks. As far as I can tell, I discovered the hashtag shortly after it was created, so it’s been interesting to see how use of the hashtag has grown in the hours, days, and weeks since.
Inspired by a project in the class I’m taking on Internet research methods, I decided to see if I could plot locations for all of the Twitter users who have either included this hashtag in one of their own tweets or retweeted a tweet including the hashtag. My long-term goal with this would be to split the tweets by units of time to see how use of the hashtag spread (and eventually shrank?) geographically over time. I haven’t gotten that far yet, but I am happy with what I’ve done so far (you can find the code here). I’d like to highlight a couple of parts of this process, since they represent a couple of tricks that I plan to use in the future and that may be useful to others as well.
The easiest way to find someone’s location on Twitter is by identifying geotagged tweets, as my advisor, Dr. Matt Koehler, has done to great effect on his blog. However, of the 6000 tweets I’ve collected so far, there is just one that has a geotag (and it was collected after I started exploring the data), so geotags are of no use to me at all.
Fortunately, most Twitter users list a location in their profile. There are some obvious validity issues with this (tweeters can easily specify as their location a town they don’t live in or even say that they live on Mars), but for now, I’m choosing to assume that most Twitter users are honest and accurate as far as specifying a location. The twitteR package in R can easily retrieve a Twitter user’s location given a username, but it does so through the Twitter API, which means that you have to be careful not to send too many requests in too short of a time. I’m impatient, and this is also a pretty straightforward task, so I went another route.
Each Twitter user has a distinct URL associated with her profile page, so if you have a list of usernames (which is pretty easy to get from a TAGS archiver), you can easily access each of those profile pages. So, using the XML and rvest packages, I fed my code the URL for the profile page of each of the users involved with this hashtag. I used the read_html() function to get the HTML code for each of those pages, then used the html_nodes() function and XPath to find the parts of the page where the location of the user is stored. With a little bit of cleaning, I soon had myself a list of locations.
Obviously, that’s not the whole picture. All I had at this point was a list of character strings specifying some location, and that’s not enough to place them on a map unless I want to do it by hand. So, in tomorrow’s post, I’ll discuss converting those strings into mappable locations!