Plotting Twitter users’ locations in R (part 2: geotags vs. Web scraping vs. API)

EDIT: Since writing this series of posts, I’ve learned not only that Web scraping violates Twitter’s terms of service but also that it’s really, really easy to use the Twitter API to collect the locations listed in users’ Twitter profiles (simply put, the API limits are much higher than I thought they were). I learned a lot from the Web scraping route, but I’d suggest using the Twitter API for your own location-retrieving needs!

Yesterday, I mentioned discovering the French hashtag #educattentats that was created in the wake of the 13 November terrorist attacks. As far as I can tell, I discovered the hashtag shortly after it was created, so it’s been interesting to see how use of the hashtag has grown in the hours, days, and weeks since.

Inspired by a project in the class I’m taking on Internet research methods, I decided to see if I could plot locations for all of the Twitter users who have either included this hashtag in one of their own tweets or retweeted a tweet including the hashtag. My long-term goal with this would be to split the tweets by units of time to see how use of the hashtag spread (and eventually shrank?) geographically over time. I haven’t gotten that far yet, but I am happy with what I’ve done so far (you can find the code here). I’d like to highlight a couple of parts of this process, since they represent a couple of tricks that I plan to use in the future and that may be useful to others as well.

The easiest way to find someone’s location on Twitter is by identifying geotagged tweets, as my advisor, Dr. Matt Koehler, has done to great effect on his blog. However, of the 6000 tweets I’ve collected so far, there is just one that has a geotag (and it was collected after I started exploring the data), so geotags are of no use to me at all.

Fortunately, most Twitter users list a location in their profile. There are some obvious validity issues with this (tweeters can easily specify as their location a town they don’t live in or even say that they live on Mars), but for now, I’m choosing to assume that most Twitter users are honest and accurate as far as specifying a location. The twitteR package in R can easily retrieve a Twitter user’s location given a username, but it does so through the Twitter API, which means that you have to be careful not to send too many requests in too short of a time. I’m impatient, and this is also a pretty straightforward task, so I went another route.

Each Twitter user has a distinct URL associated with her profile page, so if you have a list of usernames (which is pretty easy to get from a TAGS archiver), you can easily access each of those profile pages. So, using the XML and rvest packages, I fed my code the URL for the profile page of each of the users involved with this hashtag. I used the read_html() function to get the HTML code for each of those pages, then used the html_nodes() function and XPath to find the parts of the page where the location of the user is stored. With a little bit of cleaning, I soon had myself a list of locations.

Obviously, that’s not the whole picture. All I had at this point was a list of character strings specifying some location, and that’s not enough to place them on a map unless I want to do it by hand. So, in tomorrow’s post, I’ll discuss converting those strings into mappable locations!

9 thoughts on “Plotting Twitter users’ locations in R (part 2: geotags vs. Web scraping vs. API)

  1. hey, can u share a code, we are doing a similar project for insurgents in india, we have collected like 200 redlist hashtags along with 200+ antisocial twitter handles, how do we use this to geotag location using code?

    1. Hi, there’s a link in the post to the R code that I used. You’ll have to tweak it to provide it with the handles that you have in mind, but it should work from there. Please keep in mind, though, that this actually isn’t related to geotags… you’d have to find different code for that. This will just help plot points listed in Twitter profiles.

  2. I need to find location of all users that contains tweets in csv file. but i don’t have username in that file. please help me to find location of all users. It contains text, favourite , retweet , id , screen name, longitude,latitude. I need to get the location details of tweets. please provide me code for that.

    1. Could you tell me what you mean by “screen name”? Is that not the username that you’re looking for? Alternatively, the ID might be able to point you to individual tweets and you could scrape those pages to find the usernames.

    1. Hello,

      The CSV file is one that I built myself based on data that I had collected. You’ll want to replace it with a CSV of your own data, which might require tweaking the code.



  3. Hello sir,
    I’m facing some problems in extracting the tweets. Firstly, the same tweets get repeated several times while downloading from TAGS and R also. Secondly, I tried adding the code (var advParams = {“lang”: “en”};) to the script for filtering the tweets posted on english only but it didn’t work. I got only 83 locations of 200 tweets. So, is there any way to avoid repetition of tweets and also any other efficient way of filtering tweets by language.

    1. Hi Girish,

      Sorry for the delay! I haven’t checked my comments in a while!

      I’ve also run into trouble with tweets being duplicated in TAGS, but it’s pretty straightforward to take care of that in R once you’ve uploaded the data. One possible way to do that is with a snippet of code like this one (which I’ve borrowed from my colleague Josh Rosenberg): df < - df[!duplicated(df$id_str), ].

      As for filtering tweets by language, the "user_lang" column in a TAGS refers to the language that someone has their Twitter account set to, not the language that they're necessarily writing in. I'd suggest using the textcat() function from the textcat() R package to identify the language of tweets.

      Hope this helps!


Leave a Reply

Your email address will not be published. Required fields are marked *