EDIT: Since writing this series of posts, I’ve learned not only that Web scraping violates Twitter’s terms of service but also that it’s really, really easy to use the Twitter API to collect the locations listed in users’ Twitter profiles (simply put, the API limits are much higher than I thought they were). I learned a lot from the Web scraping route, but I’d suggest using the Twitter API for your own location-retrieving needs!
In yesterday’s post, I described how I scraped Twitter users’ profiles to collect the locations (as text strings) that they list in those profiles. This was a fantastic leap forward for my eventual goal of indicating on a map the location of everyone who participated in the #educattentats hashtag, especially considering the lack of geotagging in the tweets that I collected. However, it’s still not quite enough. I need latitude and longitude coordinates for these different locations if I’m going to be able to accurately plot them on the map.
Wouldn’t it be nice if there were a way to input a text string into a service and have it return a location on a map? Well, that sounds an awful lot like Google Maps. The ggmap package in R has a function called geocode(), which will feed a text string into either the Google Maps API or the Data Science Toolkit and return a host of much-more-useful geographic data. I’ve been telling people that I was getting this info from Google Maps, but after taking a closer look at my code, it turns out that I was using the Data Science Toolkit. I’ve done some quick testing and discovered that the two services will return different things for ambiguous text strings, but I haven’t looked into it systematically, yet.
So, this is a really handy way of not-having-to-interpret-and-plot-all-of-these-written-locations-by-hand, but there are still some drawbacks to it. You’ll recall that I’m working with a French hashtag here, so although there is a fair amount of international attention, I’m mostly plotting the location of French tweeters. France is divided into departments, each of which has an INSEE code, which is like a ZIP code, but BETTER. Well, maybe not better, but I feel like it’s a bigger deal than a ZIP code. Each department has a two- or three-digit code, and then each city within that department has some additional digits tacked onto the end of the department code, so you can get a perfect idea of what department a city is in and even a good idea of what bigger cities it’s near by what its INSEE code is. The codes are also two of the digits on French license plates, so you can tell where a car is from by checking its plates (I think the current plates also have a department or region logo on them, which spoils the fun of figuring this relationship out for yourself).
Anyway, so INSEE codes are a very convenient and pretty informative shorthand for describing where you’re from. It should come as no surprise, then, that some French tweeters use a two-digit department code as their location on their Twitter profile. That makes sense for a human (as long as they’re reasonably familiar with INSEE codes), but these automated processes don’t do well with them and list these tweeters as living in an entirely different place. Now that I know that I’m working with the Data Science Toolkit, I need to test Google Maps to see if it does any better at this, but I suspect it will always be a problem. There are also issues with people who list several locations and, of course, those who list somewhere fake. Fake locations are actually easiest to deal with, since the function usually just returns a “does not compute” sort of thing, so I’m more worried about people who live in France but get plotted in China because they used an INSEE code instead of a written address.
In my next post (the last on this subject), I’ll show you what I came up with!