EDIT: Since writing this series of posts, I’ve learned not only that Web scraping violates Twitter’s terms of service but also that it’s really, really easy to use the Twitter API to collect the locations listed in users’ Twitter profiles (simply put, the API limits are much higher than I thought they were). I learned a lot from the Web scraping route, but I’d suggest using the Twitter API for your own location-retrieving needs!
After a week of going through the ins and outs of my experience with the #educattentats hashtag, it’s time to see some results!
Yesterday, I discussed how I used the ggmaps package to turn the text string “locations” in users’ Twitter profiles into latitude and longitude coordinates. Once you have those, plotting them on a map is pretty straightforward.
That said, I did run into one problem, which is neatly summed up in this tweet:
stringsAsFactors <- STUPID
— Spencer Greenhalgh (@spgreenhalgh) November 22, 2015
As one does when working with data that’s taken a long time to clean up, once I had my list of Twitter handles and geographic coordinates, I saved it as a CSV file so that I wouldn’t have to repeat the whole process if I wanted to come back to the end steps. However, when R reads in a CSV file, the default option is to interpret all text strings as factors. Essentially, R is assuming that text strings represent categorical variables that you want to eventually turn into dummy variables, so it counts up the different strings that you have in a CSV file and assigns them to integers (you can read more about that here). That’s helpful if you’re doing something like a regression analysis, but less helpful if you’re plotting things on a map.
I’m not entirely sure how the latitudes and longitudes got turned into strings (instead of integers) in the first place, but between turning into strings and then turning into dummy code integers through stringsAsFactors, I was getting a screwy map. Using the maps package in R, I managed to get points on a map, but they all looked like this:
You’ll notice not only that the points are clearly not in France, where most of them should be but also that they are all to the north of the equator and to the east of the prime meridian. This gives us a hint as to what’s going on. stringsAsFactors is assigning all of these strings integer values, and there’s no good reason to have negative numbers for integers representing dummy codes, so all of the latitude and longitude values are now positive, which means north of the equator and east of the prime meridian. The new integers are also essentially arbitrary, which makes them entirely inaccurate as actual geographical data.
Fortunately, the fix was easy enough. I set stringsAsFactors to FALSE when importing the CSV, and then applied the as.numeric function to all of them when plotting the points on the map. That resulted in this map:
This makes a lot more sense. A lot of concentration in France (as is expected), some in Québec and Francophone West Africa, and some general attention to this hashtag from tweeters throughout the world (including a dot in Michigan, which is me!). Now, as mentioned yesterday, some of these points are definitely wrong because of problems with conversion from text strings to geographic coordinates, but this still isn’t a bad representation of what’s going on.
Here’s a closer view of France, using the get_googlemap() and ggmap() functions from the ggmap package. The great thing about the get_googlemap() function is that I can really fine-tune this kind of map in a way that I can’t with the maps package, from the starting location (France, Michigan, etc.) to the level of zoom (city, country, entire continent, etc.).
Again, no surprises here. Lot of concentration in Paris, which is to be expected, since that’s where the attentats (terrorist attacks) in #educattentats took place and since, well, Paris is a big city. Some respectable clusters in other large French cities, and we also see some dots in neighboring Francophone (and even non-Francophone) countries.
So, we’ve got a start. I still haven’t gotten to my eventual goal of showing how these points change over time, but I feel like I’ve laid a good foundation and hope that what I’ve picked up are helpful for you, too.