Plotting Twitter users’ locations in R (part 4: the results)

EDIT: Since writing this series of posts, I’ve learned not only that Web scraping violates Twitter’s terms of service but also that it’s really, really easy to use the Twitter API to collect the locations listed in users’ Twitter profiles (simply put, the API limits are much higher than I thought they were). I learned a lot from the Web scraping route, but I’d suggest using the Twitter API for your own location-retrieving needs!

After a week of going through the ins and outs of my experience with the #educattentats hashtag, it’s time to see some results!

Yesterday, I discussed how I used the ggmaps package to turn the text string “locations” in users’ Twitter profiles into latitude and longitude coordinates. Once you have those, plotting them on a map is pretty straightforward.

That said, I did run into one problem, which is neatly summed up in this tweet:

As one does when working with data that’s taken a long time to clean up, once I had my list of Twitter handles and geographic coordinates, I saved it as a CSV file so that I wouldn’t have to repeat the whole process if I wanted to come back to the end steps. However, when R reads in a CSV file, the default option is to interpret all text strings as factors. Essentially, R is assuming that text strings represent categorical variables that you want to eventually turn into dummy variables, so it counts up the different strings that you have in a CSV file and assigns them to integers (you can read more about that here). That’s helpful if you’re doing something like a regression analysis, but less helpful if you’re plotting things on a map.

I’m not entirely sure how the latitudes and longitudes got turned into strings (instead of integers) in the first place, but between turning into strings and then turning into dummy code integers through stringsAsFactors, I was getting a screwy map. Using the maps package in R, I managed to get points on a map, but they all looked like this:

Rplot04

You’ll notice not only that the points are clearly not in France, where most of them should be but also that they are all to the north of the equator and to the east of the prime meridian. This gives us a hint as to what’s going on. stringsAsFactors is assigning all of these strings integer values, and there’s no good reason to have negative numbers for integers representing dummy codes, so all of the latitude and longitude values are now positive, which means north of the equator and east of the prime meridian. The new integers are also essentially arbitrary, which makes them entirely inaccurate as actual geographical data.

Fortunately, the fix was easy enough. I set stringsAsFactors to FALSE when importing the CSV, and then applied the as.numeric function to all of them when plotting the points on the map. That resulted in this map:

worldMapCropped

This makes a lot more sense. A lot of concentration in France (as is expected), some in Québec and Francophone West Africa, and some general attention to this hashtag from tweeters throughout the world (including a dot in Michigan, which is me!). Now, as mentioned yesterday, some of these points are definitely wrong because of problems with conversion from text strings to geographic coordinates, but this still isn’t a bad representation of what’s going on.

Here’s a closer view of France, using the get_googlemap() and ggmap() functions from the ggmap package. The great thing about the get_googlemap() function is that I can really fine-tune this kind of map in a way that I can’t with the maps package, from the starting location (France, Michigan, etc.) to the level of zoom (city, country, entire continent, etc.).

FrancePlot

Again, no surprises here. Lot of concentration in Paris, which is to be expected, since that’s where the attentats (terrorist attacks) in #educattentats took place and since, well, Paris is a big city. Some respectable clusters in other large French cities, and we also see some dots in neighboring Francophone (and even non-Francophone) countries.

So, we’ve got a start. I still haven’t gotten to my eventual goal of showing how these points change over time, but I feel like I’ve laid a good foundation and hope that what I’ve picked up are helpful for you, too.

7 thoughts on “Plotting Twitter users’ locations in R (part 4: the results)

  1. Yes, StringsAsFactors has caught me more than once, too. Argh!

    I really enjoyed this little series. Keep them coming!

  2. Hello Spencer,
    Really cool tutorials! I was wondering if you have the r code example for parts 3 and 4 and if you’de be so kind to share it here. This would help me a lot for my master dissertation.
    Best,
    Marcia

  3. Hello Spencer,
    Thanks for the code I will experiment it soon for my dissertation purposes. I am attempting to examine scientist’s communication networks on Twitter in the lexicon of research performance and geographical proximity. I am a newbie to programming as my background is in the human geographies. So, I really appreciate that you share your knowledge with the world.

    Keep it up please :)
    All the best,
    Marcia

  4. Hello again Spencer,
    The R code you provided looks interesting! Do you know how reliable the location-data is that this code provides? Are we talking about 100% accuracy or something more like 50 – 80%?
    Best,
    Marcia

  5. hai spencer i had a bit of problem with my bachelor dissertation…i collected a numerous data from twitter then i convert my data into csv file..the problem i have now is that my data do not have the latitude and longitude for all of the location…i just wondering that maybe u can help me in some way that would be great…thanks

    1. Hi Sera,

      Not all tweets are geotagged, so if you’re collecting data from the Twitter API, you’re only going to get longitude and latitude for some of the tweets.

      Best,

      Spencer

Leave a Reply

Your email address will not be published. Required fields are marked *