Lessons learned when Web scraping #GorafiESR tweets

I’ve posted in the past about Web scraping Twitter user profiles, but I took some time last week to tackle something else that I’ve been thinking about: Scraping the tweets themselves. Web scraping tweets is a nifty trick, but it doesn’t necessarily have an obvious application right off the bat.

I wound up doing it because another French hashtag caught my eye: #GorafiESR. Le Gorafi (which I’ve posted about before) is a French satirical news source equivalent to the Onion. Recently, they launched the Madame Gorafi spinoff, and that inspired one French academic to propose instead a Higher Ed and Research (Enseignement Supérieur et Recherche, or ESR) edition of the periodical. The hashtag went viral, with French academics weighing in to suggest increasingly preposterous headlines.

This was too good to pass up, so I’ve started working with my colleague Sarah Gretter to dive into these tweets and find out what people are saying. We have a Twitter Archiving Google Sheet set up to collect tweets, which is pretty handy for easily collecting tweets and gathering some basic information on them.

However, there are a lot of tweets here! Even after limiting our collection to the first 24 hours of the hashtag and filtering out retweets, we still had over 2,800 different tweets to look at. It’s not uncommon to work with larger collections of tweets, but that usually relies more on automated methods; we’ll be reading over the tweets, so we agreed that it would be nice to have an automated way of figuring out which tweets are most important so that we could check them out before the others.

One way to do that would be to judge how many likes and retweets each of these original posts got. Presumably, the tweets that got the most attention would be the ones worth looking at first; then, we could take a look at the rest. Our TAGS collector doesn’t track likes and retweets; Agarwal’s Twitter Archiver does, but only (as far as I know) when the tweet is logged, which is a kind of awkward timing that risks missing out on some information. I have no doubts that you can do this through the Twitter API, but the Twitter API is going to limit the number of requests you can make per chunk of time.

So, what do we do? Web scrape.

Here’s a link to the code that I used to Web scrape our tweets. The code skips any inaccessible tweets (which, I’ve learned, can happen in one of at least two ways: suspended accounts and deleted tweets), so there shouldn’t be any problems with that. Plus, in addition to counting likes and retweets, I played with the code so that it would also grab:

  • the text of the tweet,
  • the Twitter handle of the user who sent the tweet, and
  • the UNIX timestamp for the tweet.

Not all of this will be helpful in all situations, but I tried to grab most of the “low-hanging metadata.” I think the UNIX timestamp will be particularly helpful: The date and time that Twitter displays for a tweet varies depending on a few things, including the timezone you’re in, but grabbing UNIX time—which is set to GMT and incorporates both date and time—might help keep numbers straight if you’re collecting across time zones. It may be possible to grab some more advanced stuff, like replies, but that’s for another day.

Like I said in the beginning, applications of tweet-scraping aren’t as obvious as applications of profile-scraping. However, I think there are some potential uses for this sort of thing. For example, why not download someone’s Twitter archive and find out which of their tweets have been the most popular? I’m working to figure out some other possible applications for this sort of thing and would love to hear any other ideas for how to put this to work!

Leave a Reply

Your email address will not be published. Required fields are marked *