Public data and digital research ethics

The Verge recently posted an article that highlights some of the ethical dilemmas involved in collecting publicly-available data for research purposes. The article begins by describing the work of a researcher working on facial recognition of people before and after hormone replacement therapy:

On YouTube, he found a treasure trove. Individuals undergoing HRT often document their progress and post the results online, sometimes keeping regular diaries, and sometimes making time-lapse videos of the entire process. “I shared my videos because I wanted other trans people to see my transition,” says Danielle, who posted her transition video on YouTube years ago. “These types of transition montages were helpful to me, so I wanted to pay it forward,” she tells The Verge.

At first glance, YouTube videos seem like a perfect dataset for this sort of thing. They’re being freely provided and are generally available under a Creative Commons license. However, in the words of Dr. Ian Malcolm:


Again, from the article:

Danielle, who is featured in the dataset and whose transition pictures appear in scientific papers because of it, says she was never contacted about her inclusion. “I by no means ‘hide’ my identity,” she told The Verge using an online messaging service. “But this feels like a violation of privacy.” She said she was gratified to know that there are limits on the use of the dataset (especially that it wasn’t sold to companies), but said this sort of biometric collection had “all sorts of implications for the trans community.”

The idea of having one’s picture—especially a transition picture—appear in scientific papers without ever having consented to it seems highly problematic. And yet, I’m obviously in favor of using publicly-available digital data for research; after all, I rely on public Twitter data for nearly all of my research. So, how can I continue to use this data while not crossing any lines? I don’t claim to do this perfectly, but here’s one way Josh Rosenberg, Leigh Graves Wolf, and I described our efforts in a recent article:

Twitter and other Internet data provide new ethical challenges for educational (and other) researchers. Inspired by medical research, the concept of human subjects research has long been the distinguishing factor in whether researchers are required to submit their work to institutional review boards (IRBs) for ethical review (Markham and Buchanan, 2012). However, data such as the collection of tweets described above frequently do not qualify as human subjects research; indeed, this study did not require review by an IRB according to the definitions set out by Michigan State University. However, Internet researchers are increasingly vocal in their arguments that existing ethical frameworks are not well suited to digital data (Markham and Buchanan, 2012) and that the limits established by the law are also inadequate for determining what constitutes ethical Internet research (Eynon et al., 2008).

In response to the absence of universal, clear guidelines for Internet data, we have taken explicit steps of our own to report our findings ethically. Most notably, we have tried to avoid the use of direct quotation throughout the paper, even when referring to particular tweets. Twitter’s search function is powerful enough that even a small but distinct quotation may be sufficient for identifying a particular tweet, and while tweets can be considered public documents, we feel that it is important to acknowledge that notions of publicity and privacy on the Internet are mediated by varying expectations, intentions, and contexts (Eynon et al., 2008; Markham and Buchanan, 2012) and that no one has provided explicit consent for their tweets to appear in this paper. When we have chosen to quote from tweets, we have made modifications such as excerpting tweets and removing URLs to personal blog posts in order to preserve as much anonymity as possible.

There’s a lot more that could be said—and that I ought to write— on this subject, but I was glad that I ran into this article to get my mind working on this subject again.

