A couple of weeks ago, I posted about some of the ethical dilemmas involved in using public data for research, using an example of facial recognition researchers who used YouTube videos of people undergoing hormone replacement therapy to improve their algorithms’ ability to recognize faces from pre- and post-transition.
Since reading that article, I’ve seen the occasional tweet about another facial recognition project that purports to be able to infer sexual orientation from facial analysis. Like the other project, this one also uses public data and also has some worrying implications. Over the weekend, though, I sat down to read an article about the project and learned a few things that I hadn’t picked up from tweetskimming. First, as for much research, the “success” of this project is much higher in a controlled lab setting than in the real world. Second, and more importantly, the researchers claim to have carried out this project precisely in order to raise the alarm bells that they did:
Dr Kosinski says he conducted the research as a demonstration, and to warn policymakers of the power of machine vision. It makes further erosion of privacy “inevitable”; the dangers must be understood, he adds. Spouses might seek to know what sexuality-inferring software says about their partner (the word “gay” is 10% more likely to complete searches that begin “Is my husband…” than the word “cheating”). In parts of the world where being gay is socially unacceptable, or illegal, such software could pose a serious threat to safety. Dr Kosinski is at pains to make clear that he has invented no new technology, merely bolted together software and data that are readily available to anyone with an internet connection. He has asked The Economist not to reveal the identity of the dating website he used, in order to discourage copycats.
This description reminds me of “white hat hacking,” hacking that tries to break a system’s security but with the purpose of calling attention to the identified weaknesses so that the system can be strengthened. Dr. Kosinski seems to be doing something similar: Carrying out a project with troubling implications but with the purpose of calling our attention to those implications and helping us deal with them before someone else (perhaps a “black hat” with malicious intent) does the same thing.
When it comes to Twitter research, I’ll admit that I’ve done some of this same thing on a much smaller scale. It started at a colleague’s practice job talk, when he mentioned the steps that he took to anonymize the teenage tweeters that he was studying. I took that as a challenge, and the next time he quoted a tweet in a slide, I opened up Twitter, typed in a distinctive phrase, and managed to find out who at least one of the “anonymized” participants was. I’ve kept up this habit when reading papers or watching presentations that quote “anonymized” tweets, and I’ve usually come away with some success identifying one or more of the participants (of course, in one case, it was one of my own tweets that had been quoted by a fellow MSU researcher, so that one wasn’t terribly difficult). On one hand, doing this (i.e., intentionally breaking someone’s efforts at anonymizing research participants) kind of makes me a jerk. On the other hand, I’m trying in part to draw attention to the problems of truly anonymizing Twitter data.
So, white hat research ethics violations… are they a thing? Do we need more of them? Are they really research ethics violations? Plenty of food for thought.