Chapter 3 Exploring the world of user-generated data

3.1 Video

3.2 Transcript (Jeremy Kidwell speaking)

You’ve probably heard by now of the company “Cambridge Analytica” recently renamed to Emerdata. As several media outlets reported in 2017, a little known firm called Cambridge analytica surprised many by claiming that their “evolutionary approach to data-driven communication has played such an integral part in President-elect Trump’s extraordinary win.” As details emerged, it became clear that this was not mere bluster, but that this firm had managed to amass a trove of personal data about individuals, as the Washington Post suggested, up to 5000 pieces of data on each American citizen and then sought to nudge or manipulate voting behaviours by creating highly-targetted content, including ads on major social media platforms and so-called “fake news” stories.

Data ethics is always easier in hindsight, but I’d like to nonetheless look into the structure of this data collection to raise some issues about how data gets “out there” in the first place.

Facebook is a central character in this story about data and this isn’t surprising given their dominance of internet communication in recent years. In some cases, more persons answering surveys claim to be using facebook than the internet. While this is logically impossible - facebook is merely a service which sits on top of the internet, at least for now – it gets towards the ubiquity of facebook use. Given this centrality, it is sensible to begin our look here to see how things are in terms of data. The story of privacy and data protection on facebook is, to be generous, an evolving one. Much of the data that users put on facebook was completely public until 2012, including the complete catalogue of your “likes”. For a company like Cambridge Analytica, this information was pure gold - enabling them to build up what psychologists call a “psychometric” profile using this data. If this information was on the internet in plain sight, could any user have assumed that their activity on facebook was private? Should they have? Since likes were made private, facebook has had a number of “gaffes” in which new features or bugs have forced this data back out into the public. Much of the reporting of the cambridge analytica scandal have referred to their access of data as a “breach” implying that Facebook had been trying to keep data that users generated private in good faith and that this company had found improper or possibly even illegal ways to harvest it, but this is actually quite misleading. Companies like CA – and it is worth noting that there are probably hundreds of other similar operations which have been harvesting similarly massive datasets - can put together millions of tiny pieces of tiny information scattered across the internet - the number of contacts you have on a social network platform, or the number of profile pictures you’ve cycled through, hint at personality traits.

The controversial part that some persons are (in my opinion inaccurately) calling a breach relates to another approach that CA took on, shortly after facebook began to make its data privacy policies a bit less free-wheeling. They used Amazon’s mechanical turk platform, where companies can hire consultants to do tiny tasks for small sums of money, sometimes just a single pound (or dollar in this case) to answer a personality survey. Over 200k persons answered this survey, which had a hidden gem at the end - users were asked to share their facebook profiles, with their (now private) likes and friends. Thousands compiled unwittingly. Some people who took the survey complained to Amazon that this violated Amazon’s terms of service, but Amazon didn’t discontinue the surveys until more than a year later.

Is this kind of data collection ethical? Well, I’ll get into these kinds of questions from the perspective of a researcher a bit later, after we hear about the monkey selfie. For now, I want us to start thinking about ourselves as generators of data. This is a good ethical exercise, to place ourselves in a situation and see how we feel - so that we turn this dynamic around and begin to think of ourselves as collectors (and not producers) of data, we have some sensitivity to how things might be a bit complex.

For this session, we’d like to have you try a few exercises which will get you acquainted with the idea of “terms and conditions”. You’ve likely seen dozens of T&C’s as they’re called by now, but because they’re all in legalese and often dozens of pages long, we hardly ever read them. In fact, the Guardian reported in 2011 that less than 7% of Britons ever read T&C’s and that 1/10 would rather read the whole phone book. Another more recent study found that only 1 in 4 students take the time to read terms and conditions. Jonathan Obar at York University did a study which found that it would take the average user 40 minutes a day to actually read through privacy and T&C documents in which they’re implicated. Yep, that’s 40 minutes out of every single day.

Whether this situation is deliberate as some scholars have suggested, or merely an unforunate accident, there’s a problem here relating to user literacy of data policies. So we’re going to ask you to actually read through one of these documents and then to debrief how this knowledge changes your perspective on putting your data on social network platforms. We’re also going to ask you to do an informal study of a digital chat.

We hope you’ll find this exercise illuminating, and will look forward to telling you about that monkey selfie in our next session.