Transcript of the video:

Data is used in a huge variety of ways, in fact, with the rise of digital platforms, social media, and big data, as we highlighted in our second session, few few aspects of our lives are "off limits". To be fair, it's not as if platforms like Google and Facebook have invented the idea of "data," for decades scholarly researchers have been collecting and analysing data in ways that they hope will improve the quality of life for countless individual people.

Let me give you one example of what I see as a benevolent kind of data collection. In recent years, DNA sequencing has become incredibly fast, efficient, and also inexpensive. Many issues with human health and disease have genetic components, and access to information about a range of human genomes can really help researchers to accelerate the pace of research and look for new areas of impact.

Though there are some movements encouraging people to share personal genetic data openly (especially given the way some of these genomic databases are privatised and expensive to access) there are expectedly rare. Instead, in most cases people who contribute their personal genomic data to research do so on the expectation that their contributions will be made confidential or anonymous.

I'm sure that some of you will already be aware of research concepts like confidentiality, but it's helpful to pause and highlight what are three fundamental aspects of empirical research before we continue this discussion.

The first "core concept" in research ethics is

Informed consent: Regardless of privacy concerns, it is a fundamental right of any person participating in research that they consent to participation in a given study. Though we may imagine that consent can be implied sometimes, our ability to "read" people simply isn't that trustworthy. Consent cannot be assumed. This means that research subjects must be informed of the study, how it will work, and what the data will be used for - and then given the opportunity to choose whether to participate or not.

The research arm of Facebook conducted a study in 2012, the results of which were published in the journal PNAS. For a brief period, they deliberately skewed the news for 689,003 people to emphasise either positive emotional content or negative emotional content. Researchers wanted to see if this had an impact on the way users used the platform. What they found was not surprising, it did have an impact. What did surprise this research team was the outcry by a wide range of academic researchers suggesting (rightly) that they had not observed the principle of informed consent. You can read the eventual expression of "editorial concern" by PNAS editors to see their repentance for publishing this article.

Informed consent isn't the whole picture, however. In many cases, as I've already suggested above, users might not be willing to participate in a study if their resposes were to be made public. So, one of the fundamental ways that researchers approach data ethics is to remove personally identifiable data. In some cases, identities are known to researchers but made confidential, such that the readers of research publication cannot identify individual persons. In other stricter cases, research subjects are anonymised, so that even researchers in their notes do not preserve the specific identity of the people contributing data. They are just "subject 1" "subject 2" and so on.

[lower thirds information with key terms here in videography]

In 2013 a team of scientists from MIT released a study with shocking results. Through some clever data analysis techniques, they had found a way to de-anonmise a major genomic database. You can read more about this later. There have also been subsequent studies by other research teams de-anonymising other anonymised research databases. Much like cambridge analytica, these researchers gathered together millions of small fragments of data together and then tested for compatibility against these databases. In the case of the work by this team at MIT, they were able to de-anonymise 50 out of 1000 subjects in that study. In other cases the rate has been much worse.

So what does this mean for research ethics? Simply put, in this new internet age, we can just use "subject 1" and "subject 2". Our computers are powerful enough, and our lives public enough that clever people can reverse engineer them. There are two possible ways to respond to this new context for digital data.

First, researchers can try to lock down their data sets, ensuring that they are stored in secure digital repositories. What makes a digital repository secure? A whole lot of things - good passwords (and not ones which need to be changed every six months), strong and consistent network security and data privacy policies, clear ways to classify the levels of sensitivity of a given data-set, well-trained staff... etc. I've included some articles you can read here if you want to learn a bit more about how to set up a proper data vault.

But as the range of high-profile hacks in recent years indicate, organisations very rarely check all these boxes, and so even secured data can be at risk of breach.

This continued risk has led some researchers to take up another option, to pursue alternative research methods which do not emphasise privacy and confidentiality, but rather active participation by research subjects. To be fair, this will only work in certain contexts, definitely not in the context of highly sensitive data or in studies conducted with vulnerable people. But many researchers are having innovative and surprising results with participatory studies. We can involve research subjects in the design of a study, enable them to help us interpret the results, and make their voices known. The advantage here is that a study turns from a faceless mass into a specific group of unique people.

This is a brave new digital world that we're working in, to be sure. And it can be easy to ignore the risks associated with data generation and use in our excitement for new research and horizons to be explored. Research is fun, exciting, and often empowering. I like to say that the task of a researcher is to help people tell their stories. We need to be sure that we do this in a way that empowers those individuals and not just ourselves. And this means that we need to make responsible choices about research design and data management.

In the tasks for this session, we're going to have you do a deep dive into some of the studies we've mentioned briefly here and then to write up a brief reflection on what you'd like to make your own research design ethic. In the next session, we're going to wrap-up and try to explore some of the ways that we can move forward as researchers in the midst of these very complicated scenarios that we've shared with you so far.

Resources:

"Genealogy Databases Enable Naming of Anonymous DNA Donors" http://science.sciencemag.org/content/339/6117/262
The full technical report, "Identifying Personal Genomes by Surname Inference"
http://www.pnas.org/content/pnas/111/24/8788.full.pdf
PNAS Editorial expression of concern:
https://www.hhs.gov/ohrp/regulations-and-policy/regulations/common-rule/index.html

7.2 KiB Raw Blame History

Confidentiality, anonymity, consent, and privacy

Transcript of the video:

Resources:

7.2 KiB

Raw Blame History