Data sharing has been in the news a lot lately from the refusal of the authors of the PACE trial to share their data even though the journal expects it to the eventful story of the “Sadness impairs color perception” study. A blog post by Dorothy Bishop called “Who’s afraid of Open Data?” made the rounds. The post itself is actually a month old already but it was republished by the LSE blog which gave it some additional publicity. In it she makes a impassioned argument for open data sharing and discusses the fears and criticisms many researchers have voiced against data sharing.
I have long believed in making all data available (and please note that in the following I will always mean data and materials, so not just the results but also the methods). The way I see it this transparency is the first and most important remedy to the ills of scientific research. I have regular discussions with one of my close colleagues* about how to improve science – we don’t always agree on various points like preregistration, but if there is one thing where we are on the same page, it is open data sharing. By making data available anyone can reanalyse it and check if the results reproduce and it allows you to check the robustness of a finding for yourself, if you feel that you should. Moreover, by documenting and organising your data you not only make it easier for other researchers to use, but also for yourself and your lab colleagues. It also helps you with spotting errors. It is also a good argument that stops reviewer 2 from requesting a gazillion additional analyses – if they really think these analyses are necessary they can do them themselves and publish them. This aspect in fact overlaps greatly with the debate on Registered Reports (RR) and it is one of the reasons I like the RR concept. But the benefits of data sharing go well beyond this. Access to the data will allow others to reuse the data to answer scientific questions you may not even have thought of. They can also be used in meta-analyses. With the increasing popularity and feasibility of large-scale permutation/bootstrapping methods it also means that availability to the raw values will be particularly important. Access to the data allows you to take into account distributional anomalies, outliers, or perhaps estimate the uncertainty on individual data points.
But as Dorothy describes, many scientists nevertheless remain afraid of publishing their actual data alongside their studies. For several years many journals and funding agencies have had a policy that data should always be shared upon request – but a laughably small proportion of such requests are successful. This is why some have now adopted the policy that all data must be shared in repositories upon publication or even upon submission. And to encourage this process recently the Peer Reviewer Openness Initiative was launched by which signatories would refuse to conduct in-depth reviews of manuscripts unless the authors can give a reason why data and materials aren’t public.
My most memorable experience with fears about open data involve a case where the lab head refused to share data and materials with the graduate student* who actually created the methods and collected the data. The exact details aren’t important. Maybe one day I will talk more about this little horror story… For me this demonstrates how far we have come already. Nowadays that story would be baffling to most researchers but back then (and that’s only a few years ago – I’m not that old!) more than one person actually told me that the PI and university were perfectly justified in keeping the student’s results and the fruits of their intellectual labour under lock and key.
Clearly, people are still afraid of open data. Dorothy lists the following reasons:
- Lack of time to curate data; Data are only useful if they are understandable, and documenting a dataset adequately is a non-trivial task;
- Personal investment – sense of not wanting to give away data that had taken time and trouble to collect to other researchers who are perceived as freeloaders;
- Concerns about being scooped before the analysis is complete;
- Fear of errors being found in the data;
- Ethical concerns about confidentiality of personal data, especially in the context of clinical research;
- Possibility that others with a different agenda may misuse the data, e.g. perform selective analysis that misrepresented the findings;
In my view, points 1-4 are invalid arguments even if they seem understandable. I have a few comments about some of these:
The fear of being scooped
I honestly am puzzled by this one. How often does this actually happen? The fear of being scooped is widespread and it may occasionally be justified. Say, if you discuss some great idea you have or post a pilot result on social media perhaps you shouldn’t be surprised if someone else agrees that the idea is great and also does it. Some people wouldn’t be bothered by that but many would and that’s understandable. Less understandable to me is if you present research at a conference and then complain about others publishing similar work because they were inspired by you. That’s what conferences are for. If you don’t want that to happen, don’t go to conferences. Personally, I think science would be a lot better if we cared a lot less about who did what first and instead cared more about what is true and how we can work together…
But anyway, as far as I can see none of that applies to data sharing. By definition data you share is either already published or at least submitted for peer review. If someone reuses your data for something else they have to cite you and give you credit. In many situations they may even do it in collaboration with you which could lead to coauthorship. More importantly, if the scooped result is so easily obtained that somebody beats you to it despite your head start (it’s your data, regardless of how well documented it is you will always know it better than some stranger) then perhaps you should have thought about that sooner. You could have held back on your first publication and combined the analyses. Or, if it really makes more sense to publish the data in separate papers, then you could perhaps declare that the full data set will be shared after the second one is published. I don’t really think this is necessary but I would accept that argument.
Either way, I don’t believe being scooped by data sharing is very realistic and any cases of that happening must be extremely rare. But please share these stories if you have them to prove me wrong! If you prefer, you can post it anonymously on the Neuroscience Devils. That’s what I created that website for.
Fear of errors being discovered
I’m sure everyone can understand that fear. It can be embarrassing to have your errors (and we all make mistakes) being discovered – at least if they are errors with big consequences. Part of the problem is also that all too often the discovery of errors is associated with some malice. To err is human, to forgive divine. We really need to stop treating every time somebody’s mistakes are being revealed (or, for that matter, when somebody’s findings fail to replicate) as an implication of sloppy science or malpractice. Sometimes (usually?) mistakes are just mistakes.
Probably nobody wants to have all of their data combed by vengeful sleuths nitpicking every tiny detail. If that becomes excessive and the same person is targeted, it could border on harassment and that should be counteracted. In-depth scrutiny of all the data by a particular researcher should be a special case that only happens when there is a substantial reason, say, in a fraud investigation. I would hope though that these cases are also rare.
And surely nobody can seriously want the scientific record to be littered with false findings, artifacts, and coding errors. I am not happy if someone tells me I made a serious error but I would nonetheless be grateful to them for telling me! It has happened before when lab members or collaborators spotted mistakes I made. In turn I have spotted mistakes colleagues made. None of this would have been possible if we didn’t share our data and methods amongst each another. I am always surprised when I hear how uncommon this seems to be in some labs. Labs should be collaborative, and so should science as a whole. And as I already said, organising and documenting your data actually helps you to spot errors before the work is published. If anything, data sharing reduces mistakes.
Ethical issues with patient confidentiality
This is a big concern – and the only one that I have full sympathy with. But all of our ethics and data protection applications actually discuss this. The only data that is shared should be anonymised. Participants should only be identified by unique codes that only the researchers who collected the data have access to. For a lot of psychology or other behavioural experiments this shouldn’t be hard to achieve.
Neuroimaging or biological data are a different story. I have a strict rule for my own results. We do not upload the actual brain images of our fMRI experiments to public repositories. While under certain conditions I am willing to share such data upon request as long as the participant’s name has been removed, I don’t think it is safe to make those data permanently available to the entire internet. Participant confidentiality must trump the need for transparency. It simply is not possible to remove all identifying information from these files. Skull-stripping, which removes the head tissues from an MRI scan except for the brain, does not remove all identifying information. Brains are like finger-prints and they can easily be matched up, if you have the required data. As someone* recently said in a discussion of this issue, the undergrad you are scanning in your experiment now may be Prime Minister in 20-30 years time. They definitely didn’t consent to their brain scans being available to anyone. It may not take much to identify a person’s data using only their age, gender, handedness, and a basic model of their head shape derived from their brain scan. We must also keep in mind of what additional data mining may be possible in the coming decades that we simply have no idea about yet. Nobody can know what information could be gleaned from these data, say, about health risks or personality factors. Sharing this without very clear informed consent (that many people probably wouldn’t give) is in my view irresponsible.
I also don’t believe that for most purposes this is even necessary. Most neuroimaging studies involve group analyses. In those you first spatially normalise the images of each participant and the perform statistical analysis across participants. It is perfectly reasonable to make those group results available. For purpose of non-parametric permutation analyses (also in the news recently) you would want to share individual data points but even there you can probably share images after sufficient processing that not much incidental information is left (e.g. condition contrast images). In our own work, these considerations don’t apply. We conduct almost all our analyses in the participant’s native brain space. As such we decided to only share the participants’ data projected on a cortical reconstruction. These data contain the functional results for every relevant voxel after motion correction and signal filtering. No this isn’t raw data but it is sufficient to reproduce the results and it is also sufficient for applying different analyses. I’d wager that for almost all purposes this is more than enough. And again, if someone were to be interested in applying different motion correction or filtering methods, this would be a negotiable situation. But I don’t think we need to allow unrestricted permanent access for such highly unlikely purposes.
Basically, rather than sharing all raw data I think we need to treat each data set on a case-by-case basis and weigh the risks against benefits. What should be mandatory in my view is sharing all data after default processing that is needed to reproduce the published results.
People with agendas and freeloaders
Finally a few words about a combination of points 2 and 6 in Dorothy Bishop’s list. When it comes to controversial topics (e.g. climate change, chronic fatigue syndrome, to name a few examples where this apparently happened) there could perhaps be the danger that people with shady motivations will reanalyse and nitpick the data to find fault with them and discredit the researcher. More generally, people with limited expertise may conduct poor reanalysis. Since failed reanalysis (and again, the same applies to failed replications) often cause quite a stir and are frequently discussed as evidence that the original claims were false, this could indeed be a problem. Also some will perceive these cases as “data tourism”, using somebody else’s hard-won results for quick personal gain – say by making a name for themselves as a cunning data detective.
There can be some truth in that and for that reason I feel we really have to work harder to change the culture of scientific discourse. We must resist the bias to agree with the “accuser” in these situations. (Don’t pretend you don’t have this bias because we all do. Maybe not in all cases but in many cases…)
Of course skepticism is good. Scientists should be skeptical but the skepticism should apply to all claims (see also this post by Neuroskeptic on this issue). If somebody reanalyses somebody else’s data using a different method that does not automatically make them right and the original author wrong. If somebody fails to replicate a finding, that doesn’t mean that finding was false.
Science thrives on discussion and disagreement. The critical thing is that the discussion is transparent and public. Anyone who has an interest should have the opportunity to follow it. Anyone who is skeptical of the authors’ or the reanalysers’/replicators’ claims should be able to check for themselves.
And the only way to achieve this level of openness is Open Data.
* They will remain anonymous unless they want to join this debate.