Why wouldn’t you share data?

Data sharing has been in the news a lot lately from the refusal of the authors of the PACE trial to share their data even though the journal expects it to the eventful story of the “Sadness impairs color perception” study. A blog post by Dorothy Bishop called “Who’s afraid of Open Data?” made the rounds. The post itself is actually a month old already but it was republished by the LSE blog which gave it some additional publicity. In it she makes a impassioned argument for open data sharing and discusses the fears and criticisms many researchers have voiced against data sharing.

I have long believed in making all data available (and please note that in the following I will always mean data and materials, so not just the results but also the methods). The way I see it this transparency is the first and most important remedy to the ills of scientific research. I have regular discussions with one of my close colleagues* about how to improve science – we don’t always agree on various points like preregistration, but if there is one thing where we are on the same page, it is open data sharing. By making data available anyone can reanalyse it and check if the results reproduce and it allows you to check the robustness of a finding for yourself, if you feel that you should. Moreover, by documenting and organising your data you not only make it easier for other researchers to use, but also for yourself and your lab colleagues. It also helps you with spotting errors. It is also a good argument that stops reviewer 2 from requesting a gazillion additional analyses – if they really think these analyses are necessary they can do them themselves and publish them. This aspect in fact overlaps greatly with the debate on Registered Reports (RR) and it is one of the reasons I like the RR concept. But the benefits of data sharing go well beyond this. Access to the data will allow others to reuse the data to answer scientific questions you may not even have thought of. They can also be used in meta-analyses. With the increasing popularity and feasibility of large-scale permutation/bootstrapping methods it also means that availability to the raw values will be particularly important. Access to the data allows you to take into account distributional anomalies, outliers, or perhaps estimate the uncertainty on individual data points.

But as Dorothy describes, many scientists nevertheless remain afraid of publishing their actual data alongside their studies. For several years many journals and funding agencies have had a policy that data should always be shared upon request – but a laughably small proportion of such requests are successful. This is why some have now adopted the policy that all data must be shared in repositories upon publication or even upon submission. And to encourage this process recently the Peer Reviewer Openness Initiative was launched by which signatories would refuse to conduct in-depth reviews of manuscripts unless the authors can give a reason why data and materials aren’t public.

My most memorable experience with fears about open data involve a case where the lab head refused to share data and materials with the graduate student* who actually created the methods and collected the data. The exact details aren’t important. Maybe one day I will talk more about this little horror story… For me this demonstrates how far we have come already. Nowadays that story would be baffling to most researchers but back then (and that’s only a few years ago – I’m not that old!) more than one person actually told me that the PI and university were perfectly justified in keeping the student’s results and the fruits of their intellectual labour under lock and key.

Clearly, people are still afraid of open data. Dorothy lists the following reasons:

  1. Lack of time to curate data;  Data are only useful if they are understandable, and documenting a dataset adequately is a non-trivial task;
  2. Personal investment – sense of not wanting to give away data that had taken time and trouble to collect to other researchers who are perceived as freeloaders;
  3. Concerns about being scooped before the analysis is complete;
  4. Fear of errors being found in the data;
  5. Ethical concerns about confidentiality of personal data, especially in the context of clinical research;
  6. Possibility that others with a different agenda may misuse the data, e.g. perform selective analysis that misrepresented the findings;

In my view, points 1-4 are invalid arguments even if they seem understandable. I have a few comments about some of these:

The fear of being scooped 

I honestly am puzzled by this one. How often does this actually happen? The fear of being scooped is widespread and it may occasionally be justified. Say, if you discuss some great idea you have or post a pilot result on social media perhaps you shouldn’t be surprised if someone else agrees that the idea is great and also does it. Some people wouldn’t be bothered by that but many would and that’s understandable. Less understandable to me is if you present research at a conference and then complain about others publishing similar work because they were inspired by you. That’s what conferences are for. If you don’t want that to happen, don’t go to conferences. Personally, I think science would be a lot better if we cared a lot less about who did what first and instead cared more about what is true and how we can work together…

But anyway, as far as I can see none of that applies to data sharing. By definition data you share is either already published or at least submitted for peer review. If someone reuses your data for something else they have to cite you and give you credit. In many situations they may even do it in collaboration with you which could lead to coauthorship. More importantly, if the scooped result is so easily obtained that somebody beats you to it despite your head start (it’s your data, regardless of how well documented it is you will always know it better than some stranger) then perhaps you should have thought about that sooner. You could have held back on your first publication and combined the analyses. Or, if it really makes more sense to publish the data in separate papers, then you could perhaps declare that the full data set will be shared after the second one is published. I don’t really think this is necessary but I would accept that argument.

Either way, I don’t believe being scooped by data sharing is very realistic and any cases of that happening must be extremely rare. But please share these stories if you have them to prove me wrong! If you prefer, you can post it anonymously on the Neuroscience Devils. That’s what I created that website for.

Fear of errors being discovered

I’m sure everyone can understand that fear. It can be embarrassing to have your errors (and we all make mistakes) being discovered – at least if they are errors with big consequences. Part of the problem is also that all too often the discovery of errors is associated with some malice. To err is human, to forgive divine. We really need to stop treating every time somebody’s mistakes are being revealed (or, for that matter, when somebody’s findings fail to replicate) as an implication of sloppy science or malpractice. Sometimes (usually?) mistakes are just mistakes.

Probably nobody wants to have all of their data combed by vengeful sleuths nitpicking every tiny detail. If that becomes excessive and the same person is targeted, it could border on harassment and that should be counteracted. In-depth scrutiny of all the data by a particular researcher should be a special case that only happens when there is a substantial reason, say, in a fraud investigation. I would hope though that these cases are also rare.

And surely nobody can seriously want the scientific record to be littered with false findings, artifacts, and coding errors. I am not happy if someone tells me I made a serious error but I would nonetheless be grateful to them for telling me! It has happened before when lab members or collaborators spotted mistakes I made. In turn I have spotted mistakes colleagues made. None of this would have been possible if we didn’t share our data and methods amongst each another. I am always surprised when I hear how uncommon this seems to be in some labs. Labs should be collaborative, and so should science as a whole. And as I already said, organising and documenting your data actually helps you to spot errors before the work is published. If anything, data sharing reduces mistakes.

Ethical issues with patient confidentiality

This is a big concern – and the only one that I have full sympathy with. But all of our ethics and data protection applications actually discuss this. The only data that is shared should be anonymised. Participants should only be identified by unique codes that only the researchers who collected the data have access to. For a lot of psychology or other behavioural experiments this shouldn’t be hard to achieve.

Neuroimaging or biological data are a different story. I have a strict rule for my own results. We do not upload the actual brain images of our fMRI experiments to public repositories. While under certain conditions I am willing to share such data upon request as long as the participant’s name has been removed, I don’t think it is safe to make those data permanently available to the entire internet. Participant confidentiality must trump the need for transparency. It simply is not possible to remove all identifying information from these files. Skull-stripping, which removes the head tissues from an MRI scan except for the brain, does not remove all identifying information. Brains are like finger-prints and they can easily be matched up, if you have the required data. As someone* recently said in a discussion of this issue, the undergrad you are scanning in your experiment now may be Prime Minister in 20-30 years time. They definitely didn’t consent to their brain scans being available to anyone. It may not take much to identify a person’s data using only their age, gender, handedness, and a basic model of their head shape derived from their brain scan. We must also keep in mind of what additional data mining may be possible in the coming decades that we simply have no idea about yet. Nobody can know what information could be gleaned from these data, say, about health risks or personality factors. Sharing this without very clear informed consent (that many people probably wouldn’t give) is in my view irresponsible.

I also don’t believe that for most purposes this is even necessary. Most neuroimaging studies involve group analyses. In those you first spatially normalise the images of each participant and the perform statistical analysis across participants. It is perfectly reasonable to make those group results available. For purpose of non-parametric permutation analyses (also in the news recently) you would want to share individual data points but even there you can probably share images after sufficient processing that not much incidental information is left (e.g. condition contrast images). In our own work, these considerations don’t apply. We conduct almost all our analyses in the participant’s native brain space. As such we decided to only share the participants’ data projected on a cortical reconstruction. These data contain the functional results for every relevant voxel after motion correction and signal filtering. No this isn’t raw data but it is sufficient to reproduce the results and it is also sufficient for applying different analyses. I’d wager that for almost all purposes this is more than enough. And again, if someone were to be interested in applying different motion correction or filtering methods, this would be a negotiable situation. But I don’t think we need to allow unrestricted permanent access for such highly unlikely purposes.

Basically, rather than sharing all raw data I think we need to treat each data set on a case-by-case basis and weigh the risks against benefits. What should be mandatory in my view is sharing all data after default processing that is needed to reproduce the published results.

People with agendas and freeloaders

Finally a few words about a combination of points 2 and 6 in Dorothy Bishop’s list. When it comes to controversial topics (e.g. climate change, chronic fatigue syndrome, to name a few examples where this apparently happened) there could perhaps be the danger that people with shady motivations will reanalyse and nitpick the data to find fault with them and discredit the researcher. More generally, people with limited expertise may conduct poor reanalysis. Since failed reanalysis (and again, the same applies to failed replications) often cause quite a stir and are frequently discussed as evidence that the original claims were false, this could indeed be a problem. Also some will perceive these cases as “data tourism”, using somebody else’s hard-won results for quick personal gain – say by making a name for themselves as a cunning data detective.

There can be some truth in that and for that reason I feel we really have to work harder to change the culture of scientific discourse. We must resist the bias to agree with the “accuser” in these situations. (Don’t pretend you don’t have this bias because we all do. Maybe not in all cases but in many cases…)

Of course skepticism is good. Scientists should be skeptical but the skepticism should apply to all claims (see also this post by Neuroskeptic on this issue). If somebody reanalyses somebody else’s data using a different method that does not automatically make them right and the original author wrong. If somebody fails to replicate a finding, that doesn’t mean that finding was false.

Science thrives on discussion and disagreement. The critical thing is that the discussion is transparent and public. Anyone who has an interest should have the opportunity to follow it. Anyone who is skeptical of the authors’ or the reanalysers’/replicators’ claims should be able to check for themselves.

And the only way to achieve this level of openness is Open Data.

 

* They will remain anonymous unless they want to join this debate.

7 thoughts on “Why wouldn’t you share data?

  1. Nice blogpost, NeuroNeurotic. Just one point I wanted to add on “persons with an agenda”. We should remember that the authors of the original paper almost definitely had “an agenda” too. This might have been a strong personal (sometimes lifelong) belief in the ideas being tested, a major career investment in those ideas and/or simply a desire to present the data in a way most likely to get published.

    These are all “agendas” too. Too often in science we overlook these.

    So let’s get away from the idea of “agendas” being a special case. Accept that agendas exist for virtually every participant in a debate. Better to think about whether a group’s agenda does or doesn’t translate into a different hope for the outcome than the original authors’. A different hope = best recipe for genuine critical analysis.

    Of course, the results of any secondary analysis need to be evaluated and criticised in the same way as the original. But remember also that an extremely poor analysis is unlikely to make it through peer review and reach out attention in the first place.

    Like

    1. Thanks for your comment. When I am talking about people with an agenda I mean people with a strong agenda. Yes, I think this is commonplace in science and this is a problem. The less scientists care about their pet theories and the more they care about the results the better. All I meant to say in this post is that you should be aware of the biases some scientists may have.

      We are all human, we all have biases, regardless how sceptical and rational we may think we are. That is precisely why we need open data. With open data nobody can ever accuse you of hiding anything.

      Like

  2. “When it comes to controversial topics (e.g. climate change, chronic fatigue syndrome, to name a few examples where this happened) there could perhaps be the danger that people with shady motivations will reanalyse and nitpick the data to find fault with them and discredit the researcher.”

    When has this happened with CFS? I’ve seen lots of complaints from ‘shady’ researchers that patients are only unhappy with their work because of a prejudice against psychiatry, or some such excuse. The way in which results were presented from the PACE trial has received a lot of criticism (starting with patients, but including, eventually. the recent AHRQ evidence review, which stated that the PACE trial’s post-hoc recovery criteria was so loose as to be “contradictory” http://effectivehealthcare.ahrq.gov/index.cfm/search-for-guides-reviews-and-reports/?pageaction=displayproduct&productid=1906 ), but generally these researchers have avoided having critics re-analyse their data, and largely, avoided engaging in any real discussion or debate about the merits of their work. It seems that this is now changing, and I don’t think that their work will hold up to outside scrutiny.

    A fair amount of information about the reason for requesting data from the PACE trial was provided by the patient who recently had the ICO rule in their favour:

    https://www.whatdotheyknow.com/request/selected_data_on_pace_trial_part

    This patient’s motivations had been questioned previously by QMUL, and it seems that their arguments about this quite largely rested upon the prejudices that surround a stigmatised health condition. Sadly, these prejudices seem quite widespread. The patient made all of his correspondence public so that others could judge for themselves:

    https://www.whatdotheyknow.com/request/timing_of_changes_to_pace_trial#comment-59096

    Like

    1. I am by no means an expert on chronic fatigue syndrome and to be frank I don’t want to be. I’m a neuroscientist who studies perception. My side interests are how perception differs in autism or schizophrenia and how it relates to personality, political opinions or the like. I’m also interested in how early experience affects visual development…

      I am definitely not a person to give any thoughtful opinion on CFS or even for commenting on this whole controversy. But what I do see with regard to CFS is a bloody shambles. Journals being caught between researchers, universities, and the people requesting data. Journals that – we might add – have it as a condition of publication that data are made available. Clearly, there is a major controversy going on with CFS and at least a large part of it doesn’t appear to be driven by science but by political concerns.

      Like

    2. Just to clarify further: I don’t really know the details of the backstory to this particular case. I know that the university’s/researchers’ reason for rejection Coyne’s data sharing request has something to do with previous data requests and researchers being harassed. If this is true this is a problem. But this is precisely why the data should be available because this will permit scrutiny.

      My only point here is that any claims by people with a bias/agenda (whatever this may be) can be revealed if the data are transparent.

      Also, I added the word ‘apparently’ to the sentence you quote above. When I was writing this I was going on the gossip I had been told about the CFS case. This now reflects this better.

      Like

  3. It seems to me that when patients question results published in a trail it is easy to paint them as ‘having an agenda’ and therefore unworthy of seeing the actual data, Even though it is their lives that are affected by the choice of treatment options,

    In the case of the PACE trial a paper was published with significant changes to protocol and some very strange claims of patients getting ‘back to normal’ based on some very strange thresholds. This lead to press coverage claiming ‘cure’ rates of 30-40% but data for the recovery protocol remains unpublished. The results were taken at face value by the medical community even those who would normally be sceptical of such protocol changes. posthoc analysis and unpublished data.

    Some very sick patients have had to ask questions driven by getting an accurate picture of treatments and the science. Yet are told that they don’t deserve data because they have an ‘agenda’.

    Part of the opposition to the treatments proposed by PACE is that many have tried them. Some, particularly those who have tried GET have deteriated with treatments.

    Liked by 1 person

    1. Yes, and that’s why I feel the only way to deal with this controversy is to make the data available. Obviously only data that doesn’t have identifying information but all data used to anonymously reproduce the results should be public.

      Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s