Category Archives: reproducibility

Why Gilbert et al. are missing the point

This (hopefully brief) post is yet again related to the replicability debate I discussed in my previous post. I just read a response by Gilbert et al. to the blog comments about their reply to the reply to the reply to the (in my view, misnamed) Reproducibility Project Psychology. I won’t go over all of this again. I also won’t discuss the minutia of the statistical issues as many others have already done so and will no doubt do so again. I just want to say briefly why I believe they are missing the point:

The main argument put forth by Gilbert et al. is that there is no evidence for a replicability crisis in psychology and that the “conclusions” of the RPP are thus unfounded. I don’t think that the RPP ever claimed anything of the kind one way or the other (in fact, I was impressed by the modesty of the claims made by the RPP study when I read it) but I’ll leave that aside. I appreciate what Gilbert et al. are trying to do. I have myself frequently argued a contrarian position in these discussions (albeit not always entirely seriously). I am trying to view this whole debate the same way any scientist should: by evaluating the evidence without any investment in the answer. For that reason, the debate they have raised seems worthwhile. They tried to estimate a baseline level of replicability one could expect from psychology studies. I don’t think they’ve done it correctly (for statistical reasons) but I appreciate that they are talking about this. This is certainly what we would want to do in any other situation.

Unfortunately, it isn’t that simple. Even if there were no problems with publication bias, analytical flexibility, and lacking statistical power (and we can probably agree that this is not a tenable assumption), it wouldn’t be a straightforward thing to estimate how many psychology studies should replicate by chance. In order to know this you would need to know how many of the hypotheses are true and we usually don’t. As Einstein said – or at least the internet tells me he did: “If we knew what it was we were doing, it would not be called research, would it?”

One of the main points they brought up is that some of the replications in the RPP may have used inappropriate procedures to test the original hypotheses – I agree this is a valid concern but it also completely invalidates the argument they are trying to make. Instead of quibbling about what measure of replication rates is evidence for a “crisis” (a completely subjective judgement) let’s look at the data:

f1-large

This scatter graph from the RPP plots effect sizes in the replications against the originally reported ones. Green (referred to as “blue” by the presumably colour-blind art editors) points are replications that turned out significant, red ones are those that were not significant and thus “failed to replicate.” The separation of the two data clouds is fairly obvious. Significant replication effects have a clear linear relationship with the original ones. Non-significant ones are uncorrelated with the original effect sizes.

We can argue until the cows come home what this means. The red points are presumably at least to a large part false positives. Yes, of course some – perhaps many – may be because of methodological differences or hidden moderators etc. There is no way to quantify this reliably. And conversely, a lot of the green dots probably don’t tell us about any cosmic truths. While they are replicating wonderfully, they may just be replicating the same errors and artifacts. All of these arguments are undoubtedly valid.

But that’s not the point. When we test the reliability of something we should aim for high fidelity. Of course, perfect reliability is impossible so there must be some scatter around the identity line. We also know that there will always be false positives so there should be some data points scattering around the x-axis. But do you honestly think it should be as many as in that scatter graph? Even if these are not all false positives in the original but rather false negatives in the replication, for instance because the replicators did a poor job or there were unknown factors we don’t yet understand, this ratio of green to red dots is not very encouraging.

Replicability encompasses all of the aforementioned explanations. When I read a scientific finding I don’t expect it to be “true.” Even if the underlying effects are real, the explanation for them can be utterly wrong. But we should expect a level of replicability from a field of research that at least maximises the trustworthiness of the reported findings. Any which way you look at it, this scatter graph is unsettling: if two thirds of the dots are red because low statistical power and publication bias in the original effects, this is a major problem. But if they are red because the replications are somehow defective this isn’t exactly a great argument either. What this shows is that the way psychology studies are currently done does not permit very reliable replication. Either way, if you give me a psychology study I should probably bet against it replicating. Does anyone think that’s an acceptable state of affairs?

I am sure both of these issues play a role but the encouraging thing is that probably it is the former, false positives, that is more dominant after all. In my opinion the best way anyone has looked at the RPP data so far is Alex Etz’s Bayesian reanalysis. This suggests that one of the main reasons the replicability in the RPP is so underwhelming is that the level of evidence for the original effects was weak to begin with. This speaks for false positives (due to low power, publication bias, QRPs) and against unknown moderators being behind most of the replication failures. Believe it or not, this is actually a good thing – because it is much easier to address the former problem than the latter.

A brave new world of research parasites

What a week! I have rarely seen the definition of irony being demonstrated more clearly in front of my eyes than during the days following the publication of this comment by Lewandowsky and Bishop in Nature. I mentioned this at the end of my previous post. The comment discusses the question how to deal with data requests and criticisms of scientific claims in the new world of open science. A lot of digital ink has already been spilled elsewhere debating what they did or didn’t say and what they meant to say with their article. I have no intention of rehashing that debate here. So while I typically welcome any meaningful and respectful comments under my posts, I’ll regard any comments on the specifics of the L&B article as off-topic and will not publish them. There are plenty of other channels for this.

I think the critics attack a strawman and the L&B discussion is a red herring. Irrespective of what they actually said, I want to get back to the discussion we should be having, which I already alluded to last time.  In order to do so, let’s get the premise crystal clear. I have said all this before in my various posts about data sharing but let me summarise the fundamental points:

  1. Data sharing: All data for scientific studies needed to reproduce the results should be made public in some independent repository at the point of publication. This must exclude data which would be unethical to share, e.g. unprocessed brain images from human participants. Such data fall in a grey area as to how much anonymisation is necessary and it is my policy to err on the side of caution there. We have no permission from our participants (except for some individual cases) to share their data with anyone outside the team if there is a chance that they could be identified from it so we don’t. For the overwhelming majority of purposes such data are not required and the pre-processed, anonymised data will suffice.
  2. Material sharing: When I talk about sharing data I implicitly also mean material so any custom analysis code, stimulus protocols, or other materials used for the study  should also be shared. This is not only good for reproducibility, i.e. getting the same results using the same data. It is also useful for replication efforts aiming to repeat the same experiment to collect new data.
  3. Useful documentation: Shared data are unlikely to be much use to anyone if there isn’t a minimum of documentation explaining what it contains. I don’t think this needs to be excessive, especially given the fact that most data will probably never be accessed by anyone. But there should at least be some basic guide how to use the data to return a result. It should be reasonably clear what data can be found where or how to run the experiment. Provided the uncompiled code is included and the methods section of the publication contains sufficient detail of what is being done, anyone looking at it should be able to work it out by themselves. More extensive documentation is certainly helpful and may also help the researchers themselves in organising their work – but I don’t think we should expect more than the basics.

Now with this out of the way I don’t want to hear no lamentations about how I am “defending” the restriction of data access to anyone or any such rubbish. Let’s simply work on the assumption that the world is how it should be and that the necessary data are available to anyone with an internet connection. So let’s talk about the worries and potential problems this may bring. Note that, as I already said, most data sets will probably not generate much interest. That is fine – they should be available for potential future use in any case. More importantly this doesn’t mean the following concerns aren’t valid:

Volume of criticism

In some cases the number of people reusing the shared data will be very large. This is particularly likely for research on controversial topics. This could be because the topic is a political battleground or that the research is being used to promote policy changes people are not happy with. Perhaps the research receives undeserved accolades from the mainstream media or maybe it’s just a very sensational claim (Psi research springs to mind again…). The criticisms of this research may or may not be justified. None of this matters and I don’t care to hear about the specifics about your particular pet peeve whether it’s climate change or some medical trial. All that matters in this context is that the topic is controversial.

As I said last time, it should be natural that sensational or controversial research attracts more attention and more scepticism. This is how it should be. Scientists should be sceptical. But individual scientists or small research teams are composed of normal human beings and they have a limit with how much criticism they can keep up with. This is a simple fact. Of course this statement will no doubt draw out the usual suspects who feel the need to explain to me that criticism and scepticism is necessary in science and that this is simply what one should expect.

396px-bookplate_of_the_royal_society_28great_britain29

So let me cut the heads off this inevitable hydra right away. First of all, this is exactly what I just said: Yes, science depends on scepticism. But it is also true that humans have limited capacity for answering questions and criticisms and limited ability to handle stress. Simply saying that they should be prepared for that and have no right to complain is unrealistic. If anything it will drive people away from doing research on controversial questions which cannot be a good thing.

Similar, it is unrealistic to say that they could just ignore criticisms if it gets too much for them. It is completely natural that a given scientist will want to respond to criticisms, especially if those criticisms are public. They will want to defend the conclusions they’ve drawn and they will also feel that they have a reputation to defend. I believe science would generally be better off if we all learned to become less invested in our pet theories and conducted our inferences in a less dogmatic way. I hope there are ways we can encourage such a change – but I don’t think you can take ego out of the question completely. Especially if a critic accuses a researcher of incompetence or worse, it shouldn’t surprise anyone if they react emotionally and have personal stakes in the debate.

So what can we expect? To me it seems entirely justified in this situation that a researcher would write a summary response that addresses the criticism collectively. In that they would most likely have to be selective and only address the more serious points and ignore the minutia. This may require some training. Even then it may be difficult because critics might insist that their subtle points are of fundamental importance. In that situation an adjudicating article by an independent party may be helpful (albeit probably not always feasible).

On a related note, it also seems justified to me that a researcher will require time to make a response. This pertains more to how we should assess a scientific disagreement as outside observers. Just because a researcher hasn’t responded to every little criticism within days of somebody criticising their work doesn’t mean that the criticism is valid. Scientists have lives too. They have other professional duties, mortgages to pay with their too-low salaries, children to feed, and – hard as it is to believe – they deserve some time off occasionally. As long as they declare their intention to respond in depth at some stage we should respect that. Of course if they never respond that may be a sign that they simply don’t have a good response to the criticism. But you need some patience, something we seem to have lost in the age of instant access social media.

Excessive criticism or harassment

This brings us to the next issue. Harassment of researchers is never okay. Which is really because harassment of anyone is never okay. So pelting a researcher with repeated criticisms, making the same points or asking the same questions over and over, is not acceptable. This certainly borders on harassment and may cross the line. This constant background noise can wear people out. It is also counterproductive because it slows them down in making their response. It may also paralyse their other research efforts which in turn will stress them out because they have grant obligations to fulfill etc. Above all, stress can make you sick. If you harassed somebody out of the ability to work, you’ll never get a response – this doesn’t make your criticism valid.

If the researchers declared their intention to respond to criticism we should leave it at that. If they don’t respond after a significant time it might be worth a reminder if they are still working on it. As I said above, if they never respond this may be a sign that they have no response. In that case, leave it at that.

It should require no explanation why any blatant harassment, abusive contact, or any form of interference in the researchers’ personal lives, is completely unacceptable. Depending on the severity of such cases they should be prosecuted to the full extent of the law. And if someone reports harassment, in the first instance you should believe them. It is a common tactic of harassers to downplay claims of abuse. Sure, it is also unethical to make false accusations but you should leave that for the authorities to judge, in particular if you don’t have any evidence one way or the other. Harassment is also subjective. What might not bother you may very well affect another person badly. Brushing this off as them being too sensitive demonstrates a serious lack of compassion, is disrespectful, and I think it also makes you seem untrustworthy.

Motive and bias

Speaking of untrustworthiness brings me to the next point. There has been much discussion about the motives of critics and in how far a criticism is to be taken in “good faith”. This is a complex and highly subjective judgement. In my view, your motive for reanalysing or critiquing a particular piece of research is not automatically a problem. All the data should be available, remember? Anyone can reanalyse it.

However, as all researchers should be honest so should all critics. Obviously this isn’t mandatory and it couldn’t be enforced even if it were. But this is how it should be and how good scientists should work. I have myself criticised and reanalysed research by others and I was not beating around the bush in either case – I believe I was pretty clear that I didn’t believe their hypothesis was valid. Hiding your prior notions is disrespectful to the authors and also misleads the neutral observers of the discussion. Even if you think that your public image already makes your views clear – say, because you ranted at great length on social media about how terribly flawed you think that study was – this isn’t enough. Even the Science Kardashians don’t have that large a social media following and probably only a fraction of that following will have read all your in-depth rants.

In addition to declaring your potential bias you should also state your intention. It is perfectly justified to dig into the data because you suspect it isn’t kosher. But this is an exploratory analysis and it comes with many of the same biases that uncontrolled, undeclared exploration always has. Of course you may find some big smoking gun that invalidates or undermines the original authors’ conclusions. But you are just as likely to find some spurious glitch or artifact in the data that doesn’t actually mean anything. In the latter case it would make more sense to conduct a follow up experiment that tests  your new alternative hypothesis to see if your suspicion holds up. If on the other hand you have a clear suspicion to start with you should declare it and then test it and report the findings no matter what. Preregistration may help to discriminate the exploratory fishing trips from the pointed critical reanalyses – however, it is logistically not very feasible to check whether this wasn’t just a preregistration after the fact because the data were already available.

So I think this judgement will always rely heavily on trust but that’s not a bad thing. I’m happy to trust a critic if they declare their prior opinion. I will simply take their views with some scepticism that their bias may have influenced them. A critic who didn’t declare their bias but is then shown to have a bias appears far less trustworthy. So it is actually in your interest to declare your bias.

Now before anyone inevitably reminds us that we should also worry about the motives and biases of the original authors – yes, of course. But this is a discussion we’ve already had for years and this is why data sharing and novel publication models like preregistration and registered reports are becoming more commonplace.

Lack of expertise

On to the final point. Reanalyses or criticism may come from people with limited expertise and knowledge of a research area to provide useful contributions. Such criticisms may obfuscate the discussion and that is never a good thing. Again preempting the inevitable comments: No, this does not mean that you have to prove your expertise to reanalyse the data. (Seriously guys, which part of “all data should be available to anyone” don’t you get?!). What it does mean is that I might not want to weight the criticism by someone who once took a biology class in high school the same way as that of a world expert. It also means that I will be more sceptical when someone is criticising something outside their own field.

There are many situations where this caveat doesn’t matter. Any scientist with some statistical training may be able to comment on some statistical issue. In fact, a statistician is presumably more qualified to comment on some statistical point than a non-statistician of whatever field. And even if you may not be an expert on some particular research topic you may still be an expert on the methods used by the researchers. Importantly, even a non-expert can reveal a fundamental flaw. The lack of a critic’s expertise shouldn’t be misused to discredit them. In the end, what really matters is that your argument is coherent and convincing. For that it doesn’t actually matter if you are an expert or not (an expert may however find it easier to communicate their criticism convincingly).

However, let’s assume that a large number of non-experts are descending on a data set picking little things they perceive as flaws that aren’t actually consequential or making glaring errors (to an expert) in their analysis. What should the researchers do in this situation? Not responding at all is not in their interest. This can easily be misinterpreted as a tacit acknowledgement that their research is flawed. On the other hand, responding to every single case is not in their interest either if they want to get on with their work (and their lives for that matter). As above, perhaps the best thing to do would be write a summary response collectively rebuking the most pertinent points, make a clear argument about the inconsequentialness of these criticisms, and then leave it at that.

Conclusion

In general, scientific criticisms are publications that should work like any other scientific publications. They should be subject to peer review (which, as readers of this blog will know, I believe should be post-publication and public). This doesn’t mean that criticism cannot start on social media, blogs, journal comment sections, or on PubPeer, and the boundaries may also blur at times. For some kinds of criticism, such as pointing out basic errors or misinterpretations some public comments may suffice and there have been cases where a publication was retracted simply because of the social media response. But for a criticism to be taken seriously by anyone, especially non-experts, it helps if it is properly vetted by independent experts – just how any study should be vetted. This may also help particularly with cases where the validity of the criticism is uncertain.

I think this is a very important discussion to have. We need to have this to bring about the research culture most of us seem to want. A brave new world of happy research parasites.

Parasites

(Note: I changed the final section somewhat after Neuroskeptic rightly pointed out that the conclusions were a bit too general. Tal Yarkoni independently replicated this sentiment. But he was only giving me a hard time.)