This weekend marked another great moment in the saga surrounding the discussion about open science – a worthy sequel to “angry birds” and “shameless little bullies”. This time it was an editorial about data sharing in the New England Journal of Medicine which contains the statement that:
There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”
Remarks like this from journal editors are just all kinds of stupid. Even though this was presented in the context of quotes by unnamed “front-line researchers” (whatever that means) they implicitly endorse the interpretation that re-using other people’s published data is parasitical. In fact, their endorsement is made clear later on in the editorial when the editors express the hope that data sharing “should happen symbiotically, not parasitically.”
It shouldn’t come as a surprise that this editorial was immediately greeted by wide-spread ridicule and the creation of all sorts of internet memes poking fun of the notion of “research parasites.” Even if some people believe this, hell, even if the claim were true (spoiler: it’s not), this is just a very idiotic thing to do. Like it or not, open access, transparency, and post-publication scrutiny of published scientific findings are becoming increasingly common and are already required in many places. We’re now less than a year away from the date when the Peer Reviewers Openness Initiative, whose function it is to encourage data sharing, will come into effect. Not only is the clock not turning back on this stuff – it is deeply counterproductive to liken the supporters of this movement to parasites. This is no way to start (or have) a reasonable conversation.
And there should be a conversation. If there is one thing I have learned from talking with colleagues, worries about data sharing and open science as a whole are far from rare. Misguided as it may be, the concern about others scooping your ideas and sifting through your data you spent blood, sweat, and tears collecting resonates with many people. This editorial didn’t just pop into existence from the quantum foam – it comes from a real place. The mocking and snide remarks about this editorial are fully deserved. This editorial is moronic and ass-backwards. But speaking more generally, snide and mocking are never a good way to convince people of the strength of your argument. All too often worries like this are met with disrespect and ridicule. Is it any surprise that a lot of people don’t dare to speak up against open science? Similarly, when someone discovers errors or problems in somebody else’s data, some are quick to make jokes or serious accusations about these researchers. Is this encouraging them to be open their lab books and file drawers? I think not.
Scientists are human beings and they tend to have normal human reactions when being accused of wrong-doing, incompetence, or sloppiness. Whether or not the accusations are correct is irrelevant. Even mentioning the dreaded “questionable research practices” sounds like a fierce accusation to the accused even though questionable research practices can occur quite naturally without conscious ill intent when people are wandering in the garden of forking paths. In my opinion we need to be mindful of that and try to be more considerate in the way we discuss these issues. Social media like Facebook and Twitter do not exactly seem to encourage respectful dialogue. I know this firsthand as I have myself said things about (in my view) questionable research that I subsequently regretted. Scepticism is good and essential to scientific progress – disrespect is not.
It seems to have been the intention of this misguided editorial to communicate a similar message. It encourages researchers using other people’s data to work with the original authors. So far so good. I am sure no sensible person would actually disagree with that notion. But where the editorial misses the point is that there is no plan for what happens if this “symbiotic” relationship isn’t forming, either because the original authors are not cooperating or because there is a conflict of interests between skeptics and proponents of a scientific claim. In fact, the editorial lays bare what I think is the heart of the problem in a statement that to me seems much worse than the “research parasites” label. They say that people…
…even use the data to try to disprove what the original investigators had posited.
It baffles me that anyone can write something like this whilst keeping a straight face. Isn’t this how science is supposed to work? Trying to disprove a hypothesis is just basic Popperian falsification. Not only should others do that, you should do that yourself with your own research claims. To be fair, the best way to do science in my opinion is to generate competing hypotheses and test them with as little emotional attachment to any of them as possible but this is more easily said than done… So ideally we should try to find the hypothesis that best explains the data rather than just seeking to disprove. Either way however, this sentence is clearly symptomatic of a much greater problem: Science should be about “finding better ways of being wrong.” The first step towards this is to acknowledge that anything we posited is never really going to be “true” and that it can always use a healthy dose of scientific scepticism and disproving.
I want to have this dialogue. I want to debate the ways to make science healthier, more efficient, and more flexible in overturning false ideas. As I outlined in a previous post, data sharing is the single most important improvement we can make to our research culture. I think even if there are downsides to it, the benefits outweigh them by far. But not everyone shares my enthusiasm for data sharing and many people seem worried but afraid to speak up. This is wrong and it must change. I strongly believe that most of the worries can be alleviated:
I think it’s delusional that data sharing will produce a “class” of “research parasites.” People will still need to generate their own science to be successful. Simply sitting around waiting for other people to generate data is not going to be a viable career strategy. If anything, large consortia like the Human Genome or Human Connectome Project will produce large data sets that a broad base of researchers can use. But this won’t allow them to test every possible hypothesis under the sun. In fact, most data sets are far too specific to be much use to many other people.
I’m willing to bet that the vast majority of publicly shared data sets won’t be downloaded, let alone analysed by anyone other than the original authors. This is irrelevant. The point is that the data are available because they could be potentially useful to future science.
Scooping other people’s research ideas by doing the experiment they wanted to do by using their published data is a pretty ineffective and risky strategy. In most cases, there is just no way that someone else would be faster than you publishing an experiment you wanted to do using your data. This doesn’t mean that it never happens but I’m still waiting for anyone to tell me of a case where this actually did happen… But if you are worried about it, preregister your intention so at least anyone can see that you planned it. Or even better, submit it as a Registered Report so you can guarantee that this work will be published in a journal regardless of what other people did with your data.
While we’re at it, upload the preprints of your manuscripts when you submit them to journals. I still dream of a publication system where we don’t submit to journals at all, or at least not until peer review took place and the robustness of the finding has been confirmed. But until we get there, preprints are the next best thing. With a public preprint the chronological precedence is clear for all to see.
Now that covers the “parasites” feeding on your research productivity. But what to do if someone else subjects your data to sceptical scrutiny in the attempt to disprove what you posited? Again, first of all I don’t think this is going to be that frequent. It is probably more frequent for controversial or surprising claims and it bloody well should be. This is how science progresses and shouldn’t be a concern. And if it actually turns out that the result or your interpretation of it is wrong, wouldn’t you want to know about it? If your answer to this question is No, then I honestly wonder why you do research.
I can however empathise with the fear that people, some of whom may lack the necessary expertise or who cherry pick the results, will actively seek to dismantle your findings. I am sure that this does happen and with more general data sharing this may certainly become more common. If the volume of such efforts becomes so large that it overwhelms an individual researcher and thus hinders their own progress unnecessarily, this would indeed be a concern. Perhaps we need to have a discussion on what safeguards could ensure that this doesn’t get out of hand or how one should deal with that situation. I think it’s a valid concern and worth some serious thought. (Update on 25 Jan 2016: In this context Stephan Lewandowsky and Dorothy Bishop wrote an interesting comment about this).
But I guarantee you, throwing the blame at data sharing is not the solution to this potential problem. The answer to scepticism and scrutiny cannot ever be to keep your data under lock and key. You may never convince a staunch sceptic but you also will not win the hearts and minds of the undecidedly doubtful by hiding in your ivory tower. In science, the only convincing argument is data, more data, better tests – and the willingness to change your mind if the evidence demands it.
Yesterday Neuroskeptic came to our Cognitive Drinks event in the Experimental Psychology department at UCL to talk about p-hacking. His entertaining talk (see Figure 1) was followed by a lively and fairly long debate about p-hacking and related questions about reproducibility, preregistration, and publication bias. During the course of this discussion a few interesting things came up. (I deliberately won’t name anyone as I think this complicates matters. People can comment and identify themselves if they feel that they should…)
It was suggested that a lot of the problems with science would be remedied effectively if only people were encouraged (or required?) to replicate their own findings before publication. Now that sounds generally like a good idea. I have previously suggested that this would work very well in combination with preregistration: you first do a (semi-)exploratory experiment to finalise the protocol, then submit a preregistration of your hypothesis and methods, and then do the whole thing again as a replication (or perhaps more than one if you want to test several boundary conditions or parameters). You then submit the final set of results for publication. Under the Registered Report format, your preregistered protocol would already undergo peer review. This would ensure that the final results are almost certain to be published provided you didn’t stray excessively from the preregistered design. So far, so good.
Should you publish unclear results?
Or is it? Someone suggested that it would be a problem if your self-replication didn’t show the same thing as the original experiment. What should one do in this case? Doesn’t publishing something incoherent like this, one significant finding and a failed replication, just add to the noise in the literature?
At first, this question simply baffled me, as I suspect it would many of the folks campaigning to improve science. (My evil twin sister called these people Crusaders for True Science but I’m not supposed to use derogatory terms like that anymore nor should I impersonate lady demons for that matter. Most people from both sides of this mudslinging contest “debate” never seemed to understand that I’m also a revolutionary – you might just say that I’m more Proudhon, Bakunin, or Henry David Thoreau rather than Marx, Lenin, or Che Guevara. But I digress…)
Surely, the attitude that unclear, incoherent findings, that is, those that are more likely to be null results, are not worth publishing must contribute to the prevailing publication bias in the scientific literature? Surely, this view is counterproductive to the aims of science to accumulate evidence and gradually get closer to some universal truths? We must know which hypotheses have been supported by experimental data and which haven’t. One of the most important lessons I learned from one of my long-term mentors was that all good experiments should be published regardless of what they show. This doesn’t mean you should publish every single pilot experiment you ever did that didn’t work. (We can talk about what that does and doesn’t mean another time. But you know how life is: sometimes you think you have some great idea only to realise that it makes no sense at all when you actually try it in practice. Or maybe that’s just me? :P). Even with completed experiments you probably shouldn’t bother publishing if you realise afterwards that it is all artifactual or the result of some error. Hopefully you don’t have a lot of data sets like that though. So provided you did an experiment of suitable quality I believe you should publish it rather than hiding it in the proverbial file drawer. All scientific knowledge should be part of the scientific record.
I naively assumed that this view was self-evident and shared by almost everyone – but this clearly is not the case. Yet instead of sneering at such alternative opinions I believe we should understand why people hold them. There are reasonable arguments why one might wish to not publish every unclear finding. The person making this suggestion at our discussion said that it is difficult to interpret a null result, especially an assumed null result like this. If your original experiment O showed a significant effect supporting your hypothesis, but your replication experiment R does not, you cannot naturally conclude that the effect really doesn’t exist. For one thing you need to be more specific than that. If O showed a significant positive effect but R shows a significant negative one, this would be more consistent with the null hypothesis than if O is highly significant (p<10-30) and R just barely misses the threshold (p=0.051).
So let’s assume that we are talking about the former scenario. Even then things aren’t as straightforward, especially if R isn’t as exact a replication of O as you might have liked. If there is any doubt (and usually there is) that something could have been different in R than in O, this could be one of the hidden factors people always like to talk about in these discussions. Now you hopefully know your data better than anyone. If experiment O was largely exploratory and you tried various things to see what works best (dare we say p-hacking again?), then the odds are probably quite good that a significant non-replication in the opposite direction shows that the effect was just a fluke. But this is not a natural law but a probabilistic one. Youcannot everknow whether the original effect was real or not, especially not from such a limited data set of two non-independent experiments.
This is precisely why you should publish all results!
In my view, it is inherently dangerous if researchers decide for themselves which findings are important and which are not. It is not only a question of publishing only significant results. It applies much more broadly to the situation when a researcher publishes only results that support their pet theory but ignores or hides those that do not. I’d like to believe that most scientists don’t engage in this sort of behaviour – but sadly it is probably not uncommon. A way to counteract this is to train researchers to think of ways that test alternative hypotheses that make opposite predictions. However, such so-called “strong inference” is not always feasible. And even when it is, the two alternatives are not always equally interesting, which in turn means that people may still become emotionally attached to one hypothesis.
The decision whether a result is meaningful should be left to posterity. You should publish all your properly conducted experiments. If you have defensible doubts that the data are actually rubbish (say, an fMRI data set littered with spikes, distortions, and excessive motion artifacts, or a social psychology study where you discovered posthoc that all the participants were illiterate and couldn’t read the questionnaires) then by all means throw them in the bin. But unless you have a good reason, you should never do this and instead add the results to the scientific record.
Now the suggestion during our debate was that such inconclusive findings clog up the record with unnecessary noise. There is an enormous and constantly growing scientific literature. As it is, it is becoming increasingly harder to separate the wheat from the chaff. I can barely keep up with the continuous feed of new publications in my field and I am missing a lot. Total information overload. So from that point of view the notion makes sense that only those studies that meet a certain threshold for being conclusive are accepted as part of the scientific record.
I can certainly relate to this fear. For the same reason I am sceptical of proposals that papers should be published before review and all decisions about the quality and interest of some piece of research, including the whole peer review process, should be entirely post-publication. Some people even seem to think that the line between scientific publication and science blog should be blurred beyond recognition. I don’t agree with this. I don’t think that rating systems like those used on Amazon or IMDb are an ideal way to evaluate scientific research. It doesn’t sound wise to me to assess scientific discoveries and medical breakthroughs in the same way we rank our entertainment and retail products. And that is not even talking about unleashing the horror of internet comment sections onto peer review…
Solving the (false) dilemma
I think this discussion is creating a false dichotomy. These are not mutually exclusive options. The solution to a low signal-to-noise ratio in the scientific literature is not to maintain publication bias of significant results. Rather the solution is to improve our filtering mechanisms. As I just described, I don’t think it will be sufficient to employ online shopping and social network procedures to rank the scientific literature. Even in the best-case scenario this is likely to highlight the results of authors who are socially dominant or popular and probably also those who are particularly unpopular or controversial. It does not necessarily imply that the highest quality research floats to the top [cue obvious joke about what kind of things float to the top…].
No, a high quality filter requires some organisation. I am convinced the scientific community can organise itself very well to create these mechanisms without too much outside influence. (I told you I’m Thoreau and Proudhon, not some insane Chaos Worshipper :P). We need some form of acceptance to the record. As I outlined previously, we should reorganise the entire publication process so that the whole peer-review process is transparent and public. It should be completely separate from journals. The journals’ only job should be to select interesting manuscripts and to publish short summary versions of them in order to communicate particularly exciting results to the broader community. But this peer-review should still involve a “pre-publication” stage – in the sense that the initial manuscript should not generate an enormous amount of undue interest before it has been properly vetted. To reiterate (because people always misunderstand that): the “vetting” should be completely public. Everyone should be able to see all the reviews, all the editorial decisions, and the whole evolution of the manuscript. If anyone has any particular insight to share about the study, by all means they should be free to do so. But there should be some editorial process. Someone should chase potential reviewers to ensure the process takes off at all.
The good news about all this is that it benefits you. Instead of weeping bitterly and considering to quit science because yet again you didn’t find the result you hypothesised, this just means that you get to publish more research. Taking the focus off novel, controversial, special, cool or otherwise “important” results should also help make the peer review more about the quality and meticulousness of the methods. Peer review should be about ensuring that the science is sound. In current practice it instead often resembles a battle with authors defending to the death their claims about the significance of their findings against the reviewers’ scepticism. Scepticism is important in science but this kind of scepticism is completely unnecessary when people are not incentivised to overstate the importance of their results.
Practice what you preach
I honestly haven’t followed all of the suggestions I make here. Neither have many other people who talk about improving science. I know of vocal proponents of preregistration who have yet to preregister any study of their own. The reasons for this are complex. Of course, you should “be the change you wish to see in the world” (I’m told Gandhi said this). But it’s not always that simple.
On the whole though I think I have published almost all of the research I’ve done. While I currently have a lot of unpublished results there is very little in the file drawer as most of these experiments have either been submitted or are being written up for eventual publication. There are two exceptions. One is a student project that produced somewhat inconclusive results although I would say it is a conceptual replication of a published study by others. The main reason we haven’t tried to publish this yet is that the student isn’t here anymore and hasn’t been in contact and the data aren’t that exciting to us to bother with the hassle of publication (and it is a hassle!).
The other data set is perhaps ironic because it is a perfect example of the scenario I described earlier. A few years ago when I started a new postdoc I was asked to replicate an experiment a previous lab member had done. For simplicity, let’s just call this colleague Dr Toffee. Again, they can identify themselves if they wish. The main reason for this was that reviewers had asked Dr Toffee to collect eye-movement data. So I replicated the original experiment but added eye-tracking. My replication wasn’t an exact one in the strictest terms because I decided to code the experimental protocol from scratch (this was a lot easier). I also had to use a different stimulus setup than the previous experiment as that wouldn’t have worked with the eye-tracker. Still, I did my best to match the conditions in all other ways.
My results were a highly significant effect in the opposite direction than the original finding. We did all the necessary checks to ensure that this wasn’t just a coding error etc. It seemed to be real. Dr Toffee and I discussed what to do about it and we eventually decided that we wouldn’t bother to publish this set of experiments. The original experiment had been conducted several years before my replication. Dr Toffee had moved on with their life. I on the other hand had done this experiment as a courtesy because I was asked to. It was very peripheral to my own research interests. So, as in the other example, we both felt that going through the publication process would have been a fairly big hassle for very little gain.
Now this is bad. Perhaps there is some other poor researcher, a student perhaps, who will do a similar experiment again and waste a lot of time on testing the hypothesis that, at least according to our incoherent results, is unlikely to be true. And perhaps they will also not publish their failure to support this hypothesis. The circle of null results continues…
But you need to pick your battles. We are all just human beings and we do not have unlimited (research) energy. With both of these lacklustre or incoherent results I mentioned (and these are literally the only completed experiments we haven’t at least begun to write up), it seems like a daunting task to undergo the pain of submission->review->rejection->repeat that simply doesn’t seem worth it.
So what to do? Well, the solution is again what I described. The very reason the task of publishing these results isn’t worth our energy is everything that is wrong with the current publication process! In my dream world in which I can simply write up a manuscript formatted in a way that pleases me and then upload this to the pre-print peer-review site my life would be infinitely simpler. No more perusing dense journal websites for their guide to authors or hunting for the Zotero/Endnote/Whatever style to format the bibliography. No more submitting your files to one horribly designed, clunky journal website after another, checking the same stupid tick boxes, adding the same reviewer suggestions. No more rewriting your cover letters by changing the name of the journal. Certainly for my student’s project, it would not be hard to do as there is already a dissertation that can be used as a basis for the manuscript. Dr Toffee’s experiment and its contradictory replication might require a bit more work – but to be fair even there is already a previous manuscript. So all we’d need to add would be the modifications of the methods and the results of my replication. In a world where all you need to do is upload the manuscript and address some reviewers’ comments to ensure the quality of the science this should be fairly little effort. In turn it would ensure that the file drawer is empty and we are all much more productive.
This world isn’t here yet but there are journals that will allow something that isn’t too far off from that, namely F1000Research and PeerJ (and the Winnower also counts although the content there seems to be different and I don’t quite know how much review editing happens there). So, maybe I should email Dr Toffee now…
(* In case you didn’t get this from the previous 2700ish words: the answer to this question is unequivocally “No.”)
Because it’s so much more fun than the things I should really be doing (correcting student dissertations and responding to grant reviews) I read a long blog post entitled “Science isn’t broken” by Christie Aschwanden. In large parts this is a summary of the various controversies and “crises” that seem to have engulfed scientific research in recent years. The title is a direct response to an event I participated in recently at UCL. More importantly, I think it’s a really good read so I recommend it.
This post is a quick follow-up response to the general points raised there. As I tried to argue (probably not very coherently) at that event, I also don’t think science is broken. First of all, probably nobody seriously believes that the lofty concept of science, the scientific method (if there is one such thing), can even be broken. But even in more pragmatic terms, the human aspects of how science works are not broken either. My main point was that the very fact we are having these kinds of discussions, about how scientific research can be improved, is direct proof that science is in fact very healthy. This is what self-correction looks like.
If anything, the fact that there has been a recent surge of these kinds of debates shows that science has already improved a lot recently. After decades of complacency with the status quo there now seems to be real energy afoot to effect some changes. However, it is not the first time this happened (for example, the introduction of peer review would have been a similarly revolutionary time) and it will not be the last. Science will always need to be improved. If some day conventional wisdom were that our procedure is now perfect, that it cannot be improved anymore, that would be a tell-tale sign for me that I should do something else.
So instead of fretting over whether science is “broken” (No, it isn’t) or even whether it needs improvement (Yes, it does), what we should be talking about is specifically what really urgently needs improvement. Here is my short list. I am not proposing many solutions (except for point 1). I’d be happy to hear suggestions:
I. Publishing and peer review
The way we publish and review seriously needs to change. We are wasting far too much time on trivialities instead of the science. The trivialities range from reformatting manuscripts to fit journal guidelines and uploading files on the practical side to chasing impact factors and “novel” research on the more abstract side. Both hurt research productivity although in different ways. I recently proposed a solution that combines some of the ideas by Dorothy Bishop and Micah Allen (and no doubt many others).
II. Post-publication review
Related to this, the way we evaluate and discuss published science needs to change, too. We need to encourage more post-publication review. This currently still doesn’t happen as most studies never receive any post-pub review or get commented on at all. Sure, some (including some of my own) probably just don’t deserve any attention, but how will you know unless somebody tells you the study even exists? Many precious gems will be missed that way. This has of course always been the case in science but we should try to minimise that problem. Some believe post-publication review is all we will ever need but unless there are robust mechanisms to attract reviewers to new manuscripts besides the authors’ fame, (un-)popularity, and/or their social media presence – none of which are good scientific arguments – I can’t see how a post-pub only system can change this. On this note I should mention that Tal Yarkoni, with whom I’ve had some discussions about this issue, wrote an article about this which presents some suggestions. I am not entirely convinced of the arguments he makes for enhancing post-publication review but I need more time to respond to this in detail. So I will just point this out for now to any interested reader.
III. Research funding and hiring decisions
Above all, what seriously needs to change is how we allocate research funds and how we make hiring decisions. The solution to that probably goes hand in hand with solving the other two points, but I think it also requires direct action now in the absence of good solutions for the other issues. We must stop judging grant and job applicants based on impact factors or h-indeces. This is certainly more easily done for job applications than for grant decisions as in the latter the volume of applications is much greater – and the expertise of the panel members in judging the applications is lower. But it should be possible to reduce the reliance on metrics and ratings – even newer, more refined ones. Also grant applications shouldn’t be killed by a single off-hand critical review comment. Most importantly, grants shouldn’t all be written in a way that devalues exploratory research (by pretending to have strong hypotheses when you don’t) or – even worse – by pretending that the research you already conducted and are ready to publish is a “preliminary pilot data set.” For work that actually is hypothesis driven I quite like Dorothy Bishop’s idea that research funds could be obtained at the pre-registration stage when the theoretical background and experimental design have been established but before data collection commences. Realistically, this is probably more suitable for larger experimental programs than for every single study. But then again, encouraging larger, more thorough, projects may in fact be a good thing.
A few months ago a study from EJ Wagenmakers’ lab (Boekel et al. in Cortex) failed to replicate 17 structural brain-behaviour correlations reported in the published literature. The study was preregistered by uploading the study protocol to a blog and so was what Wagenmakers generally refers to as “purely confirmatory“. As Wagenmakers is also a vocal proponent of Bayesian inferential methods, they used a one-tailed Bayesian hypothesis tests to ask whether their replication evidence supported the original findings. A lot has already been written about the Boekel study and I was previously engaged in a discussion on it. Therefore, in the interest of brevity (and thus the time Alexander Etz’s needs to spend on reading it :P) I will not cover all the details again but cut right to the case (It is pretty long anyway despite by earlier promises…)
Ryota Kanai, author of several of the results Boekel et al. failed to replicate, has now published a response in which he reanalyses their replication data. He shows that at least one finding (a correlation between grey matter volume in the left SPL and a measure of distractibility as quantified by a questionnaire) replicates successfully if the same methods as his original study are used. In fact, while Kanai does not report these statistics, using the same Bayesian replication test for which Boekel reported “anecdotal” evidence* for the null hypothesis (r=0.22, BF10=0.8), Kanai’s reanalysis reveals “strong” evidence for the alternative hypothesis (r=0.48, BF10=28.1). This successful replication is further supported by a third study that replicated this finding in an independent sample (albeit with some of the same authors as the original study). Taken together this suggests that at least for this finding, the failure to replicate may be due to methodological differences rather than that the original result was spurious.
Now, disagreements between scientists are common and essential to scientific progress. Replication is essential for healthy science. However, I feel that these days, as a field, psychology and neuroscience researchers are going about it in the wrong way. To me this case is a perfect illustration of these problems. In my next post I will summarise this one in a positive light by presenting ten simple rules for a good replication effort (and – hand on my heart – that one will be short!)
1. No such thing as “direct replication”
Recent years have seen the rise of numerous replication attempts with a particular emphasis on “direct” replications, that is, the attempt to exactly reproduce the experimental conditions that generated the original results. This is in contrast to “conceptual” replications in which a new experiment follows the spirit of a previous one but the actual parameters may be very different. So for instance a finding that exposing people to a tiny picture of the US flag influences their voting behaviour months in the future could be interpreted as conceptually replicating the result that people walk more slowly when they were primed with words describing the elderly.
However, I believe this dichotomy is false. The “directness” of a replication attempt is not categorical but exists on a gradual spectrum. Sure, the examples of conceptual replications from the social priming literature are quite distinct from Boekel’s attempt to replicate the brain-behaviour correlations or all the other Many Labs projects currently being undertaken with the aim to test (or disprove?) the validity of social psychology research.
However, there is no such thing as a perfectly direct replication. The most direct replication would be an exact carbon copy of the original, with the same participants, tested at the same time in the same place under the exact same conditions. This is impossible and nobody actually wants that because it would be completely meaningless other than testing just how deterministic our universe really is. What people mean when they talk about direct replications is that they match the experimental conditions reasonably well but use an independent sample of participants and (ideally) independent experimenters. Just how “direct” the replication is depends on how closely matched the experimental parameters are. By that logic, I would call the replication attempt of Boekel et al. less direct than say Wagenmakers et al’s replication of Bem’s precognition experiments. Boekel’s experiments were not matched at least with those by Kanai on a number of methodological points. However, even for the precognition replication Bem challenged Wakenmakers** on the directness of their methods because his replication attempt did not use the same software and stimuli as the original experiment.
Controversies like this reveal several issues. While you can strive to match the conditions of an original experiment as closely as possible, there will always be discrepancies. Ideally the original authors and the “replicators”*** can reach a consensus as to whether or not the discrepancies should matter. However, even if they do, this does not mean that they are unimportant. Saying that “original authors agreed to the protocol” means that a priori they made the assumption that methodological differences are insignificant. This does not mean that this assumption is correct. In the end this is an empirical question.
Discussions about failed replications are often contaminated with talk about “hidden moderators”, that is, unknown factors and discrepancies between the original experiment and the replication effort. As I pointed out under the guise of my alter ego****, I have little patience for this argument. It is counter-productive because there are always unknown factors. Saying that unknown factors can explain failures to replicate is an unfalsifiable hypothesis and a truism. The only thing that should matter in this situation is empirical evidence for additional factors. If you cannot demonstrate that your result hinges on a particular factor, this argument is completely meaningless. In the case of Bem’s precognition experiments, this could have been done by conducting an explicit experiment that compares the use of his materials with those used by Wagenmakers, ideally in the same group of participants. However, in the case of these brain-behaviour correlations this is precisely what Kanai did in his reply: he reanalysed Boekel’s data using the methods he had used originally and he found a different result. Importantly, this does not necessarily prove that Kanai’s theory about these results is correct. However, it clearly demonstrates that the failure to replicate was due to another factor that Boekel et al. did not take into account.
2. Misleading dichotomy
I also think the dichotomy between direct and conceptual replication is misleading. When people conduct “conceptual” replications the aim is different but equally important: direct replications (in so far that they exist) can test whether specific effects are reproducible. Conceptual replications are designed to test theories. Taking again the elderly-walking-speed and voting-behaviour priming examples from above, whether or not you believe that such experiments constitute compelling evidence for this idea, they are both experiments aiming to test the idea that subtle (subconscious?) information can influence people’s behaviour.
There is also a gradual spectrum for conceptual replication but here it depends on how general the overarching theory is that the replication seeks to test. These social priming examples clearly test a pretty diffuse theory of subconscious processing. By the same logic one could say that all of the 17 results scrutinised by Boekel test the theory that brain structure shares some common variance with behaviour. This theory is not only vague but so generic that it is almost meaningless. If you honestly doubt that there are any structural links between brain and behaviour, may I recommend checking some textbooks on brain lesions or neurodegenerative illness in your local library?
A more meaningful conceptual replication would be to show that the same grey matter volume in the SPL not only correlates with a cognitive failure questionnaire but with other, independent measures of distractibility. You could even go a step further and show that this brain area is somehow causally related to distraction. In fact, this is precisely what Kanai’s original study did.
I agree that replicating actual effects (i.e. what is called “direct” replication) is important because it can validate the existence of previous findings and – as I described earlier – help us identify the factors that modulate it. You may however also consider ways to improve your methodology. A single replication with a demonstrably better method (say, better model fits, higher signal-to-noise ratios, or more precise parameter estimates) is worth a 100 direct replications from a Many Labs project. In any of these cases, the directness of your replication will vary.
In the long run, however, conceptual replication that tests a larger overarching theory is more important than showing that a specific effect exists. The distinction between these two is very blurred though. It is important to know what factors modulate specific findings to derive a meaningful theory. Still, if we focus too much on Many Labs direct replication efforts, science will slow down to a snail’s pace and waste an enormous amount of resources (and taxpayer money). I feel that these experiments are largely designed to deconstruct the social priming theory in general. And sure, if the majority of these findings fail to replicate in repeated independent attempts, perhaps we can draw the conclusion that current theory is simply wrong. This happens a lot in science – just look at the history of phrenology or plate tectonics or our model of the solar system.
However, wouldn’t it be better to replace subconscious processing theory with a better model that actually describes what is really going on than to invest years of research funds and effort to prove the null hypothesis? As far as I can tell, the current working theory about social priming by most replicators is that social priming is all about questionable research practices, p-hacking, and publication bias. I know King Ioannidis and his army of Spartans show that the situation is dire***** – but I am not sure it is realistically that dire.
3. A fallacious power fallacy
Another issue with the Boekel replications, which is also discussed in Kanai’s response, is that the sample size they used was very small. For the finding that Kanai reanalysed the sample size was only 36. Across the 17 results they failed to replicate, their sample size ranged between 31-36. This is in stark contrast with the majority of the original studies many of which used samples well above 100. Only for one of the replications, which was of one of their own findings, Boekel et al. used a sample that was actually larger (n=31) than that in the original study (n=9). It seems generally accepted that larger samples are better, especially for replication attempts. A recent article recommended a sample size for replications two and a half times larger than the original. This may be a mathematical rule of thumb but it is hardly realistic, especially for neuroimaging experiments.
Thus I can understand why Boekel et al. couldn’t possibly have done their experiment on hundreds of participants. However, at the very least you should think that a direct replication effort should at least try to match the sample of the original study not one that is four times smaller. In our online discussions Wagenmakers explained the small sample by saying that they “simply lacked the financial resources” to collect more data. I do not think this is a very compelling argument. Using the same logic I could build a lego version of the Large Hadron Collider in my living room but fail to find the Higgs boson – only to then claim that my inadequate methodology was due to the lack of several billion dollars on my bank account******.
I must admit I sympathise a little with Wagenmakers here because it isn’t like I never had to collect more data for an experiment than I had planned (usually this sort of optional stopping happens at the behest of anonymous peer reviewers). But surely you can’t just set out to replicate somebody’s research, using a preregistered protocol no less, with a wholly inadequate sample size? As a matter of fact,their preregistered protocol states the structural data for this project (which is the expensive part) had already been collected previously and that the maximum sample of 36 was pre-planned. While they left “open the possibility of testing additional participants” they opted not to do so even though the evidence for half of the 17 findings remained inconclusively low (more on this below). Presumably this was as they say because they ran “out of time, money, or patience.”
In the online discussion Wagenmakers further states that power is a pre-experimental concept and refers to another publications by him and others in which they describe a “power fallacy.” I hope I am piecing together their argument accurately in my own head. Essentially statistical power tells you how probable it is that you can detect evidence for a given effect with your planned sample size. It thus quantifies the probabilities across all possible outcomes given these parameters. I ran a simulation to do this for the aforementioned correlation between left SPL grey matter and cognitive failure questionnaire scores. So I drew 10,000 samples of 36 participants each from a bivariate Gaussian distribution with a correlation of rho=0.38 (i.e. the observed correlation coefficient in Kanai’s study). I also repeated this for the null hypothesis so I drew similar samples from an uncorrelated Gaussian distribution. The histograms in the figure below show the distributions of the 10,000 Bayes factors calculated using the same replication test used by Boekel et al. for the alternative hypothesis in red and the null hypothesis in blue.
Out of those 10,000 simulations in the red curve only about 62% would pass the criterion for “anecdotal” evidence of BF10=3. Thus even if the effect size originally reported by Kanai’s study had been a perfect estimate of the true population effect (which is highly improbable) only in somewhat less than two thirds of replicate experiments should you expect conclusive evidence supporting the alternative hypothesis. The peak of the red distribution is in fact very close to the anecdotal criterion. With the exception of the study by Xu et al. (which I am in no position to discuss) this result was in fact one of the most highly powered experiments in Boekel’s study: as I showed in the online discussion the peaks of expected Bayes factors of the other correlations were all below the anecdotal criterion. To me this suggests that the pre-planned power of these replication experiments was wholly insufficient to give the replication a fighting chance.
Now, Wagenmakers’ reasoning of the “power fallacy” however is that after the experiment is completed power is a meaningless concept. It doesn’t matter what potential effect sizes (and thus Bayesian evidence) one could have gotten if one repeated the experiment infinitely. What matters is the results and evidence they did find. It is certainly true that a low-powered experiment can produce conclusive evidence in favour of a hypothesis – for example the proportion of simulations at the far right end of the red curve would very compellingly support H1 while those simulations forming the peak of the blue curve would afford reasonable confidence that the null hypothesis is true. Conversely, a high-powered experiment can still fail to provide conclusive evidence. This essentially seems to be Wagenmakers’ argument of the power fallacy: just because an experiment had low power doesn’t necessarily mean that its results are uninterpretable.
However, in my opinion this argument serves to obfuscate the issue. I don’t believe that Wagenmakers is doing this on purpose but I think that he has himself fallen victim to a logical fallacy. It is a non-sequitur. While it is true that low-powered experiments can produce conclusive evidence, this does not make the evidence conclusive. In fact, it is the beauty of Bayesian inference that it allows quantification of the strength of evidence. The evidence Boekel et al. observed in was inconclusive (“anecdotal”) in 9 of the 17 replications. Only in 3 the evidence for the null hypothesis was anywhere close to “strong” (i.e. below 1/10 or very close to it).
Imagine you want to test if a coin is biased. You flip it once and it comes up heads. What can we conclude from this? Absolutely nothing. Even though the experiment has been completed it was obviously underpowered. The nice thing about Bayesian inference is that it reflects that fact.
4. Interpreting (replication) evidence
You can’t have it both ways. You either take Bayesian inference to the logical conclusion and interpret the evidence you get according to Bayesian theory or you shouldn’t use it. Bayes factor analysis has the potential to be a perfect tool for statistical inference. Had Boekel et al. observed a correlation coefficient near 0 in the replication of the distractibility correlation they would have been right to conclude (in the context of their test) that the evidence supports the null hypothesis with reasonable confidence.
Now a close reading of Boekel’s study shows that the authors were in fact very careful in how they worded the interpretation of their results. They say that they “were unable to successfully replicate any of these 17 correlations”. This is entirely correct in the context of their analysis. What they do not say, however, is that they were also unable to refute the previously reported effects even though this was the case for over half of their results.
Unfortunately, this sort of subtlety is entirely lost on most people. The reaction of many commenters on the aforementioned blog post, on social media, and in personal communications was to interpret this replication study as a demonstration that these structural brain-behaviour correlations have been conclusively disproved. This is in spite of the statement in the actual article that “a single replication cannot be conclusive in terms of confirmation or refutation of a finding.” On social media I heard people say that “this is precisely what we need more of.” And you can literally feel the unspoken, gleeful satisfaction of many commenters that yet more findings by some famous and successful researchers have been “debunked.”
Do we really need more low-powered replication attempts and more inconclusive evidence? As I described above, a solid replication attempt can actually inform us about the factors governing a particular effect, which in turn can help us formulate better theories. This is what we need more of. We need more studies that test assumptions but that also take all the available evidence into account. Many of these 17 brain-behaviour correlation results here originally came with internal replications in the original studies. As far as I can tell these were not incorporated in Boekel’s analysis (although they mentioned them). For some of the results independent replications – or at least related studies – had already been published and it seems odd that Boekel et al. didn’t discuss at least those that had already been published months earlier.
Also some results, like Kanai’s distractibility correlation, were accompanied in the original paper by additional tests of the causal link between the brain area and behaviour. In my mind, from a scientific perspective it is far more important to test those questions in detail rather than simply asking whether the original MRI results can be reproduced.
5. Communicating replication efforts
I think there is also a more general problem with how the results of replication efforts are communicated. Replication should be a natural component of scientific research. All too often failed replications result in mudslinging contests, heated debates, and sometimes in inappropriate comparisons of replication authors with video game characters. Some talk about how the reputation of the original authors is hurt by failed replication.
It shouldn’t have to be this way. Good scientists also produce non-replicable results and even geniuses can believe in erroneous theories. However, the way our publishing and funding system works as well as our general human emotions predispose us to having these unfortunate disagreements.
I don’t think you can solely place the blame for such arguments on the authors of the original studies. Because scientists are human beings the way you talk to them influences how they will respond. Personally I think that the reports of many high profile replication failures suffer from a lack of social awareness. In that sense the discussion surrounding the Boekel replications has actually been very amicable. There have been far worse cases where the whole research programs of some authors have been denigrated and ridiculed on social media, sometimes while the replication efforts were still on-going. I’m not going to delve into that. Perhaps one of the Neuroscience Devils wants to pick up that torch in the future.
However, even the Boekel study shows how this communication could have been done with more tact. The first sentences of the Boekel article read as follows:
“A recent ‘crisis of confidence’ has emerged in the empirical sciences. Several studies have suggested that questionable research practices (QRPs) such as optional stopping and selective publication may be relatively widespread. These QRPs can result in a high proportion of false-positive findings, decreasing the reliability and replicability of research output.”
I know what Boekel et al. are trying to say here. EJ Wagenmakers has a declared agenda to promote “purely confirmatory” research in which experimental protocols are preregistered. There is nothing wrong with this per se. However, surely the choice of language here is odd? Preregistration is not the most relevant part about the Boekel study. It could have been done without it. It is fine to argue for why it is necessary in the article, but to actually start the article with a discussion of the replication crisis in the context of questionable research practices is very easy to be (mis?-)interpreted as an accusation. Whatever the intentions may have been, starting the article in this manner immediately places a spark of doubt in the reader’s mind and primes them to consider the original studies as being of a dubious nature. In fact, in the online debate Wagenmakers went a step further to suggest (perhaps somewhat tongue-in-cheek) that:
“For this particular line of research (brain-behavior correlations) I’d like to suggest an exploration-safeguard principle (ESP): after collecting the imaging data, researchers are free to analyze these data however they see fit. Data cleaning, outlier rejection, noise reduction: this is all perfectly legitimate and even desirable. Crucially, however, the behavioral measures are not available until after completion of the neuroimaging data analysis. This can be ensured by collecting the behavioral data in a later session, or by collaborating with a second party that holds the behavioral data in reserve until the imaging analysis is complete. This kind of ESP is something I can believe in.”
This certainly sounds somewhat accusatory to me. Quite frankly this is a bit offensive. I am all in favour of scientific skepticism but this is not the same as baseless suspicion. Having been on the receiving end of a particularly bad case of reviewer 2 once who made similar unsubstantiated accusations (and in fact ignored evidence to the contrary) I can relate to people who would be angered by that. For one thing such procedures are common in many labs conducting experiments like this. Having worked with Ryota Kanai in the past I have a fairly good idea of the meticulousness of his research. I also have great respect for EJ Wagenmakers and I don’t think he actually meant to offend anyone. Still, I think it could easily happen with statements like this and I think it speaks for Kanai’s character that he didn’t take offense here.
There is a better way. This recently published failure to replicate a link between visually induced gamma oscillation frequency and resting occipital GABA concentration is a perfect example of a well-written replication failure. There is no paranoid language about replication crises and p-hacking but a simple, factual account of the research question and the results. In my opinion this exposition certainly facilitated the rather calm reaction to this publication.
6. Don’t hide behind preregistration
Of course, the question about optional stopping and outcome-dependent analysis (I think that term was coined by Tal Yarkoni) could be avoided by preregistering the experimental protocols (in fact at least some of these original experiments were almost certainly preregistered in departmental project reviews). As opposed to what some may think, I am not opposed to preregistration as such. In fact, I fully intend to try it.
However, there is a big problem with this, which Kanai also discusses in his response. As a peer reviewer, he actually recommended Boekel et al. to use the same analysis pipeline he employed now to test for the effects. The reason Boekel et al. did not do this is that these methods were not part of the preregistered protocol. However, this did not stop them from employing other non-registered methods, which they report as exploratory analyses. In fact, we are frequently told that pre-registration does not preclude exploration. So why not here?
Moreover, preregistration is in the first instance designed to help authors control the flexibility of their experimental procedure. It should not be used as a justification to refuse performing essential analyses when reviewers ask for them. In this case, a cynic might say that Boekel et al. in fact did these analyses and chose not to report them because the results were inconsistent with the message they wanted to argue. Now I do not believe this to be the case but it’s an example of how unfounded accusations can go both ways in these discussions.
If this is how preregistration is handled in the future, we are in danger of slowing down scientific progress substantially. If Boekel et al. had performed these additional analyses (which should have been part of the originally preregistered protocol in the first place), this would have saved Kanai the time to do them himself. Both he and Boekel et al. could have done something more productive with their time (and so could I, for that matter :P).
It doesn’t have to go this way but we must be careful. If we allow this line of reasoning with preregistration, we may be able to stop the Texas sharpshooter from bragging but we will also break his rifle. It will then take much longer and more ammunition to finally hit the bulls-eye than is necessary.
*) I actually dislike categorical labels for Bayesian evidence. I don’t think we need them.
**) This is a pre-print manuscript. It keeps changing with on-going peer review so this statement may no longer be true when you read this.
***) Replicators is a very stupid word but I can’t think of a better, more concise one.
****) Actually this post was my big slip-up as Devil’s Neuroscientist. In that one a lot of Sam’s personality shown through, especially in the last paragraph.
*****) I should add that I am merely talking about the armies of people pointing out the proneness of false positives. I am not implying that all these researchers I linked to here agree with one another.
******) To be fair, I probably wouldn’t be able to find the Higgs boson even if I had the LHC.
I decided to respond now before I get inundated with the next round of overdue work I need to do this week… I was going to wait until Chris’ response as I think you will probably overlap a bit but there are a lot of deadlines and things to do, so now is a better time. I also decided to write my reply as a post because it is a bit long for a comment and others may find it interesting.
I think most of your answers illustrate how we all miss each other’s points a little. I am not talking about what RR and prereg are like right now. Any evidence we have about it now is confounded by the fact that it is new and that the people trying it are probably for the most part proponents of the approach. Most of the points I raised (except perhaps the last one) are issues that really only come into play once the approach has become normalised, when it is commonplace at many journals, and it stopped being a measure to improve science but just how it works – so a bit like how standard peer review is now (and you know how much people complain about that).
DB: Nope. You have to give a comprehensive account of what you plan, thinking through every aspect of rationale, methods and analysis: Cortex really doesn’t want to publish anything flawed and so they screw you down on the details. DB: Why any more so than for other publication methods? I really find this concern quite an odd one.
I agree that detailed review is key but the same could be said about the standard system. I don’t buy that author reputation isn’t going to influence judgements there. Like I am sure most of us, I always try my best not to be influenced by it, but I think we’re kidding ourselves if we think we’re perfectly unbiased. If you get a somewhat lacklustre manuscript to review, you will almost inevitably respond better to that author with a proven track record in the field (who probably possesses great writing skills) compared to some nobodies you never heard of, especially if they’re failing to communicate their ideas well (e.g. because their native language isn’t English). Nevertheless the quality of their work could actual be equal.
Now I take your point that this is also an issue in the current system, but the difference is that RR stage 1 reviews are just about evaluating the idea and the design. I think you’re lacking some information that could actually help you make a more informed choice. And it would be very disturbing if we tell people what science they can or can’t do (in the good journals that have RRs) just because of factors like this.
DB: Well, for a start registered reports require you to have very high statistical power and so humungous great N. Most people who just happened to do a study won’t meet that criterion. Second, as Chris pointed out in his talk, if you submit a registered report, then it goes out to review, and the reviewers do what reviewers do, i.e. make suggestions for changes, new conditions, etc etc. They do also expect you to specify your analysis in advance: that is one of the important features of RR.
I think this isn’t really answering my question. It should be very easy to come up with a “highly powered experiment” if you already know the finally observed effect size :P. And as I said in my post, I think many outcome-dependent changes to the protocol are about the analysis not about the design. Again, my point is also that once RRs have become more normal and people have run a bit out of steam (so the review quality may suffer compared to now) it may be a fairly easy thing to do. I could also see there being hybrids (i.e. people have already collected a fair bit of “pilot” data and just add a bit more in the registered protocol.
But I agree that this is perhaps all a bit hypothetical. I was questioning the actual logic of the response to this criticism. In the end what matters though is how likely it is that people engage in that sort of behaviour. If pre-completed grant proposals are really as common as people claim I could see it happening – but that depends largely on how difficult it is compared to being honest. Perhaps you’re right and it’s just very unlikely.
DB: So you would be unlikely to get through the RR process even if you did decide to fake your time stamps (and let’s face it, if you’re going to do that, you are beyond redemption).
I’m sure we all agree on that but I wouldn’t put it past some people. I’ve seen cases where people threw away around a third of their data points because they didn’t like the results. I am not sure that fiddling with the time stamps (which may be easier than actively changing the date) is really all that much worse.
Of course, this brings us to another question in that nothing in RR or data sharing in general really stops people from excluding “bad” subjects. Again, of course this is not different from the status quo but my issue is that having preregistered and open experiments clearly bestows a certain value judgement for people (hell, the OSF actually operates a “badge” system!). So in a way a faked RR could end up being valued more than an honest well-done non-RR. That does bother me.
DB: Since you yourself don’t find this [people stealing my ideas from RRs] all that plausible, I won’t rehearse the reasons why it isn’t.
Again, I was mostly pointing out the holes in the logic here. And of course whether or not it is plausible, a lot of people are quite evidently afraid of what Chris called the “boogieman” of being scooped. My point was that to allay this fear pointing to Manuscript Received dates is not going to suffice. But we all seem to agree that scooping is an exaggerated problem. I think the best way to deal with this worry is to stop people from being afraid of the boogieman in the first place.
DB: Your view on this may be reinforced by PIs in your institution. However, be aware that there are some senior people who are more interested in whether your research is replicable than whether it is sexy. And who find the soundbite form of reporting required by Nature etc quite inadequate.
This seems a bit naive to me. It’s not just about what “some senior people” think. I can’t with all honesty say that these factors don’t play into grant and hiring decisions. I also think it is a bit hypocritical to advise junior researchers not to pursue a bit of high impact glory when our own careers are at least in part founded on that (although mine isn’t nearly as much as some other people’s ;)). I do advise people that just to chase high impact is a bad idea but that you should have a healthy selection of solid studies. But I can also tell from experience that a few high impact publications clearly open doors for you. Anyway, this is really a topic for a different day I guess.
My own view is that I would go for a registered report in cases where it is feasible, as it has three big benefits – 1) you get good peer review before doing the study, 2) it can be nice to have a guarantee of publication and 3) you don’t have to convince people that you didn’t make up the hypothesis after seeing the data. But where it’s not feasible, I’d go for a registered protocol on OSF which at least gives me (3).
I agree this is eminently sensible. I think the (almost) guaranteed publication is probably a quite convincing argument to many people. And by god I can say that I have in the past wished for (3) – oddly enough it’s usually the most hypothesis-driven research where (some) people don’t want to believe you weren’t HARKing…
I think this also underlines an important point. The whole prereg discussion far too often revolves around negative issues. The critics are probably partly to blame for it but I think in general you often hear it mentioned as a response to questionable research practices. But what this discussion suggests is that there are many positive aspects about prereg so rather than being a cure to an ailing scientific process, it can also be seen as a healthier way to do science.
Recently I participated in an event organised by PhD students in Experimental Psychology at UCL called “Is Science broken?”. It involved a lively panel discussion in which the panel members answered many questions from the audience about how we feel science can be improved. The opening of the event was a talk by Chris Chambers of Cardiff University about Registered Reports (RR), a new publishing format in which researchers preregister their introduction and methods sections before any data collection takes place. Peer review takes place in two stages: first, reviewers evaluate the appropriateness of the question and the proposed experimental design and analysis procedures, and then, after data collection and analysis have been completed and the results are known, peer review continues to finalise the study for publication. This approach is aimed to make scientific publishing independent from the pressure to get perfect results or changing one’s apparent hypothesis depending on the outcome of the experiments.
Chris’ talk was in large part a question and answer session for specific concerns with the RR approach that had been raised at other talks or in writing. Most of these questions he (and his coauthors) had already answered in a similar way in a published FAQ paper. However, it was nice to see him talk so passionately about this topic. Also speaking for myself at least I can say that seeing a person arguing their case is usually far more compelling than reading an article on it – even though the latter will in the end probably have a wider reach.
Here I want to raise some additional questions about the answers Chris (and others) have given to some of these specific concerns. The purpose in doing so is not to “bring about death by a thousand cuts” to the RR concept as Aidan Horner calls it. I completely agree with Aidan that many concerns people have with RR (and lighter forms of preregistration) are probably logistical. It may well be that some people just really want to oppose this idea and are looking for any little reason to use as an excuse. However, I think both sides of the debate about this issue have suffered from a focus on potentials rather than fact. We simply won’t know how much good or bad preregistration will do for science unless we try it. This seems to be a concept that everyone at the discussion was very much in agreement on and we all discussed ways in which we could actually assess the evidence for whether RRs improved science over the next few decades.
Therefore I want it to be clear that the points I raise are not an ideological opposition to preregistration. Rather they are some points where I found the answers Chris describe to be not entirely satisfying. I very much believe that preregistration must be tried but I want to provoke some thought about possible problems with it. The sooner we are aware of these issues, the sooner they can be fixed.
Wouldn’t reviewers rely even more on the authors’ reputation?
In the Stage 1 of an RR, when only the scientific question and experimental design are reviewed, reviewers have little to go on to evaluate the protocol. Provided that the logic of the question and the quality of the design are evident, they would hopefully be able to make some informed decisions about it. However, I think it is a bit naive to assume that the reputation of the authors isn’t going to influence the reviewers’ judgements. I have heard of many grant reviews asking questions as to whether the authors would be capable of pulling off some proposed research. There is an extensive research literature on how the evaluation of identical exam papers or job applications or whatnot can be influenced by factors like the subject’s gender or name. I don’t think simply saying “Author reputation is not among” the review criteria is enough of a safeguard.
I also don’t think that having a double-blind review system is necessarily a good way to protect against this. There have been wider discussions about the short-comings of double-blind reviews and this situation is no different. In many situations you could easily guess the authors’ identity by the experimental protocol alone. And double blind review suffers even more from one of the main problems with anonymous reviewers (which I generally support): when the reviewers guess the authors’ identities incorrectly that could have even worse consequences because their decision will be based on an incorrect assessment of the authors.
Can’t people preregister experiments they have already completed?
The general answer here is that this would constitute fraud. The RR format would also require time stamped data files and lab logs to guarantee that data were produced only after the protocol has been registered. Both of these points are undeniably true. However, while there may be such a thing as an absolute ethical ideal, in the end a lot of our ethics are probably governed by majority consensus. The fact that many questionable research practices are apparently so widespread is presumably just that: while most people deep down understand that these things are “not ideal”, they may nonetheless engage in them because they feel that “everybody else does it.”
For instance, I often hear that people submit grant proposals for experiments they have already completed although I have personally never seen this with any grant proposals. I have also heard that it is more common in the US perhaps but at least based on all the anecdotes it may generally be widespread. Surely this is also fraudulent but nevertheless people apparently do it?
Regarding time stamped data, I also don’t know if this is necessarily a sufficiently strong safeguard. For the most part, time stamps are pretty easy to “adjust”. Crucially, I don’t think many reviewers or post-publication commenters will go through the trouble of checking them. Faking time stamps is certainly deep into fraud territory but people’s ethical views are probably not all black and white. I could easily see some people bending the rules just that little, especially if preregistered studies become a new gold standard in the scientific community.
Now perhaps this is a bit too pessimistic a view of our colleagues. I agree that we probably should not exaggerate this concern. But given the concerns people have with questionable research practices now I am not entirely sure we can really just dismiss this possibility by saying that this would be fraud.
Finally, another answer to this concern is that preregistering your experiment after the fact would backfire because the authors could then not implement any changes suggested by reviewers in Stage 1. However, this only applies to changes in the experimental design, the stimuli or apparatus etc. The most confusing corners in the “garden of forking paths” are usually the analysis procedure, not the design. There are only so many ways to run a simple experiment and most minor changes suggested by a reviewer could easily be dismissed by authors. However, changes to the analysis approach could quite easily be implemented after the results are known.
Reviewers could steal my preregistered protocol and scoop me
I agree that this concern is not overly realistic. In fact, I don’t even believe the fear of being scooped is overly realistic. I’m sure it happens (and there are some historical examples) but it is far rarer than most people believe. Certainly it is rather unlikely for a reviewer to do this. For one thing, time is usually simply not on their side. There is a lot that could be written about the fear of getting scooped and I might do that at some future point. But this is outside the scope of this post.
Whatever its actual prevalence, the paranoia (or boogieman) of scooping is clearly widespread. Until we find a way to allay this fear I am not sure that it will be enough to tell people that the Manuscript Received date of a preregistered protocol would clearly show who had the idea sooner. First of all, the Received date doesn’t really tell you when somebody had an idea. The “scooper” could always argue that they had the idea before that date but only ended up submitting the study afterwards (and I am sure that actually happens fairly often without scooping).
More importantly though, one of the main reasons people are afraid of being scooped is the pressure to publish in high impact journals. Having a high impact publication has greater currency than the Received date of a RR in what is most likely a lower impact journal. I doubt many people would actually check the date unless you specifically point it out to them. We already now have a problem with people not reading the publications of job and grant applicants but relying on metrics like impact factors and h-indeces. I don’t easily see them looking through that information.
As a junior researcher I must publish in high impact journals
I think this is an interesting issue. I would love nothing more than if we could stop caring about who published what where. At the same time I think that there is a role for high impact journals like Nature, Science or Neuron (seriously folks, PNAS doesn’t belong in that list – even if you didn’t boycott it like me…). I would like the judgement of scientific quality and merit to be completely divorced from issues of sensationalism, novelty, and news that quite likely isn’t the whole story. I don’t know how to encourage that change though. Perhaps RRs can help with that but I am not sure they’re enough. Either way, it may be a foolish and irrational fear but I know that as a junior scientist I (and my postdocs and students) currently do seek to publish at least some (but not exclusively) “high impact” research to be successful. But I digress.
Chris et al. write that sooner or later high impact outlets will probably come on board with offering RRs. I don’t honestly see that happening, at least not without a much more wide-ranging change in culture. I think RRs are a great format for specialised journals to have. However, the top impact journals primarily exist for publishing exciting results (whatever that means). I don’t think they will be keen to open the floodgates for lots of submissions that aim to test exciting ideas but fail to deliver the results to match them. What I could see perhaps is a system in which a journal like Nature would review a protocol and accept to publish it in its flagship journal if the results are positive but in its lower-impact outlet (e.g. Nature Communications) if the results are negative. The problem with this idea is that it somehow goes against the egalitarian philosophy of the current RR proposals. The publication again would be dependent on the outcome of the experiments.
Registered Reports are incompatible with short student projects
After all the previous fairly negative points I think this one is actually about a positive aspect of science. For me this is actually one of the greatest concerns. In my mind this is a very valid worry and Chris and co acknowledge this also in their article. I think RRs would be a viable solution for experiments by a PhD student but for master students, who are typically around for a few months only, it is simply not very realistic to first submit a protocol and revising it over weeks and months of reviews before even collecting the first data.
A possible solution suggested for this problem is that you could design the experiments and have them approved by peer reviewers before the students commence. I think this is a terrible idea. For me perhaps the best part of supervising student projects in my lab is when we discuss the experimental design. The best students typically come with their own ideas and make critical suggestions and improvements to the procedure. Not only is this phase very enjoyable but I think designing good experiments is also one of the most critical skills for junior scientists to learn. By having the designs finalised before the students even step through the door of the lab would undermine that.
Perhaps for those cases it would make more sense to just use light preregistration, that is, uploading your protocol to a timestamped archive but without external review. But if in the future RR do become the new gold standard, I would worry that this denigrates the excellent research projects of many students.
As I said, these points are not meant to shoot down the concept of Registered Reports. Some of the points may not even be such enormous concerns at all. However, I hope that my questions provoke thought and that we can discuss ways to improve the concept further and find safeguards against these possible problems.
Sorry this post was very long as usual but there seems to be a lot to say. My next post though will be short, I promise! 😉