Category Archives: preregistration

Science is not broken – but these three things are

Because it’s so much more fun than the things I should really be doing (correcting student dissertations and responding to grant reviews) I read a long blog post entitled “Science isn’t broken” by Christie Aschwanden. In large parts this is a summary of the various controversies and “crises” that seem to have engulfed scientific research in recent years. The title is a direct response to an event I participated in recently at UCL. More importantly, I think it’s a really good read so I recommend it.

This post is a quick follow-up response to the general points raised there. As I tried to argue (probably not very coherently) at that event, I also don’t think science is broken. First of all, probably nobody seriously believes that the lofty concept of science, the scientific method (if there is one such thing), can even be broken. But even in more pragmatic terms, the human aspects of how science works are not broken either. My main point was that the very fact we are having these kinds of discussions, about how scientific research can be improved, is direct proof that science is in fact very healthy. This is what self-correction looks like.

If anything, the fact that there has been a recent surge of these kinds of debates shows that science has already improved a lot recently. After decades of complacency with the status quo there now seems to be real energy afoot to effect some changes. However, it is not the first time this happened (for example, the introduction of peer review would have been a similarly revolutionary time) and it will not be the last. Science will always need to be improved. If some day conventional wisdom were that our procedure is now perfect, that it cannot be improved anymore, that would be a tell-tale sign for me that I should do something else.

So instead of fretting over whether science is “broken” (No, it isn’t) or even whether it needs improvement (Yes, it does), what we should be talking about is specifically what really urgently needs improvement. Here is my short list. I am not proposing many solutions (except for point 1). I’d be happy to hear suggestions:

I. Publishing and peer review

The way we publish and review seriously needs to change. We are wasting far too much time on trivialities instead of the science. The trivialities range from reformatting manuscripts to fit journal guidelines and uploading files on the practical side to chasing impact factors and “novel” research on the more abstract side. Both hurt research productivity although in different ways. I recently proposed a solution that combines some of the ideas by Dorothy Bishop and Micah Allen (and no doubt many others).

II. Post-publication review

Related to this, the way we evaluate and discuss published science needs to change, too. We need to encourage more post-publication review. This currently still doesn’t happen as most studies never receive any post-pub review or get commented on at all. Sure, some (including some of my own) probably just don’t deserve any attention, but how will you know unless somebody tells you the study even exists? Many precious gems will be missed that way. This has of course always been the case in science but we should try to minimise that problem. Some believe post-publication review is all we will ever need but unless there are robust mechanisms to attract reviewers to new manuscripts besides the authors’ fame, (un-)popularity, and/or their social media presence – none of which are good scientific arguments – I can’t see how a post-pub only system can change this. On this note I should mention that Tal Yarkoni, with whom I’ve had some discussions about this issue, wrote an article about this which presents some suggestions. I am not entirely convinced of the arguments he makes for enhancing post-publication review but I need more time to respond to this in detail. So I will just point this out for now to any interested reader.

III. Research funding and hiring decisions

Above all, what seriously needs to change is how we allocate research funds and how we make hiring decisions. The solution to that probably goes hand in hand with solving the other two points, but I think it also requires direct action now in the absence of good solutions for the other issues. We must stop judging grant and job applicants based on impact factors or h-indeces. This is certainly more easily done for job applications than for grant decisions as in the latter the volume of applications is much greater – and the expertise of the panel members in judging the applications is lower. But it should be possible to reduce the reliance on metrics and ratings – even newer, more refined ones. Also grant applications shouldn’t be killed by a single off-hand critical review comment. Most importantly, grants shouldn’t all be written in a way that devalues exploratory research (by pretending to have strong hypotheses when you don’t) or – even worse – by pretending that the research you already conducted and are ready to publish is a “preliminary pilot data set.” For work that actually is hypothesis driven I quite like Dorothy Bishop’s idea that research funds could be obtained at the pre-registration stage when the theoretical background and experimental design have been established but before data collection commences. Realistically, this is probably more suitable for larger experimental programs than for every single study. But then again, encouraging larger, more thorough, projects may in fact be a good thing.

Failed replication or flawed reasoning?

A few months ago a study from EJ Wagenmakers’ lab (Boekel et al. in Cortex) failed to replicate 17 structural brain-behaviour correlations reported in the published literature. The study was preregistered by uploading the study protocol to a blog and so was what Wagenmakers generally refers to as “purely confirmatory“. As Wagenmakers is also a vocal proponent of Bayesian inferential methods, they used a one-tailed Bayesian hypothesis tests to ask whether their replication evidence supported the original findings. A lot has already been written about the Boekel study and I was previously engaged in a discussion on it. Therefore, in the interest of brevity (and thus the time Alexander Etz’s needs to spend on reading it :P) I will not cover all the details again but cut right to the case (It is pretty long anyway despite by earlier promises…)

Ryota Kanai, author of several of the results Boekel et al. failed to replicate, has now published a response in which he reanalyses their replication data. He shows that at least one finding (a correlation between grey matter volume in the left SPL and a measure of distractibility as quantified by a questionnaire) replicates successfully if the same methods as his original study are used. In fact, while Kanai does not report these statistics, using the same Bayesian replication test for which Boekel reported “anecdotal” evidence* for the null hypothesis (r=0.22, BF10=0.8), Kanai’s reanalysis reveals “strong” evidence for the alternative hypothesis (r=0.48, BF10=28.1). This successful replication is further supported by a third study that replicated this finding in an independent sample (albeit with some of the same authors as the original study). Taken together this suggests that at least for this finding, the failure to replicate may be due to methodological differences rather than that the original result was spurious.

Now, disagreements between scientists are common and essential to scientific progress. Replication is essential for healthy science. However, I feel that these days, as a field, psychology and neuroscience researchers are going about it in the wrong way. To me this case is a perfect illustration of these problems. In my next post I will summarise this one in a positive light by presenting ten simple rules for a good replication effort (and – hand on my heart – that one will be short!)

1. No such thing as “direct replication”

Recent years have seen the rise of numerous replication attempts with a particular emphasis on “direct” replications, that is, the attempt to exactly reproduce the experimental conditions that generated the original results. This is in contrast to “conceptual” replications in which a new experiment follows the spirit of a previous one but the actual parameters may be very different. So for instance a finding that exposing people to a tiny picture of the US flag influences their voting behaviour months in the future could be interpreted as conceptually replicating the result that people walk more slowly when they were primed with words describing the elderly.

However, I believe this dichotomy is false. The “directness” of a replication attempt is not categorical but exists on a gradual spectrum. Sure, the examples of conceptual replications from the social priming literature are quite distinct from Boekel’s attempt to replicate the brain-behaviour correlations or all the other Many Labs projects currently being undertaken with the aim to test (or disprove?) the validity of social psychology research.

However, there is no such thing as a perfectly direct replication. The most direct replication would be an exact carbon copy of the original, with the same participants, tested at the same time in the same place under the exact same conditions. This is impossible and nobody actually wants that because it would be completely meaningless other than testing just how deterministic our universe really is. What people mean when they talk about direct replications is that they match the experimental conditions reasonably well but use an independent sample of participants and (ideally) independent experimenters. Just how “direct” the replication is depends on how closely matched the experimental parameters are. By that logic, I would call the replication attempt of Boekel et al. less direct than say Wagenmakers et al’s replication of Bem’s precognition experiments. Boekel’s experiments were not matched at least with those by Kanai on a number of methodological points. However, even for the precognition replication Bem challenged Wakenmakers** on the directness of their methods because his replication attempt did not use the same software and stimuli as the original experiment.

Controversies like this reveal several issues. While you can strive to match the conditions of an original experiment as closely as possible, there will always be discrepancies. Ideally the original authors and the “replicators”*** can reach a consensus as to whether or not the discrepancies should matter. However, even if they do, this does not mean that they are unimportant. Saying that “original authors agreed to the protocol” means that a priori they made the assumption that methodological differences are insignificant. This does not mean that this assumption is correct. In the end this is an empirical question.

Discussions about failed replications are often contaminated with talk about “hidden moderators”, that is, unknown factors and discrepancies between the original experiment and the replication effort. As I pointed out under the guise of my alter ego****, I have little patience for this argument. It is counter-productive because there are always unknown factors. Saying that unknown factors can explain failures to replicate is an unfalsifiable hypothesis and a truism. The only thing that should matter in this situation is empirical evidence for additional factors. If you cannot demonstrate that your result hinges on a particular factor, this argument is completely meaningless. In the case of Bem’s precognition experiments, this could have been done by conducting an explicit experiment that compares the use of his materials with those used by Wagenmakers, ideally in the same group of participants. However, in the case of these brain-behaviour correlations this is precisely what Kanai did in his reply: he reanalysed Boekel’s data using the methods he had used originally and he found a different result. Importantly, this does not necessarily prove that Kanai’s theory about these results is correct. However, it clearly demonstrates that the failure to replicate was due to another factor that Boekel et al. did not take into account.

2. Misleading dichotomy

I also think the dichotomy between direct and conceptual replication is misleading. When people conduct “conceptual” replications the aim is different but equally important: direct replications (in so far that they exist) can test whether specific effects are reproducible. Conceptual replications are designed to test theories. Taking again the elderly-walking-speed and voting-behaviour priming examples from above, whether or not you believe that such experiments constitute compelling evidence for this idea, they are both experiments aiming to test the idea that subtle (subconscious?) information can influence people’s behaviour.

There is also a gradual spectrum for conceptual replication but here it depends on how general the overarching theory is that the replication seeks to test. These social priming examples clearly test a pretty diffuse theory of subconscious processing. By the same logic one could say that all of the 17 results scrutinised by Boekel test the theory that brain structure shares some common variance with behaviour. This theory is not only vague but so generic that it is almost meaningless. If you honestly doubt that there are any structural links between brain and behaviour, may I recommend checking some textbooks on brain lesions or neurodegenerative illness in your local library?

A more meaningful conceptual replication would be to show that the same grey matter volume in the SPL not only correlates with a cognitive failure questionnaire but with other, independent measures of distractibility. You could even go a step further and show that this brain area is somehow causally related to distraction. In fact, this is precisely what Kanai’s original study did.

I agree that replicating actual effects (i.e. what is called “direct” replication) is important because it can validate the existence of previous findings and – as I described earlier – help us identify the factors that modulate it. You may however also consider ways to improve your methodology. A single replication with a demonstrably better method (say, better model fits, higher signal-to-noise ratios, or more precise parameter estimates) is worth a 100 direct replications from a Many Labs project. In any of these cases, the directness of your replication will vary.

In the long run, however, conceptual replication that tests a larger overarching theory is more important than showing that a specific effect exists. The distinction between these two is very blurred though. It is important to know what factors modulate specific findings to derive a meaningful theory. Still, if we focus too much on Many Labs direct replication efforts, science will slow down to a snail’s pace and waste an enormous amount of resources (and taxpayer money). I feel that these experiments are largely designed to deconstruct the social priming theory in general. And sure, if the majority of these findings fail to replicate in repeated independent attempts, perhaps we can draw the conclusion that current theory is simply wrong. This happens a lot in science – just look at the history of phrenology or plate tectonics or our model of the solar system.

However, wouldn’t it be better to replace subconscious processing theory with a better model that actually describes what is really going on than to invest years of research funds and effort to prove the null hypothesis? As far as I can tell, the current working theory about social priming by most replicators is that social priming is all about questionable research practices, p-hacking, and publication bias. I know King Ioannidis and his army of Spartans show that the situation is dire***** – but I am not sure it is realistically that dire.

3. A fallacious power fallacy

Another issue with the Boekel replications, which is also discussed in Kanai’s response, is that the sample size they used was very small. For the finding that Kanai reanalysed the sample size was only 36. Across the 17 results they failed to replicate, their sample size ranged between 31-36. This is in stark contrast with the majority of the original studies many of which used samples well above 100. Only for one of the replications, which was of one of their own findings, Boekel et al. used a sample that was actually larger (n=31) than that in the original study (n=9). It seems generally accepted that larger samples are better, especially for replication attempts. A recent article recommended a sample size for replications two and a half times larger than the original. This may be a mathematical rule of thumb but it is hardly realistic, especially for neuroimaging experiments.

Thus I can understand why Boekel et al. couldn’t possibly have done their experiment on hundreds of participants. However, at the very least you should think that a direct replication effort should at least try to match the sample of the original study not one that is four times smaller. In our online discussions Wagenmakers explained the small sample by saying that they “simply lacked the financial resources” to collect more data. I do not think this is a very compelling argument. Using the same logic I could build a lego version of the Large Hadron Collider in my living room but fail to find the Higgs boson – only to then claim that my inadequate methodology was due to the lack of several billion dollars on my bank account******.

I must admit I sympathise a little with Wagenmakers here because it isn’t like I never had to collect more data for an experiment than I had planned (usually this sort of optional stopping happens at the behest of anonymous peer reviewers). But surely you can’t just set out to replicate somebody’s research, using a preregistered protocol no less, with a wholly inadequate sample size? As a matter of fact,their preregistered protocol states the structural data for this project (which is the expensive part) had already been collected previously and that the maximum sample of 36 was pre-planned. While they left “open the possibility of testing additional participants” they opted not to do so even though the evidence for half of the 17 findings remained inconclusively low (more on this below). Presumably this was as they say because they ran “out of time, money, or patience.”

In the online discussion Wagenmakers further states that power is a pre-experimental concept and refers to another  publications by him and others in which they describe a “power fallacy.” I hope I am piecing together their argument accurately in my own head. Essentially statistical power tells you how probable it is that you can detect evidence for a given effect with your planned sample size. It thus quantifies the probabilities across all possible outcomes given these parameters. I ran a simulation to do this for the aforementioned correlation between left SPL grey matter and cognitive failure questionnaire scores. So I drew 10,000 samples of 36 participants each from a bivariate Gaussian distribution with a correlation of rho=0.38 (i.e. the observed correlation coefficient in Kanai’s study). I also repeated this for the null hypothesis so I drew similar samples from an uncorrelated Gaussian distribution. The histograms in the figure below show the distributions of the 10,000 Bayes factors calculated using the same replication test used by Boekel et al. for the alternative hypothesis in red and the null hypothesis in blue.

Histograms of Bayes factors in favour of alternative hypothesis (BF10) over 10,000 simulated samples of n=36 with rho=0.38 (red curve) and rho=0 (blue curve).

Out of those 10,000 simulations in the red curve only about 62% would pass the criterion for “anecdotal” evidence of BF10=3. Thus even if the effect size originally reported by Kanai’s study had been a perfect estimate of the true population effect (which is highly improbable) only in somewhat less than two thirds of replicate experiments should you expect conclusive evidence supporting the alternative hypothesis. The peak of the red distribution is in fact very close to the anecdotal criterion. With the exception of the study by Xu et al. (which I am in no position to discuss) this result was in fact one of the most highly powered experiments in Boekel’s study: as I showed in the online discussion the peaks of expected Bayes factors of the other correlations were all below the anecdotal criterion. To me this suggests that the pre-planned power of these replication experiments was wholly insufficient to give the replication a fighting chance.

Now, Wagenmakers’ reasoning of the “power fallacy” however is that after the experiment is completed power is a meaningless concept. It doesn’t matter what potential effect sizes (and thus Bayesian evidence) one could have gotten if one repeated the experiment infinitely. What matters is the results and evidence they did find. It is certainly true that a low-powered experiment can produce conclusive evidence in favour of a hypothesis – for example the proportion of simulations at the far right end of the red curve would very compellingly support H1 while those simulations forming the peak of the blue curve would afford reasonable confidence that the null hypothesis is true. Conversely, a high-powered experiment can still fail to provide conclusive evidence. This essentially seems to be Wagenmakers’ argument of the power fallacy: just because an experiment had low power doesn’t necessarily mean that its results are uninterpretable.

However, in my opinion this argument serves to obfuscate the issue. I don’t believe that Wagenmakers is doing this on purpose but I think that he has himself fallen victim to a logical fallacy. It is a non-sequitur. While it is true that low-powered experiments can produce conclusive evidence, this does not make the evidence conclusive. In fact, it is the beauty of Bayesian inference that it allows quantification of the strength of evidence. The evidence Boekel et al. observed in was inconclusive (“anecdotal”) in 9 of the 17 replications. Only in 3 the evidence for the null hypothesis was anywhere close to “strong” (i.e. below 1/10 or very close to it).

Imagine you want to test if a coin is biased. You flip it once and it comes up heads. What can we conclude from this? Absolutely nothing. Even though the experiment has been completed it was obviously underpowered. The nice thing about Bayesian inference is that it reflects that fact.

4. Interpreting (replication) evidence

You can’t have it both ways. You either take Bayesian inference to the logical conclusion and interpret the evidence you get according to Bayesian theory or you shouldn’t use it. Bayes factor analysis has the potential to be a perfect tool for statistical inference. Had Boekel et al. observed a correlation coefficient near 0 in the replication of the distractibility correlation they would have been right to conclude (in the context of their test) that the evidence supports the null hypothesis with reasonable confidence.

Now a close reading of Boekel’s study shows that the authors were in fact very careful in how they worded the interpretation of their results. They say that they “were unable to successfully replicate any of these 17 correlations”. This is entirely correct in the context of their analysis. What they do not say, however, is that they were also unable to refute the previously reported effects even though this was the case for over half of their results.

Unfortunately, this sort of subtlety is entirely lost on most people. The reaction of many commenters on the aforementioned blog post, on social media, and in personal communications was to interpret this replication study as a demonstration that these structural brain-behaviour correlations have been conclusively disproved. This is in spite of the statement in the actual article that “a single replication cannot be conclusive in terms of confirmation or refutation of a finding.” On social media I heard people say that “this is precisely what we need more of.” And you can literally feel the unspoken, gleeful satisfaction of many commenters that yet more findings by some famous and successful researchers have been “debunked.”

Do we really need more low-powered replication attempts and more inconclusive evidence? As I described above, a solid replication attempt can actually inform us about the factors governing a particular effect, which in turn can help us formulate better theories. This is what we need more of. We need more studies that test assumptions but that also take all the available evidence into account. Many of these 17 brain-behaviour correlation results here originally came with internal replications in the original studies. As far as I can tell these were not incorporated in Boekel’s analysis (although they mentioned them). For some of the results independent replications – or at least related studies – had already been published and it seems odd that Boekel et al. didn’t discuss at least those that had already been published months earlier.

Also some results, like Kanai’s distractibility correlation, were accompanied in the original paper by additional tests of the causal link between the brain area and behaviour. In my mind, from a scientific perspective it is far more important to test those questions in detail rather than simply asking whether the original MRI results can be reproduced.

5. Communicating replication efforts

I think there is also a more general problem with how the results of replication efforts are communicated. Replication should be a natural component of scientific research. All too often failed replications result in mudslinging contests, heated debates, and sometimes in inappropriate comparisons of replication authors with video game characters. Some talk about how the reputation of the original authors is hurt by failed replication.

It shouldn’t have to be this way. Good scientists also produce non-replicable results and even geniuses can believe in erroneous theories. However, the way our publishing and funding system works as well as our general human emotions predispose us to having these unfortunate disagreements.

I don’t think you can solely place the blame for such arguments on the authors of the original studies. Because scientists are human beings the way you talk to them influences how they will respond. Personally I think that the reports of many high profile replication failures suffer from a lack of social awareness. In that sense the discussion surrounding the Boekel replications has actually been very amicable. There have been far worse cases where the whole research programs of some authors have been denigrated and ridiculed on social media, sometimes while the replication efforts were still on-going. I’m not going to delve into that. Perhaps one of the Neuroscience Devils wants to pick up that torch in the future.

However, even the Boekel study shows how this communication could have been done with more tact. The first sentences of the Boekel article read as follows:

“A recent ‘crisis of confidence’ has emerged in the empirical sciences. Several studies have suggested that questionable research practices (QRPs) such as optional stopping and selective publication may be relatively widespread. These QRPs can result in a high proportion of false-positive findings, decreasing the reliability and replicability of research output.”

I know what Boekel et al. are trying to say here. EJ Wagenmakers has a declared agenda to promote “purely confirmatory” research in which experimental protocols are preregistered. There is nothing wrong with this per se. However, surely the choice of language here is odd? Preregistration is not the most relevant part about the Boekel study. It could have been done without it. It is fine to argue for why it is necessary in the article, but to actually start the article with a discussion of the replication crisis in the context of questionable research practices is very easy to be (mis?-)interpreted as an accusation. Whatever the intentions may have been, starting the article in this manner immediately places a spark of doubt in the reader’s mind and primes them to consider the original studies as being of a dubious nature. In fact, in the online debate Wagenmakers went a step further to suggest (perhaps somewhat tongue-in-cheek) that:

“For this particular line of research (brain-behavior correlations) I’d like to suggest an exploration-safeguard principle (ESP): after collecting the imaging data, researchers are free to analyze these data however they see fit. Data cleaning, outlier rejection, noise reduction: this is all perfectly legitimate and even desirable. Crucially, however, the behavioral measures are not available until after completion of the neuroimaging data analysis. This can be ensured by collecting the behavioral data in a later session, or by collaborating with a second party that holds the behavioral data in reserve until the imaging analysis is complete. This kind of ESP is something I can believe in.”

This certainly sounds somewhat accusatory to me. Quite frankly this is a bit offensive. I am all in favour of scientific skepticism but this is not the same as baseless suspicion. Having been on the receiving end of a particularly bad case of reviewer 2 once who made similar unsubstantiated accusations (and in fact ignored evidence to the contrary) I can relate to people who would be angered by that. For one thing such procedures are common in many labs conducting experiments like this. Having worked with Ryota Kanai in the past I have a fairly good idea of the meticulousness of his research. I also have great respect for EJ Wagenmakers and I don’t think he actually meant to offend anyone. Still, I think it could easily happen with statements like this and I think it speaks for Kanai’s character that he didn’t take offense here.

There is a better way. This recently published failure to replicate a link between visually induced gamma oscillation frequency and resting occipital GABA concentration is a perfect example of a well-written replication failure. There is no paranoid language about replication crises and p-hacking but a simple, factual account of the research question and the results. In my opinion this exposition certainly facilitated the rather calm reaction to this publication.

6. Don’t hide behind preregistration

Of course, the question about optional stopping and outcome-dependent analysis (I think that term was coined by Tal Yarkoni) could be avoided by preregistering the experimental protocols (in fact at least some of these original experiments were almost certainly preregistered in departmental project reviews). As opposed to what some may think, I am not opposed to preregistration as such. In fact, I fully intend to try it.

However, there is a big problem with this, which Kanai also discusses in his response. As a peer reviewer, he actually recommended Boekel et al. to use the same analysis pipeline he employed now to test for the effects. The reason Boekel et al. did not do this is that these methods were not part of the preregistered protocol. However, this did not stop them from employing other non-registered methods, which they report as exploratory analyses. In fact, we are frequently told that pre-registration does not preclude exploration. So why not here?

Moreover, preregistration is in the first instance designed to help authors control the flexibility of their experimental procedure. It should not be used as a justification to refuse performing essential analyses when reviewers ask for them. In this case, a cynic might say that Boekel et al. in fact did these analyses and chose not to report them because the results were inconsistent with the message they wanted to argue. Now I do not believe this to be the case but it’s an example of how unfounded accusations can go both ways in these discussions.

If this is how preregistration is handled in the future, we are in danger of slowing down scientific progress substantially. If Boekel et al. had performed these additional analyses (which should have been part of the originally preregistered protocol in the first place), this would have saved Kanai the time to do them himself. Both he and Boekel et al. could have done something more productive with their time (and so could I, for that matter :P).

It doesn’t have to go this way but we must be careful. If we allow this line of reasoning with preregistration, we may be able to stop the Texas sharpshooter from bragging but we will also break his rifle. It will then take much longer and more ammunition to finally hit the bulls-eye than is necessary.

Simine Vazier-style footnotes:

*) I actually dislike categorical labels for Bayesian evidence. I don’t think we need them.

**) This is a pre-print manuscript. It keeps changing with on-going peer review so this statement may no longer be true when you read this.

***) Replicators is a very stupid word but I can’t think of a better, more concise one.

****) Actually this post was my big slip-up as Devil’s Neuroscientist. In that one a lot of Sam’s personality shown through, especially in the last paragraph.

*****) I should add that I am merely talking about the armies of people pointing out the proneness of false positives. I am not implying that all these researchers I linked to here agree with one another.

******) To be fair, I probably wouldn’t be able to find the Higgs boson even if I had the LHC.

Replies to Dorothy Bishop about RRs

I decided to respond now before I get inundated with the next round of overdue work I need to do this week… I was going to wait until Chris’ response as I think you will probably overlap a bit but there are a lot of deadlines and things to do, so now is a better time. I also decided to write my reply as a post because it is a bit long for a comment and others may find it interesting.

I think most of your answers illustrate how we all miss each other’s points a little. I am not talking about what RR and prereg are like right now. Any evidence we have about it now is confounded by the fact that it is new and that the people trying it are probably for the most part proponents of the approach. Most of the points I raised (except perhaps the last one) are issues that really only come into play once the approach has become normalised, when it is commonplace at many journals, and it stopped being a measure to improve science but just how it works – so a bit like how standard peer review is now (and you know how much people complain about that).

DB: Nope. You have to give a comprehensive account of what you plan, thinking through every aspect of rationale, methods and analysis: Cortex really doesn’t want to publish anything flawed and so they screw you down on the details.
DB: Why any more so than for other publication methods? I really find this concern quite an odd one.

I agree that detailed review is key but the same could be said about the standard system. I don’t buy that author reputation isn’t going to influence judgements there. Like I am sure most of us, I always try my best not to be influenced by it, but I think we’re kidding ourselves if we think we’re perfectly unbiased. If you get a somewhat lacklustre manuscript to review, you will almost inevitably respond better to that author with a proven track record in the field (who probably possesses great writing skills) compared to some nobodies you never heard of, especially if they’re failing to communicate their ideas well (e.g. because their native language isn’t English). Nevertheless the quality of their work could actual be equal.

Now I take your point that this is also an issue in the current system, but the difference is that RR stage 1 reviews are just about evaluating the idea and the design. I think you’re lacking some information that could actually help you make a more informed choice. And it would be very disturbing if we tell people what science they can or can’t do (in the good journals that have RRs) just because of factors like this.

DB: Well, for a start registered reports require you to have very high statistical power and so humungous great N. Most people who just happened to do a study won’t meet that criterion. Second, as Chris pointed out in his talk, if you submit a registered report, then it goes out to review, and the reviewers do what reviewers do, i.e. make suggestions for changes, new conditions, etc etc. They do also expect you to specify your analysis in advance: that is one of the important features of RR.

I think this isn’t really answering my question. It should be very easy to come up with a “highly powered experiment” if you already know the finally observed effect size :P. And as I said in my post, I think many outcome-dependent changes to the protocol are about the analysis not about the design. Again, my point is also that once RRs have become more normal and people have run a bit out of steam (so the review quality may suffer compared to now) it may be a fairly easy thing to do. I could also see there being hybrids (i.e. people have already collected a fair bit of “pilot” data and just add a bit more in the registered protocol.

But I agree that this is perhaps all a bit hypothetical. I was questioning the actual logic of the response to this criticism. In the end what matters though is how likely it is that people engage in that sort of behaviour. If pre-completed grant proposals are really as common as people claim I could see it happening – but that depends largely on how difficult it is compared to being honest. Perhaps you’re right and it’s just very unlikely.

DB: So you would be unlikely to get through the RR process even if you did decide to fake your time stamps (and let’s face it, if you’re going to do that, you are beyond redemption).

I’m sure we all agree on that but I wouldn’t put it past some people. I’ve seen cases where people threw away around a third of their data points because they didn’t like the results. I am not sure that fiddling with the time stamps (which may be easier than actively changing the date) is really all that much worse.

Of course, this brings us to another question in that nothing in RR or data sharing in general really stops people from excluding “bad” subjects. Again, of course this is not different from the status quo but my issue is that having preregistered and open experiments clearly bestows a certain value judgement for people (hell, the OSF actually operates a “badge” system!). So in a way a faked RR could end up being valued more than an honest well-done non-RR. That does bother me.

DB: Since you yourself don’t find this [people stealing my ideas from RRs] all that plausible, I won’t rehearse the reasons why it isn’t.

Again, I was mostly pointing out the holes in the logic here. And of course whether or not it is plausible, a lot of people are quite evidently afraid of what Chris called the “boogieman” of being scooped. My point was that to allay this fear pointing to Manuscript Received dates is not going to suffice. But we all seem to agree that scooping is an exaggerated problem. I think the best way to deal with this worry is to stop people from being afraid of the boogieman in the first place.

DB: Your view on this may be reinforced by PIs in your institution. However, be aware that there are some senior people who are more interested in whether your research is replicable than whether it is sexy. And who find the soundbite form of reporting required by Nature etc quite inadequate.

This seems a bit naive to me. It’s not just about what “some senior people” think. I can’t with all honesty say that these factors don’t play into grant and hiring decisions. I also think it is a bit hypocritical to advise junior researchers not to pursue a bit of high impact glory when our own careers are at least in part founded on that (although mine isn’t nearly as much as some other people’s ;)). I do advise people that just to chase high impact is a bad idea but that you should have a healthy selection of solid studies. But I can also tell from experience that a few high impact publications clearly open doors for you. Anyway, this is really a topic for a different day I guess.

My own view is that I would go for a registered report in cases where it is feasible, as it has three big benefits – 1) you get good peer review before doing the study, 2) it can be nice to have a guarantee of publication and 3) you don’t have to convince people that you didn’t make up the hypothesis after seeing the data. But where it’s not feasible, I’d go for a registered protocol on OSF which at least gives me (3).

I agree this is eminently sensible. I think the (almost) guaranteed publication is probably a quite convincing argument to many people. And by god I can say that I have in the past wished for (3) – oddly enough it’s usually the most hypothesis-driven research where (some) people don’t want to believe you weren’t HARKing…

I think this also underlines an important point. The whole prereg discussion far too often revolves around negative issues. The critics are probably partly to blame for it but I think in general you often hear it mentioned as a response to questionable research practices. But what this discussion suggests is that there are many positive aspects about prereg so rather than being a cure to an ailing scientific process, it can also be seen as a healthier way to do science.

Some questions about Registered Reports

Recently I participated in an event organised by PhD students in Experimental Psychology at UCL called “Is Science broken?”. It involved a lively panel discussion in which the panel members answered many questions from the audience about how we feel science can be improved. The opening of the event was a talk by Chris Chambers of Cardiff University about Registered Reports (RR), a new publishing format in which researchers preregister their introduction and methods sections before any data collection takes place. Peer review takes place in two stages: first, reviewers evaluate the appropriateness of the question and the proposed experimental design and analysis procedures, and then, after data collection and analysis have been completed and the results are known, peer review continues to finalise the study for publication. This approach is aimed to make scientific publishing independent from the pressure to get perfect results or changing one’s apparent hypothesis depending on the outcome of the experiments.

Chris’ talk was in large part a question and answer session for specific concerns with the RR approach that had been raised at other talks or in writing. Most of these questions he (and his coauthors) had already answered in a similar way in a published FAQ paper. However, it was nice to see him talk so passionately about this topic. Also speaking for myself at least I can say that seeing a person arguing their case is usually far more compelling than reading an article on it – even though the latter will in the end probably have a wider reach.

Here I want to raise some additional questions about the answers Chris (and others) have given to some of these specific concerns. The purpose in doing so is not to “bring about death by a thousand cuts” to the RR concept as Aidan Horner calls it. I completely agree with Aidan that many concerns people have with RR (and lighter forms of preregistration) are probably logistical. It may well be that some people just really want to oppose this idea and are looking for any little reason to use as an excuse. However, I think both sides of the debate about this issue have suffered from a focus on potentials rather than fact. We simply won’t know how much good or bad preregistration will do for science unless we try it. This seems to be a concept that everyone at the discussion was very much in agreement on and we all discussed ways in which we could actually assess the evidence for whether RRs improved science over the next few decades.

Therefore I want it to be clear that the points I raise are not an ideological opposition to preregistration. Rather they are some points where I found the answers Chris describe to be not entirely satisfying. I very much believe that preregistration must be tried but I want to provoke some thought about possible problems with it. The sooner we are aware of these issues, the sooner they can be fixed.

Wouldn’t reviewers rely even more on the authors’ reputation?

In the Stage 1 of an RR, when only the scientific question and experimental design are reviewed, reviewers have little to go on to evaluate the protocol. Provided that the logic of the question and the quality of the design are evident, they would hopefully be able to make some informed decisions about it. However, I think it is a bit naive to assume that the reputation of the authors isn’t going to influence the reviewers’ judgements. I have heard of many grant reviews asking questions as to whether the authors would be capable of pulling off some proposed research. There is an extensive research literature on how the evaluation of identical exam papers or job applications or whatnot can be influenced by factors like the subject’s gender or name. I don’t think simply saying “Author reputation is not among” the review criteria is enough of a safeguard.

I also don’t think that having a double-blind review system is necessarily a good way to protect against this. There have been wider discussions about the short-comings of double-blind reviews and this situation is no different. In many situations you could easily guess the authors’ identity by the experimental protocol alone. And double blind review suffers even more from one of the main problems with anonymous reviewers (which I generally support): when the reviewers guess the authors’ identities incorrectly that could have even worse consequences because their decision will be based on an incorrect assessment of the authors.

Can’t people preregister experiments they have already completed?

The general answer here is that this would constitute fraud. The RR format would also require time stamped data files and lab logs to guarantee that data were produced only after the protocol has been registered. Both of these points are undeniably true. However, while there may be such a thing as an absolute ethical ideal, in the end a lot of our ethics are probably governed by majority consensus. The fact that many questionable research practices are apparently so widespread is presumably just that: while most people deep down understand that these things are “not ideal”, they may nonetheless engage in them because they feel that “everybody else does it.”

For instance, I often hear that people submit grant proposals for experiments they have already completed although I have personally never seen this with any grant proposals. I have also heard that it is more common in the US perhaps but at least based on all the anecdotes it may generally be widespread. Surely this is also fraudulent but nevertheless people apparently do it?

Regarding time stamped data, I also don’t know if this is necessarily a sufficiently strong safeguard. For the most part, time stamps are pretty easy to “adjust”. Crucially, I don’t think many reviewers or post-publication commenters will go through the trouble of checking them. Faking time stamps is certainly deep into fraud territory but people’s ethical views are probably not all black and white. I could easily see some people bending the rules just that little, especially if preregistered studies become a new gold standard in the scientific community.

Now perhaps this is a bit too pessimistic a view of our colleagues. I agree that we probably should not exaggerate this concern. But given the concerns people have with questionable research practices now I am not entirely sure we can really just dismiss this possibility by saying that this would be fraud.

Finally, another answer to this concern is that preregistering your experiment after the fact would backfire because the authors could then not implement any changes suggested by reviewers in Stage 1. However, this only applies to changes in the experimental design, the stimuli or apparatus etc. The most confusing corners in the “garden of forking paths” are usually the analysis procedure, not the design. There are only so many ways to run a simple experiment and most minor changes suggested by a reviewer could easily be dismissed by authors. However, changes to the analysis approach could quite easily be implemented after the results are known.

Reviewers could steal my preregistered protocol and scoop me

I agree that this concern is not overly realistic. In fact, I don’t even believe the fear of being scooped is overly realistic. I’m sure it happens (and there are some historical examples) but it is far rarer than most people believe. Certainly it is rather unlikely for a reviewer to do this. For one thing, time is usually simply not on their side. There is a lot that could be written about the fear of getting scooped and I might do that at some future point. But this is outside the scope of this post.

Whatever its actual prevalence, the paranoia (or boogieman) of scooping is clearly widespread. Until we find a way to allay this fear I am not sure that it will be enough to tell people that the Manuscript Received date of a preregistered protocol would clearly show who had the idea sooner. First of all, the Received date doesn’t really tell you when somebody had an idea. The “scooper” could always argue that they had the idea before that date but only ended up submitting the study afterwards (and I am sure that actually happens fairly often without scooping).

More importantly though, one of the main reasons people are afraid of being scooped is the pressure to publish in high impact journals. Having a high impact publication has greater currency than the Received date of a RR in what is most likely a lower impact journal. I doubt many people would actually check the date unless you specifically point it out to them. We already now have a problem with people not reading the publications of job and grant applicants but relying on metrics like impact factors and h-indeces. I don’t easily see them looking through that information.

As a junior researcher I must publish in high impact journals

I think this is an interesting issue. I would love nothing more than if we could stop caring about who published what where. At the same time I think that there is a role for high impact journals like Nature, Science or Neuron (seriously folks, PNAS doesn’t belong in that list – even if you didn’t boycott it like me…). I would like the judgement of scientific quality and merit to be completely divorced from issues of sensationalism, novelty, and news that quite likely isn’t the whole story. I don’t know how to encourage that change though. Perhaps RRs can help with that but I am not sure they’re enough. Either way, it may be a foolish and irrational fear but I know that as a junior scientist I (and my postdocs and students) currently do seek to publish at least some (but not exclusively) “high impact” research to be successful. But I digress.

Chris et al. write that sooner or later high impact outlets will probably come on board with offering RRs. I don’t honestly see that happening, at least not without a much more wide-ranging change in culture. I think RRs are a great format for specialised journals to have. However, the top impact journals primarily exist for publishing exciting results (whatever that means). I don’t think they will be keen to open the floodgates for lots of submissions that aim to test exciting ideas but fail to deliver the results to match them. What I could see perhaps is a system in which a journal like Nature would review a protocol and accept to publish it in its flagship journal if the results are positive but in its lower-impact outlet (e.g. Nature Communications) if the results are negative. The problem with this idea is that it somehow goes against the egalitarian philosophy of the current RR proposals. The publication again would be dependent on the outcome of the experiments.

Registered Reports are incompatible with short student projects

After all the previous fairly negative points I think this one is actually about a positive aspect of science. For me this is actually one of the greatest concerns. In my mind this is a very valid worry and Chris and co acknowledge this also in their article. I think RRs would be a viable solution for experiments by a PhD student but for master students, who are typically around for a few months only, it is simply not very realistic to first submit a protocol and revising it over weeks and months of reviews before even collecting the first data.

A possible solution suggested for this problem is that you could design the experiments and have them approved by peer reviewers before the students commence. I think this is a terrible idea. For me perhaps the best part of supervising student projects in my lab is when we discuss the experimental design. The best students typically come with their own ideas and make critical suggestions and improvements to the procedure. Not only is this phase very enjoyable but I think designing good experiments is also one of the most critical skills for junior scientists to learn. By having the designs finalised before the students even step through the door of the lab would undermine that.

Perhaps for those cases it would make more sense to just use light preregistration, that is, uploading your protocol to a timestamped archive but without external review. But if in the future RR do become the new gold standard, I would worry that this denigrates the excellent research projects of many students.

Wrapping up…

As I said, these points are not meant to shoot down the concept of Registered Reports. Some of the points may not even be such enormous concerns at all. However, I hope that my questions provoke thought and that we can discuss ways to improve the concept further and find safeguards against these possible problems.

Sorry this post was very long as usual but there seems to be a lot to say. My next post though will be short, I promise! 😉