Failed replication or flawed reasoning?

A few months ago a study from EJ Wagenmakers’ lab (Boekel et al. in Cortex) failed to replicate 17 structural brain-behaviour correlations reported in the published literature. The study was preregistered by uploading the study protocol to a blog and so was what Wagenmakers generally refers to as “purely confirmatory“. As Wagenmakers is also a vocal proponent of Bayesian inferential methods, they used a one-tailed Bayesian hypothesis tests to ask whether their replication evidence supported the original findings. A lot has already been written about the Boekel study and I was previously engaged in a discussion on it. Therefore, in the interest of brevity (and thus the time Alexander Etz’s needs to spend on reading it :P) I will not cover all the details again but cut right to the case (It is pretty long anyway despite by earlier promises…)

Ryota Kanai, author of several of the results Boekel et al. failed to replicate, has now published a response in which he reanalyses their replication data. He shows that at least one finding (a correlation between grey matter volume in the left SPL and a measure of distractibility as quantified by a questionnaire) replicates successfully if the same methods as his original study are used. In fact, while Kanai does not report these statistics, using the same Bayesian replication test for which Boekel reported “anecdotal” evidence* for the null hypothesis (r=0.22, BF10=0.8), Kanai’s reanalysis reveals “strong” evidence for the alternative hypothesis (r=0.48, BF10=28.1). This successful replication is further supported by a third study that replicated this finding in an independent sample (albeit with some of the same authors as the original study). Taken together this suggests that at least for this finding, the failure to replicate may be due to methodological differences rather than that the original result was spurious.

Now, disagreements between scientists are common and essential to scientific progress. Replication is essential for healthy science. However, I feel that these days, as a field, psychology and neuroscience researchers are going about it in the wrong way. To me this case is a perfect illustration of these problems. In my next post I will summarise this one in a positive light by presenting ten simple rules for a good replication effort (and – hand on my heart – that one will be short!)

1. No such thing as “direct replication”

Recent years have seen the rise of numerous replication attempts with a particular emphasis on “direct” replications, that is, the attempt to exactly reproduce the experimental conditions that generated the original results. This is in contrast to “conceptual” replications in which a new experiment follows the spirit of a previous one but the actual parameters may be very different. So for instance a finding that exposing people to a tiny picture of the US flag influences their voting behaviour months in the future could be interpreted as conceptually replicating the result that people walk more slowly when they were primed with words describing the elderly.

However, I believe this dichotomy is false. The “directness” of a replication attempt is not categorical but exists on a gradual spectrum. Sure, the examples of conceptual replications from the social priming literature are quite distinct from Boekel’s attempt to replicate the brain-behaviour correlations or all the other Many Labs projects currently being undertaken with the aim to test (or disprove?) the validity of social psychology research.

However, there is no such thing as a perfectly direct replication. The most direct replication would be an exact carbon copy of the original, with the same participants, tested at the same time in the same place under the exact same conditions. This is impossible and nobody actually wants that because it would be completely meaningless other than testing just how deterministic our universe really is. What people mean when they talk about direct replications is that they match the experimental conditions reasonably well but use an independent sample of participants and (ideally) independent experimenters. Just how “direct” the replication is depends on how closely matched the experimental parameters are. By that logic, I would call the replication attempt of Boekel et al. less direct than say Wagenmakers et al’s replication of Bem’s precognition experiments. Boekel’s experiments were not matched at least with those by Kanai on a number of methodological points. However, even for the precognition replication Bem challenged Wakenmakers** on the directness of their methods because his replication attempt did not use the same software and stimuli as the original experiment.

Controversies like this reveal several issues. While you can strive to match the conditions of an original experiment as closely as possible, there will always be discrepancies. Ideally the original authors and the “replicators”*** can reach a consensus as to whether or not the discrepancies should matter. However, even if they do, this does not mean that they are unimportant. Saying that “original authors agreed to the protocol” means that a priori they made the assumption that methodological differences are insignificant. This does not mean that this assumption is correct. In the end this is an empirical question.

Discussions about failed replications are often contaminated with talk about “hidden moderators”, that is, unknown factors and discrepancies between the original experiment and the replication effort. As I pointed out under the guise of my alter ego****, I have little patience for this argument. It is counter-productive because there are always unknown factors. Saying that unknown factors can explain failures to replicate is an unfalsifiable hypothesis and a truism. The only thing that should matter in this situation is empirical evidence for additional factors. If you cannot demonstrate that your result hinges on a particular factor, this argument is completely meaningless. In the case of Bem’s precognition experiments, this could have been done by conducting an explicit experiment that compares the use of his materials with those used by Wagenmakers, ideally in the same group of participants. However, in the case of these brain-behaviour correlations this is precisely what Kanai did in his reply: he reanalysed Boekel’s data using the methods he had used originally and he found a different result. Importantly, this does not necessarily prove that Kanai’s theory about these results is correct. However, it clearly demonstrates that the failure to replicate was due to another factor that Boekel et al. did not take into account.

2. Misleading dichotomy

I also think the dichotomy between direct and conceptual replication is misleading. When people conduct “conceptual” replications the aim is different but equally important: direct replications (in so far that they exist) can test whether specific effects are reproducible. Conceptual replications are designed to test theories. Taking again the elderly-walking-speed and voting-behaviour priming examples from above, whether or not you believe that such experiments constitute compelling evidence for this idea, they are both experiments aiming to test the idea that subtle (subconscious?) information can influence people’s behaviour.

There is also a gradual spectrum for conceptual replication but here it depends on how general the overarching theory is that the replication seeks to test. These social priming examples clearly test a pretty diffuse theory of subconscious processing. By the same logic one could say that all of the 17 results scrutinised by Boekel test the theory that brain structure shares some common variance with behaviour. This theory is not only vague but so generic that it is almost meaningless. If you honestly doubt that there are any structural links between brain and behaviour, may I recommend checking some textbooks on brain lesions or neurodegenerative illness in your local library?

A more meaningful conceptual replication would be to show that the same grey matter volume in the SPL not only correlates with a cognitive failure questionnaire but with other, independent measures of distractibility. You could even go a step further and show that this brain area is somehow causally related to distraction. In fact, this is precisely what Kanai’s original study did.

I agree that replicating actual effects (i.e. what is called “direct” replication) is important because it can validate the existence of previous findings and – as I described earlier – help us identify the factors that modulate it. You may however also consider ways to improve your methodology. A single replication with a demonstrably better method (say, better model fits, higher signal-to-noise ratios, or more precise parameter estimates) is worth a 100 direct replications from a Many Labs project. In any of these cases, the directness of your replication will vary.

In the long run, however, conceptual replication that tests a larger overarching theory is more important than showing that a specific effect exists. The distinction between these two is very blurred though. It is important to know what factors modulate specific findings to derive a meaningful theory. Still, if we focus too much on Many Labs direct replication efforts, science will slow down to a snail’s pace and waste an enormous amount of resources (and taxpayer money). I feel that these experiments are largely designed to deconstruct the social priming theory in general. And sure, if the majority of these findings fail to replicate in repeated independent attempts, perhaps we can draw the conclusion that current theory is simply wrong. This happens a lot in science – just look at the history of phrenology or plate tectonics or our model of the solar system.

However, wouldn’t it be better to replace subconscious processing theory with a better model that actually describes what is really going on than to invest years of research funds and effort to prove the null hypothesis? As far as I can tell, the current working theory about social priming by most replicators is that social priming is all about questionable research practices, p-hacking, and publication bias. I know King Ioannidis and his army of Spartans show that the situation is dire***** – but I am not sure it is realistically that dire.

3. A fallacious power fallacy

Another issue with the Boekel replications, which is also discussed in Kanai’s response, is that the sample size they used was very small. For the finding that Kanai reanalysed the sample size was only 36. Across the 17 results they failed to replicate, their sample size ranged between 31-36. This is in stark contrast with the majority of the original studies many of which used samples well above 100. Only for one of the replications, which was of one of their own findings, Boekel et al. used a sample that was actually larger (n=31) than that in the original study (n=9). It seems generally accepted that larger samples are better, especially for replication attempts. A recent article recommended a sample size for replications two and a half times larger than the original. This may be a mathematical rule of thumb but it is hardly realistic, especially for neuroimaging experiments.

Thus I can understand why Boekel et al. couldn’t possibly have done their experiment on hundreds of participants. However, at the very least you should think that a direct replication effort should at least try to match the sample of the original study not one that is four times smaller. In our online discussions Wagenmakers explained the small sample by saying that they “simply lacked the financial resources” to collect more data. I do not think this is a very compelling argument. Using the same logic I could build a lego version of the Large Hadron Collider in my living room but fail to find the Higgs boson – only to then claim that my inadequate methodology was due to the lack of several billion dollars on my bank account******.

I must admit I sympathise a little with Wagenmakers here because it isn’t like I never had to collect more data for an experiment than I had planned (usually this sort of optional stopping happens at the behest of anonymous peer reviewers). But surely you can’t just set out to replicate somebody’s research, using a preregistered protocol no less, with a wholly inadequate sample size? As a matter of fact,their preregistered protocol states the structural data for this project (which is the expensive part) had already been collected previously and that the maximum sample of 36 was pre-planned. While they left “open the possibility of testing additional participants” they opted not to do so even though the evidence for half of the 17 findings remained inconclusively low (more on this below). Presumably this was as they say because they ran “out of time, money, or patience.”

In the online discussion Wagenmakers further states that power is a pre-experimental concept and refers to another  publications by him and others in which they describe a “power fallacy.” I hope I am piecing together their argument accurately in my own head. Essentially statistical power tells you how probable it is that you can detect evidence for a given effect with your planned sample size. It thus quantifies the probabilities across all possible outcomes given these parameters. I ran a simulation to do this for the aforementioned correlation between left SPL grey matter and cognitive failure questionnaire scores. So I drew 10,000 samples of 36 participants each from a bivariate Gaussian distribution with a correlation of rho=0.38 (i.e. the observed correlation coefficient in Kanai’s study). I also repeated this for the null hypothesis so I drew similar samples from an uncorrelated Gaussian distribution. The histograms in the figure below show the distributions of the 10,000 Bayes factors calculated using the same replication test used by Boekel et al. for the alternative hypothesis in red and the null hypothesis in blue.

Histograms of Bayes factors in favour of alternative hypothesis (BF10) over 10,000 simulated samples of n=36 with rho=0.38 (red curve) and rho=0 (blue curve).

Out of those 10,000 simulations in the red curve only about 62% would pass the criterion for “anecdotal” evidence of BF10=3. Thus even if the effect size originally reported by Kanai’s study had been a perfect estimate of the true population effect (which is highly improbable) only in somewhat less than two thirds of replicate experiments should you expect conclusive evidence supporting the alternative hypothesis. The peak of the red distribution is in fact very close to the anecdotal criterion. With the exception of the study by Xu et al. (which I am in no position to discuss) this result was in fact one of the most highly powered experiments in Boekel’s study: as I showed in the online discussion the peaks of expected Bayes factors of the other correlations were all below the anecdotal criterion. To me this suggests that the pre-planned power of these replication experiments was wholly insufficient to give the replication a fighting chance.

Now, Wagenmakers’ reasoning of the “power fallacy” however is that after the experiment is completed power is a meaningless concept. It doesn’t matter what potential effect sizes (and thus Bayesian evidence) one could have gotten if one repeated the experiment infinitely. What matters is the results and evidence they did find. It is certainly true that a low-powered experiment can produce conclusive evidence in favour of a hypothesis – for example the proportion of simulations at the far right end of the red curve would very compellingly support H1 while those simulations forming the peak of the blue curve would afford reasonable confidence that the null hypothesis is true. Conversely, a high-powered experiment can still fail to provide conclusive evidence. This essentially seems to be Wagenmakers’ argument of the power fallacy: just because an experiment had low power doesn’t necessarily mean that its results are uninterpretable.

However, in my opinion this argument serves to obfuscate the issue. I don’t believe that Wagenmakers is doing this on purpose but I think that he has himself fallen victim to a logical fallacy. It is a non-sequitur. While it is true that low-powered experiments can produce conclusive evidence, this does not make the evidence conclusive. In fact, it is the beauty of Bayesian inference that it allows quantification of the strength of evidence. The evidence Boekel et al. observed in was inconclusive (“anecdotal”) in 9 of the 17 replications. Only in 3 the evidence for the null hypothesis was anywhere close to “strong” (i.e. below 1/10 or very close to it).

Imagine you want to test if a coin is biased. You flip it once and it comes up heads. What can we conclude from this? Absolutely nothing. Even though the experiment has been completed it was obviously underpowered. The nice thing about Bayesian inference is that it reflects that fact.

4. Interpreting (replication) evidence

You can’t have it both ways. You either take Bayesian inference to the logical conclusion and interpret the evidence you get according to Bayesian theory or you shouldn’t use it. Bayes factor analysis has the potential to be a perfect tool for statistical inference. Had Boekel et al. observed a correlation coefficient near 0 in the replication of the distractibility correlation they would have been right to conclude (in the context of their test) that the evidence supports the null hypothesis with reasonable confidence.

Now a close reading of Boekel’s study shows that the authors were in fact very careful in how they worded the interpretation of their results. They say that they “were unable to successfully replicate any of these 17 correlations”. This is entirely correct in the context of their analysis. What they do not say, however, is that they were also unable to refute the previously reported effects even though this was the case for over half of their results.

Unfortunately, this sort of subtlety is entirely lost on most people. The reaction of many commenters on the aforementioned blog post, on social media, and in personal communications was to interpret this replication study as a demonstration that these structural brain-behaviour correlations have been conclusively disproved. This is in spite of the statement in the actual article that “a single replication cannot be conclusive in terms of confirmation or refutation of a finding.” On social media I heard people say that “this is precisely what we need more of.” And you can literally feel the unspoken, gleeful satisfaction of many commenters that yet more findings by some famous and successful researchers have been “debunked.”

Do we really need more low-powered replication attempts and more inconclusive evidence? As I described above, a solid replication attempt can actually inform us about the factors governing a particular effect, which in turn can help us formulate better theories. This is what we need more of. We need more studies that test assumptions but that also take all the available evidence into account. Many of these 17 brain-behaviour correlation results here originally came with internal replications in the original studies. As far as I can tell these were not incorporated in Boekel’s analysis (although they mentioned them). For some of the results independent replications – or at least related studies – had already been published and it seems odd that Boekel et al. didn’t discuss at least those that had already been published months earlier.

Also some results, like Kanai’s distractibility correlation, were accompanied in the original paper by additional tests of the causal link between the brain area and behaviour. In my mind, from a scientific perspective it is far more important to test those questions in detail rather than simply asking whether the original MRI results can be reproduced.

5. Communicating replication efforts

I think there is also a more general problem with how the results of replication efforts are communicated. Replication should be a natural component of scientific research. All too often failed replications result in mudslinging contests, heated debates, and sometimes in inappropriate comparisons of replication authors with video game characters. Some talk about how the reputation of the original authors is hurt by failed replication.

It shouldn’t have to be this way. Good scientists also produce non-replicable results and even geniuses can believe in erroneous theories. However, the way our publishing and funding system works as well as our general human emotions predispose us to having these unfortunate disagreements.

I don’t think you can solely place the blame for such arguments on the authors of the original studies. Because scientists are human beings the way you talk to them influences how they will respond. Personally I think that the reports of many high profile replication failures suffer from a lack of social awareness. In that sense the discussion surrounding the Boekel replications has actually been very amicable. There have been far worse cases where the whole research programs of some authors have been denigrated and ridiculed on social media, sometimes while the replication efforts were still on-going. I’m not going to delve into that. Perhaps one of the Neuroscience Devils wants to pick up that torch in the future.

However, even the Boekel study shows how this communication could have been done with more tact. The first sentences of the Boekel article read as follows:

“A recent ‘crisis of confidence’ has emerged in the empirical sciences. Several studies have suggested that questionable research practices (QRPs) such as optional stopping and selective publication may be relatively widespread. These QRPs can result in a high proportion of false-positive findings, decreasing the reliability and replicability of research output.”

I know what Boekel et al. are trying to say here. EJ Wagenmakers has a declared agenda to promote “purely confirmatory” research in which experimental protocols are preregistered. There is nothing wrong with this per se. However, surely the choice of language here is odd? Preregistration is not the most relevant part about the Boekel study. It could have been done without it. It is fine to argue for why it is necessary in the article, but to actually start the article with a discussion of the replication crisis in the context of questionable research practices is very easy to be (mis?-)interpreted as an accusation. Whatever the intentions may have been, starting the article in this manner immediately places a spark of doubt in the reader’s mind and primes them to consider the original studies as being of a dubious nature. In fact, in the online debate Wagenmakers went a step further to suggest (perhaps somewhat tongue-in-cheek) that:

“For this particular line of research (brain-behavior correlations) I’d like to suggest an exploration-safeguard principle (ESP): after collecting the imaging data, researchers are free to analyze these data however they see fit. Data cleaning, outlier rejection, noise reduction: this is all perfectly legitimate and even desirable. Crucially, however, the behavioral measures are not available until after completion of the neuroimaging data analysis. This can be ensured by collecting the behavioral data in a later session, or by collaborating with a second party that holds the behavioral data in reserve until the imaging analysis is complete. This kind of ESP is something I can believe in.”

This certainly sounds somewhat accusatory to me. Quite frankly this is a bit offensive. I am all in favour of scientific skepticism but this is not the same as baseless suspicion. Having been on the receiving end of a particularly bad case of reviewer 2 once who made similar unsubstantiated accusations (and in fact ignored evidence to the contrary) I can relate to people who would be angered by that. For one thing such procedures are common in many labs conducting experiments like this. Having worked with Ryota Kanai in the past I have a fairly good idea of the meticulousness of his research. I also have great respect for EJ Wagenmakers and I don’t think he actually meant to offend anyone. Still, I think it could easily happen with statements like this and I think it speaks for Kanai’s character that he didn’t take offense here.

There is a better way. This recently published failure to replicate a link between visually induced gamma oscillation frequency and resting occipital GABA concentration is a perfect example of a well-written replication failure. There is no paranoid language about replication crises and p-hacking but a simple, factual account of the research question and the results. In my opinion this exposition certainly facilitated the rather calm reaction to this publication.

6. Don’t hide behind preregistration

Of course, the question about optional stopping and outcome-dependent analysis (I think that term was coined by Tal Yarkoni) could be avoided by preregistering the experimental protocols (in fact at least some of these original experiments were almost certainly preregistered in departmental project reviews). As opposed to what some may think, I am not opposed to preregistration as such. In fact, I fully intend to try it.

However, there is a big problem with this, which Kanai also discusses in his response. As a peer reviewer, he actually recommended Boekel et al. to use the same analysis pipeline he employed now to test for the effects. The reason Boekel et al. did not do this is that these methods were not part of the preregistered protocol. However, this did not stop them from employing other non-registered methods, which they report as exploratory analyses. In fact, we are frequently told that pre-registration does not preclude exploration. So why not here?

Moreover, preregistration is in the first instance designed to help authors control the flexibility of their experimental procedure. It should not be used as a justification to refuse performing essential analyses when reviewers ask for them. In this case, a cynic might say that Boekel et al. in fact did these analyses and chose not to report them because the results were inconsistent with the message they wanted to argue. Now I do not believe this to be the case but it’s an example of how unfounded accusations can go both ways in these discussions.

If this is how preregistration is handled in the future, we are in danger of slowing down scientific progress substantially. If Boekel et al. had performed these additional analyses (which should have been part of the originally preregistered protocol in the first place), this would have saved Kanai the time to do them himself. Both he and Boekel et al. could have done something more productive with their time (and so could I, for that matter :P).

It doesn’t have to go this way but we must be careful. If we allow this line of reasoning with preregistration, we may be able to stop the Texas sharpshooter from bragging but we will also break his rifle. It will then take much longer and more ammunition to finally hit the bulls-eye than is necessary.

Simine Vazier-style footnotes:

*) I actually dislike categorical labels for Bayesian evidence. I don’t think we need them.

**) This is a pre-print manuscript. It keeps changing with on-going peer review so this statement may no longer be true when you read this.

***) Replicators is a very stupid word but I can’t think of a better, more concise one.

****) Actually this post was my big slip-up as Devil’s Neuroscientist. In that one a lot of Sam’s personality shown through, especially in the last paragraph.

*****) I should add that I am merely talking about the armies of people pointing out the proneness of false positives. I am not implying that all these researchers I linked to here agree with one another.

******) To be fair, I probably wouldn’t be able to find the Higgs boson even if I had the LHC.

14 thoughts on “Failed replication or flawed reasoning?

  1. Hi Sam. We’ve discussed this issue at length before, and it is too bad that I didn’t manage to change your mind! Let me respond briefly: you steadfastly ignore our replication Bayes factors, or the fact that the large majority of the correlations are substantially lower than the original. It is also too bad that you cannot see the main pattern across all studies.

    You state that it is a common procedure for labs to obey the Exploration-Safeguard Principle. That is great to hear. Unfortunately, I must have missed this in the method section of the neuroscience papers I’ve had the pleasure of reading. Can we make a bet? (this is after all what Bayesians like to do :-)) If I take any issue of, say, Journal of Neuroscience, what proportion of papers do you think will mention this procedure?

    My general skepticism is confirmed the intermediate results from large-scale replication attempts in experimental psychology and social psychology. The dangers of violating the distinction between exploration and confirmation are already discussed in Peirce (1878).

    The section where you discuss the power fallacy is a little confusing. You end up stating that the evidence is just the evidence. Indeed, this is exactly what we are claiming. When the evidence is inconclusive, the evidence is inconclusive, and this is what we say.

    This was an intense multi-year project that took quite something out of everybody involved. I don’t need any sympathy but the implicit accusation of laziness is grating. We reported the evidence that we found, honestly and completely. In some cases, the evidence was strong, and in other cases it was not. This project was never intended to be the final answer. It was intended to start a conversation. We mention this clearly in the paper, several times. If you don’t believe the overall pattern is worrisome and should result in more orientation towards replication, then I am, again, disappointed.

    Best of luck with your own preregistration attempt — I am sure it will be a success (but you’d better obtain compelling evidence, or else! ;-))



    1. Thanks for your comment, EJ.

      When you say I “steadfastly ignore” the replication BFs, I assume you are referring to the BFs based on the posterior of the original studies? These are obviously more conservative than the ones with the uniform prior – however they are also exploratory by your own definition, which would again make we wonder why this exploratory analysis was included but the analysis Ryota suggested (and now carried out) was not. Again what I said in the post applies: you can’t have it both ways.

      In general these posterior-based replication BFs are a good idea. I have admittedly not looked at them. I agree this would be worthwhile if we can be sure that the methodological concerns are addressed. If the methods you used are likely to underestimate the true correlation, then I don’t think it seems a lot of sense to use a more precise prior for the replication BF.

      Of course, this replication BF should incorporate any independent replications within the original studies, shouldn’t it? I can’t honestly tell if you did that.

      You also say that the “large majority of the correlations” is substantially lower than the original ones. Let’s look at that for a moment:
      I am not sure what you count as “substantially lower”. If we consider replications whose confidence intervals do not overlap the original point estimate, I count 7/17. Not my definition of majority.
      If we look at whether the point estimate in the replication is lower than the point estimate of the original by more than the width of the symbols you used on this plot I count 13. Now that is a majority but I don’t think this is particularly informative.

      I think it also doesn’t make that much sense to treat the 17 results as independent. In actual fact it looks pretty clear to me that the studies by Xu et al. and Forstmann et al. clearly failed to replicate in your sample. However, for Westlye et al. and Kanai et al. 2012 the story is clearly different. Kanai et al. 2011 is more unclear but that one is already covered in much detail in my post.

      Unfortunately, I must have missed this in the method section of the neuroscience papers I’ve had the pleasure of reading.

      If your point is that the methods sections of most neuroscience papers are lacking in detail, we won’t have to argue. I have been saying this for years. However, for many experiments (certainly the ones Ryota did) I think it is actually inherent to the nature of the experiments. He had a MRI database and he decided to collect behavioural data on these subjects. In many such experiments the behavioural data acquisition is essentially automatic and free of experimenter influence. So there needs to be no specific “ESP” declaration in the methods section. But I suppose better safe than sorry – and preregistration can help here for sure.

      My general skepticism is confirmed the intermediate results from large-scale replication attempts in experimental psychology and social psychology.

      This is just my point. I think large-scale replication attempts seeking to disprove – sorry – “test” results in psychology are essentially meaningless. I literally am going to pass out from boredom if I have to hear about another failed social psychology replication (or even a psi one – and you know how I feel about those…). Please wake me up when anybody has a damn theory that goes beyond “I don’t believe any of this!”

      The dangers of violating the distinction between exploration and confirmation are already discussed in Peirce (1878).

      I totally agree that those two should be delineated and emphasised more clearly – without giving more preference over one than the other. Honestly, most of my research has been hypothesis-driven but I think exploration is actually important. We shouldn’t blur them together and I kind of agree that preregistration is a good way to ensure that we don’t.

      When the evidence is inconclusive, the evidence is inconclusive, and this is what we say.

      I think you don’t say that clearly enough. You certainly don’t say it clearly enough so that other people who aren’t reading it carefully will understand it. If I have to listen to another person telling me about phrenology or how this shows how all brain-behavioural findings have been debunked, I honestly don’t know what I’ll do…

      This was an intense multi-year project that took quite something out of everybody involved. I don’t need any sympathy but the implicit accusation of laziness is grating. We reported the evidence that we found, honestly and completely.

      I never said you or any of your coauthors were lazy. And as I hope to have made clear in my post, I believe you that you reported what you found. But given what you set out to do and the way you wanted to do it (using Bayesian inference) I do have to question whether you did a complete job. It would never cross my mind to replicate somebody’s finding with n=36 when the original data set was n=144. That’s just (well, almost) like the coin toss example in my post. It also seems odd to me that this is a multi-year project. That seems a lot longer than the substantially larger studies you sought to replicate took to conduct but there may be good reasons for that.

      Of course, this discussion again detracts from the much bigger issue of using the appropriate methods. As I said, I have no patience for the “hidden moderator” defense. However, Ryota evidently showed that there are moderating factors, irrespective of whether his original findings mean anything or not. This is what this discussion should focus on and this was really my main whole point.


    1. Thank you for that link. This does indeed look very relevant and I will see if I can mention this in my upcoming post.


  2. Hi Sam. A few brief & select responses. In our main post, you argued that it is all about doing the correct analysis, but now you argue that the replication BF does not count because it is exploratory. Seems I’m not the only one who wants to have it both ways :-). With respect to the correlations, we can debate about what “substantially lower” means, but the replication BF is one way to judge this (not the only one, of course). But –bottom line, I think– I do agree with your assessment that the results for some studies are clear-cut, whereas for others the pattern is less clear.

    With respect to “ESP”, my issue is not with experimenter effects on data collection. It is about experimenter effects on data analysis (i.e., choices on pipelines, corrections for multiplicity, selection of ROIs, etc). But perhaps I am missing something.

    I am not sure why you feel large-scale replication projects (e.g., the special issue in Social Psychology or the OSF Reproducibility Project) are boring. The Social Psychology special issue focused on effects that were deemed important to the field, and the Reproducibility Project picked studies randomly. Perhaps you find theories tested with behavioral measures inherently less interesting — I’m just not sure. The point I was trying to make is that these effects (from random papers) do not replicate as well as one would hope or believe. I don’t see a good argument to presume that the situation is better in neuroscience: the data are more complex, the studies are more expensive, and these factors can only worsen a bad situation. But I hope I’m wrong.

    Anyway, I think in terms of evidence, the glass is half-full: as you indicate, for some studies the pattern is clear, for one it may be somewhat less clear, and for others it is not clear at all. We should highlight this even more prominently in our rejoinder. Anyway, in my opinion it makes sense to focus on what was learned, and not on what was *not* learned. Of course, with limited N you may not get answers to all questions. But we did get answers to some (so your coin analogy is not appropriate). And, as we stressed in the paper, the work is intended as the start of a conversation, not the end.

    Lastly, if you want to experience the effort required to replicate a study in neuroscience (in a confirmatory fashion, with a preregistration document, and sticking as closely as possible to the procedures from the original authors), I challenge you to carry one out yourself 🙂 I have some experience with other replications as well, and they just take much more effort than “original” research. In a way this is counter-intuitive, because “all you need to” is carry out the same experiment again. Unfortunately, the fact that it was someone else’s experiment makes a big difference.

    We are writing up a rejoinder for Cortex and your post will be highly valuable.



    1. Thanks EJ, I’ll try to be brief too … but probably won’t 😉

      you argued that it is all about doing the correct analysis, but now you argue that the replication BF does not count because it is exploratory. Seems I’m not the only one who wants to have it both ways

      I don’t think I do. I promise to look into creating a Matlab version of your rep-BFs (I may need your help with that again 😛 – or I guess I *could* just install R again but I really don’t want to…). However, my point is that the rep-BFs are evidently more conservative unless you replicate a similar point estimate – which is unlikely given the methods you used. So if you use the ROI approach Ryota used, I will use the rep-BF!

      There is an easy reason to understand why Ryota’s approach is better as well. In neuroimaging localisation experiments the spatial location is one of the dependent variables. The original ROI location is a point estimate of the true population location. So you have to account for the precision with which you can estimate it.

      With respect to “ESP”, my issue is not with experimenter effects on data collection. It is about experimenter effects on data analysis…

      I get that. And I think what you’re missing is that I believe there isn’t really that much under the surface here than you think. But to be fair this is really for Ryota and his co-authors to say as I had no involvement in those studies. You might be right in more general terms and regarding the wider literature and certainly a clearer methods section will help.

      Anyway, in my opinion it makes sense to focus on what was learned, and not on what was *not* learned.

      Makes sense, but I am not sure we agree on how much we actually learned from this ;). My take-home message is two things:

      1. Take into account positional variability when you do localisation studies (seems obvious when you say it that way…:P). While you’re at it also take into account all available evidence (within-study replications and converging evidence).

      2. Being “purely confirmatory” doesn’t stop you from “purely” using the incorrect method and has the potential to greatly skew the findings (in Ryota’s example using a different method more than doubled the correlation coefficient…)

      Of course we don’t know if Ryota’s new results (with your data) are closer to the truth than yours. This is why we need more research looking into this. Ryota’s results *could* theoretically reflect artifacts with his approach (using DARTEL etc). I think this is an important question for the field to answer – much more important than direct replications!

      I am not sure why you feel large-scale replication projects (e.g., the special issue in Social Psychology or the OSF Reproducibility Project) are boring.

      I tried to answer this in my post but perhaps I wasn’t clear enough. I think a single study with a good (that is, better than status quo) theory, or at the very least an alternative hypothesis, is worth more than a thousand direct replications. I have used this example before but in my mind that Doyen et al. replication of Bargh et al. was a great example: they tested an alternative hypothesis for what produced the original effect and found evidence for it. That’s how good science should work. There may have been other mishaps with how this study emerged on the scene but I think this part of it was spot-on at least.

      As long as we just have lots of people replicating happily away without actual novel scientific questions we will not get anywhere! With purely direct replications (even if “purely confirmatory”) you can never get away from the hidden moderator argument or the notion that the replicators just did something wrong.

      Perhaps you find theories tested with behavioral measures inherently less interesting

      Most certainly not! I do a lot of behavioural-only research myself. I am a neuroscientist so my research interests relate back to the brain rather than the mind, but I think a lot of pure psychology research can be very interesting.

      I think I already explained why I am bored by the replication movement though. I feel it’s not there to answer scientific questions and it slows down progress.

      But we did get answers to some (so your coin analogy is not appropriate)

      It was hyperbolic, no doubt. A better analogy would be that you flipped the coin 10 times and you have shown that it’s definitely not a heads-only coin but we still don’t really know if it may be biased – and in fact it may not even have been the same coin… 😉

      Lastly, if you want to experience the effort required to replicate a study in neuroscience (in a confirmatory fashion, with a preregistration document, and sticking as closely as possible to the procedures from the original authors), I challenge you to carry one out yourself…

      As I said, I won’t replicate something unless I have a specific question to ask. Just asking, “Does this effect exist?” is insufficient for scientific progress. In that sense I have replicated many neuroscience findings (like this one: Admittedly this is not usually as confirmatory as you would like because there are many little factors that differ between studies (pulse sequence, TR, scanner model, field strength, stimuli, duration of experiment etc). But given the extensively repeated replication of retinotopy despite methodological variations across the literature I am confident it exists :P. Quite honestly, what would direct “purely confirmatory” replication have told me? If I can get the same thing despite methodological differences I think that actually tells us more.

      Now if I choose to replicate some neuroscience effect we may want to make it as direct as possible but as I said earlier I would feel that it needs something in addition. I don’t just want to know if some effect replicates but what its underlying cause is. This is also why I haven’t bothered to replicate Bem yet 😛

      However, I take your word (damn, I’m violating the Royal Society motto I advertised in my open science talk now!) that doing your kind of replication study is probably a lot of hard work. You have a lot of experience doing these kind of preregistered replications while I have none. So I believe you.


  3. Hey Sam,

    It would seem I came away from your bayes factor distributions graph more optimistic than you did. I am interpreting these distributions in terms of probabilities of being misled by the evidence, similar to the way Royall explains in his book. Correct my numbers if they are way off! I can only eyeball the graph.

    According to the null distribution in the graph, the probability of selecting a BF at random, and subsequently having that BF be above 3 (I.e., strong misleading evidence) is what, roughly 2%? So if it turns out the null is true we will be misled into supporting Ha only 2% of the time. Not too bad, in my opinion. It’s hard to tell from the graph, but the probability of obtaining a weak BF (I.e., 1/3 < BF 3, according to the text, that means roughly 35% of the time we would get inconclusive BFs.

    So the bayes factors won’t mislead us very often, but they will leave us in the dark relatively frequently in either case. That jives with my intuitions, and you can see it in the results from the study. Does it suck to spend a lot of money and get weak/uninformative evidence? Yes. But the beauty of bayes factors is we can call it as we see it: weak, uninformative evidence either way.

    So why am I optimistic? Your graph says to me that even with relatively small samples, these BFs show pretty damn good behavior! We almost never see strong evidence for the wrong hypothesis, which is something you would like to avoid when you’re evaluating a replication. That is, we want to avoid obtaining strong support for the wrong hypothesis in a replication because that could harm reputations and cause a ruckus. And your graph shows that even with n=36 that only happens 2 or 3 percent of the time! If we take “strong” to mean 1/10 or 10 instead of 1/3 or 3, then we will almost never be misled by these bayes factors!

    If one wants to avoid harming reputations, then in my opinion these bayes factors do a pretty awesome job.


    1. I didn’t actually keep the results of this simulation as they are really quite trivial and only take a few seconds to run. I just ran it again. Your general reading of the plot is correct of course. I actually have much more extensive simulations of this kind in my stats manuscript (which was editorially rejected last night by the way) where I look at the error rates (I know, how frequentist of me…) across different sample sizes.

      For this simulation now (identical parameters as in the graph) I get 1.8% false positives defined by BF10>3. This roughly matches the results from my manuscript as well.

      The proportion under H0 for inconclusive BFs (1/3 to 3)is 27.6%. It follows that the proportion of H0<1/3 is actually 70%. So in frequentist terms that is your "power" to confirm the null – I agree that this is actually surprisingly high given that small sample size.

      As you are saying in roughly 35% of H1 simulations we get inconclusive BFs. That's only a little larger than the proportion under H0. Thus in this case, if we pick a random inconclusive BF it is essentially a coin toss whether it came from H1 or H0. That's even moreso the case if you look at the BF the replication study actually got (0.8). That's pretty much the point where the two curves overlap. So I interpret this is low sensitivity.

      I redid the simulations with the same parameters except changing the sample size to the original (n=144). In this case both the false positive rate and the range of inconclusive BFs under H1 is below 1%. Now that would have been sensitive!
      Even with half the original sample (n=72) the "power" would have been almost 90% and the false positives 1.5%.

      Of course, true Bayesians don't like talk of error rates but I'm a Pragmatist. If our field has a problem with underpowered analyses and inflated false positives we should worry about this.


  4. When you say, “I interpret this as low sensitivity” you’ve basically just redefined that bayes factor. Of course it is not sensitive, the BF is close to 1!


  5. Sorry, I accidentally posted that one early! I’m on a bus 😛 I mean that you’ve just shown that these bayes factors are behaving correctly: BF close to 1 means insensitive data- equally probable under either hypothesis. That doesn’t say anything about the frequency of errors nor the “power” of the test, since that’s just the definition of a bayes factor.

    If they had more n there’s a much better chance of getting sensitive data as you say. But even if the sample size was smaller, sensitive data is sensitive data. If you want to keep in mind that a method often obtains insensitive data fairly often you of course can, but it doesn’t change the fact that the BF tells you how much you need to update your prior odds. These and other properties of the method can be important but they are interpreted separately from what the BF actually tells you.

    And looking at error rates is fine! Andrew Gelman is big on that and he’s a Big Bayesian. But the beauty of Bayesian inference is that it dissociates the strength of evidence from the probability of obtaining it.


    1. But isn’t the important point here that it’s not only equally likely under both hypotheses but that the the overall proportion of inconclusive results is quite high?

      Regarding error rates, Jeff Rouder recently told me that this ain’t the way in the Bayesian world. I kind of almost understood it except that the pragmatist side still wins. In the end, if a BF is likely to mislead you into believing the wrong thing it still seems inadequate for solving our problems.

      Someone on that FB thread made a good point though which I think is also relevant to my method or any non-dichotomous approach. If the vast majority of findings produce inconclusive results a single weakly conclusive result doesn’t count that much either. This is in fact the issue with Bem’s data for me. All of his experiments are completely inconclusive except for his experiment 9 which is just borderline. I don’t think we should put too much faith into that.

      Anyway, I digress. I should stay away from this place for the rest of the weekend. I’ve been working on the follow-up post though. It definitely is quite short (by my standard 😉 and I hope it’s also positive 😛


  6. There is nothing wrong with caring about error rates in general. Is it right to say that obtaining inconclusive data is an error though?


    1. “n the end, if a BF is likely to mislead you into believing the wrong thing it still seems inadequate for solving our problems.”

      We’ve just agreed that these Bayes factors don’t mislead you! They leave you in the dark but don’t you agree that saying “I didn’t learn much” is not an error nor is it misleading?


    2. I’m replying to this one because I only allow two levels on my comment threads… 😉
      Also, sorry for the belated response but I need more time to focus to do so than I am willing to invest on a weekend.

      I think we need to dissociate two separate issues here. I don’t think Bayes Factors are misleading in terms of what they are actually showing. However, they do mislead if they frequently result in inadequate interpretations. This is an important issue to me because my bootstrapping procedure also aims to quantify evidence for H1 or H0 and it can also be inconclusive. Whether or not my method makes sense (I’m no statistician so it probably doesn’t) as a field I really want us to get away from dichotomous significance testing. So any situations where non-dichotomous evidence is misinterpreted matter to me. It shows that it may be harder for us to change our thinking than we believe.

      Case in point, I had this reanalysis of Bem 2011 in my manuscript (I took this out now for unrelated reasons, but there is another precognition data set in there and the conclusions are the same). In my first draft I myself interpreted the inconclusive evidence as evidence for H0 until my colleagues reading the draft pointed out that error.

      I think this is also what is happening here. As I said in my post, Boekel et al. are actually quite careful in their wording but I think this subtlety is lost on most readers. Many people read “Famous proponent of Bayes Factors proves that yet another set of controversial findings don’t replicate!” This to me is potentially quite dangerous.

      I believe we need to be very cautious how we interpret and discuss inconclusive results. I think the same applies to those precognition data. I don’t believe that precognition exists either but the main conclusions we should draw from Bem and co’s findings is not that it doesn’t exist but that most experiments are largely insufficient to conclude anything.

      I do agree with one thing someone said on Twitter though: inconclusive evidence should be interpreted against the backdrop of other results. If most results support H0 and a few are inconclusive, I’d be inclined to accept H0 is probably generally true. Similarly, if a single result supports H1 barely, but the rest are inconclusive or even support H0 (this happened with Bem 2011), I wouldn’t put too much faith into that single result – it’s just a multiple comparisons issue I suppose. I think this doesn’t really apply in the Boekel situation though. I think you must treat each of the five studies separately here (although I could accept lumping together Kanai 2011 and 2012 though as they are conceptually similar).

      Anyway, the main misleading issue is not that BFs are inconclusive but that the experiment is underpowered. As we have seen, a third of BFs will be inconclusive even if the original result was a perfect point estimate of the population effect. And this is only for this effect for which power was highest (amongst Kanai data anyway). It looks even worse for other results. I think you also cannot really expect original results to be a perfect point estimate. In my opinion we should in fact expect effect size distributions to be skewed towards weaker effects. Even without publication bias and QRPs there is what my alter ego called the Data Quality Decay Function ( In many replications you should probably expect weaker effect size estimates than the original finding even if the experiments were executed perfectly. For instance, in the Boekel replications we could expect the sample of Dutch students to be more homogeneous than a subject pool in London. A narrower range of inter-individual variability results in weaker correlations. Considering that the lack of variability was a reason they dropped the political attitude experiment this doesn’t seem very far-fetched. Then again, Ryota’s reanalysis of the Boekel data showed a larger effect size than his original study.

      This brings me to the last thing that I think is misleading about all this. I think there is too much of a focus on statistics. I’ve been saying for months that this is a problem. Statistics are important but we shouldn’t lose sight of the true issues. Replications shouldn’t be about the behaviour of statistical tests but about whether previous experimental claims hold water. It seems very misleading to me to replicate the 17 findings here but not actually concentrate on the specific biological questions. In general, as far as I can tell the replication BFs do not incorporate any prior replication evidence. And in the case of Ryota’s VBM correlations for distractibility, the replication completely ignores the convergent evidence that he (and his coauthors) presented. They showed that the brain area localised in this analysis also correlated with different behavioural measures and that causally manipulating this region with TMS produces behavioural effects consistent with the theory. It seems to me it would have made more sense to pick one of the previous experiments and try to replicate it in its entirety rather than picking 17 statistical tests from the literature.

      But I want to make one thing clear: I don’t think that these issues mean that Boekel’s study should not have been published. We still have a problem with null results and/or inconclusive results ending up in the file-drawer. This is not conducive to healthy science. Whatever the reasons may be for why Boekel et al. didn’t collect more data that would have allowed more conclusive inference, it is good that it was published. No data should just vanish in the void of oblivion. I’m not saying that every pilot data set must be made available (I certainly haven’t done that) but if you do a proper study it should be published. That’s how I was trained and it’s a philosophy I always tried to follow.

      I think the fact that Ryota finds very different results from Boekel using the same data is reason enough to be glad that this was published. Yes it may indicate that Boekel’s interpretation is wrong but it could also indicate some serious problems with the pipeline. I seem to recall hearing about a recent study suggesting that VBM results from SPM are less reliable than those with FSL. To me such issues are much more important than whether or not these individual SBB findings are true. If similar discrepancies are found for some of the other 16 data sets (or at least the other 7 VBM results), then this clearly indicates that more research is required to validate the common assumptions people make about these methods.

      Sorry for the very long comment… 😛


Comments are closed.