Experimenter effects in replication efforts

I mentioned the issue of data quality before but reading Richard Morey’s interesting post about standardised effect sizes the other day made me think about this again. Yesterday I gave a lecture discussing Bem’s infamous precognition study and the meta-analysis he recently published of the replication attempts. I hadn’t looked very closely at the meta-analysis data before but for my lecture I produced the following figure:Bem-Meta

This shows the standardised effect size for each of the 90 results in that meta-analysis split into four categories. On the left in red we have the ten results by Bem himself (nine of which are his original study and one is a replication of one of them by himself). Next, in orange we have what they call ‘exact replications’ in the meta-analysis, that is, replications that used his program/materials. In blue we have ‘non-exact replications’ – those that sought to replicate the paradigms but didn’t use his materials. Finally, on the right in black we have what I called ‘different’ experiments. These are at best conceptual replications because they also test whether precognition exists but use different experiment protocols. The hexagrams denote the means across all the experiments in each category (these are non-weighted means but it’s not that important for this post).

While the means for all categories are evidently greater than zero, the most notable thing should be that Bem’s findings are dramatically different from the rest. While the mean effect size in the other categories are below or barely at 0.1 and there is considerable spread beyond zero in all of them, all ten of Bem’s results are above zero and, with one exception, above 0.1. This is certainly very unusual and there are all sorts of reasons we could discuss for why this might be…

But let’s not. Instead let’s assume for the sake of this post that there is indeed such a thing as precognition and that Daryl Bem simply knows how to get people to experience it. I doubt that this is a plausible explanation in this particular case – but I would argue that for many kinds of experiments such “experimenter effects” are probably notable. In an fMRI experiment different labs may differ considerably in how well they control participants’ head motion or even simply in terms of the image quality of the MRI scans. In psychophysical experiments different experimenters may differ in how well they explain the task to participants or how meticulous they are in ensuring that they really understood the instructions, etc. In fact, the quality of the methods surely must matter in all experiments, whether they are in astronomy, microbiology, or social priming. Now this argument has been made in many forms, most infamously perhaps in Jason Mitchell’s essay “On the emptiness of failed replications” that drew much ire from many corners. You may disagree with Mitchell on many things but not on the fact that good methods are crucial. What he gets wrong is laying the blame for failed replications solely at the feet of “replicators”. Who is to say that the original authors didn’t bungle something up?

However, it is true that all good science should seek to reduce noise from irrelevant factors to obtain as clean observations as possible of the effect of interest. Using again Bem’s precognition experiments as an example, we could hypothesise that he indeed had a way to relax participants to unlock their true precognitive potential that others seeking to replicate his findings did not. If that were true (I’m willing to bet a fair amount of money that it isn’t but that’s not the point), if true, this would indeed mean that most of the replications – failed or successful – in his meta-analysis are only of low scientific value. All of these experiments are more contaminated by noise confounds than his experiments; thus only he provides clean measurements. Standardised effect sizes like Cohen’s d divide the absolute raw effect by a measure of uncertainty or dispersion in the data. The dispersion is a direct consequence of the noise factors involved. So it should be unsurprising that the effect size is greater for experimenters that are better at eliminating unnecessary noise.

Statistical inference seeks to estimate the population effect size from a limited sample. Thus, a meta-analytic effect size is an estimate of the “true” effect size from a set of replications. But since this population effect includes the noise from all the different experimenters, it does not actually reflect the true effect? The true effect is people’s inherent precognitive ability. The meta-analytic effect size estimate is spoiling that with all the rubbish others pile on with their sloppy Psi experimentation skills. Surely we want to know the former not the latter? Again, for precognition most of us will probably agree that this is unlikely – it seems more trivially explained by some Bem-related artifact – but in many situations this is a very valid point: Imagine one researcher manages to produce a cure for some debilitating disease but others fail to replicate it. I’d bet that most people wouldn’t run around shouting “Failed replication!”, “Publication bias!”, “P-hacking!” but would want to know what makes the original experiment – the one with the working drug – different from the rest.

The way I see that, meta-analysis of large scale replications is not the right way to deal with this problem. Meta-analysis of one lab’s replications are worthwhile, especially as a way to summarise a set of conceptually related experiments – but then you need to take them with a grain of salt because they aren’t independent replications. But large-scale meta-analysis across different labs don’t really tell us all that much. They simply don’t estimate the effect size that really matters. The same applies to replication efforts (and I know I’ve said this before). This is the point on which I have always sympathised with Jason Mitchell: you cannot conclude a lot from a failed replication. A successful replication that nonetheless demonstrates that the original claim is false is another story but simply failing to replicate some effect only tells you that something is (probably) different between the original and the replication. It does not tell you what the difference is.

Sure, it’s hard to make that point when you have a large-scale project like Brian Nosek’s “Estimating the reproducibility of psychological science” (I believe this is a misnomer because they mean replicability not reproducibility – but that’s another debate). Our methods sections are supposed to allow independent replication. The fact that so few of their attempts produced significant replications is a great cause for concern. It seems doubtful that all of the original authors knew what they were doing and so few of the “replicators” did. But in my view, there are many situations where this is not the case.

I’m not necessarily saying that large-scale meta-analysis is entirely worthless but I am skeptical that we can draw many firm conclusions from it. In cases where there is reasonable doubt about differences in data quality or experimenter effects, you need to test these differences. I’ve repeatedly said that I have little patience for claims about “hidden moderators”. You can posit moderating effects all you want but they are not helpful unless you test them. The same principle applies here. Rather than publishing one big meta-analysis after another showing that some effect is probably untrue or, as Psi researchers are wont to do, in an effort to prove that precognition, presentiment, clairvoyance or whatever are real, I’d like to see more attempts to rule out these confounds.

In my opinion the only way to do this is through adversarial collaboration. If an honest skeptic can observe Bem conduct his experiments, inspect his materials, and analyse the data for themselves and yet he still manages to produce these findings, that would go a much longer way convincing me that these effects are real than any meta-analysis ever could.

Humans are dirty test tubes



10 thoughts on “Experimenter effects in replication efforts

  1. Thanks for that Bem plot – really interesting! I should’ve gone to your lecture πŸ˜‰

    Isn’t it a very interesting question how an effect fares in the real world and how fragile it is with regard to experimental ‘fine tuning’? Ideally, one would like to know both, I guess: The ‘ideal’ and ‘real world’ size of an effect. Big discrepancies between both immediately become interesting questions in their own right. For medical treatments, what are the compliance issues etc. at hand? Or think of the Pearl index. A condom’s ‘ideal’ effectiveness and that which you can typically expect for a given population are both valuable information.

    And I think this is true for basic science as well. Ultimately our interest should go beyond whether an effect exists, but in explaining it. And this will necessarily include all relevant factors. I once visited a lab doing experiments on the rubber hand illusion. It turned out there was some received wisdom on how to best perform these experiments that I never saw in any methods section (like dim ambient light). If such factors indeed have a significant influence, then there will be no true understanding of the effect without taking them into account.


    1. Yeah perhaps I should have differentiated that better. It depends on the purpose of the study. If I want to know the applicability of a phenomenon to the real world than estimating the population effect including all the noise, warts and all, is what you want. But surely for a lot of basic science, for “explaining it”, this is not the case.

      Hence I trust a result on modulation of the tilt illusion far more when it includes 3-5 trained psychophysics observers than 500 undergrads doing it for course credit.

      When it comes to the meta-analyses that I have read the purpose of the study essential always was to ask whether the effect exists at all. The Psi folks are hell-bent on using meta-analyses to prove the existence of their effects. Most other meta-analyses seem to show that effects are far weaker than the original reports and almost always skewed by publication bias. This really shouldn’t be surprising. Even if an effect is real it is almost certainly going to be muddled by large-scale independent replication.

      Now somewhere down the line I guess a meta-analysis of precognition effects could be useful to know just how strongly such a thing should manifest generally. But before that we should know if it is real at all which we can only really know through adversarial collaboration with different systematic tests: first, as I said in my post, let Bem do his experiments but have someone else conduct the analysis etc. Then, if that confirms his result, do it again but in another place. Perhaps it’s just about his subject pool? Or maybe his lab is on top of a geographic psi hotspot? If that still works, then try to carefully replicate his behaviour. But maybe he just has an aura that produces precognition?


  2. 1 Is this the Mitchel piece you’re looking for?:


    2) Here is an article that might be of interest to you/ related to your post:”Meta-analyses are no substitute for registered replications: a skeptical perspective on religious priming”


    3) I think what Bem is doing is great. It makes us critically think about the way research is performed and evaluated. I don’t think he actually believes Psi exists, but is merely doing what a lot of other people have been doing for decades only with less-obviously controversial topics.

    Now what to make of it al? I am not that smart, so i am hoping smart people will come up with a solution. My thoughts at this moment are that for me it all comes down to thinking that performing more rigorous research in the first place might be the best way forward. That way meta-analyses can’t be skewed by “weird” original findings. That way it would also make it more clear whether or not any resources should be spend on trying to find out what exactly happened in the original work, which with current practices remains unclear (i.c. perhaps the most simple explanation of differences between original and replication studies is due to p-hacking, low-powered studies, etc.).

    If Bem (2011) had performed only high-powered, pre-registered studies where the results would have been published no matter how they turned out I doubt we would be talking about it right now.


    1. Thanks for the links. I was already sent the link to Mitchell’s piece and updated the text. But the Frontiers paper is definitely interesting!

      Regarding Bem: I’ve heard that notion before that he doesn’t really believe in Psi and is just doing that to test scientific rigour and reveal the flaws in our methods etc… I know this idea seems very appealing but in my view the evidence speaks against that. He has long been a Psi researcher, well before the 2011 precognition paper was published. For instance he published a meta-analysis on remote viewing (I think – too lazy to look up what it was exactly) years ago and he was long involved with parapsychology society.

      None of that is a problem of course. I haven’t got any qualms with people studying parapsychological topics (in fact, many such phenomena are very interesting). I only have a problem with people presenting over-the-top conclusions about incredible findings which are inconsistent with most of what we currently know about nature. And I have a problem with the “Psi hypothesis” which by its very nature is not a real hypothesis at all in my book.

      I think you’re overly optimistic if you think that preregistration rules out p-hacking. I fully agree that it should reduce it substantially but I think QRPs can still occur in registered studies. It largely depends on how honest researchers will be in terms of arbitrary subject exclusions etc. But yes, I would take a meta-analysis of 9 preregistered Bem replications over the 90 we currently have.

      “If Bem (2011) had performed only high-powered, pre-registered studies where the results would have been published no matter how they turned out I doubt we would be talking about it right now.”

      Actually, if he had done that and still found the results he did, I’m pretty sure we’d still be talking about it… πŸ˜‰


  3. Adversarial collaborations of the kind you point to are a nice idea in principle, but the model is not scalable, and puts the onus of doing all the hard work on the wrong people. If you publish a paper claiming that you can observe ESP in your lab, and you take yourself to be doing science, it’s not my job to come to your lab and watch you reproduce everything in detail. It’s your job to determine what you’re doing that makes the effect work, and describe that in a way that others can reproduce independently.

    Now, it’s certainly it’s true that if I can’t reproduce your result in my lab, I’m not entitled to say that “the effect doesn’t exist”. What I can say, however (assuming adequate power, etc.) is that I’m unable to observe the effect under similar conditions. And if enough other people are also unable to reproduce the effect, there is a point at which I should feel perfectly comfortable saying “it’s not worth my time to pursue this any further, even if it could in principle be reproduced under very narrow conditions” and move on. I don’t have to give you the benefit of the doubt; it’s your job to convince me that the effect you’re studying can be systematically elicited and is sufficiently robust to are about. If this bothers you, it should be *your* responsibility to come to my lab and see how *I* do my replications, and then maybe you can point out where I should do things differently.


    1. I don’t think anyone has any responsibility to do anything. If you want people to believe that you can produce an effect but I can’t, then I think it would be in your interest to find out what that difference is (assuming for the moment that I am not just obviously incompetent). Inviting me to an adversarial collaboration would be a way to do that. In the same vein, if you’re really skeptical of some finding then I think it is your interest to test that. Certainly, you would want to minimise the risks of QRPs being an issue but assuming that you did I think it would be a good idea to suggest an adversarial collaboration.

      It requires a change in research culture but it is a change I would like to see. And it need not be adversarial. It could simply be a question of more eyes on the data which is never a bad thing. It should happen more inside our labs as well as across labs.


  4. Interesting article. Are there any examples of adversarial collaboration being successfully implemented? Also, if an experimenter under observation was not able to replicate the original findings, couldn’t he simply claim that the change in outcome was due to how he (and maybe the participant) was being intrusively observed? Perhaps this could be minimized by a slightly less invasive (but certainly more deceptive) approach in which the lead researcher invites (skeptic) students who observe/inspect/analyze an experiment under the ruse that they are merely there to learn from the experimenter as part of their dissertation.


    1. There are examples but definitely not very many (I can look up some if you’re interested but can’t right now). A few in Psi research (unsurprising perhaps, because it will attract a lot of skepticism) but also in more conventional research. A feature that frequently seems to come up is that the two sides disagree on what the same result means… I don’t think this is a problem though. The reader can make up their mind based on the arguments from either party.

      I don’t think the intrusiveness is an issue. I’m not saying the skeptic must sit in the room at all times that the participant is doing the experiment unless the experimenter also sits in the room the whole time. The skeptic should just do exactly what the experimenter does and – what is more important – should have complete access to all the materials and data.


Comments are closed.