I mentioned the issue of data quality before but reading Richard Morey’s interesting post about standardised effect sizes the other day made me think about this again. Yesterday I gave a lecture discussing Bem’s infamous precognition study and the meta-analysis he recently published of the replication attempts. I hadn’t looked very closely at the meta-analysis data before but for my lecture I produced the following figure:
This shows the standardised effect size for each of the 90 results in that meta-analysis split into four categories. On the left in red we have the ten results by Bem himself (nine of which are his original study and one is a replication of one of them by himself). Next, in orange we have what they call ‘exact replications’ in the meta-analysis, that is, replications that used his program/materials. In blue we have ‘non-exact replications’ – those that sought to replicate the paradigms but didn’t use his materials. Finally, on the right in black we have what I called ‘different’ experiments. These are at best conceptual replications because they also test whether precognition exists but use different experiment protocols. The hexagrams denote the means across all the experiments in each category (these are non-weighted means but it’s not that important for this post).
While the means for all categories are evidently greater than zero, the most notable thing should be that Bem’s findings are dramatically different from the rest. While the mean effect size in the other categories are below or barely at 0.1 and there is considerable spread beyond zero in all of them, all ten of Bem’s results are above zero and, with one exception, above 0.1. This is certainly very unusual and there are all sorts of reasons we could discuss for why this might be…
But let’s not. Instead let’s assume for the sake of this post that there is indeed such a thing as precognition and that Daryl Bem simply knows how to get people to experience it. I doubt that this is a plausible explanation in this particular case – but I would argue that for many kinds of experiments such “experimenter effects” are probably notable. In an fMRI experiment different labs may differ considerably in how well they control participants’ head motion or even simply in terms of the image quality of the MRI scans. In psychophysical experiments different experimenters may differ in how well they explain the task to participants or how meticulous they are in ensuring that they really understood the instructions, etc. In fact, the quality of the methods surely must matter in all experiments, whether they are in astronomy, microbiology, or social priming. Now this argument has been made in many forms, most infamously perhaps in Jason Mitchell’s essay “On the emptiness of failed replications” that drew much ire from many corners. You may disagree with Mitchell on many things but not on the fact that good methods are crucial. What he gets wrong is laying the blame for failed replications solely at the feet of “replicators”. Who is to say that the original authors didn’t bungle something up?
However, it is true that all good science should seek to reduce noise from irrelevant factors to obtain as clean observations as possible of the effect of interest. Using again Bem’s precognition experiments as an example, we could hypothesise that he indeed had a way to relax participants to unlock their true precognitive potential that others seeking to replicate his findings did not. If that were true (I’m willing to bet a fair amount of money that it isn’t but that’s not the point), if true, this would indeed mean that most of the replications – failed or successful – in his meta-analysis are only of low scientific value. All of these experiments are more contaminated by noise confounds than his experiments; thus only he provides clean measurements. Standardised effect sizes like Cohen’s d divide the absolute raw effect by a measure of uncertainty or dispersion in the data. The dispersion is a direct consequence of the noise factors involved. So it should be unsurprising that the effect size is greater for experimenters that are better at eliminating unnecessary noise.
Statistical inference seeks to estimate the population effect size from a limited sample. Thus, a meta-analytic effect size is an estimate of the “true” effect size from a set of replications. But since this population effect includes the noise from all the different experimenters, it does not actually reflect the true effect? The true effect is people’s inherent precognitive ability. The meta-analytic effect size estimate is spoiling that with all the rubbish others pile on with their sloppy Psi experimentation skills. Surely we want to know the former not the latter? Again, for precognition most of us will probably agree that this is unlikely – it seems more trivially explained by some Bem-related artifact – but in many situations this is a very valid point: Imagine one researcher manages to produce a cure for some debilitating disease but others fail to replicate it. I’d bet that most people wouldn’t run around shouting “Failed replication!”, “Publication bias!”, “P-hacking!” but would want to know what makes the original experiment – the one with the working drug – different from the rest.
The way I see that, meta-analysis of large scale replications is not the right way to deal with this problem. Meta-analysis of one lab’s replications are worthwhile, especially as a way to summarise a set of conceptually related experiments – but then you need to take them with a grain of salt because they aren’t independent replications. But large-scale meta-analysis across different labs don’t really tell us all that much. They simply don’t estimate the effect size that really matters. The same applies to replication efforts (and I know I’ve said this before). This is the point on which I have always sympathised with Jason Mitchell: you cannot conclude a lot from a failed replication. A successful replication that nonetheless demonstrates that the original claim is false is another story but simply failing to replicate some effect only tells you that something is (probably) different between the original and the replication. It does not tell you what the difference is.
Sure, it’s hard to make that point when you have a large-scale project like Brian Nosek’s “Estimating the reproducibility of psychological science” (I believe this is a misnomer because they mean replicability not reproducibility – but that’s another debate). Our methods sections are supposed to allow independent replication. The fact that so few of their attempts produced significant replications is a great cause for concern. It seems doubtful that all of the original authors knew what they were doing and so few of the “replicators” did. But in my view, there are many situations where this is not the case.
I’m not necessarily saying that large-scale meta-analysis is entirely worthless but I am skeptical that we can draw many firm conclusions from it. In cases where there is reasonable doubt about differences in data quality or experimenter effects, you need to test these differences. I’ve repeatedly said that I have little patience for claims about “hidden moderators”. You can posit moderating effects all you want but they are not helpful unless you test them. The same principle applies here. Rather than publishing one big meta-analysis after another showing that some effect is probably untrue or, as Psi researchers are wont to do, in an effort to prove that precognition, presentiment, clairvoyance or whatever are real, I’d like to see more attempts to rule out these confounds.
In my opinion the only way to do this is through adversarial collaboration. If an honest skeptic can observe Bem conduct his experiments, inspect his materials, and analyse the data for themselves and yet he still manages to produce these findings, that would go a much longer way convincing me that these effects are real than any meta-analysis ever could.