So this post is just a brief follow up to my previous post on data peeking. I hope it will be easy to see why this is very related:
Today I read this long article about the RRR of the pen-in-mouth experiments – another in a growing list of failures to replicate classical psychology findings. I was quite taken aback by one comment in this: the assertion that these classical psychology experiments (in particular the social priming ones) had been “precisely calibrated to elicit tiny changes in behavior.” It is an often repeated argument to explain why findings fail to replicate – the “replicators” simply do not have the expertise and/or skill to redo these delicate experiments. And yes, I am entirely willing to believe that I’d be unable to replicate a lot of experiments outside my area, say, finding subatomic particles or even (to take an example from my general field) difficult studies on clinical populations.
But what does this statement really mean? How were these psychology experiments “calibrated” before they were run? What did the authors do to nail down the methods before they conducted the studies? It implies that extensive pilot experiments were conducted first. I am in no position to say that this is what the authors of these psychology studies did during their piloting stage but one possibility is that several small pilot experiments were run and the experimental parameters were tweaked until a significant result supporting the hypothesis was observed. Only then they continued the experiment and collected a full data set that included the pilot data. I have seen and heard of people who did precisely this sort of piloting until the “experiment worked.”
So, what actually happens when you “pilot” experiments to “precisely calibrate” them? I decided to simulate this and the results are in the graph below (each data point is based on 100,000 simulations). In this simulation, an intrepid researcher first runs a small number of pilot subjects per group (plotted on x-axis). If the pilot fails to produce significant results at p < 0.05, the experiment is abandoned and the results are thrown in the bin never to again see the light of day. However, if the results are significant, the eager researcher collects more data until the full sample in each group is n = 20, 40, or 80. On the y-axis I plotted the proportion of these continued experiments that produced significant results. Note that all simulated groups were drawn from a normal distribution with mean 0 and standard deviation 1. Therefore, any experiments that “worked” (that is, they were significant) are false positives. In a world where publication bias is still commonplace, these are the findings that make it into journals – the rest vanish in the file-drawer.
As you can see, such a scheme of piloting until the experiment “works,” can produce an enormous number of false positives in the completed experiments. Perhaps this is not really all that surprising – after all this is just another form of data peeking. Critically, I don’t think this is unrealistic. I’d wager this sort of thing is not at all uncommon. And doesn’t it seem harmless? After all we are only peeking once! If a pilot experiment “worked,” we continue sampling until the sample is complete.
Well, even under these seemingly benign conditions false positives can be inflated dramatically. The black curve is for the case when the final sample size, of the completed studies, is only 20. This is the worst case and it is perhaps unrealistic. If the pilot experiment consists of 10 subjects (that is, half the full sample) about a third of results will be flukes. But even in the other cases, when only a handful of pilot subjects are collected compared to the much larger full samples, false positives are well above 5%. In other words, whenever you pilot an experiment and decide that it’s “working” because it seems to support your hypothesis, you are already skewing the final outcome.
Of course, the true false positive rate, taken across the whole set of 100,000 pilots that were run, would be much lower (0.05 times the rates I plotted above to be precise, because we picked from the 5% of significant “pilots” in the first place). However, since we cannot know how much of this inappropriate piloting went on behind the scenes, knowing this isn’t particularly helpful.
More importantly, we aren’t only interested in the false positive rate. A lot of researchers will care about the effect size estimates of their experiments. Crucially, this form of piloting will substantially inflate these effect size estimates as well and this may have even worse consequences for the interpretation of these experiments. In the graph below, I plot the effect sizes (the mean absolute Cohen’s d) for the same simulations for which I showed you the false positive rates above. I use the absolute effect size because the sign is irrelevant – the whole point of this simulation exercise is to mimic a full-blown fishing expedition via inappropriate “piloting.” So our researcher will interpret a significant result as meaningful regardless of whether d is positive or negative.
Forgive the somewhat cluttered plot but it’s not that difficult to digest really. The color code is the same as for the previous figure. The open circles and solid lines show you the effect size of the experiments that “worked,” that is, the ones for which we completed data collection and which came out significant. The asterisks and dashed lines show the effect sizes for all of global false positives, that is, all the simulations with p < 0.05 after the pilot but using the full the data set, as if you had completed these experiments. Finally, the crosses and dotted lines show the effect sizes you get for all simulations (ignoring inferential statistics). This is just given as a reference.
Two things are notable about all this. First, effect size estimates increase with “pilot” sample size for the set of global false positives (asterisks) but not the other curves. This is because the “pilot” sample size determines how strongly the fluke pilot effect will contribute to the final effect size. More importantly, the effect size estimates for those experiments with significant pilots and which also “worked” after completion are massively exaggerated (open circles). The degree of exaggeration is a factor of the baseline effect (crosses). The absolute effect size estimate depends on the full sample size. At the smallest full sample size (n=20, black curve) the effect sizes are as high as d = 0.8. Critically, the degree of exaggeration does not depend on how large your pilot sample was. Whether your “pilot” had only 2 or 15 subjects, the average effect size estimate is around 0.8.
The reason for this is that the smaller the pilot experiment, the more underpowered it is. Since it is a condition for continuing the experiment that the pilot must be significant, the pilot effect size must be considerably larger for small pilots than larger pilots. Because the true effect size is always zero, this cancels out in the end so the final effect size estimate is constant regardless of the pilot sample size. But in any case, the effect size estimate you got from your
precisely calibrated inappropriately piloted experiments are enormously overrated. It shouldn’t be much of a surprise if these don’t replicate and that posthoc power calculations based on these effect sizes suggest low power (of course, you should never use posthoc power in that way but that’s another story…) .
So what should we do? Ideally you should just throw away the pilot data, preregister the design, and restart the experiment anew with the methods you piloted. In this case the results are independent and only the methods are shared. Importantly, there is nothing wrong with piloting in general. After all, I had a previous post praising pilot experiments. But piloting should be about ensuring that the methods are effective in producing clean data. There are many situations in which an experiment seems clever and elegant in theory but once you actually start it in practice you realize that it just can’t work. Perhaps the participants don’t use the task strategy you envisioned. Or they simply don’t perceive the stimuli the way they were intended. In fact, this happened to us recently and we may have stumbled over an interesting finding in its own right (but this must also be confirmed by a proper experiment!). In all these situations, however, the decision on the pilot results is unrelated to the hypothesis you are testing. If they are related, you must account for that.
MatLab code for these simulations is available. As always, let me know if you find errors. (To err is human, to have other people check your code divine?)