Unless you’ve been living under a rock, you have probably heard of data peeking – also known as “optional stopping”. It’s one of those nasty questionable research practices that could produce a body of scientific literature contaminated by widespread spurious findings and thus lead to poor replicability.
Data peeking is when you run a Frequentist statistical test every time you collect a new subject/observation (or after every few observations) and stop collecting data when the test comes out significant (say, at p < 0.05). Doing this clearly does not accord with good statistical practice because under the Frequentist framework you should plan your final sample size a priori based on power analysis, collect data until you have that sample size, and never look back (but see my comment below for more discussion of this…). What is worse, under the aforementioned data peeking scheme you can be theoretically certain to reject the null hypothesis eventually. Even if the null hypothesis is true, sooner or later you will hit a p-value smaller than the significance threshold.
Until recently, many researchers, at least in psychological and biological sciences, appeared to be unaware of this problem and it isn’t difficult to see that this could contribute to a prevalence of false positives in the literature. Even now, after numerous papers and blog posts have been written about this topic, this problem still persists. It is perhaps less common but I still occasionally overhear people (sometimes even in their own public seminar presentations) saying things like “This effect isn’t quite significant yet so we’ll see what happens after we tested a few more subjects.” So far so bad.
Ever since I heard about this issue (and I must admit that I was also unaware of the severity of this problem back in my younger, carefree days), I have felt somehow dissatisfied with how this issue has been described. While it is a nice illustration of a problem, the models of data peeking seem extremely simplistic to me. There are two primary aspects of this notion that in my opinion just aren’t realistic. First, the notion of indefinite data collection is obviously impossible, as this would imply having an infinite subject pool and other bottomless resources. However, even if you allow for a relatively manageable maximal sample size at which a researcher may finally stop data collection even when the test is not significant, the false positive rate is still massively inflated.
The second issue is therefore a bigger problem: the simple data peeking procedure described above seems grossly fraudulent to me. I would have thought that even if the researcher in question were unaware of the problems with data peeking, they probably would nonetheless feel that something is quite right with checking for significant results after every few subjects and continuing until they get them. As always, I may be wrong about this but I sincerely doubt this is what most “normal“ people do. Rather, I believe people would be more likely to peek at the data to look if the results are significant, and only if the p-value “looks promising” (say 0.05 < p < 0.1) they continue testing. This sampling plan sounds a lot more like what may actually happen. So I wanted to find out how this sort of sampling scheme would affect results. I have no idea if anyone already did something like this. If so, I’d be grateful if you could point me to that analysis.
So what I did is the following: I used Pearson’s correlation as the statistical test. In each iteration of the simulation I generated a data set of 150 subjects, each with two uncorrelated Gaussian variables, let’s just pretend it’s the height of some bump on the subjects’ foreheads and a behavioral score of how belligerent they are. 150 is thus the maximal sample size, assuming that our simulated phrenologist – let’s call him Dr Peek – would not want to test more than 150 subjects. However, Dr Peek actually starts with only 3 subjects and then runs the correlation test. In the simplistic version of data peeking, Dr Peek will stop collecting data if p < 0.05; otherwise he will collect another subject until p < 0.05 or 150 subjects are eventually reached. In addition, I simulated three other sampling schemes that feel more realistic to me. In these cases, Dr Peek will also stop data collection when p < 0.05 but he will also stop when p is either greater than 0.1, greater than 0.3, or greater than 0.5. I repeated each of these simulations 1000 times.
The results are in the graph below. The four sampling schemes are denoted by the different colors. On the y-axis I plotted the proportion of the 1000 simulations in which the final outcome (that is, whenever data collection was stopped) yielded p < 0.05. The scenario I described above is the leftmost set of data points in which the true effect size, the correlation between forehead bump height and belligerence, is zero. Confirming previous reports on data peeking, the simplistic case (blue curve) has an enormously inflated false positive rate of around 0.42. Nominally, the false positive rate should be 0.05. However, under the more “realistic” sampling schemes the false positive rates are far lower. In fact, for the case where data collection only continues while p-values are marginal (0.05 < p < 0.1), the false positive rate is 0.068, only barely above the nominal rate. For the other two schemes, the situation is slightly worse but not by that much. So does this mean that data peeking isn’t really as bad as we have been led to believe?
Hold on, not so fast. Let us now look what happens in the rest of the plot. I redid the same kind of simulation for a range of true effect sizes up to rho = 0.9. The x-axis shows the true correlation between forehead bump height and belligerence. Unlike for the above case when the true correlation is zero, now the y-axis shows statistical power, the proportion of simulations in which Dr Peek concluded correctly that there actual is a correlation. All four curves rise steadily as one might expect with stronger true effects. The blue curve showing the simplistic data peeking scheme rises very steeply and reaches maximal power at a true correlation of around 0.4. The slopes of the other curves are much more shallow and while the power at strong true correlations is reasonable at least for two of them, they don’t reach the lofty heights of the simplistic scheme.
This feels somehow counter-intuitive at first but it makes sense: when the true correlation is strong, the probability of high p-values is low. However, at the very small sample sizes we start out with even a strong correlation is not always detectable – the confidence interval of the estimated correlation is very wide. Thus there will be a relatively large proportion of p-values that pass that high cut-off and terminate data collection prematurely without rejecting the null hypothesis.
Critically, these two things, inflated false positive rates and reduced statistical power to detect true effects, dramatically reduce the sensitivity of any analysis that is performed under these realistic data peeking schemes. In the graph below, I plot the sensitivity (quantified as d’) of the analysis. Larger d’ means there is a more favorable ratio between the number of simulations in which Dr Peek correctly detected a true effect and how often he falsely concluded there was a correlation when there wasn’t one. Sensitivity for the simplistic sample scheme (blue curve) rises steeply until power is maximal. However, sensitivity for the other sampling schemes starts off close to zero (no sensitivity) and only rises fairly slowly.
For reference compare this to the situation under desired conditions, that is, without questionable research practices, with adequate statistical power of 0.8, and the nominal false positive rate of 0.05: in this case the sensitivity would be d’ = 2.49, so higher than any of the realistic sampling schemes ever get. Again, this is not really surprising because data collections will typically be terminated at sample sizes that give far less than 0.8 power. But in any case, this is bad news. Even though the more realistic forms of data peeking don’t inflate false positives as massively as the most pessimistic predictions, they impede the sensitivity of experiments dramatically and are thus very likely to only produce rubbish. It should come as no surprise that many findings fail to replicate.
Obviously, what I call here more realistic data peeking is not necessarily a perfect simulation of how data peeking may work in practice. For one thing, I don’t think Dr Peek would have a fixed cut-off of p > 0.1 or p > 0.5. Rather, such a cut-off might be determined on a case-by-case basis, dependent on the prior expectation Dr Peek has that the experiment should yield significant results. (Dr Peek may not use Bayesian statistics, but like all of us he clearly has Bayesian priors.) In some cases, he may be very confident that there should be an effect and he will continue testing for a while but then finally give up when the p-value is very high. For other hypotheses that he considered to be risky to begin with, he may not be very convinced even by marginal p-values and thus will terminate data collection when p > 0.1.
Moreover, it is probably also unrealistic that Dr Peek would start with a sample size of 3. Rather, it seems more likely that he would have a larger minimal sample size in mind, for example 20 and collect that first. While he may have been peeking at the data before he completed testing 20 subjects, there is nothing wrong with that provided he doesn’t stop early if the result becomes significant. Under these conditions the situation becomes somewhat better but the realistic data peeking schemes still have reduced sensitivity, at least for lower true effect sizes, which are presumably far more prevalent in real world situations. The only reason that sensitivity goes up fairly quickly to reasonable levels is that with the starting sample size of 20 subjects, the power to detect those stronger correlations is already fairly high – so in many cases data collection will be terminated as soon as the minimum sample is completed.
Finally, while I don’t think this plot is entirely necessary, I also show you the false positives / power rates for this latter case. The curves are such beautiful sigmoids that I just cannot help myself but to include them in this post…
So to sum up, leaving aside the fact that you shouldn’t really peek at your data and stop data collection prematurely in any case, if you do this you can shoot yourself seriously in the foot. While the inflation of false positives through data peeking may have contributed a considerable number of spurious, unreplicable findings to the literature, what is worse it may very well also have contributed a great number of false negatives to the proverbial file drawer: experiments that were run but failed to produce significant results after peeking a few times and which were then abandoned, never to be heard of again. When it comes to spurious findings in the literature, I suspect the biggest problem is not actually data peeking but other questionable practices from the Garden of Forking Paths, such as tweaking the parameters of an experiment or the analysis.
* Actually it may just be me…
Matlab code for these simulations. Please let me know if you discover the inevitable bugs in this analysis.