Category Archives: statistics

Boosting power with better experiments

Probably one of the main reasons for the low replicability of scientific studies is that many previous studies have been underpowered – or rather that they only provided inconclusive evidence for or against the hypotheses they sought to test. Alex Etz had a great blog post on this with regard to replicability in psychology (and he published an extension of this analysis that takes publication bias into account as a paper). So it is certainly true that as a whole researchers in psychology and neuroscience can do a lot better when it comes to the sensitivity of their experiments.

A common mantra is that we need larger sample sizes to boost sensitivity. Statistical power is a function of the sample size and the expected effect size. There is a lot of talk out there about what effect size one should use for power calculations. For instance, when planning a replication study, it has been suggested that you should more than double the sample size of the original study. This is supposed to take into account the fact that published effect sizes are probably skewed upwards due to publication bias and analytical flexibility, or even simply because the true effect happens to be weaker than originally reported.

However, what all these recommendations neglect to consider is that standardized effect sizes, like Cohen’s d or a correlation coefficient, are also dependent on the precision of your observations. By reducing measurement error or other noise factors, you can literally increase the effect size. A higher effect size means greater statistical power – so with the same sample size you can boost power by improving your experiment in other ways.

Here is a practical example. Imagine I want to correlate the height of individuals measured in centimeters and inches. This is a trivial case – theoretically the correlation should be perfect, that is, ρ = 1. However, measurement error will spoil this potential correlation somewhat. I have a sample size of 100 people. I first ask my auntie Angie to guess the height of each subject in centimeters. To determine their heights in inches, I then take them all down the pub and ask this dude called Nigel to also take a guess. Both Angie and Nigel will misestimate heights to some degree. For simplicity, let’s just say that their errors are on average the same. This nonetheless means their guesses will not always agree very well. If I then calculate the correlation between their guesses, it will obviously have to be lower than 1, even though this is the true correlation. I simulated this scenario below. On the x-axis I plot the amount of measurement error in cm (the standard deviation of Gaussian noise added to the actual body heights). On the y-axis I plot the median observed correlation and the shaded area is the 95% confidence interval over 10,000 simulations. As you can see, as measurement error increases, the observed correlation goes down and the confidence interval becomes wider.


Greater error leads to poorer correlations. So far, so obvious. But while I call this the observed correlation, it really is the maximally observable correlation. This means that in order to boost power, the first thing you could do is to reduce measurement error. In contrast, increasing your sample size can be highly inefficient and border on the infeasible.

For a correlation of 0.35, hardly an unrealistically low effect in a biological or psychological scenario, you would need a sample size of 62 to achieve 80% power. Let’s assume this is the correlation found by a previous study and we want to replicate it. Following common recommendations you would plan to collect two-and-a-half the sample size, so n = 155. Doing so may prove quite a challenge. Assume that each data point involves hours of data collection per participant and/or that it costs 100s of dollars to acquire the data (neither are atypical in neuroimaging experiments). This may be a considerable additional expense few researchers are able to afford.

And it gets worse. It is quite possible that by collecting more data you further sacrifice data quality. When it comes to neuroimaging data, I have heard from more than one source that some of the large-scale imaging projects contain only mediocre data contaminated by motion and shimming artifacts. The often mentioned suggestion that sample sizes for expensive experiments could be increased by multi-site collaborations ignores that this quite likely introduces additional variability due to differences between sites. The data quality even from the same equipment may differ. The research staff at the two sites may not have the same level of skill or meticulous attention to detail. Behavioral measurements acquired online via a website may be more variable than under controlled lab conditions. So you may end up polluting your effect size even further by increasing sample size.

The alternative is to improve your measurements. In my example here, even going from a measurement error of 20 cm to 15 cm improves the observable effect size quite dramatically, moving from 0.35 to about 0.5. To achieve 80% power, you would only need a sample size of 29. If you kept the original sample size of 62, your power would be 99%. So the critical question is not really what the original effect size was that you want to replicate – rather it is how much you can improve your experiment by reducing noise. If your measurements are already pretty precise to begin with, then there is probably little room for improvement and you also don’t win all that much, as going from measurement error 5 cm to 1 cm in my example. But when the original measurement was noisy, improving the experiment can help a hell of a lot.

There are many ways to make your measurements more reliable. It can mean ensuring that your subjects in the MRI scanner are padded in really well, that they are not prone to large head movements, that you did all in your power to maintain a constant viewing distance for each participant, and that they don’t fall asleep halfway through your experiment. It could mean scanning 10 subjects twice, instead of scanning 20 subjects once. It may be that you measure the speed that participants walk down the hall to the lift with laser sensors instead of having a confederate sit there with a stopwatch. Perhaps you can change from a group comparison to a within-subject design? If your measure is an average across trials collected in each subject, you can enhance the effect size by increasing the number of trials. And it definitely means not giving a damn what Nigel from down the pub says and investing in a bloody tape measure instead.

I’m not saying that you shouldn’t collect larger samples. Obviously, if measurement reliability remains constant*, larger samples can improve sensitivity. But the first thought should always be how you can make your experiment a better test of your hypothesis. Sometimes the only thing you can do is to increase the sample but I bet usually it isn’t – and if you’re not careful, it can even make things worse. If your aim is to conclude something about the human brain/mind in general, a larger and broader sample would allow you to generalize better. However, for this purpose increasing your subject pool from 20 undergraduate students at your university to 100 isn’t really helping. And when it comes to the choice between an exact replication study with three times the sample size than the original experiment, and one with the same sample but objectively better methods, I know I’d always pick the latter.


(* In fact, it’s really a trade-off and in some cases a slight increase of measurement error may very well be outweighed by greater power due to a larger sample size. This probably happens for the kinds of experiments where slight difference in experimental parameters don’t matter much and you can collect 100s of people fast, for example online or at a public event).

A few thoughts on stats checking

You may have heard of StatCheck, an R package developed by Michèle B. Nuijten. It allows you to search a paper (or manuscript) for common frequentist statistical tests. The program then compares whether the p-value reported in the test matches up with the reported test statistic and the degrees of freedom. It flags up cases where the p-value is inconsistent and, additionally, when the recalculated p-value would change the conclusions of the test. Now, recently this program was used to trawl through 50,000ish papers in psychology journals (it currently only recognizes statistics in APA style). The results on each paper are then automatically posted as comments on the post-publication discussion platform PubPeer, for example here. At the time of writing this, I still don’t know if this project has finished. I assume not because the (presumably) only one of my papers that has been included in this search has yet to receive its comment. I left a comment of my own there, which is somewhat satirical because 1) I don’t take the world as seriously as my grumpier colleagues and 2) I’m really just an asshole…

While many have welcomed the arrival of our StatCheck Overlords, not everyone is happy. For instance, a commenter in this thread bemoans that this automatic stats checking is just “mindless application of stats unnecessarily causing grief, worry, and ostracism. Effectively, a witch hunt.” In a blog post, Dorothy Bishop discusses the case of her own StatCheck comments, one of which gives the paper a clean bill of health and the other discovered some potential errors that could change the significance and thus the conclusions of the study. My own immediate gut reaction to hearing about this was that this would cause a deluge of vacuous comments and that this diminishes the signal-to-noise ratio of PubPeer. Up until now discussions on there frequently focused on serious issues with published studies. If I see a comment on a paper I’ve been looking up (which is made very easy using the PubPeer plugin for Firefox), I would normally check it out. If in future most papers have a comment from StatCheck, I will certainly lose that instinct. Some are worried about the stigma that may be attached to papers when some errors are found although others have pointed out that to err is human and we shouldn’t be afraid of discovering errors.

Let me be crystal clear here. StatCheck is a fantastic tool and should prove immensely useful to researchers. Surely, we all want to reduce errors in our publications, which I am also sure all of us make some of the time. I have definitely noticed typos in my papers and also errors with statistics. That’s in spite of the fact that when I do the statistics myself I use Matlab code that outputs the statistics in the way they should look in the text so all I have to do is copy and paste them in. Some errors are introduced by the copy-editing stage after a manuscript is accepted. Anyway, using StatCheck on our own manuscripts can certainly help reduce such errors in future. It is also extremely useful for reviewing papers and marking student dissertations because I usually don’t have the time (or desire) to manually check every single test by hand. The real question is if there is really much of a point doing this posthoc for thousands of already published papers?

One argument for this is to enable people to meta-analyze previous results. Here it is important to know that a statistic is actually correct. However, I don’t entirely buy this argument because if you meta-analyze literature you really should spend more time on checking the results than looking what StatCheck auto-comment on PubPeer said. If anything, the countless comments saying that there are zero errors are probably more misleading than the ones that found minor problems. They may actually mislead you into thinking that there is probably nothing wrong with these statistics – and this is not necessarily true. In all fairness, StatCheck, both in its auto-comments and the original paper is very explicit about the fact that its results aren’t definite and should be verified manually. But if there is one thing I’ve learned about people it is that they tend to ignore the small print. When is the last time you actually read an EULA before agreeing to it?

Another issue with the meta-analysis argument is that presently the search is of limited scope. While 50,000 is a large number, it is a small proportion of scientific papers, even within the field of psychology and neuroscience. I work at a psychology department and am (by some people’s definition) a psychologist but – as I said – to my knowledge only one of my own papers should have even been included in the search so far. So if I do a literature search for a meta-analysis StatCheck’s autopubpeering wouldn’t be much help to me. I’m told there are plans to widen the scope of StatCheck’s robotic efforts beyond psychology journals in the future. When it is more common this may indeed be more useful although the problem remains that the validity of its results is simply unknown.

The original paper includes a validity check in the Appendix. This suggests that error rates are reasonably low when comparing StatCheck’s results to previous checks. This is doubtless important for confirming that StatCheck works. But in the long run this is not really the error rate we are interested in. What this does not tell us which proportion of papers contain actual errors with a study’s conclusions. Take Dorothy Bishop‘s paper as an example. For that StatCheck detected two F-tests for which the recalculated p-value would change the statistical conclusions. However, closer inspection reveals that the test was simply misreported in the paper. There is only one degree of freedom and I’m told StatCheck misinterpreted what test this was (but I’m also told this has been fixed in the new version). If you substitute in the correct degrees of freedom, the reported p-value matches.

Now, nobody is denying that there is something wrong with how these particular stats were reported. An F-test should have two degrees of freedom. So StatCheck did reveal errors and this is certainly useful. But the PubPeer comment flags this up as a potential gross inconsistency that could theoretically change the study’s conclusions. However, we know that it doesn’t actually mean that. The statistical inference and conclusions are fine. There is merely a typographic error. The StatCheck report is clearly a false positive.

This distinction seems important to me. The initial reports about this StatCheck mega-trawl was that “around half of psychology papers have at least one statistical error, and one in eight have mistakes that affect their statistical conclusions.” At least half of this sentence is blatantly untrue. I wouldn’t necessarily call a typo a “statistical error”. But as I already said, revealing these kinds of errors is certainly useful nonetheless. The second part of this statement is more troubling. I don’t think we can conclude 1 in 8 papers included in the search have mistakes that affect their conclusions. We simply do not know that. StatCheck is a clever program but it’s not a sentient AI. The only way to really determine if the statistical conclusions are correct is still to go and read each paper carefully and work out what’s going on. Note that the statement in the StatCheck paper is more circumspect and acknowledges that such firm conclusions cannot be drawn from its results. It’s a classical case of journalistic overreach where the RetractionWatch post simplifies what the researchers actually said. But these are still people who know what they’re doing. They aren’t writing flashy “science” article for the tabloid press.

This is a problem. I do think we need to be mindful of how the public perceives scientific research. In a world in which it is fine for politicians to win referenda because “people have had enough of experts” and in which a narcissistic, science-denying madman is dangerously close to becoming US President we simply cannot afford to keep telling the public that science is rubbish. Note that worries about the reputation of science are no excuse not to help improve it. Quite to the contrary, it is a reason to ensure that it does improve. I have said many times, science is self-correcting but only if there are people who challenge dearly held ideas, who try to replicate previous results, who improve the methods, and who reveal errors in published research. This must be encouraged. However, if this effort does not go hand in hand with informing people about how science actually works, rather than just “fucking loving” it for its cool tech and flashy images, then we are doomed. I think it is grossly irresponsible to tell people that an eighth of published articles contain incorrect statistical conclusions when the true number is probably considerably smaller.

In the same vein, an anonymous commenter on my own PubPeer thread also suggested that we should “not forget that Statcheck wasn’t written ‘just because.'” There is again an underhanded message in this. Again, I think StatCheck is a great tool and it can reveal questionable results such as rounding down your p=0.054 to p=0.05 or the even more unforgivable p<0.05. It can also reveal other serious errors. However, until I see any compelling evidence that the proportion of such evils in the literature is as high as suggested by these statements I remain skeptical. A mass-scale StatCheck of the whole literature in order to weed out serious mistakes seems a bit like carpet-bombing a city just to assassinate one terrorist leader. Even putting questions of morality aside, it isn’t really very efficient. Because if we assume that some 13% of papers have grossly inconsistent statistics we still need to go and manually check them all. And, what is worse, we quite likely miss a lot of serious errors that this test simple can’t detect.

So what do I think about all this? I’ve come to the conclusion that there is no major problem per se with StatCheck posting on PubPeer. I do think it is useful to see these results, especially if it becomes more general. Seeing all of these comments may help us understand how common such errors are. It allows people to double check the results when they come across them. I can adjust my instinct. If I see one or two comments on PubPeer I may now suspect it’s probably about StatCheck. If I see 30, it is still likely to be about something potentially more serious. So all of this is fine by me. And hopefully, as StatCheck becomes more widely used, it will help reduce these errors in future literature.

But – and this is crucial – we must consider how we talk about this. We cannot treat every statistical error as something deeply shocking. We need to develop a fair tolerance to these errors as they are discovered. This may seem obvious to some but I get the feeling not everybody realizes that correcting errors is the driving force behind science. We need to communicate this to the public instead of just telling them that psychologists can’t do statistics. We can’t just say that some issue with our data analysis invalidates 45,000 and 15 years worth of fMRI studies. In short, we should stop overselling our claims. If, like me, you believe it is damaging when people oversell their outlandish research claims about power poses and social priming, then it is also damaging if people oversell their doomsday stories about scientific errors. Yes, science makes errors – but the fact that we are actively trying to fix them is proof that it works.

Your friendly stats checking robot says hello

On the magic of independent piloting

TL,DR: Never simply decide to run a full experiment based on whether one of the small pilots in which you tweaked your paradigm supported the hypothesis. Use small pilots only to ensure the experiment produces high quality data, judged by criteria that are unrelated to your hypothesis.

Sorry for the bombardment with posts on data peeking and piloting. I felt this would have cluttered up the previous post so I wrote a separate one. After this one I will go back to doing actual work though, I promise! That grant proposal I should be writing has been neglected for too long…

In my previous post, I simulated what happens when you conduct inappropriate pilot experiments by running a small experiment and then continuing data collection if the pilot produces significant results. This is really data peeking and it shouldn’t come as much of a surprise that this inflates false positives and massively skews effect size estimates. I hope most people realize that this is a terrible thing to do because it makes your results entirely dependent on the outcome. Quite possibly, some people would have learned about this in their undergrad stats classes. As one of my colleagues put it, “if it ends up in the final analysis it is not a pilot.” Sadly, I don’t think this as widely known as it should be. I was not kidding when I said that I have seen it happen before or overheard people discussing having done this type of inappropriate piloting.

But anyway, what is an appropriate pilot then? In my previous post, I suggested you should redo the same experiment but restart data collection. You now stick to the methods that gave you a significant pilot result. Now the data set used to test your hypothesis is completely independent, so it won’t be skewed by the pre-selected pilot data. Put another way, your exploratory pilot allows you to estimate a prior, and your full experiment seeks to confirm it. Surely there is nothing wrong with that, right?

I’m afraid there is and it is actually obvious why: your small pilot experiment is underpowered to detect real effects, especially small ones. So if you use inferential statistics to determine if a pilot experiment “worked,” this small pilot is biased towards detecting larger effect sizes. Importantly, this does not mean you bias your experiment towards larger effect sizes. If you only continue the experiment when the pilot was significant, you are ignoring all of the pilots that would have shown true effects but which – due to the large uncertainty (low power) of the pilot – failed to do so purely by chance. Naturally, the proportion of these false negatives becomes smaller the larger you make your pilot sample – but since pilots are by definition small, the error rate is pretty high in any case. For example, for a true effect size of δ = 0.3, the false negatives at a pilot sample of 2 is 95%. With a pilot sample of 15, it is still as high as 88%. Just for illustration I show below the false negative rates (1-power) for three different true effect sizes. Even for quite decent effect sizes the sensitivity of a small pilot is abysmal:

False Negatives

Thus, if you only pick pilot experiments with significant results to do real experiments you are deluding yourself into thinking that the methods you piloted are somehow better (or “precisely calibrated”). Remember this is based on a theoretical scenario that the effect is real and of fixed strength. Every single pilot experiment you ran investigated the same underlying phenomenon and any difference in outcome is purely due to chance – the tweaking of your methods had no effect whatsoever. You waste all manner of resources piloting some methods you then want to test.

So frequentist inferential statistics on pilot experiments are generally nonsense. Pilots are by nature exploratory. You should only determine significance for confirmatory results. But what are these pilots good for? Perhaps we just want to have an idea of what effect size they can produce and then do our confirmatory experiments for those methods that produce a reasonably strong effect?

I’m afraid that won’t do either. I simulated this scenario in a similar manner as in my previous post. 100,000 times I generated two groups (with a full sample size of n = 80, although the full sample size isn’t critical for this). Both groups are drawn from a population with standard deviation 1 but one group has a mean of zero while the other’s mean is shifted by 0.3 – so we have a true effect size here (the actual magnitude of this true effect size is irrelevant for the conclusions). In each of the 100,000 simulations, the researcher runs a number of pilot subjects per group (plotted on x-axis). Only if the effect size estimate for this pilot exceeds a certain criterion level, the researcher runs an independent, full experiment. The criterion is either 50%, 100%, or 200% of the true effect size. Obviously, the researcher cannot know this however. I simply use these criteria as something that the researcher might be doing in a real world situation. (For the true effect size I used here, these criteria would be d = 0.15, d = 0.3, or d = 0.6, respectively).

The results are below. The graph on the left once again plots the false negative rates against the pilot sample size. A false negative here is not based on significance but on effect size, so any simulation for which d was below the criterion. When the criterion is equal to the true effect size, the false negative rate is constant at 50%. The reason for this is obvious: each simulation is drawn from a population centered on the true effect of 0.3, so half of these simulations will exceed that value. However, when the criterion is not equal to the true effect the false negative rates depend on the pilot sample size. If the criterion is lower than the true effect, false negatives decrease. If the criterion is strict, false negatives increase. Either way, the false negative rates are substantially greater than the 20% mark you would have with an adequately powered experiment. So you will still delude yourself a considerable number of times if you only conduct the full experiment when your pilot has a particular effect size. Even if your criterion is lax (and d = 0.15 for a pilot sounds pretty lax to me), you are missing a lot of true results. Again, remember that all of the pilot experiments here investigated a real effect of fixed size. Tweaking the method makes no difference. The difference between simulations is simply due to chance.

Finally, the graph on the right shows the mean effect sizes  estimated by your completed experiments (but not the absolute this time!). The criterion you used in the pilot makes no difference here (all colors are at the same level), which is reassuring. However, all is not necessarily rosy. The open circles plot the effect size you get under publication bias, that is, if you only publish the significant experiments with p < 0.05. This effect is clearly inflated compared to the true effect size of 0.3. The asterisks plot the effect size estimate if you take all of the experiments. This is the situation you would have (Chris Chambers will like this) if you did a Registered Report for your full experiment and publication of the results is guaranteed irrespective of whether or not they are significant. On average, this effect size is an accurate estimate of the true effect.

Simulation Results

Again, these are only the experiments that were lucky enough to go beyond the piloting stage. You already wasted a lot of time, effort, and money to get here. While the final outcome is solid if publication bias is minimized, you have thrown a considerable number of good experiments into the trash. You’ve also misled yourself into believing that you conducted a valid pilot experiment that honed the sensitivity of your methods when in truth all your pilot experiments were equally mediocre.

I have had a few comments from people saying that they are only interested in large effect sizes and surely that means they are fine? I’m afraid not. As I said earlier already, the principle here is not dependent on the true effect size. It is solely a factor of the low sensitivity of the pilot experiment. Even with a large true effect, your outcome-dependent pilot is a blind chicken that errs around in the dark until it is lucky enough to hit a true effect more or less by chance. For this to happen you must use a very low criterion to turn your pilot into a real experiment. This however also means that if the null hypothesis is true an unacceptable proportion of your pilots produce false positives. Again, remember that your piloting is completely meaningless – you’re simply chasing noise here. It means that your decision whether to go from pilot to full experiment is (almost) completely arbitrary, even when the true effect is large.

So for instance, when the true effect is a whopping δ = 1, and you are using d > 0.15 as a criterion in your pilot of 10 subjects (which is already large for pilots I typically hear about), your false negative rate is nice and low at ~3%. But critically, if the null hypothesis of δ = 0 is true, your false positive rate is ~37%. How often you will fool yourself by turning a pilot into a full experiment depends on the base rate. If you give this hypothesis at 50:50 chance of being true, almost one in three of your pilot experiments will lead you to chase a false positive. If these odds are lower (which they very well may be), the situation becomes increasingly worse.

What should we do then? In my view, there are two options: either run a well-powered confirmatory experiment that tests your hypothesis based on an effect size you consider meaningful. This is the option I would chose if resources are a critical factor. Alternatively, if you can afford the investment of time, money, and effort, you could run an exploratory experiment with a reasonably large sample size (that is, more than a pilot). If you must, tweak the analysis at the end to figure out what hides in the data. Then, run a well-powered replication experiment to confirm the result. The power for this should be high enough to detect effects that are considerably weaker than the exploratory effect size. This exploratory experiment may sound like a pilot but it isn’t because it has decent sensitivity and the only resource you might be wasting is your time* during the exploratory analysis stage.

The take-home message here is: don’t make your experiments dependent on whether your pilot supported your hypothesis, even if you use independent data. It may seem like a good idea but it’s tantamount to magical thinking. Chances are that you did not refine your method at all. Again (and I apologize for the repetition but it deserves repeating): this does not mean all small piloting is bad. If your pilot is about assuring that the task isn’t too difficult for subjects, that your analysis pipeline works, that the stimuli appear as you intended, that the subjects aren’t using a different strategy to perform the task, or quite simply to reduce the measurement noise, then it is perfectly valid to run a few people first and it can even be justified to include them in your final data set (although that last point depends on what you’re studying). The critical difference is that the criteria for green-lighting a pilot experiment are completely unrelated to the hypothesis you are testing.

(* Well, your time and the carbon footprint produced by your various analysis attempts. But if you cared about that, you probably wouldn’t waste resources on meaningless pilots in the first place, so this post is not for you…)

MatLab code for this simulation.

On the worthlessness of inappropriate piloting

So this post is just a brief follow up to my previous post on data peeking. I hope it will be easy to see why this is very related:

Today I read this long article about the RRR of the pen-in-mouth experiments – another in a growing list of failures to replicate classical psychology findings. I was quite taken aback by one comment in this: the assertion that these classical psychology experiments (in particular the social priming ones) had been “precisely calibrated to elicit tiny changes in behavior.” It is an often repeated argument to explain why findings fail to replicate – the “replicators” simply do not have the expertise and/or skill to redo these delicate experiments. And yes, I am entirely willing to believe that I’d be unable to replicate a lot of experiments outside my area, say, finding subatomic particles or even (to take an example from my general field) difficult studies on clinical populations.

But what does this statement really mean? How were these psychology experiments “calibrated” before they were run? What did the authors do to nail down the methods before they conducted the studies? It implies that extensive pilot experiments were conducted first. I am in no position to say that this is what the authors of these psychology studies did during their piloting stage but one possibility is that several small pilot experiments were run and the experimental parameters were tweaked until a significant result supporting the hypothesis was observed. Only then they continued the experiment and collected a full data set that included the pilot data. I have seen and heard of people who did precisely this sort of piloting until the “experiment worked.”

So, what actually happens when you “pilot” experiments to “precisely calibrate” them? I decided to simulate this and the results are in the graph below (each data point is based on 100,000 simulations). In this simulation, an intrepid researcher first runs a small number of pilot subjects per group (plotted on x-axis). If the pilot fails to produce significant results at p < 0.05, the experiment is abandoned and the results are thrown in the bin never to again see the light of day. However, if the results are significant, the eager researcher collects more data until the full sample in each group is n = 20, 40, or 80. On the y-axis I plotted the proportion of these continued experiments that produced significant results. Note that all simulated groups were drawn from a normal distribution with mean 0 and standard deviation 1. Therefore, any experiments that “worked” (that is, they were significant) are false positives. In a world where publication bias is still commonplace, these are the findings that make it into journals – the rest vanish in the file-drawer.


False Positives

As you can see, such a scheme of piloting until the experiment “works,” can produce an enormous number of false positives in the completed experiments. Perhaps this is not really all that surprising – after all this is just another form of data peeking. Critically, I don’t think this is unrealistic. I’d wager this sort of thing is not at all uncommon. And doesn’t it seem harmless? After all we are only peeking once! If a pilot experiment “worked,” we continue sampling until the sample is complete.

Well, even under these seemingly benign conditions false positives can be inflated dramatically. The black curve is for the case when the final sample size, of the completed studies, is only 20. This is the worst case and it is perhaps unrealistic. If the pilot experiment consists of 10 subjects (that is, half the full sample) about a third of results will be flukes. But even in the other cases, when only a handful of pilot subjects are collected compared to the much larger full samples, false positives are well above 5%. In other words, whenever you pilot an experiment and decide that it’s “working” because it seems to support your hypothesis, you are already skewing the final outcome.

Of course, the true false positive rate, taken across the whole set of 100,000 pilots that were run, would be much lower (0.05 times the rates I plotted above to be precise, because we picked from the 5% of significant “pilots” in the first place). However, since we cannot know how much of this inappropriate piloting went on behind the scenes, knowing this isn’t particularly helpful.

More importantly, we aren’t only interested in the false positive rate. A lot of researchers will care about the effect size estimates of their experiments. Crucially, this form of piloting will substantially inflate these effect size estimates as well and this may have even worse consequences for the interpretation of these experiments. In the graph below, I plot the effect sizes (the mean absolute Cohen’s d) for the same simulations for which I showed you the false positive rates above. I use the absolute effect size because the sign is irrelevant – the whole point of this simulation exercise is to mimic a full-blown fishing expedition via inappropriate “piloting.” So our researcher will interpret a significant result as meaningful regardless of whether d is positive or negative.

Forgive the somewhat cluttered plot but it’s not that difficult to digest really. The color code is the same as for the previous figure. The open circles and solid lines show you the effect size of the experiments that “worked,” that is, the ones for which we completed data collection and which came out significant. The asterisks and dashed lines show the effect sizes for all of global false positives, that is, all the simulations with p < 0.05 after the pilot but using the full the data set, as if you had completed these experiments. Finally, the crosses and dotted lines show the effect sizes you get for all simulations (ignoring inferential statistics). This is just given as a reference.

Effect Sizes

Two things are notable about all this. First, effect size estimates increase with “pilot” sample size for the set of global false positives (asterisks) but not the other curves. This is because the “pilot” sample size determines how strongly the fluke pilot effect will contribute to the final effect size. More importantly, the effect size estimates for those experiments with significant pilots and which also “worked” after completion are massively exaggerated (open circles). The degree of exaggeration is a factor of the baseline effect (crosses). The absolute effect size estimate depends on the full sample size. At the smallest full sample size (n=20, black curve) the effect sizes are as high as d = 0.8. Critically, the degree of exaggeration does not depend on how large your pilot sample was. Whether your “pilot” had only 2 or 15 subjects, the average effect size estimate is around 0.8.

The reason for this is that the smaller the pilot experiment, the more underpowered it is. Since it is a condition for continuing the experiment that the pilot must be significant, the pilot effect size must be considerably larger for small pilots than larger pilots. Because the true effect size is always zero, this cancels out in the end so the final effect size estimate is constant regardless of the pilot sample size. But in any case, the effect size estimate you got from your precisely calibrated inappropriately piloted experiments are enormously overrated. It shouldn’t be much of a surprise if these don’t replicate and that posthoc power calculations based on these effect sizes suggest low power (of course, you should never use posthoc power in that way but that’s another story…) .

So what should we do? Ideally you should just throw away the pilot data, preregister the design, and restart the experiment anew with the methods you piloted. In this case the results are independent and only the methods are shared. Importantly, there is nothing wrong with piloting in general. After all, I had a previous post praising pilot experiments. But piloting should be about ensuring that the methods are effective in producing clean data. There are many situations in which an experiment seems clever and elegant in theory but once you actually start it in practice you realize that it just can’t work. Perhaps the participants don’t use the task strategy you envisioned. Or they simply don’t perceive the stimuli the way they were intended. In fact, this happened to us recently and we may have stumbled over an interesting finding in its own right (but this must also be confirmed by a proper experiment!). In all these situations, however, the decision on the pilot results is unrelated to the hypothesis you are testing. If they are related, you must account for that.

MatLab code for these simulations is available. As always, let me know if you find errors. (To err is human, to have other people check your code divine?)

Realistic data peeking isn’t as bad as you* thought – it’s worse

Unless you’ve been living under a rock, you have probably heard of data peeking – also known as “optional stopping”. It’s one of those nasty questionable research practices that could produce a body of scientific literature contaminated by widespread spurious findings and thus lead to poor replicability.

Data peeking is when you run a Frequentist statistical test every time you collect a new subject/observation (or after every few observations) and stop collecting data when the test comes out significant (say, at p < 0.05). Doing this clearly does not accord with good statistical practice because under the Frequentist framework you should plan your final sample size a priori based on power analysis, collect data until you have that sample size, and never look back (but see my comment below for more discussion of this…). What is worse, under the aforementioned data peeking scheme you can be theoretically certain to reject the null hypothesis eventually. Even if the null hypothesis is true, sooner or later you will hit a p-value smaller than the significance threshold.

Until recently, many researchers, at least in psychological and biological sciences, appeared to be unaware of this problem and it isn’t difficult to see that this could contribute to a prevalence of false positives in the literature. Even now, after numerous papers and blog posts have been written about this topic, this problem still persists. It is perhaps less common but I still occasionally overhear people (sometimes even in their own public seminar presentations) saying things like “This effect isn’t quite significant yet so we’ll see what happens after we tested a few more subjects.” So far so bad.

Ever since I heard about this issue (and I must admit that I was also unaware of the severity of this problem back in my younger, carefree days), I have felt somehow dissatisfied with how this issue has been described. While it is a nice illustration of a problem, the models of data peeking seem extremely simplistic to me. There are two primary aspects of this notion that in my opinion just aren’t realistic. First, the notion of indefinite data collection is obviously impossible, as this would imply having an infinite subject pool and other bottomless resources. However, even if you allow for a relatively manageable maximal sample size at which a researcher may finally stop data collection even when the test is not significant, the false positive rate is still massively inflated.

The second issue is therefore a bigger problem: the simple data peeking procedure described above seems grossly fraudulent to me. I would have thought that even if the researcher in question were unaware of the problems with data peeking, they probably would nonetheless feel that something isn’t quite right with checking for significant results after every few subjects and continuing until they get them. As always, I may be wrong about this but I sincerely doubt this is what most “normal people do. Rather, I believe people would be more likely to peek at the data to look if the results are significant, and only if the p-value “looks promising” (say 0.05 < p < 0.1) they continue testing. This sampling plan sounds a lot more like what may actually happen. So I wanted to find out how this sort of sampling scheme would affect results. I have no idea if anyone already did something like this. If so, I’d be grateful if you could point me to that analysis.

So what I did is the following: I used Pearson’s correlation as the statistical test. In each iteration of the simulation I generated a data set of 150 subjects, each with two uncorrelated Gaussian variables, let’s just pretend it’s the height of some bump on the subjects’ foreheads and a behavioral score of how belligerent they are. 150 is thus the maximal sample size, assuming that our simulated phrenologist – let’s call him Dr Peek – would not want to test more than 150 subjects. However, Dr Peek actually starts with only 3 subjects and then runs the correlation test. In the simplistic version of data peeking, Dr Peek will stop collecting data if p < 0.05; otherwise he will collect another subject until p < 0.05 or 150 subjects are eventually reached. In addition, I simulated three other sampling schemes that feel more realistic to me. In these cases, Dr Peek will also stop data collection when p < 0.05 but he will also stop when p is either greater than 0.1, greater than 0.3, or greater than 0.5. I repeated each of these simulations 1000 times.

The results are in the graph below. The four sampling schemes are denoted by the different colors. On the y-axis I plotted the proportion of the 1000 simulations in which the final outcome (that is, whenever data collection was stopped) yielded p < 0.05. The scenario I described above is the leftmost set of data points in which the true effect size, the correlation between forehead bump height and belligerence, is zero. Confirming previous reports on data peeking, the simplistic case (blue curve) has an enormously inflated false positive rate of around 0.42. Nominally, the false positive rate should be 0.05. However, under the more “realistic” sampling schemes the false positive rates are far lower. In fact, for the case where data collection only continues while p-values are marginal (0.05 < p < 0.1), the false positive rate is 0.068, only barely above the nominal rate. For the other two schemes, the situation is slightly worse but not by that much. So does this mean that data peeking isn’t really as bad as we have been led to believe?


Hold on, not so fast. Let us now look what happens in the rest of the plot. I redid the same kind of simulation for a range of true effect sizes up to rho = 0.9. The x-axis shows the true correlation between forehead bump height and belligerence. Unlike for the above case when the true correlation is zero, now the y-axis shows statistical power, the proportion of simulations in which Dr Peek concluded correctly that there actual is a correlation. All four curves rise steadily as one might expect with stronger true effects. The blue curve showing the simplistic data peeking scheme rises very steeply and reaches maximal power at a true correlation of around 0.4. The slopes of the other curves are much more shallow and while the power at strong true correlations is reasonable at least for two of them, they don’t reach the lofty heights of the simplistic scheme.

This feels somehow counter-intuitive at first but it makes sense: when the true correlation is strong, the probability of high p-values is low. However, at the very small sample sizes we start out with even a strong correlation is not always detectable – the confidence interval of the estimated correlation is very wide. Thus there will be a relatively large proportion of p-values that pass that high cut-off and terminate data collection prematurely without rejecting the null hypothesis.

Critically, these two things, inflated false positive rates and reduced statistical power to detect true effects, dramatically reduce the sensitivity of any analysis that is performed under these realistic data peeking schemes. In the graph below, I plot the sensitivity (quantified as d’) of the analysis. Larger d’ means there is a more favorable ratio between the number of simulations in which Dr Peek correctly detected a true effect and how often he falsely concluded there was a correlation when there wasn’t one. Sensitivity for the simplistic sample scheme (blue curve) rises steeply until power is maximal. However, sensitivity for the other sampling schemes starts off close to zero (no sensitivity) and only rises fairly slowly.


For reference compare this to the situation under desired conditions, that is, without questionable research practices, with adequate statistical power of 0.8, and the nominal false positive rate of 0.05: in this case the sensitivity would be d’ = 2.49, so higher than any of the realistic sampling schemes ever get. Again, this is not really surprising because data collections will typically be terminated at sample sizes that give far less than 0.8 power. But in any case, this is bad news. Even though the more realistic forms of data peeking don’t inflate false positives as massively as the most pessimistic predictions, they impede the sensitivity of experiments dramatically and are thus very likely to only produce rubbish. It should come as no surprise that many findings fail to replicate.

Obviously, what I call here more realistic data peeking is not necessarily a perfect simulation of how data peeking may work in practice. For one thing, I don’t think Dr Peek would have a fixed cut-off of p > 0.1 or p > 0.5. Rather, such a cut-off might be determined on a case-by-case basis, dependent on the prior expectation Dr Peek has that the experiment should yield significant results. (Dr Peek may not use Bayesian statistics, but like all of us he clearly has Bayesian priors.) In some cases, he may be very confident that there should be an effect and he will continue testing for a while but then finally give up when the p-value is very high. For other hypotheses that he considered to be risky to begin with, he may not be very convinced even by marginal p-values and thus will terminate data collection when p > 0.1.

Moreover, it is probably also unrealistic that Dr Peek would start with a sample size of 3. Rather, it seems more likely that he would have a larger minimal sample size in mind, for example 20 and collect that first. While he may have been peeking at the data before he completed testing 20 subjects, there is nothing wrong with that provided he doesn’t stop early if the result becomes significant. Under these conditions the situation becomes somewhat better but the realistic data peeking schemes still have reduced sensitivity, at least for lower true effect sizes, which are presumably far more prevalent in real world situations. The only reason that sensitivity goes up fairly quickly to reasonable levels is that with the starting sample size of 20 subjects, the power to detect those stronger correlations is already fairly high – so in many cases data collection will be terminated as soon as the minimum sample is completed.


Finally, while I don’t think this plot is entirely necessary, I also show you the false positives / power rates for this latter case. The curves are such beautiful sigmoids that I just cannot help myself but to include them in this post…


So to sum up, leaving aside the fact that you shouldn’t really peek at your data and stop data collection prematurely in any case, if you do this you can shoot yourself seriously in the foot. While the inflation of false positives through data peeking may have contributed a considerable number of spurious, unreplicable findings to the literature, what is worse it may very well also have contributed a great number of false negatives to the proverbial file drawer: experiments that were run but failed to produce significant results after peeking a few times and which were then abandoned, never to be heard of again. When it comes to spurious findings in the literature, I suspect the biggest problem is not actually data peeking but other questionable practices from the Garden of Forking Paths, such as tweaking the parameters of an experiment or the analysis.

* Actually it may just be me…

Matlab code for these simulations. Please let me know if you discover the inevitable bugs in this analysis.

Experimenter effects in replication efforts

I mentioned the issue of data quality before but reading Richard Morey’s interesting post about standardised effect sizes the other day made me think about this again. Yesterday I gave a lecture discussing Bem’s infamous precognition study and the meta-analysis he recently published of the replication attempts. I hadn’t looked very closely at the meta-analysis data before but for my lecture I produced the following figure:Bem-Meta

This shows the standardised effect size for each of the 90 results in that meta-analysis split into four categories. On the left in red we have the ten results by Bem himself (nine of which are his original study and one is a replication of one of them by himself). Next, in orange we have what they call ‘exact replications’ in the meta-analysis, that is, replications that used his program/materials. In blue we have ‘non-exact replications’ – those that sought to replicate the paradigms but didn’t use his materials. Finally, on the right in black we have what I called ‘different’ experiments. These are at best conceptual replications because they also test whether precognition exists but use different experiment protocols. The hexagrams denote the means across all the experiments in each category (these are non-weighted means but it’s not that important for this post).

While the means for all categories are evidently greater than zero, the most notable thing should be that Bem’s findings are dramatically different from the rest. While the mean effect size in the other categories are below or barely at 0.1 and there is considerable spread beyond zero in all of them, all ten of Bem’s results are above zero and, with one exception, above 0.1. This is certainly very unusual and there are all sorts of reasons we could discuss for why this might be…

But let’s not. Instead let’s assume for the sake of this post that there is indeed such a thing as precognition and that Daryl Bem simply knows how to get people to experience it. I doubt that this is a plausible explanation in this particular case – but I would argue that for many kinds of experiments such “experimenter effects” are probably notable. In an fMRI experiment different labs may differ considerably in how well they control participants’ head motion or even simply in terms of the image quality of the MRI scans. In psychophysical experiments different experimenters may differ in how well they explain the task to participants or how meticulous they are in ensuring that they really understood the instructions, etc. In fact, the quality of the methods surely must matter in all experiments, whether they are in astronomy, microbiology, or social priming. Now this argument has been made in many forms, most infamously perhaps in Jason Mitchell’s essay “On the emptiness of failed replications” that drew much ire from many corners. You may disagree with Mitchell on many things but not on the fact that good methods are crucial. What he gets wrong is laying the blame for failed replications solely at the feet of “replicators”. Who is to say that the original authors didn’t bungle something up?

However, it is true that all good science should seek to reduce noise from irrelevant factors to obtain as clean observations as possible of the effect of interest. Using again Bem’s precognition experiments as an example, we could hypothesise that he indeed had a way to relax participants to unlock their true precognitive potential that others seeking to replicate his findings did not. If that were true (I’m willing to bet a fair amount of money that it isn’t but that’s not the point), if true, this would indeed mean that most of the replications – failed or successful – in his meta-analysis are only of low scientific value. All of these experiments are more contaminated by noise confounds than his experiments; thus only he provides clean measurements. Standardised effect sizes like Cohen’s d divide the absolute raw effect by a measure of uncertainty or dispersion in the data. The dispersion is a direct consequence of the noise factors involved. So it should be unsurprising that the effect size is greater for experimenters that are better at eliminating unnecessary noise.

Statistical inference seeks to estimate the population effect size from a limited sample. Thus, a meta-analytic effect size is an estimate of the “true” effect size from a set of replications. But since this population effect includes the noise from all the different experimenters, it does not actually reflect the true effect? The true effect is people’s inherent precognitive ability. The meta-analytic effect size estimate is spoiling that with all the rubbish others pile on with their sloppy Psi experimentation skills. Surely we want to know the former not the latter? Again, for precognition most of us will probably agree that this is unlikely – it seems more trivially explained by some Bem-related artifact – but in many situations this is a very valid point: Imagine one researcher manages to produce a cure for some debilitating disease but others fail to replicate it. I’d bet that most people wouldn’t run around shouting “Failed replication!”, “Publication bias!”, “P-hacking!” but would want to know what makes the original experiment – the one with the working drug – different from the rest.

The way I see that, meta-analysis of large scale replications is not the right way to deal with this problem. Meta-analysis of one lab’s replications are worthwhile, especially as a way to summarise a set of conceptually related experiments – but then you need to take them with a grain of salt because they aren’t independent replications. But large-scale meta-analysis across different labs don’t really tell us all that much. They simply don’t estimate the effect size that really matters. The same applies to replication efforts (and I know I’ve said this before). This is the point on which I have always sympathised with Jason Mitchell: you cannot conclude a lot from a failed replication. A successful replication that nonetheless demonstrates that the original claim is false is another story but simply failing to replicate some effect only tells you that something is (probably) different between the original and the replication. It does not tell you what the difference is.

Sure, it’s hard to make that point when you have a large-scale project like Brian Nosek’s “Estimating the reproducibility of psychological science” (I believe this is a misnomer because they mean replicability not reproducibility – but that’s another debate). Our methods sections are supposed to allow independent replication. The fact that so few of their attempts produced significant replications is a great cause for concern. It seems doubtful that all of the original authors knew what they were doing and so few of the “replicators” did. But in my view, there are many situations where this is not the case.

I’m not necessarily saying that large-scale meta-analysis is entirely worthless but I am skeptical that we can draw many firm conclusions from it. In cases where there is reasonable doubt about differences in data quality or experimenter effects, you need to test these differences. I’ve repeatedly said that I have little patience for claims about “hidden moderators”. You can posit moderating effects all you want but they are not helpful unless you test them. The same principle applies here. Rather than publishing one big meta-analysis after another showing that some effect is probably untrue or, as Psi researchers are wont to do, in an effort to prove that precognition, presentiment, clairvoyance or whatever are real, I’d like to see more attempts to rule out these confounds.

In my opinion the only way to do this is through adversarial collaboration. If an honest skeptic can observe Bem conduct his experiments, inspect his materials, and analyse the data for themselves and yet he still manages to produce these findings, that would go a much longer way convincing me that these effects are real than any meta-analysis ever could.

Humans are dirty test tubes



What is selectivity?

TL;DR: Claiming that something is “selective” implies knowledge of the stimulus dimension to which it is tuned. It also does not apply to simple intensity codes because selectivity requires a tuning preference.

Unless you can tell me where on the x-axis “pain” is relative to “rejection” or whatever else dACC may respond to, you can’t really know that this brain area behaves according to either relationship.

This post is a just little rant about an age-old pet peeve of mine: neuroscience studies claiming they found that some brain area is selective for some stimulus/task/mental state etc. This issue recently resurfaced in my mind because of an interesting exchange between Tal Yarkoni on the one hand and Matt Lieberman and Naomi Eisenberger on the other. The latter recently published a study in PNAS suggesting that dACC is “selective for pain”. Yarkoni wrote a detailed rebuttal to their claims criticising the way they inferred selectivity. I recommend following this on-going discourse (response by Lieberman & Eisenberger and a  response to their response by Yarkoni). I won’t go into any depth on it here. Rather I want to make a more general comment on why I feel the term selectivity is frequently misused.

Neuroimaging methods like fMRI allow researchers to localise brain regions that respond preferentially to particular stimuli or tasks. This “blobology” is becoming less common now that our field has matured and many experiments are more sophisticated. Nevertheless, localising brain regions that respond more to one experimental condition than another will probably remain a common sight in the neuroimaging literature for a long time to come, even if the most typical such use is probably functional localisers to limit the regions of interest in more complex experiments.

Anyway, in blobological studies, results are frequently reported as showing that your blob is “selective” for something or other: not only are we supposed to believe that the dACC is selective for pain, but also that the FFA is selective for images of faces, the LOC is selective for intact objects, and the MT+ complex is selective for motion. While some of these claims may be correct, they are not demonstrated by blobology using any kind of method. In fact, Lieberman and Eisenberger write in their response to Yarkoni:

“We’ve never seen a response to one of these papers that says they were wrong to make these claims because they didn’t test for the thousands of other things the region of interest might respond to.  Thus the weak form of selectivity, the version we were using, can be stated this way:

Selectivityweak: The dACC is selective for pain, if pain is a more reliable source of dACC activation than the other terms of interest (executive, conflict, salience).”

Perhaps there has never been a response saying that these were wrong but I think there should have. What is selectivity? While the dictionary defines “selective” as a synonym of

“discriminating, particular, discerning”

the term has long been established in neuroscience. Take the Nobel Prize winning research by Hubel and Wiesel in the 1960s for example. They discovered that neurons in the visual cortex are selective to the orientation of simple bar stimuli. So for instance a horizontal bar may drive a neuron to fire strongly while a vertical one would not. In the neurophysiology literature one would say that this particular neuron has an “preference” for horizontal bars.

This, however, does not mean that the neuron is “selective” for horizontal bars. The firing rate in response to oblique bars is likely somewhere between the rate for horizontal and vertical ones. The distinction between preference and selectivity may seem like semantic quibbling but it’s not. They are referring to very different concepts and, as I will try to argue, this distinction is important. But let’s not jump ahead.

“Strong” selectivity (to use Lieberman and Eisenberger’s terminology) implies that the neuron only responds to horizontal bars but not much else. In the following images, the contrast of each oriented grating denotes how much of a neuronal response it would produce.

“Strong” orientation selectivity

A “weakly” selective neuron might respond similarly across a wide range of orientations:

“Weak” orientation selectivity

An even less selective neuron would show very similar responses to all orientations and a completely non-selective neuron would respond at the same rate to any visual stimulus regardless of its orientation.

Almost no orientation selectivity

This means we can measure responses of the neuron across the whole range of orientations (or, more generally speaking, across a range of different stimulus values). Thus we can determine not only the preferred orientation but also how strongly the exact orientation modulates neuronal responses. In other words, we can quantify how discerning, how selective the neuron is for orientation. The key point here is that in all these examples the stimulus dimension for which the neuron is selective is the orientation of bar stimuli. The neuron prefers horizontal. It is selective for orientation.

So at least according to how neurophysiologists have defined selectivity over the past half century, the term refers to how much varying the stimulus along this dimension affects the neuronal response. I hope you will agree now that this does not apply to many of the claims of selectivity in the neuroimaging literature: an experiment that shows stronger responses in FFA to faces than houses shows that FFA prefers faces. It does not show that it doesn’t also prefer other stimuli, a point already discussed by Tarkoni and Lieberman and Eisenberger. More importantly, it does not demonstrate that FFA is selective for faces.

Rather the stimulus dimension for which there may be selectivity here could be loosely categorised as “visual objects” or even just “images”. Face-selectivity implies that the region is sensitive to changing the face identity (or possibly also some other attribute specific to faces). Now, I believe there is fairly good evidence that FFA does just that – however, simply comparing the response to faces to non-face images does not and cannot possibly demonstrate this. Only comparing responses (or response patterns) to different faces can achieve that.

Why does this matter? Is this really not mere semantics? No, because from all of this follows that to demonstrate selectivity requires systematic manipulation of the stimulus space. You must map out how changing the stimulus modulates responses by a neuron or brain region or whatever. You cannot make any claim about selectivity without any concept of how different stimuli relate to each other. For orientation that is simple but for more complex objects it is not. Comparing faces to houses or body parts or animals or tools or whatnot cannot achieve that, not unless you can tell me exactly why one category (say, houses?) should be more similar to faces than another (say, cars?). There are many possible models that could relate different categories. It could be based on low-level visual similarity, semantic similarity, conceptual similarity, etc. There are studies investigating exactly that, for example by using representational similarity analysis – but a discussion of this is outside the scope of this post. The point is that simply randomly comparing different object categories, no matter how many thousands, does not by itself tell you about selectivity.

Hopefully, by now I have convinced you why the claim that the dACC is selective for pain cannot possibly be correct, at least not if it is based only on a blobological method comparing responses to an arbitrary set of stimuli. I am not even sure that selectivity for pain is even conceptually possible. It would imply that there are mental states or tasks that are just not quite pain but not really something else yet either, and that there is a systematic relationship between that and dACC responses. Perhaps this is possible, I don’t know. Either way, no fMRI or NeuroSynth or other analysis comparing pain and rejection and conflict resolution or whatever can demonstrate this.

While we’re at it, showing that responses in dACC covary with the intensity of pain would not confirm selectivity for pain either. All that this shows is that dACC is responsive to pain. Because stimulus selectivity and preference go hand in hand. Selectivity for pain would imply that this region preferentially responds to a particular pain level but less so to pain that is stronger or weaker.

I added another paragraph because of a discussion I had about this last point on social media: selectivity implies that a neuron or brain area is tuned to a stimulus space and it can only exist if there is also a stimulus preference. An intensity code merely implies that responses increase as the stimulus quantity is increased. While the steepness of this increase can vary and tells you about sensitivity, such a neuron has no stimulus preference because the response either saturates (thus losing sensitivity beyond a certain level) or increases linearly (which is probably biologically implausible). This is mechanistically different from selectivity with different consequences on how the stimulus dimension is represented and how it affects behaviour. A brain area may very well be sensitive to contrast, to pain intensity, or confidence – but unless the code allows you to infer the exact stimulus level from the response it isn’t selective.

Does correlation imply prediction?

TL;DR: Leave-one-out cross-validation is a bad way for testing the predictive power of linear correlation/regression.

Correlation or regression analysis are popular tools in neuroscience and psychology research for analysing individual differences. It fits a model (most typically a linear relationship between two measures) to infer whether the variability in some measure is related to the variability in another measure. Revealing such relationships can help understand the underlying mechanisms. We and others used it in previous studies to test specific mechanistic hypotheses linking brain structure/function and behaviour. It also forms the backbone of twin studies of heritability that in turn can implicate genetic and experiential factors in some trait. Most importantly, in my personal view individual differences are interesting because they acknowledge the fact that every human being is unique rather than simply treating variability as noise and averaging across large groups people.

But typically every report of a correlational finding will be followed by someone zealously pointing out that “Correlation does not imply causation”. And doubtless it is very important to keep that in mind. A statistical association between two variables may simply reflect the fact that they are both related to a third, unknown factor or a correlation may just be a fluke.

Another problem is that the titles of studies using correlation analysis sometimes use what I like to call “smooth narrative” style. Saying that some behaviour is “predicted by” or  “depends on” some brain measure makes for far more accessible and interesting reading that dryly talking about statistical correlations. However, it doesn’t sit well with a lot of people, in part because such language may imply a causal link that the results don’t actually support. Jack Gallant seems to regularly point out on Twitter that the term “prediction” should only ever be used when a predictive model is built on some data set but the validity is tested on an independent data set.

Recently I came across an interesting PubPeer thread debating this question. In this one commenter pointed out that the title of the study under discussion, “V1 surface size predicts GABA concentration“, was unjustified because this relationship explains only about 7% of the variance when using a leave-one-out cross-validation procedure. In this procedure all data points except one are used to fit the regression and the final point is then used to evaluate the fit of the model. This procedure is then repeated n-fold using every data point as evaluation data once.

Taken at face value this approach sounds very appealing because it uses independent data for making predictions and for testing them. Replication is a cornerstone of science and in some respects cross-validation is an internal replication. So surely this is a great idea? Naive as I am I have long had a strong affinity for this idea.

Cross-validation underestimates predictive power

But not so fast. These notions fail to address two important issues (both of which some commenters on that thread already pointed out): first, it is unclear what amount of variance a model should explain to be important. 7% is not very much but it can nevertheless be of substantial theoretical value. The amount of variance that can realistically be explained by any model is limited by the noise in the data that arises from measurement error or other distortions. So in fact many studies using cross-validation to estimate the variance explained by some models (often in the context of model comparison) instead report the amount of explainable variance accounted for by the model. To derive this one must first estimate the noise ceiling, that is, the realistic maximum of variance that can possibly be explained. This depends on the univariate variability of the measures themselves.

Second, the cross-validation approach is based on the assumption that the observed sample, which is then subdivided into model-fitting and evaluation sets, is a good representation of the population parameters the analysis is attempting to infer. As such, the cross-validation estimate also comes with an error (this issue is also discussed by another blog post mentioned in that discussion thread). What we are usually interested in when we conduct scientific studies is to make an inference about the whole population, say a conclusion that can be broadly generalised to any human brain, not just the handful of undergraduate students included in our experiments. This does not really fit the logic of cross-validation because the evaluation is by definition only based on the same sample we collected.

Because I am a filthy, theory-challenged experimentalist, I decided to simulate this (and I apologise to all my Bayesian friends for yet again conditioning on the truth here…). For a range of sample sizes between n=3 and n=300 I drew a sample with from a population with a fixed correlation of rho=0.7 and performed leave-one-out cross-validation to quantify the variance explained by it (using the squared correlation between predicted and observed values). I also performed a standard regression analysis and quantified the variance explained by that. At each sample size I did this 1000 times and then calculated the mean variance explained for each approach. Here are the results:


What is immediately clear is that the results strongly depend on the sample size. Let’s begin with the blue line. This represents the variance explained by the standard regression analysis on the whole observed sample. The dotted, black, horizontal line denotes the true effect size, that is, the variance explained by the population correlation (so R^2=49%). The blue line starts off well above the true effect but then converges on it. This means that at small sample sizes, especially below n=10, the observed sample inflates how much variance is explained by the fitted model.

Next look at the red line. This denotes the variance explained by the leave-one-out cross-validation procedure. This also starts off above the true population effect and follows the decline of the observed correlation. But then it actually undershoots and goes well below the true effect size. Only then it gradually increases again and converges on the true effect. So at sample sizes that are most realistic in individual differences research, n=20-100ish, this cross-validation approach underestimates how much variance a regression model can explain and thus in fact undervalues the predictive power of the model.

The error bars in this plot denote +/- 1 standard deviation across the simulations at each sample size. So as one would expect, the variability across simulations is considerable when sample size is small, especially when n <= 10. These sample sizes are maybe unusually small but certainly not unrealistically small. I have seen publications calculating correlations on such small samples. The good news here is that even with such small samples on average the effect may not be inflated massively (let’s assume for the moment that publication bias or p-hacking etc are not an issue). However, cross-validation is not reliable under these conditions.

A correlation of rho=0.7 is unusually strong for most research. So I repeated this simulation analysis using a perhaps more realistic effect size of rho=0.3. Here is the plot:rho=3

Now we see a hint of something fascinating: the variance explained by the cross-validation approach actually subtly exceeds that of the observed sample correlation. They again converge on the true population level of 9% when the sample size reaches n=50. Actually there is again an undershoot but it is negligible. But at least for small samples with n <= 10 the cross-validation certainly doesn’t perform any better than the observed correlation. Both massively overestimate the effect size.

When the null hypothesis is true…

So if this is what is happening at a reasonably realistic rho=0.3, what about when the null hypothesis is true? This is what is shown in here (I apologise for the error bars extending into the impossible negative range but I’m too lazy to add that contingency to the code…):


The problem we saw hinted at above for rho=0.3 is exacerbated here. As before, the variance explained for the observed sample correlation is considerably inflated when sample size is small. However, for the cross-validated result this situation is much worse. Even at a sample size of n=300 the variance explained by the cross-validation is greater than 10%. If you read the PubPeer discussion I mentioned, you’ll see that I discussed this issue. This result occurs because when the null hypothesis is true – or the true effect is very weak – the cross-validation will produce significant correlations between the inadequately fitted model predictions and the actual observed values. These correlations can be positive or  negative (that is, the predictions systematically go in the wrong direction) but because the variance explained is calculated by squaring the correlation coefficient they turn into numbers substantially greater than 0%.

As I discussed in that thread, there is another way to calculate the variance explained by the cross-validation. I won’t go into detail on this but unlike the simpler approach I employed here this does not limit the variance explained to fall between 0-100%. While the estimates are numerically different, the pattern of results is qualitatively the same. At smaller sample sizes the variance explained by cross-validation systematically underestimates the true variance explained.

When the interocular traumatic test is significant…

My last example is the opposite scenario. While we already looked at an unusually strong correlation, I decided to also simulate a case where the effect should be blatantly obvious. Here rho=0.9:


Unsurprisingly, the results are similar as those seen for rho=0.7 but now the observed correlation is already doing a pretty decent job at reaching the nominal level of 81% variance explained. Still, the cross-validation underperforms at small sample sizes. In this situation, this actually seems to be a problem. It is rare that one would observe a correlation of this magnitude in psychological or biological sciences but if so chances are good that the sample size is small in that case. Often the reason for this may be that correlation estimates are inflated at small sample sizes but that’s not the point here. The point is that leave-one-out cross-validation won’t tell you. It underestimates the association even if it is real.

Where does all this leave us?

It is not my intention to rule out cross-validation. It can be a good approach for testing models and is often used successfully in the context of model comparison or classification analysis. In fact, as the debate about circular inference in neuroscience a few years ago illustrated, there are situations where it is essential that independent data are used. Cross-validation is a great way to deal with overfitting. Just don’t let yourself be misled into believing it can tell you something it doesn’t. I know it is superficially appealing and I had played with it previously for just that reason – but this exercise has convinced me that it’s not as bullet-proof is one might think.

Obviously, validation of a model with independent data is a great idea. A good approach is to collect a whole independent replication sample but this is expensive and may not always be feasible. Also, if a direct replication is performed it seems better that this is acquired independently by different researchers. A collaborative project could do this in which each group uses the data acquired by the other group to test their predictive model. But that again is not something that is likely to become regular practice anytime soon.

In the meantime we can also remember that performing typical statistical inference is a good approach after all. Its whole point is to infer the properties of the whole population from a sample. When used properly it tends to do a good job at that. Obviously, we should take measures to improve its validity, such as increasing power by using larger samples and/or better measurements. I know I am baysed but Bayesian hypothesis tests seem superior at ensuring validity than traditional significance testing. Registered Reports can probably also help and certainly should reduce the skew by publication bias and flexible analyses.

Wrapping up

So, does correlation imply prediction? I think so. Statistically this is precisely what it does. It uses one measure (or multiple measures) to make predictions of another measure. The key point is not whether calling it a prediction is valid but whether the prediction is sufficiently accurate to be important. The answer to this question actually depends considerably on what we are trying to do. A correlation explaining 10-20% of the variance in a small sample is not going to be a clear biomarker for anything. I sure as hell wouldn’t want any medical or judicial decisions to be based solely on such an association. But it may very well be very informative about mechanisms. It is a clearly detectable effect even with the naked eye.

In the context of these analysis, a better way than quantifying the variance explained is to calculate the root mean squared deviation (essentially the error bar) of the prediction. This provides an actually much more direct index of how accurately one variable predicts another. The next step – and I know I sound like a broken record – should be to confirm that these effects are actually scientifically plausible. This mantra is true for individual differences research as much as it is for Bem’s precognition and social priming experiments where I mentioned it before. Are the differences in neural transmission speed or neurotransmitter concentration implied by these correlation results realistic based on what we know about the brain?  These are the kinds of predictions we should actually care about in these discussions.


Visualising group data

Recently I have been thinking a bit about what the best way is to represent group data. The most typical way this is done is by showing summary statistics (usually the mean) and error bars (usually standard errors) either in bar plots or in plots with lines and symbols. A lot of people seem to think this is not an appropriate way to visualise results because it obscures the data distribution and also whether outliers may influence the results. One reason prompting me to think about this is that in at least one of our MSc courses students are explicitly told by course tutors that they should be plotting individual subject data. It is certainly true that close inspection of your data is always important – but I am not convinced that it is the only and best way to represent all sorts of data. In particular, looking at the results from an experiment of a recent student of mine you wouldn’t make heads or tails from just plotting individual data. Part of the reason is that most of the studies we do use within-subject designs and standard ways of plotting individual data points can actually be misleading. There are probably better ones, and perhaps my next post will deal with that.

For now though I want to only consider group data which were actually derived from between-subject or at least mixed designs. A recently published study in Psychological Science reported that sad people are worse at discriminating colours along the blue-yellow colour axis but not along the red-green colour axis. This sparked a lot of discussion on Twitter and in the blogosphere, for example this post by Andrew Gelman and also this one by Daniel Lakeland. Publications like this tend to attract a lot of coverage by mainstream media and this was no exception. This then further fuels the rage of skeptical researchers :P. There are a lot of things to debate here, from the fact that the study authors interpret a difference between differences as significant without testing the interaction, the potential inadequacy of the general procedure for measuring perceptual differences (raw accuracy rather than a visual threshold measure), and also that outliers may contribute to the main result. I won’t go into this discussion but I thought this data set (which to the authors’ credit is publicly available) would be a good example for my musings.

So here I am representing the data from their first study by plotting it in four different ways. The first plot, in the upper left, is a bar plot showing the means and standard errors for the different experimental conditions. The main result in the article is that the difference between control and sadness is significant for discriminating colours along the blue-yellow axis (the two bars on the left).


And judging by the bar graph you could certainly be forgiven for thinking so (I am using the same truncated scale used in the original article). The error bars seem reasonably well separated and this comparison is in fact statistically significant at p=0.0147 on a parametric independent sample t-test or p=0.0096 on a Mann-Whitney U-test (let’s ignore the issue of the interaction for this example).

Now consider the plot in the upper right though. Here we have the individual data points for the different groups and conditions. To give an impression of how the data are distributed, I added a little Gaussian noise to the x-position of each point. The data are evidently quite discrete due to the relatively small number of trials used to calculate the accuracy for every subject. Looking at the data in this way does not seem to give a very clear impression that there is a substantial difference between the control and sadness groups in either colour condition. The most noticeable difference is that there is one subject in the sadness group whose accuracy is not matched with any counterpart in the control group, at 0.58 accuracy. Is this an outlier pulling the result?

Next I generated a box-and-whisker plot in the lower left panel. The boxes in these plots denote the inter-quartile range (IQR, i.e. between 25th and 75th percentile of the data), the red lines indicate the medians, the error bars denote a range of 1.5 times the IQR beyond the percentiles (although it is curtailed when there are no data points beyond that range as by the ceiling at 1), and the red crosses are outliers that fall outside this range. The triangular notches surrounding the medians are a way to represent uncertainty and if they do not overlap (as is the case for the blue-yellow data) this suggests a difference between medians at the 5% significance level. Clearly the data point at 0.58 accuracy in the sadness group is considered an outlier in this plot although it is not the only one.

Finally, I also wrote a Matlab function to create cat-eye plots (Wikipedia calls those violin plots – personally they look mostly like bottles, amphoras or vases to me – or, in this case, like balloons). This is shown in the lower right panel. These plots show the distribution of the data in each condition smoothed by a kernel density function. The filled circles indicate the median, the vertical lines the inter-quartile range, and the asterisk the mean. Plots like this seem to be becoming more popular lately. They do have the nice feature that they give a fairly direct impression of how the data are distributed. It seems fairly clear that these are not normal distributions, which probably has largely to do with the ceiling effect: as accuracy cannot be higher than 1 the distributions are truncated there. The critical data set, the blue-yellow discrimination for the sadness group, has a fairly thick tail towards the bottom which is at least partially due to that outlier. This all suggests that the traditional t-test was inappropriate here but then again we did see a significant difference on the U-test. And certainly, visual inspection still suggests that there may be a difference here.

Next I decided to see what happens if I remove this outlier at 0.58. For consistency, I also removed their data from the red-green data set. This change does not alter the statistical inference in a qualitative way even though the p-values increase slightly. The t-test is still significant at p=0.0259 and the U-test at p=0.014.


Again, the bar graph shows a fairly noticeable difference. The scatter plot of the individual data points on the other hand now hardly seems to show any difference. Both the whisker and the cat-eye plot seem to still show qualitatively similar results as when the outlier is included. There seems to be a difference in medians for the blue-yellow data set. The cat-eye plot makes is more apparent that the tail of the distribution for the sadness group is quite heavy something that isn’t that clear in the whisker plot.

Finally, I decided to simulate a new data set with a similar pattern of results but in which I knew the ground truth. All four data sets contained 50 data points that were chosen from a Gaussian distribution with mean of 70 and standard deviation of 10 (I am a moron and therefore generated these on a scale of percent rather than proportion correct – and now I’m too lazy to replot all this just to correct it. It doesn’t matter really). For the control group in the blue-yellow condition I added an offset of 5 while in the sadness group I subtracted 5. This means that there is a significant difference (t-test: p=0.0017; U-test: 0.0042).


Now all four types of plot fairly clearly reflect this difference between control and sadness groups. The bar graph in particular clearly reflects the true population means in each group. But even in the scatter plot the difference is clearly apparent even though the distributions overlap considerably. The difference seems a lot less obvious in the whisker and cat-eye plots however. The notches in the whisker plot do not overlap although they seem to be very close. The difference seems to be more visually striking for the cat-eye plot but it isn’t immediately apparent from the plot how much confidence this should instill in this result.

Conclusions & Confusions

My preliminary conclusion is that all of this is actually more confusing than I thought. I am inclined to agree that the bar graph (or a similar symbol and error bar plot) seems to overstate the strength of the evidence somewhat (although one should note that this is partly because of the truncated y-scales that such plots usually employ). On the other hand, showing the individual subject data does seem to understate the results considerably except when the effect is pretty strong. So perhaps things like whisker or cat-eye (violin/bottle/balloon) plots are the most suitable but in my view they also aren’t as intuitive as some people seem to suggest. Obviously, I am not the first person who has thought about these things nor have I spent an extraordinarily long time thinking about it. It might be useful to conduct a experiment/survey in which people have to judge the strength of effects based on different kinds of plot. Anyway, in general I would be very curious to hear other people’s thoughts.

The Matlab code and data file for these examples can be found here.