TL;DR: Leave-one-out cross-validation is a bad way for testing the predictive power of linear correlation/regression.
Correlation or regression analysis are popular tools in neuroscience and psychology research for analysing individual differences. It fits a model (most typically a linear relationship between two measures) to infer whether the variability in some measure is related to the variability in another measure. Revealing such relationships can help understand the underlying mechanisms. We and others used it in previous studies to test specific mechanistic hypotheses linking brain structure/function and behaviour. It also forms the backbone of twin studies of heritability that in turn can implicate genetic and experiential factors in some trait. Most importantly, in my personal view individual differences are interesting because they acknowledge the fact that every human being is unique rather than simply treating variability as noise and averaging across large groups people.
But typically every report of a correlational finding will be followed by someone zealously pointing out that “Correlation does not imply causation”. And doubtless it is very important to keep that in mind. A statistical association between two variables may simply reflect the fact that they are both related to a third, unknown factor or a correlation may just be a fluke.
Another problem is that the titles of studies using correlation analysis sometimes use what I like to call “smooth narrative” style. Saying that some behaviour is “predicted by” or “depends on” some brain measure makes for far more accessible and interesting reading that dryly talking about statistical correlations. However, it doesn’t sit well with a lot of people, in part because such language may imply a causal link that the results don’t actually support. Jack Gallant seems to regularly point out on Twitter that the term “prediction” should only ever be used when a predictive model is built on some data set but the validity is tested on an independent data set.
Recently I came across an interesting PubPeer thread debating this question. In this one commenter pointed out that the title of the study under discussion, “V1 surface size predicts GABA concentration“, was unjustified because this relationship explains only about 7% of the variance when using a leave-one-out cross-validation procedure. In this procedure all data points except one are used to fit the regression and the final point is then used to evaluate the fit of the model. This procedure is then repeated n-fold using every data point as evaluation data once.
Taken at face value this approach sounds very appealing because it uses independent data for making predictions and for testing them. Replication is a cornerstone of science and in some respects cross-validation is an internal replication. So surely this is a great idea? Naive as I am I have long had a strong affinity for this idea.
Cross-validation underestimates predictive power
But not so fast. These notions fail to address two important issues (both of which some commenters on that thread already pointed out): first, it is unclear what amount of variance a model should explain to be important. 7% is not very much but it can nevertheless be of substantial theoretical value. The amount of variance that can realistically be explained by any model is limited by the noise in the data that arises from measurement error or other distortions. So in fact many studies using cross-validation to estimate the variance explained by some models (often in the context of model comparison) instead report the amount of explainable variance accounted for by the model. To derive this one must first estimate the noise ceiling, that is, the realistic maximum of variance that can possibly be explained. This depends on the univariate variability of the measures themselves.
Second, the cross-validation approach is based on the assumption that the observed sample, which is then subdivided into model-fitting and evaluation sets, is a good representation of the population parameters the analysis is attempting to infer. As such, the cross-validation estimate also comes with an error (this issue is also discussed by another blog post mentioned in that discussion thread). What we are usually interested in when we conduct scientific studies is to make an inference about the whole population, say a conclusion that can be broadly generalised to any human brain, not just the handful of undergraduate students included in our experiments. This does not really fit the logic of cross-validation because the evaluation is by definition only based on the same sample we collected.
Because I am a filthy, theory-challenged experimentalist, I decided to simulate this (and I apologise to all my Bayesian friends for yet again conditioning on the truth here…). For a range of sample sizes between n=3 and n=300 I drew a sample with from a population with a fixed correlation of rho=0.7 and performed leave-one-out cross-validation to quantify the variance explained by it (using the squared correlation between predicted and observed values). I also performed a standard regression analysis and quantified the variance explained by that. At each sample size I did this 1000 times and then calculated the mean variance explained for each approach. Here are the results:
What is immediately clear is that the results strongly depend on the sample size. Let’s begin with the blue line. This represents the variance explained by the standard regression analysis on the whole observed sample. The dotted, black, horizontal line denotes the true effect size, that is, the variance explained by the population correlation (so R^2=49%). The blue line starts off well above the true effect but then converges on it. This means that at small sample sizes, especially below n=10, the observed sample inflates how much variance is explained by the fitted model.
Next look at the red line. This denotes the variance explained by the leave-one-out cross-validation procedure. This also starts off above the true population effect and follows the decline of the observed correlation. But then it actually undershoots and goes well below the true effect size. Only then it gradually increases again and converges on the true effect. So at sample sizes that are most realistic in individual differences research, n=20-100ish, this cross-validation approach underestimates how much variance a regression model can explain and thus in fact undervalues the predictive power of the model.
The error bars in this plot denote +/- 1 standard deviation across the simulations at each sample size. So as one would expect, the variability across simulations is considerable when sample size is small, especially when n <= 10. These sample sizes are maybe unusually small but certainly not unrealistically small. I have seen publications calculating correlations on such small samples. The good news here is that even with such small samples on average the effect may not be inflated massively (let’s assume for the moment that publication bias or p-hacking etc are not an issue). However, cross-validation is not reliable under these conditions.
A correlation of rho=0.7 is unusually strong for most research. So I repeated this simulation analysis using a perhaps more realistic effect size of rho=0.3. Here is the plot:
Now we see a hint of something fascinating: the variance explained by the cross-validation approach actually subtly exceeds that of the observed sample correlation. They again converge on the true population level of 9% when the sample size reaches n=50. Actually there is again an undershoot but it is negligible. But at least for small samples with n <= 10 the cross-validation certainly doesn’t perform any better than the observed correlation. Both massively overestimate the effect size.
When the null hypothesis is true…
So if this is what is happening at a reasonably realistic rho=0.3, what about when the null hypothesis is true? This is what is shown in here (I apologise for the error bars extending into the impossible negative range but I’m too lazy to add that contingency to the code…):
The problem we saw hinted at above for rho=0.3 is exacerbated here. As before, the variance explained for the observed sample correlation is considerably inflated when sample size is small. However, for the cross-validated result this situation is much worse. Even at a sample size of n=300 the variance explained by the cross-validation is greater than 10%. If you read the PubPeer discussion I mentioned, you’ll see that I discussed this issue. This result occurs because when the null hypothesis is true – or the true effect is very weak – the cross-validation will produce significant correlations between the inadequately fitted model predictions and the actual observed values. These correlations can be positive or negative (that is, the predictions systematically go in the wrong direction) but because the variance explained is calculated by squaring the correlation coefficient they turn into numbers substantially greater than 0%.
As I discussed in that thread, there is another way to calculate the variance explained by the cross-validation. I won’t go into detail on this but unlike the simpler approach I employed here this does not limit the variance explained to fall between 0-100%. While the estimates are numerically different, the pattern of results is qualitatively the same. At smaller sample sizes the variance explained by cross-validation systematically underestimates the true variance explained.
When the interocular traumatic test is significant…
My last example is the opposite scenario. While we already looked at an unusually strong correlation, I decided to also simulate a case where the effect should be blatantly obvious. Here rho=0.9:
Unsurprisingly, the results are similar as those seen for rho=0.7 but now the observed correlation is already doing a pretty decent job at reaching the nominal level of 81% variance explained. Still, the cross-validation underperforms at small sample sizes. In this situation, this actually seems to be a problem. It is rare that one would observe a correlation of this magnitude in psychological or biological sciences but if so chances are good that the sample size is small in that case. Often the reason for this may be that correlation estimates are inflated at small sample sizes but that’s not the point here. The point is that leave-one-out cross-validation won’t tell you. It underestimates the association even if it is real.
Where does all this leave us?
It is not my intention to rule out cross-validation. It can be a good approach for testing models and is often used successfully in the context of model comparison or classification analysis. In fact, as the debate about circular inference in neuroscience a few years ago illustrated, there are situations where it is essential that independent data are used. Cross-validation is a great way to deal with overfitting. Just don’t let yourself be misled into believing it can tell you something it doesn’t. I know it is superficially appealing and I had played with it previously for just that reason – but this exercise has convinced me that it’s not as bullet-proof is one might think.
Obviously, validation of a model with independent data is a great idea. A good approach is to collect a whole independent replication sample but this is expensive and may not always be feasible. Also, if a direct replication is performed it seems better that this is acquired independently by different researchers. A collaborative project could do this in which each group uses the data acquired by the other group to test their predictive model. But that again is not something that is likely to become regular practice anytime soon.
In the meantime we can also remember that performing typical statistical inference is a good approach after all. Its whole point is to infer the properties of the whole population from a sample. When used properly it tends to do a good job at that. Obviously, we should take measures to improve its validity, such as increasing power by using larger samples and/or better measurements. I know I am baysed but Bayesian hypothesis tests seem superior at ensuring validity than traditional significance testing. Registered Reports can probably also help and certainly should reduce the skew by publication bias and flexible analyses.
So, does correlation imply prediction? I think so. Statistically this is precisely what it does. It uses one measure (or multiple measures) to make predictions of another measure. The key point is not whether calling it a prediction is valid but whether the prediction is sufficiently accurate to be important. The answer to this question actually depends considerably on what we are trying to do. A correlation explaining 10-20% of the variance in a small sample is not going to be a clear biomarker for anything. I sure as hell wouldn’t want any medical or judicial decisions to be based solely on such an association. But it may very well be very informative about mechanisms. It is a clearly detectable effect even with the naked eye.
In the context of these analysis, a better way than quantifying the variance explained is to calculate the root mean squared deviation (essentially the error bar) of the prediction. This provides an actually much more direct index of how accurately one variable predicts another. The next step – and I know I sound like a broken record – should be to confirm that these effects are actually scientifically plausible. This mantra is true for individual differences research as much as it is for Bem’s precognition and social priming experiments where I mentioned it before. Are the differences in neural transmission speed or neurotransmitter concentration implied by these correlation results realistic based on what we know about the brain? These are the kinds of predictions we should actually care about in these discussions.