*TL;DR: Leave-one-out cross-validation is a bad way for testing the predictive power of linear correlation/regression.*

Correlation or regression analysis are popular tools in neuroscience and psychology research for analysing individual differences. It fits a model (most typically a linear relationship between two measures) to infer whether the variability in some measure is related to the variability in another measure. Revealing such relationships can help understand the underlying mechanisms. We and others used it in previous studies to test specific mechanistic hypotheses linking brain structure/function and behaviour. It also forms the backbone of twin studies of heritability that in turn can implicate genetic and experiential factors in some trait. Most importantly, in my personal view individual differences are interesting because they acknowledge the fact that every human being is unique rather than simply treating variability as noise and averaging across large groups people.

But typically every report of a correlational finding will be followed by someone zealously pointing out that “Correlation does not imply causation”. And doubtless it is very important to keep that in mind. A statistical association between two variables may simply reflect the fact that they are both related to a third, unknown factor or a correlation may just be a fluke.

Another problem is that the titles of studies using correlation analysis sometimes use what I like to call “smooth narrative” style. Saying that some behaviour is “predicted by” or “depends on” some brain measure makes for far more accessible and interesting reading that dryly talking about statistical correlations. However, it doesn’t sit well with a lot of people, in part because such language may imply a causal link that the results don’t actually support. Jack Gallant seems to regularly point out on Twitter that the term “prediction” should only ever be used when a predictive model is built on some data set but the validity is tested on an independent data set.

Recently I came across an interesting PubPeer thread debating this question. In this one commenter pointed out that the title of the study under discussion, “V1 surface size predicts GABA concentration“, was unjustified because this relationship explains only about 7% of the variance when using a leave-one-out cross-validation procedure. In this procedure all data points except one are used to fit the regression and the final point is then used to evaluate the fit of the model. This procedure is then repeated n-fold using every data point as evaluation data once.

Taken at face value this approach sounds very appealing because it uses independent data for making predictions and for testing them. Replication is a cornerstone of science and in some respects cross-validation is an internal replication. So surely this is a great idea? Naive as I am I have long had a strong affinity for this idea.

*Cross-validation underestimates predictive power*

But not so fast. These notions fail to address two important issues (both of which some commenters on that thread already pointed out): first, it is unclear what amount of variance a model should explain to be important. 7% is not very much but it can nevertheless be of substantial theoretical value. The amount of variance that can realistically be explained by any model is limited by the noise in the data that arises from measurement error or other distortions. So in fact many studies using cross-validation to estimate the variance explained by some models (often in the context of model comparison) instead report the amount of *explainable* variance accounted for by the model. To derive this one must first estimate the noise ceiling, that is, the realistic maximum of variance that can possibly be explained. This depends on the univariate variability of the measures themselves.

Second, the cross-validation approach is based on the assumption that the observed sample, which is then subdivided into model-fitting and evaluation sets, is a good representation of the population parameters the analysis is attempting to infer. As such, the cross-validation estimate also comes with an error (this issue is also discussed by another blog post mentioned in that discussion thread). What we are usually interested in when we conduct scientific studies is to make an inference about the whole population, say a conclusion that can be broadly generalised to any human brain, not just the handful of undergraduate students included in our experiments. This does not really fit the logic of cross-validation because the evaluation is by definition only based on the same sample we collected.

Because I am a filthy, theory-challenged experimentalist, I decided to simulate this (and I apologise to all my Bayesian friends for yet again conditioning on the truth here…). For a range of sample sizes between n=3 and n=300 I drew a sample with from a population with a fixed correlation of rho=0.7 and performed leave-one-out cross-validation to quantify the variance explained by it (using the squared correlation between predicted and observed values). I also performed a standard regression analysis and quantified the variance explained by that. At each sample size I did this 1000 times and then calculated the mean variance explained for each approach. Here are the results:

What is immediately clear is that the results strongly depend on the sample size. Let’s begin with the blue line. This represents the variance explained by the standard regression analysis on the whole observed sample. The dotted, black, horizontal line denotes the true effect size, that is, the variance explained by the population correlation (so R^2=49%). The blue line starts off well above the true effect but then converges on it. This means that at small sample sizes, especially below n=10, the observed sample inflates how much variance is explained by the fitted model.

Next look at the red line. This denotes the variance explained by the leave-one-out cross-validation procedure. This also starts off above the true population effect and follows the decline of the observed correlation. But then it actually undershoots and goes well below the true effect size. Only then it gradually increases again and converges on the true effect. So at sample sizes that are most realistic in individual differences research, n=20-100ish, this cross-validation approach underestimates how much variance a regression model can explain and thus in fact *undervalues* the predictive power of the model.

The error bars in this plot denote +/- 1 standard deviation across the simulations at each sample size. So as one would expect, the variability across simulations is considerable when sample size is small, especially when n <= 10. These sample sizes are maybe unusually small but certainly not unrealistically small. I have seen publications calculating correlations on such small samples. The good news here is that even with such small samples on average the effect may not be inflated massively (let’s assume for the moment that publication bias or p-hacking etc are not an issue). However, cross-validation is *not reliable* under these conditions.

A correlation of rho=0.7 is unusually strong for most research. So I repeated this simulation analysis using a perhaps more realistic effect size of rho=0.3. Here is the plot:

Now we see a hint of something fascinating: the variance explained by the cross-validation approach actually subtly exceeds that of the observed sample correlation. They again converge on the true population level of 9% when the sample size reaches n=50. Actually there is again an undershoot but it is negligible. But at least for small samples with n <= 10 the cross-validation certainly doesn’t perform any better than the observed correlation. Both massively overestimate the effect size.

*When the null hypothesis is true…*

So if this is what is happening at a reasonably realistic rho=0.3, what about when the null hypothesis is true? This is what is shown in here (I apologise for the error bars extending into the impossible negative range but I’m too lazy to add that contingency to the code…):

The problem we saw hinted at above for rho=0.3 is exacerbated here. As before, the variance explained for the observed sample correlation is considerably inflated when sample size is small. However, for the cross-validated result this situation is much worse. Even at a sample size of n=300 the variance explained by the cross-validation is greater than 10%. If you read the PubPeer discussion I mentioned, you’ll see that I discussed this issue. This result occurs because when the null hypothesis is true – or the true effect is very weak – the cross-validation will produce significant correlations between the inadequately fitted model predictions and the actual observed values. These correlations can be positive or negative (that is, the predictions systematically go in the wrong direction) but because the variance explained is calculated by squaring the correlation coefficient they turn into numbers substantially greater than 0%.

As I discussed in that thread, there is another way to calculate the variance explained by the cross-validation. I won’t go into detail on this but unlike the simpler approach I employed here this does not limit the variance explained to fall between 0-100%. While the estimates are numerically different, the pattern of results is qualitatively the same. At smaller sample sizes the variance explained by cross-validation systematically underestimates the true variance explained.

*When the interocular traumatic test is significant…*

My last example is the opposite scenario. While we already looked at an unusually strong correlation, I decided to also simulate a case where the effect should be blatantly obvious. Here rho=0.9:

Unsurprisingly, the results are similar as those seen for rho=0.7 but now the observed correlation is already doing a pretty decent job at reaching the nominal level of 81% variance explained. Still, the cross-validation underperforms at small sample sizes. In this situation, this actually seems to be a problem. It is rare that one would observe a correlation of this magnitude in psychological or biological sciences but if so chances are good that the sample size is small in that case. Often the reason for this may be that correlation estimates are inflated at small sample sizes but that’s not the point here. The point is that leave-one-out cross-validation won’t tell you. It underestimates the association even if it is real.

*Where does all this leave us?*

It is not my intention to rule out cross-validation. It can be a good approach for testing models and is often used successfully in the context of model comparison or classification analysis. In fact, as the debate about circular inference in neuroscience a few years ago illustrated, there are situations where it is essential that independent data are used. Cross-validation is a great way to deal with overfitting. Just don’t let yourself be misled into believing it can tell you something it doesn’t. I know it is superficially appealing and I had played with it previously for just that reason – but this exercise has convinced me that it’s not as bullet-proof is one might think.

Obviously, validation of a model with independent data is a great idea. A good approach is to collect a whole independent replication sample but this is expensive and may not always be feasible. Also, if a direct replication is performed it seems better that this is acquired independently by different researchers. A collaborative project could do this in which each group uses the data acquired by the other group to test their predictive model. But that again is not something that is likely to become regular practice anytime soon.

In the meantime we can also remember that performing typical statistical inference is a good approach after all. Its whole point is to infer the properties of the whole population from a sample. When used properly it tends to do a good job at that. Obviously, we should take measures to improve its validity, such as increasing power by using larger samples and/or better measurements. I know I am baysed but Bayesian hypothesis tests seem superior at ensuring validity than traditional significance testing. Registered Reports can probably also help and certainly should reduce the skew by publication bias and flexible analyses.

*Wrapping up*

So, does correlation imply prediction? I think so. Statistically this is precisely what it does. It uses one measure (or multiple measures) to make predictions of another measure. The key point is not whether calling it a prediction is valid but whether the prediction is sufficiently accurate to be important. The answer to this question actually depends considerably on what we are trying to do. A correlation explaining 10-20% of the variance in a small sample is not going to be a clear biomarker for anything. I sure as hell wouldn’t want any medical or judicial decisions to be based solely on such an association. But it may very well be very informative about mechanisms. It *is* a clearly detectable effect even with the naked eye.

In the context of these analysis, a better way than quantifying the variance explained is to calculate the root mean squared deviation (essentially the error bar) of the prediction. This provides an actually much more direct index of how accurately one variable predicts another. The next step – and I know I sound like a broken record – should be to confirm that these effects are actually *scientifically plausible*. This mantra is true for individual differences research as much as it is for Bem’s precognition and social priming experiments where I mentioned it before. Are the differences in neural transmission speed or neurotransmitter concentration implied by these correlation results realistic based on what we know about the brain? These are the kinds of predictions we should actually care about in these discussions.

A well motivated post, but I think you miss the elephant in the room: Correlation is the wrong tool to assess a predictive model. Why do we care about prediction? Because we want to make a statement about *one* individual. What correlation tell us about? Behaviour of (a pair of variables) in a population. Think of it this way: If you were doing prediction of a binary variable how do you measure the error? On each individual you mark the prediction “Correct” or “Incorrect”. How do you get a single, per-subject contribution to a correlation coefficient? You can’t!

Another reason why correlation is a bad way to measure your predictive model is that you could be completely off in the absolute value or scale, and still get a good correlation. (E.g. what if I told you I had a great IQ predictive model with correlation 0.8, but, btw, the predicted IQ’s range from 1000 to 1010?)

For prediction of continuous variables you need some error measure that has a per-subject contribution, so you can make some specification of “When I see *one* new subject, how good will my prediction be?” The natural measure is the squared prediction error, (y-yhat)^2. If you add up all the squared prediction errors you get Prediction Error Sum of Squares (PRESS); divide by the number of samples and take a square root and you have root-mean-PRESS, or let’s just call it rMSE. That quantity is useful! For one subject, rMSE tells you how far you’ll be from the true value, on average. If you were so bold as to make a Gaussian assumption, you could say 95% of the predictions would be within +/- 2 rMSE’s.

Finally, if you must have a “r^2” like measure, you can use PRESS to do this:

Prediction R^2 = 1 – PRESS / SStot

When your predictor is good, PRESS will be small (relative to the total sum of squares, SStot = sum(y-ybar)^2), and Prediction R^2 big. But note, if your predictor is bad (which happens all the time), PRESS will be large, and in fact can be *larger* than SStot, and then you’ll get a *negative* Prediction R^2. This is fine… it is simply telling you that your predictive model is worse than the sample mean (ybar) at predicting the data… a very useful thing to know!

LikeLike

Thanks for your comment. I agree with your point about prediction error and you’ll notice I discuss this briefly at the end of the post. I normally use the 1-PRESS/SStot method to quantify R^2. As I said in my post, the qualitative pattern of results in the simulations is very similar except that this method does not have the inflated R^2 when the effect size is weak but in all other respects it still underperforms.

I disagree about the elephant in the room and I tried to also discuss this but perhaps wasn’t detailed enough: Highly accurate prediction about one individual can be *one* goal of correlation analysis. But that is not always the case and I would say it usually is not. Most science isn’t that applied. Correlation *can* be useful for testing hypotheses and its scale invariance can be a virtue in many situations. That said, I agree with you that getting the regression coefficient is useful for most applications – although for regressions in our field there is an error on both variables which means that orthogonal (or more precisely Deming) regression is the appropriate way to estimate it.

LikeLike

Note that leave-one-out CV is particularly problematic for regression: http://www.russpoldrack.org/2012/12/the-perils-of-leave-one-out.html and http://not2hastie.tumblr.com/post/56630997146/response-to-perils-of-loo-crossvalidation

I would also recommend this classic paper (http://www.jstor.org/stable/2345402), which makes clear that assessment of fit using the same data that were used to train the model (which I think is what you are proposing) is using the data twice, which results in inflated estimates of predictive accuracy on new data (which is generally what we mean by “prediction”).

LikeLike

Thanks Russ, that sounds interesting. Looking forward to reading that. I had briefly looked at other cross-validation approaches but it was getting too complex for one post. The results aren’t overly compelling in any case. A split model seems to not perform any better than the standard regression as far as I can tell.

Anyway, I think you misinterpret the point of my post. In statistical inference you always infer population parameters from fitting a model to sample data. The model can be a difference in means, a correlation or whatever. If this is all you are doing this is not “using the data twice”, it’s showing the results. I completely agree that when you are interested in an actually predictive model, say as you do in classification analysis, you must use independent data to assess performance. The question is if this is implied by saying variable A predicts variable B?

LikeLike

Sorry for missing the final paragraph and the shout-out to prediction error… I’ve seen correlation applied with predictive models so often it set me off writing before I digested the whole thing 🙂

And I agree correlation (or regression slopes) are the answer to many questions about population relationships, but in those instances the best results come from (as you point to) the usual model fits and not jacknife/LOOCV hold-outs.

LikeLike

No worries – my posts are usually a bit on the long side… (In fact this one is uncharacteristically short :P)

LikeLike

This is a nice post that highlights some of the known limitations of cross-validation, and particularly LOO-CV. But I think you’re overlooking the primary reason for using cross-validation in most real-world settings, which is to keep one’s analysis (or someone else’s analysis) honest and adjust for overfitting induced by model and/or researcher degrees of freedom. For what it’s worth, I’m not sure I’ve ever seen anyone cross-validate a simple correlation coefficient in a situation where there’s no analytical flexibility involved (I mean, I’m sure someone somewhere has done it, but it’s certainly not a common practice), and you’re right that there would be no point in doing so. Cross-validation is almost invariably applied in situations where there are multiple variables and/or feature selection/transformation steps. In such cases (which is to say, a huge proportion of all analyses), the estimate that falls out of one’s model will completely fail to account for researcher DFs at all, and will only account for estimation-related overfitting to the degree that there are analytical corrections available (e.g., the adjusted R^2 works great for linear regression, but good luck analytically adjusting the results of a random forest). By contrast, you can easily cross-validate *any* kind of model or pipeline, including any kind of model selection/hyperparameter estimation/feature transformation.

Another way to put it would be to say that *simple* correlation, where we assume no degrees of freedom in the parameterization or selection of a model, may indeed entail prediction (subject to error), provided we assume no bias. But in any realistic scenario where choices have to be made, or you’re using any estimator for which we don’t have an analytical correction for overfitting (which is most of them), all bets are off, and it’s an excellent idea to have some kind of cross-validation strategy in place. It won’t solve all your problems, of course, and you will typically pay a small penalty in underfitting, but it will certainly make it easier for other people to trust your results.

LikeLike

Thanks for your comment. I mostly agree with you. It certainly is not commonplace to use crossvalidation in linear correlation/regression. But what I am commenting about is the *calls* for using crossvalidation in this context (such as in that PubPeer thread) and the more general attitude that in order to say ‘A predicts B’ one must use crossvalidation. I would argue this is not appropriate.

Now I completely agree that with more complex models it may be a better idea to do so and it is essential to use a crossvalidation approach when you actually want to have a predictive algorithm, so to predict the risk of an illness or whatever. Essentially it would depends on the cost/efficiency. For basic science the loss in sensitivity does not seem worth it.

LikeLike

Thank you for this post. I think that there are grave misunderstandings about how bulletproof cross-validation is within the neuroscience community. Note that the machine learning community has learned the limitations of CV years ago, and thus relies on “hold-out” data as a final proof of an algorithm. These lessons have not yet percolated through our field, but it is important that they do, and quickly, since the number of classification -based approaches to neural data analysis is growing rapidly. Some people I have talked to are aware of these limitations, but some are convinced that independence of training/test sets is sufficient protection against overfitting.

LikeLike

Yes I think the main take-home message of all this for me is a reminder that you must be skeptical of any approach until you have thorough evidence how they work. Hold-out samples sound logically compelling but that’s an assumption. As always assumptions may not turn out to be correct.

LikeLike

Well I think that in the case of machine learning experts, they have some substantial experience in testing whether hold-out sets really protect against overfitting, and they generally do.

Just to make sure we’re on the same page (because I can’t tell from what you wrote there if we are), hold-out data is not the same thing as independent training/testing sets. Rather, hold-out data is the equivalent of new data, because you’ve set it aside at the start of your analysis procedure and you don’t touch it until you’re finished fine-tuning your classifier. It’s used in what is called nested cross-validation, which is a step beyond cross-validation. Apologies if this is all review for you, but perhaps it’s not for some of your readers.

LikeLike

No I wasn’t aware of that distinction so thanks for clarifying. I know about nested cross-validation procedures but have very limited experience (did a bit of this a few years ago).

LikeLike

sorry for the late comment, but Nature had an interesting piece on hold-out analyses a while ago, which might be of interest: http://www.nature.com/news/blind-analysis-hide-results-to-seek-the-truth-1.18510?WT.ec_id

and thanks for the nice post!

LikeLike