Category Archives: reliability

Boosting power with better experiments

Probably one of the main reasons for the low replicability of scientific studies is that many previous studies have been underpowered – or rather that they only provided inconclusive evidence for or against the hypotheses they sought to test. Alex Etz had a great blog post on this with regard to replicability in psychology (and he published an extension of this analysis that takes publication bias into account as a paper). So it is certainly true that as a whole researchers in psychology and neuroscience can do a lot better when it comes to the sensitivity of their experiments.

A common mantra is that we need larger sample sizes to boost sensitivity. Statistical power is a function of the sample size and the expected effect size. There is a lot of talk out there about what effect size one should use for power calculations. For instance, when planning a replication study, it has been suggested that you should more than double the sample size of the original study. This is supposed to take into account the fact that published effect sizes are probably skewed upwards due to publication bias and analytical flexibility, or even simply because the true effect happens to be weaker than originally reported.

However, what all these recommendations neglect to consider is that standardized effect sizes, like Cohen’s d or a correlation coefficient, are also dependent on the precision of your observations. By reducing measurement error or other noise factors, you can literally increase the effect size. A higher effect size means greater statistical power – so with the same sample size you can boost power by improving your experiment in other ways.

Here is a practical example. Imagine I want to correlate the height of individuals measured in centimeters and inches. This is a trivial case – theoretically the correlation should be perfect, that is, ρ = 1. However, measurement error will spoil this potential correlation somewhat. I have a sample size of 100 people. I first ask my auntie Angie to guess the height of each subject in centimeters. To determine their heights in inches, I then take them all down the pub and ask this dude called Nigel to also take a guess. Both Angie and Nigel will misestimate heights to some degree. For simplicity, let’s just say that their errors are on average the same. This nonetheless means their guesses will not always agree very well. If I then calculate the correlation between their guesses, it will obviously have to be lower than 1, even though this is the true correlation. I simulated this scenario below. On the x-axis I plot the amount of measurement error in cm (the standard deviation of Gaussian noise added to the actual body heights). On the y-axis I plot the median observed correlation and the shaded area is the 95% confidence interval over 10,000 simulations. As you can see, as measurement error increases, the observed correlation goes down and the confidence interval becomes wider.

corr_vs_error

Greater error leads to poorer correlations. So far, so obvious. But while I call this the observed correlation, it really is the maximally observable correlation. This means that in order to boost power, the first thing you could do is to reduce measurement error. In contrast, increasing your sample size can be highly inefficient and border on the infeasible.

For a correlation of 0.35, hardly an unrealistically low effect in a biological or psychological scenario, you would need a sample size of 62 to achieve 80% power. Let’s assume this is the correlation found by a previous study and we want to replicate it. Following common recommendations you would plan to collect two-and-a-half the sample size, so n = 155. Doing so may prove quite a challenge. Assume that each data point involves hours of data collection per participant and/or that it costs 100s of dollars to acquire the data (neither are atypical in neuroimaging experiments). This may be a considerable additional expense few researchers are able to afford.

And it gets worse. It is quite possible that by collecting more data you further sacrifice data quality. When it comes to neuroimaging data, I have heard from more than one source that some of the large-scale imaging projects contain only mediocre data contaminated by motion and shimming artifacts. The often mentioned suggestion that sample sizes for expensive experiments could be increased by multi-site collaborations ignores that this quite likely introduces additional variability due to differences between sites. The data quality even from the same equipment may differ. The research staff at the two sites may not have the same level of skill or meticulous attention to detail. Behavioral measurements acquired online via a website may be more variable than under controlled lab conditions. So you may end up polluting your effect size even further by increasing sample size.

The alternative is to improve your measurements. In my example here, even going from a measurement error of 20 cm to 15 cm improves the observable effect size quite dramatically, moving from 0.35 to about 0.5. To achieve 80% power, you would only need a sample size of 29. If you kept the original sample size of 62, your power would be 99%. So the critical question is not really what the original effect size was that you want to replicate – rather it is how much you can improve your experiment by reducing noise. If your measurements are already pretty precise to begin with, then there is probably little room for improvement and you also don’t win all that much, as going from measurement error 5 cm to 1 cm in my example. But when the original measurement was noisy, improving the experiment can help a hell of a lot.

There are many ways to make your measurements more reliable. It can mean ensuring that your subjects in the MRI scanner are padded in really well, that they are not prone to large head movements, that you did all in your power to maintain a constant viewing distance for each participant, and that they don’t fall asleep halfway through your experiment. It could mean scanning 10 subjects twice, instead of scanning 20 subjects once. It may be that you measure the speed that participants walk down the hall to the lift with laser sensors instead of having a confederate sit there with a stopwatch. Perhaps you can change from a group comparison to a within-subject design? If your measure is an average across trials collected in each subject, you can enhance the effect size by increasing the number of trials. And it definitely means not giving a damn what Nigel from down the pub says and investing in a bloody tape measure instead.

I’m not saying that you shouldn’t collect larger samples. Obviously, if measurement reliability remains constant*, larger samples can improve sensitivity. But the first thought should always be how you can make your experiment a better test of your hypothesis. Sometimes the only thing you can do is to increase the sample but I bet usually it isn’t – and if you’re not careful, it can even make things worse. If your aim is to conclude something about the human brain/mind in general, a larger and broader sample would allow you to generalize better. However, for this purpose increasing your subject pool from 20 undergraduate students at your university to 100 isn’t really helping. And when it comes to the choice between an exact replication study with three times the sample size than the original experiment, and one with the same sample but objectively better methods, I know I’d always pick the latter.

 

(* In fact, it’s really a trade-off and in some cases a slight increase of measurement error may very well be outweighed by greater power due to a larger sample size. This probably happens for the kinds of experiments where slight difference in experimental parameters don’t matter much and you can collect 100s of people fast, for example online or at a public event).

On the magic of independent piloting

TL,DR: Never simply decide to run a full experiment based on whether one of the small pilots in which you tweaked your paradigm supported the hypothesis. Use small pilots only to ensure the experiment produces high quality data, judged by criteria that are unrelated to your hypothesis.

Sorry for the bombardment with posts on data peeking and piloting. I felt this would have cluttered up the previous post so I wrote a separate one. After this one I will go back to doing actual work though, I promise! That grant proposal I should be writing has been neglected for too long…

In my previous post, I simulated what happens when you conduct inappropriate pilot experiments by running a small experiment and then continuing data collection if the pilot produces significant results. This is really data peeking and it shouldn’t come as much of a surprise that this inflates false positives and massively skews effect size estimates. I hope most people realize that this is a terrible thing to do because it makes your results entirely dependent on the outcome. Quite possibly, some people would have learned about this in their undergrad stats classes. As one of my colleagues put it, “if it ends up in the final analysis it is not a pilot.” Sadly, I don’t think this as widely known as it should be. I was not kidding when I said that I have seen it happen before or overheard people discussing having done this type of inappropriate piloting.

But anyway, what is an appropriate pilot then? In my previous post, I suggested you should redo the same experiment but restart data collection. You now stick to the methods that gave you a significant pilot result. Now the data set used to test your hypothesis is completely independent, so it won’t be skewed by the pre-selected pilot data. Put another way, your exploratory pilot allows you to estimate a prior, and your full experiment seeks to confirm it. Surely there is nothing wrong with that, right?

I’m afraid there is and it is actually obvious why: your small pilot experiment is underpowered to detect real effects, especially small ones. So if you use inferential statistics to determine if a pilot experiment “worked,” this small pilot is biased towards detecting larger effect sizes. Importantly, this does not mean you bias your experiment towards larger effect sizes. If you only continue the experiment when the pilot was significant, you are ignoring all of the pilots that would have shown true effects but which – due to the large uncertainty (low power) of the pilot – failed to do so purely by chance. Naturally, the proportion of these false negatives becomes smaller the larger you make your pilot sample – but since pilots are by definition small, the error rate is pretty high in any case. For example, for a true effect size of δ = 0.3, the false negatives at a pilot sample of 2 is 95%. With a pilot sample of 15, it is still as high as 88%. Just for illustration I show below the false negative rates (1-power) for three different true effect sizes. Even for quite decent effect sizes the sensitivity of a small pilot is abysmal:

False Negatives

Thus, if you only pick pilot experiments with significant results to do real experiments you are deluding yourself into thinking that the methods you piloted are somehow better (or “precisely calibrated”). Remember this is based on a theoretical scenario that the effect is real and of fixed strength. Every single pilot experiment you ran investigated the same underlying phenomenon and any difference in outcome is purely due to chance – the tweaking of your methods had no effect whatsoever. You waste all manner of resources piloting some methods you then want to test.

So frequentist inferential statistics on pilot experiments are generally nonsense. Pilots are by nature exploratory. You should only determine significance for confirmatory results. But what are these pilots good for? Perhaps we just want to have an idea of what effect size they can produce and then do our confirmatory experiments for those methods that produce a reasonably strong effect?

I’m afraid that won’t do either. I simulated this scenario in a similar manner as in my previous post. 100,000 times I generated two groups (with a full sample size of n = 80, although the full sample size isn’t critical for this). Both groups are drawn from a population with standard deviation 1 but one group has a mean of zero while the other’s mean is shifted by 0.3 – so we have a true effect size here (the actual magnitude of this true effect size is irrelevant for the conclusions). In each of the 100,000 simulations, the researcher runs a number of pilot subjects per group (plotted on x-axis). Only if the effect size estimate for this pilot exceeds a certain criterion level, the researcher runs an independent, full experiment. The criterion is either 50%, 100%, or 200% of the true effect size. Obviously, the researcher cannot know this however. I simply use these criteria as something that the researcher might be doing in a real world situation. (For the true effect size I used here, these criteria would be d = 0.15, d = 0.3, or d = 0.6, respectively).

The results are below. The graph on the left once again plots the false negative rates against the pilot sample size. A false negative here is not based on significance but on effect size, so any simulation for which d was below the criterion. When the criterion is equal to the true effect size, the false negative rate is constant at 50%. The reason for this is obvious: each simulation is drawn from a population centered on the true effect of 0.3, so half of these simulations will exceed that value. However, when the criterion is not equal to the true effect the false negative rates depend on the pilot sample size. If the criterion is lower than the true effect, false negatives decrease. If the criterion is strict, false negatives increase. Either way, the false negative rates are substantially greater than the 20% mark you would have with an adequately powered experiment. So you will still delude yourself a considerable number of times if you only conduct the full experiment when your pilot has a particular effect size. Even if your criterion is lax (and d = 0.15 for a pilot sounds pretty lax to me), you are missing a lot of true results. Again, remember that all of the pilot experiments here investigated a real effect of fixed size. Tweaking the method makes no difference. The difference between simulations is simply due to chance.

Finally, the graph on the right shows the mean effect sizes  estimated by your completed experiments (but not the absolute this time!). The criterion you used in the pilot makes no difference here (all colors are at the same level), which is reassuring. However, all is not necessarily rosy. The open circles plot the effect size you get under publication bias, that is, if you only publish the significant experiments with p < 0.05. This effect is clearly inflated compared to the true effect size of 0.3. The asterisks plot the effect size estimate if you take all of the experiments. This is the situation you would have (Chris Chambers will like this) if you did a Registered Report for your full experiment and publication of the results is guaranteed irrespective of whether or not they are significant. On average, this effect size is an accurate estimate of the true effect.

Simulation Results

Again, these are only the experiments that were lucky enough to go beyond the piloting stage. You already wasted a lot of time, effort, and money to get here. While the final outcome is solid if publication bias is minimized, you have thrown a considerable number of good experiments into the trash. You’ve also misled yourself into believing that you conducted a valid pilot experiment that honed the sensitivity of your methods when in truth all your pilot experiments were equally mediocre.

I have had a few comments from people saying that they are only interested in large effect sizes and surely that means they are fine? I’m afraid not. As I said earlier already, the principle here is not dependent on the true effect size. It is solely a factor of the low sensitivity of the pilot experiment. Even with a large true effect, your outcome-dependent pilot is a blind chicken that errs around in the dark until it is lucky enough to hit a true effect more or less by chance. For this to happen you must use a very low criterion to turn your pilot into a real experiment. This however also means that if the null hypothesis is true an unacceptable proportion of your pilots produce false positives. Again, remember that your piloting is completely meaningless – you’re simply chasing noise here. It means that your decision whether to go from pilot to full experiment is (almost) completely arbitrary, even when the true effect is large.

So for instance, when the true effect is a whopping δ = 1, and you are using d > 0.15 as a criterion in your pilot of 10 subjects (which is already large for pilots I typically hear about), your false negative rate is nice and low at ~3%. But critically, if the null hypothesis of δ = 0 is true, your false positive rate is ~37%. How often you will fool yourself by turning a pilot into a full experiment depends on the base rate. If you give this hypothesis at 50:50 chance of being true, almost one in three of your pilot experiments will lead you to chase a false positive. If these odds are lower (which they very well may be), the situation becomes increasingly worse.

What should we do then? In my view, there are two options: either run a well-powered confirmatory experiment that tests your hypothesis based on an effect size you consider meaningful. This is the option I would chose if resources are a critical factor. Alternatively, if you can afford the investment of time, money, and effort, you could run an exploratory experiment with a reasonably large sample size (that is, more than a pilot). If you must, tweak the analysis at the end to figure out what hides in the data. Then, run a well-powered replication experiment to confirm the result. The power for this should be high enough to detect effects that are considerably weaker than the exploratory effect size. This exploratory experiment may sound like a pilot but it isn’t because it has decent sensitivity and the only resource you might be wasting is your time* during the exploratory analysis stage.

The take-home message here is: don’t make your experiments dependent on whether your pilot supported your hypothesis, even if you use independent data. It may seem like a good idea but it’s tantamount to magical thinking. Chances are that you did not refine your method at all. Again (and I apologize for the repetition but it deserves repeating): this does not mean all small piloting is bad. If your pilot is about assuring that the task isn’t too difficult for subjects, that your analysis pipeline works, that the stimuli appear as you intended, that the subjects aren’t using a different strategy to perform the task, or quite simply to reduce the measurement noise, then it is perfectly valid to run a few people first and it can even be justified to include them in your final data set (although that last point depends on what you’re studying). The critical difference is that the criteria for green-lighting a pilot experiment are completely unrelated to the hypothesis you are testing.

(* Well, your time and the carbon footprint produced by your various analysis attempts. But if you cared about that, you probably wouldn’t waste resources on meaningless pilots in the first place, so this post is not for you…)

MatLab code for this simulation.

Why wouldn’t you share data?

Data sharing has been in the news a lot lately from the refusal of the authors of the PACE trial to share their data even though the journal expects it to the eventful story of the “Sadness impairs color perception” study. A blog post by Dorothy Bishop called “Who’s afraid of Open Data?” made the rounds. The post itself is actually a month old already but it was republished by the LSE blog which gave it some additional publicity. In it she makes a impassioned argument for open data sharing and discusses the fears and criticisms many researchers have voiced against data sharing.

I have long believed in making all data available (and please note that in the following I will always mean data and materials, so not just the results but also the methods). The way I see it this transparency is the first and most important remedy to the ills of scientific research. I have regular discussions with one of my close colleagues* about how to improve science – we don’t always agree on various points like preregistration, but if there is one thing where we are on the same page, it is open data sharing. By making data available anyone can reanalyse it and check if the results reproduce and it allows you to check the robustness of a finding for yourself, if you feel that you should. Moreover, by documenting and organising your data you not only make it easier for other researchers to use, but also for yourself and your lab colleagues. It also helps you with spotting errors. It is also a good argument that stops reviewer 2 from requesting a gazillion additional analyses – if they really think these analyses are necessary they can do them themselves and publish them. This aspect in fact overlaps greatly with the debate on Registered Reports (RR) and it is one of the reasons I like the RR concept. But the benefits of data sharing go well beyond this. Access to the data will allow others to reuse the data to answer scientific questions you may not even have thought of. They can also be used in meta-analyses. With the increasing popularity and feasibility of large-scale permutation/bootstrapping methods it also means that availability to the raw values will be particularly important. Access to the data allows you to take into account distributional anomalies, outliers, or perhaps estimate the uncertainty on individual data points.

But as Dorothy describes, many scientists nevertheless remain afraid of publishing their actual data alongside their studies. For several years many journals and funding agencies have had a policy that data should always be shared upon request – but a laughably small proportion of such requests are successful. This is why some have now adopted the policy that all data must be shared in repositories upon publication or even upon submission. And to encourage this process recently the Peer Reviewer Openness Initiative was launched by which signatories would refuse to conduct in-depth reviews of manuscripts unless the authors can give a reason why data and materials aren’t public.

My most memorable experience with fears about open data involve a case where the lab head refused to share data and materials with the graduate student* who actually created the methods and collected the data. The exact details aren’t important. Maybe one day I will talk more about this little horror story… For me this demonstrates how far we have come already. Nowadays that story would be baffling to most researchers but back then (and that’s only a few years ago – I’m not that old!) more than one person actually told me that the PI and university were perfectly justified in keeping the student’s results and the fruits of their intellectual labour under lock and key.

Clearly, people are still afraid of open data. Dorothy lists the following reasons:

  1. Lack of time to curate data;  Data are only useful if they are understandable, and documenting a dataset adequately is a non-trivial task;
  2. Personal investment – sense of not wanting to give away data that had taken time and trouble to collect to other researchers who are perceived as freeloaders;
  3. Concerns about being scooped before the analysis is complete;
  4. Fear of errors being found in the data;
  5. Ethical concerns about confidentiality of personal data, especially in the context of clinical research;
  6. Possibility that others with a different agenda may misuse the data, e.g. perform selective analysis that misrepresented the findings;

In my view, points 1-4 are invalid arguments even if they seem understandable. I have a few comments about some of these:

The fear of being scooped 

I honestly am puzzled by this one. How often does this actually happen? The fear of being scooped is widespread and it may occasionally be justified. Say, if you discuss some great idea you have or post a pilot result on social media perhaps you shouldn’t be surprised if someone else agrees that the idea is great and also does it. Some people wouldn’t be bothered by that but many would and that’s understandable. Less understandable to me is if you present research at a conference and then complain about others publishing similar work because they were inspired by you. That’s what conferences are for. If you don’t want that to happen, don’t go to conferences. Personally, I think science would be a lot better if we cared a lot less about who did what first and instead cared more about what is true and how we can work together…

But anyway, as far as I can see none of that applies to data sharing. By definition data you share is either already published or at least submitted for peer review. If someone reuses your data for something else they have to cite you and give you credit. In many situations they may even do it in collaboration with you which could lead to coauthorship. More importantly, if the scooped result is so easily obtained that somebody beats you to it despite your head start (it’s your data, regardless of how well documented it is you will always know it better than some stranger) then perhaps you should have thought about that sooner. You could have held back on your first publication and combined the analyses. Or, if it really makes more sense to publish the data in separate papers, then you could perhaps declare that the full data set will be shared after the second one is published. I don’t really think this is necessary but I would accept that argument.

Either way, I don’t believe being scooped by data sharing is very realistic and any cases of that happening must be extremely rare. But please share these stories if you have them to prove me wrong! If you prefer, you can post it anonymously on the Neuroscience Devils. That’s what I created that website for.

Fear of errors being discovered

I’m sure everyone can understand that fear. It can be embarrassing to have your errors (and we all make mistakes) being discovered – at least if they are errors with big consequences. Part of the problem is also that all too often the discovery of errors is associated with some malice. To err is human, to forgive divine. We really need to stop treating every time somebody’s mistakes are being revealed (or, for that matter, when somebody’s findings fail to replicate) as an implication of sloppy science or malpractice. Sometimes (usually?) mistakes are just mistakes.

Probably nobody wants to have all of their data combed by vengeful sleuths nitpicking every tiny detail. If that becomes excessive and the same person is targeted, it could border on harassment and that should be counteracted. In-depth scrutiny of all the data by a particular researcher should be a special case that only happens when there is a substantial reason, say, in a fraud investigation. I would hope though that these cases are also rare.

And surely nobody can seriously want the scientific record to be littered with false findings, artifacts, and coding errors. I am not happy if someone tells me I made a serious error but I would nonetheless be grateful to them for telling me! It has happened before when lab members or collaborators spotted mistakes I made. In turn I have spotted mistakes colleagues made. None of this would have been possible if we didn’t share our data and methods amongst each another. I am always surprised when I hear how uncommon this seems to be in some labs. Labs should be collaborative, and so should science as a whole. And as I already said, organising and documenting your data actually helps you to spot errors before the work is published. If anything, data sharing reduces mistakes.

Ethical issues with patient confidentiality

This is a big concern – and the only one that I have full sympathy with. But all of our ethics and data protection applications actually discuss this. The only data that is shared should be anonymised. Participants should only be identified by unique codes that only the researchers who collected the data have access to. For a lot of psychology or other behavioural experiments this shouldn’t be hard to achieve.

Neuroimaging or biological data are a different story. I have a strict rule for my own results. We do not upload the actual brain images of our fMRI experiments to public repositories. While under certain conditions I am willing to share such data upon request as long as the participant’s name has been removed, I don’t think it is safe to make those data permanently available to the entire internet. Participant confidentiality must trump the need for transparency. It simply is not possible to remove all identifying information from these files. Skull-stripping, which removes the head tissues from an MRI scan except for the brain, does not remove all identifying information. Brains are like finger-prints and they can easily be matched up, if you have the required data. As someone* recently said in a discussion of this issue, the undergrad you are scanning in your experiment now may be Prime Minister in 20-30 years time. They definitely didn’t consent to their brain scans being available to anyone. It may not take much to identify a person’s data using only their age, gender, handedness, and a basic model of their head shape derived from their brain scan. We must also keep in mind of what additional data mining may be possible in the coming decades that we simply have no idea about yet. Nobody can know what information could be gleaned from these data, say, about health risks or personality factors. Sharing this without very clear informed consent (that many people probably wouldn’t give) is in my view irresponsible.

I also don’t believe that for most purposes this is even necessary. Most neuroimaging studies involve group analyses. In those you first spatially normalise the images of each participant and the perform statistical analysis across participants. It is perfectly reasonable to make those group results available. For purpose of non-parametric permutation analyses (also in the news recently) you would want to share individual data points but even there you can probably share images after sufficient processing that not much incidental information is left (e.g. condition contrast images). In our own work, these considerations don’t apply. We conduct almost all our analyses in the participant’s native brain space. As such we decided to only share the participants’ data projected on a cortical reconstruction. These data contain the functional results for every relevant voxel after motion correction and signal filtering. No this isn’t raw data but it is sufficient to reproduce the results and it is also sufficient for applying different analyses. I’d wager that for almost all purposes this is more than enough. And again, if someone were to be interested in applying different motion correction or filtering methods, this would be a negotiable situation. But I don’t think we need to allow unrestricted permanent access for such highly unlikely purposes.

Basically, rather than sharing all raw data I think we need to treat each data set on a case-by-case basis and weigh the risks against benefits. What should be mandatory in my view is sharing all data after default processing that is needed to reproduce the published results.

People with agendas and freeloaders

Finally a few words about a combination of points 2 and 6 in Dorothy Bishop’s list. When it comes to controversial topics (e.g. climate change, chronic fatigue syndrome, to name a few examples where this apparently happened) there could perhaps be the danger that people with shady motivations will reanalyse and nitpick the data to find fault with them and discredit the researcher. More generally, people with limited expertise may conduct poor reanalysis. Since failed reanalysis (and again, the same applies to failed replications) often cause quite a stir and are frequently discussed as evidence that the original claims were false, this could indeed be a problem. Also some will perceive these cases as “data tourism”, using somebody else’s hard-won results for quick personal gain – say by making a name for themselves as a cunning data detective.

There can be some truth in that and for that reason I feel we really have to work harder to change the culture of scientific discourse. We must resist the bias to agree with the “accuser” in these situations. (Don’t pretend you don’t have this bias because we all do. Maybe not in all cases but in many cases…)

Of course skepticism is good. Scientists should be skeptical but the skepticism should apply to all claims (see also this post by Neuroskeptic on this issue). If somebody reanalyses somebody else’s data using a different method that does not automatically make them right and the original author wrong. If somebody fails to replicate a finding, that doesn’t mean that finding was false.

Science thrives on discussion and disagreement. The critical thing is that the discussion is transparent and public. Anyone who has an interest should have the opportunity to follow it. Anyone who is skeptical of the authors’ or the reanalysers’/replicators’ claims should be able to check for themselves.

And the only way to achieve this level of openness is Open Data.

 

* They will remain anonymous unless they want to join this debate.

Does correlation imply prediction?

TL;DR: Leave-one-out cross-validation is a bad way for testing the predictive power of linear correlation/regression.

Correlation or regression analysis are popular tools in neuroscience and psychology research for analysing individual differences. It fits a model (most typically a linear relationship between two measures) to infer whether the variability in some measure is related to the variability in another measure. Revealing such relationships can help understand the underlying mechanisms. We and others used it in previous studies to test specific mechanistic hypotheses linking brain structure/function and behaviour. It also forms the backbone of twin studies of heritability that in turn can implicate genetic and experiential factors in some trait. Most importantly, in my personal view individual differences are interesting because they acknowledge the fact that every human being is unique rather than simply treating variability as noise and averaging across large groups people.

But typically every report of a correlational finding will be followed by someone zealously pointing out that “Correlation does not imply causation”. And doubtless it is very important to keep that in mind. A statistical association between two variables may simply reflect the fact that they are both related to a third, unknown factor or a correlation may just be a fluke.

Another problem is that the titles of studies using correlation analysis sometimes use what I like to call “smooth narrative” style. Saying that some behaviour is “predicted by” or  “depends on” some brain measure makes for far more accessible and interesting reading that dryly talking about statistical correlations. However, it doesn’t sit well with a lot of people, in part because such language may imply a causal link that the results don’t actually support. Jack Gallant seems to regularly point out on Twitter that the term “prediction” should only ever be used when a predictive model is built on some data set but the validity is tested on an independent data set.

Recently I came across an interesting PubPeer thread debating this question. In this one commenter pointed out that the title of the study under discussion, “V1 surface size predicts GABA concentration“, was unjustified because this relationship explains only about 7% of the variance when using a leave-one-out cross-validation procedure. In this procedure all data points except one are used to fit the regression and the final point is then used to evaluate the fit of the model. This procedure is then repeated n-fold using every data point as evaluation data once.

Taken at face value this approach sounds very appealing because it uses independent data for making predictions and for testing them. Replication is a cornerstone of science and in some respects cross-validation is an internal replication. So surely this is a great idea? Naive as I am I have long had a strong affinity for this idea.

Cross-validation underestimates predictive power

But not so fast. These notions fail to address two important issues (both of which some commenters on that thread already pointed out): first, it is unclear what amount of variance a model should explain to be important. 7% is not very much but it can nevertheless be of substantial theoretical value. The amount of variance that can realistically be explained by any model is limited by the noise in the data that arises from measurement error or other distortions. So in fact many studies using cross-validation to estimate the variance explained by some models (often in the context of model comparison) instead report the amount of explainable variance accounted for by the model. To derive this one must first estimate the noise ceiling, that is, the realistic maximum of variance that can possibly be explained. This depends on the univariate variability of the measures themselves.

Second, the cross-validation approach is based on the assumption that the observed sample, which is then subdivided into model-fitting and evaluation sets, is a good representation of the population parameters the analysis is attempting to infer. As such, the cross-validation estimate also comes with an error (this issue is also discussed by another blog post mentioned in that discussion thread). What we are usually interested in when we conduct scientific studies is to make an inference about the whole population, say a conclusion that can be broadly generalised to any human brain, not just the handful of undergraduate students included in our experiments. This does not really fit the logic of cross-validation because the evaluation is by definition only based on the same sample we collected.

Because I am a filthy, theory-challenged experimentalist, I decided to simulate this (and I apologise to all my Bayesian friends for yet again conditioning on the truth here…). For a range of sample sizes between n=3 and n=300 I drew a sample with from a population with a fixed correlation of rho=0.7 and performed leave-one-out cross-validation to quantify the variance explained by it (using the squared correlation between predicted and observed values). I also performed a standard regression analysis and quantified the variance explained by that. At each sample size I did this 1000 times and then calculated the mean variance explained for each approach. Here are the results:

rho=7

What is immediately clear is that the results strongly depend on the sample size. Let’s begin with the blue line. This represents the variance explained by the standard regression analysis on the whole observed sample. The dotted, black, horizontal line denotes the true effect size, that is, the variance explained by the population correlation (so R^2=49%). The blue line starts off well above the true effect but then converges on it. This means that at small sample sizes, especially below n=10, the observed sample inflates how much variance is explained by the fitted model.

Next look at the red line. This denotes the variance explained by the leave-one-out cross-validation procedure. This also starts off above the true population effect and follows the decline of the observed correlation. But then it actually undershoots and goes well below the true effect size. Only then it gradually increases again and converges on the true effect. So at sample sizes that are most realistic in individual differences research, n=20-100ish, this cross-validation approach underestimates how much variance a regression model can explain and thus in fact undervalues the predictive power of the model.

The error bars in this plot denote +/- 1 standard deviation across the simulations at each sample size. So as one would expect, the variability across simulations is considerable when sample size is small, especially when n <= 10. These sample sizes are maybe unusually small but certainly not unrealistically small. I have seen publications calculating correlations on such small samples. The good news here is that even with such small samples on average the effect may not be inflated massively (let’s assume for the moment that publication bias or p-hacking etc are not an issue). However, cross-validation is not reliable under these conditions.

A correlation of rho=0.7 is unusually strong for most research. So I repeated this simulation analysis using a perhaps more realistic effect size of rho=0.3. Here is the plot:rho=3

Now we see a hint of something fascinating: the variance explained by the cross-validation approach actually subtly exceeds that of the observed sample correlation. They again converge on the true population level of 9% when the sample size reaches n=50. Actually there is again an undershoot but it is negligible. But at least for small samples with n <= 10 the cross-validation certainly doesn’t perform any better than the observed correlation. Both massively overestimate the effect size.

When the null hypothesis is true…

So if this is what is happening at a reasonably realistic rho=0.3, what about when the null hypothesis is true? This is what is shown in here (I apologise for the error bars extending into the impossible negative range but I’m too lazy to add that contingency to the code…):

rho=0

The problem we saw hinted at above for rho=0.3 is exacerbated here. As before, the variance explained for the observed sample correlation is considerably inflated when sample size is small. However, for the cross-validated result this situation is much worse. Even at a sample size of n=300 the variance explained by the cross-validation is greater than 10%. If you read the PubPeer discussion I mentioned, you’ll see that I discussed this issue. This result occurs because when the null hypothesis is true – or the true effect is very weak – the cross-validation will produce significant correlations between the inadequately fitted model predictions and the actual observed values. These correlations can be positive or  negative (that is, the predictions systematically go in the wrong direction) but because the variance explained is calculated by squaring the correlation coefficient they turn into numbers substantially greater than 0%.

As I discussed in that thread, there is another way to calculate the variance explained by the cross-validation. I won’t go into detail on this but unlike the simpler approach I employed here this does not limit the variance explained to fall between 0-100%. While the estimates are numerically different, the pattern of results is qualitatively the same. At smaller sample sizes the variance explained by cross-validation systematically underestimates the true variance explained.

When the interocular traumatic test is significant…

My last example is the opposite scenario. While we already looked at an unusually strong correlation, I decided to also simulate a case where the effect should be blatantly obvious. Here rho=0.9:

rho=9

Unsurprisingly, the results are similar as those seen for rho=0.7 but now the observed correlation is already doing a pretty decent job at reaching the nominal level of 81% variance explained. Still, the cross-validation underperforms at small sample sizes. In this situation, this actually seems to be a problem. It is rare that one would observe a correlation of this magnitude in psychological or biological sciences but if so chances are good that the sample size is small in that case. Often the reason for this may be that correlation estimates are inflated at small sample sizes but that’s not the point here. The point is that leave-one-out cross-validation won’t tell you. It underestimates the association even if it is real.

Where does all this leave us?

It is not my intention to rule out cross-validation. It can be a good approach for testing models and is often used successfully in the context of model comparison or classification analysis. In fact, as the debate about circular inference in neuroscience a few years ago illustrated, there are situations where it is essential that independent data are used. Cross-validation is a great way to deal with overfitting. Just don’t let yourself be misled into believing it can tell you something it doesn’t. I know it is superficially appealing and I had played with it previously for just that reason – but this exercise has convinced me that it’s not as bullet-proof is one might think.

Obviously, validation of a model with independent data is a great idea. A good approach is to collect a whole independent replication sample but this is expensive and may not always be feasible. Also, if a direct replication is performed it seems better that this is acquired independently by different researchers. A collaborative project could do this in which each group uses the data acquired by the other group to test their predictive model. But that again is not something that is likely to become regular practice anytime soon.

In the meantime we can also remember that performing typical statistical inference is a good approach after all. Its whole point is to infer the properties of the whole population from a sample. When used properly it tends to do a good job at that. Obviously, we should take measures to improve its validity, such as increasing power by using larger samples and/or better measurements. I know I am baysed but Bayesian hypothesis tests seem superior at ensuring validity than traditional significance testing. Registered Reports can probably also help and certainly should reduce the skew by publication bias and flexible analyses.

Wrapping up

So, does correlation imply prediction? I think so. Statistically this is precisely what it does. It uses one measure (or multiple measures) to make predictions of another measure. The key point is not whether calling it a prediction is valid but whether the prediction is sufficiently accurate to be important. The answer to this question actually depends considerably on what we are trying to do. A correlation explaining 10-20% of the variance in a small sample is not going to be a clear biomarker for anything. I sure as hell wouldn’t want any medical or judicial decisions to be based solely on such an association. But it may very well be very informative about mechanisms. It is a clearly detectable effect even with the naked eye.

In the context of these analysis, a better way than quantifying the variance explained is to calculate the root mean squared deviation (essentially the error bar) of the prediction. This provides an actually much more direct index of how accurately one variable predicts another. The next step – and I know I sound like a broken record – should be to confirm that these effects are actually scientifically plausible. This mantra is true for individual differences research as much as it is for Bem’s precognition and social priming experiments where I mentioned it before. Are the differences in neural transmission speed or neurotransmitter concentration implied by these correlation results realistic based on what we know about the brain?  These are the kinds of predictions we should actually care about in these discussions.

 

Reliability and Generalisability

So I know I said I won’t write about replications any more. I want to stay true to my word for once… But then a thought occurred to me last night. Sometimes it takes minimal sleep followed by a hectic crazy day to crystalise an idea in your mind. So before I go on a well-needed break from research, social media, and (most importantly) bureaucracy, here is a post through the loophole (actually the previous one was too – I decided to split them up). It’s not about replication as such but about what “replicability” can tell us. Also it’s unusually short!

While replicability now seems widely understood to mean that a finding replicates on repeated testing, I have come to prefer the term reliability. If a finding is so fluky that it often cannot be reproduced, it is likely it was spurious in the first place. Hence it is unreliable. Most of the direct replication attempts underway now are seeking to establish the reliability of previous findings. That is fair enough. However, any divergence from the original experimental conditions will make the replication less direct.

This brings us to another important aspect to a finding – its generalisability (I have actually written about this whole issue before although in a more specific context). A finding may be highly reliable but still fail to generalise, like the coin flipping example in my previous post. In my opinion science must seek to establish both, the reliability and generalisability of hypotheses. Just like Sinatra said, “you can’t have one without the other.”

This is where I think most (not all!) currently debated replication efforts fall short. They seek to only establish reliability which you can’t do. A reliable finding that is so specific that it only occurs under very precise conditions could still lead to important theoretical advances – just like Fluke’s magnetic sand led to the invention of holographic television and hover-cars. Or, if you prefer real examples, just ask any microbiologist or single-cell electrophysiologist. Some very real effects can be very precarious.

However, it is almost certainly true that a lot of findings (especially in our field right now) are just simply unreliable and thus probably unreal. My main issue with the “direct” replication effort is that by definition it cannot ever distinguish reliability from generalisability. One could argue (and some people clearly do) that the theories underlying things like social priming are just so implausible that we don’t need to ask if they generalise. I think that is wrong. It is perfectly fine to argue that some hypothesis is implausible – I have done so myself. However, I think we should always test reliability and generalisability at the same time. If you only seek to establish reliability, you may be looking in the wrong place. If you only ask if the hypothesis generalises, you may end up chasing a mirage. Either way, you invest a lot of effort and resources but you may not actually advance scientific knowledge very much.

And this, my dear friends, Romans, and country(wo)men will really be my final post on replication. When I’m back from my well-needed science break I will want to post about another topic inspired by a recent Neuroskeptic post.