Probably one of the main reasons for the low replicability of scientific studies is that many previous studies have been underpowered – or rather that they only provided inconclusive evidence for or against the hypotheses they sought to test. Alex Etz had a great blog post on this with regard to replicability in psychology (and he published an extension of this analysis that takes publication bias into account as a paper). So it is certainly true that as a whole researchers in psychology and neuroscience can do a lot better when it comes to the sensitivity of their experiments.

A common mantra is that we need larger sample sizes to boost sensitivity. Statistical power is a function of the sample size and the expected effect size. There is a lot of talk out there about what effect size one should use for power calculations. For instance, when planning a replication study, it has been suggested that you should more than double the sample size of the original study. This is supposed to take into account the fact that published effect sizes are probably skewed upwards due to publication bias and analytical flexibility, or even simply because the true effect happens to be weaker than originally reported.

However, what all these recommendations neglect to consider is that standardized effect sizes, like Cohen’s *d* or a correlation coefficient, are also dependent on the precision of your observations. By reducing measurement error or other noise factors, you can literally increase the effect size. A higher effect size means greater statistical power – so with the *same* sample size you can boost power by improving your experiment in other ways.

Here is a practical example. Imagine I want to correlate the height of individuals measured in centimeters and inches. This is a trivial case – theoretically the correlation should be perfect, that is, ρ = 1. However, measurement error will spoil this potential correlation somewhat. I have a sample size of 100 people. I first ask my auntie Angie to guess the height of each subject in centimeters. To determine their heights in inches, I then take them all down the pub and ask this dude called Nigel to also take a guess. Both Angie and Nigel will misestimate heights to some degree. For simplicity, let’s just say that their errors are on average the same. This nonetheless means their guesses will not always agree very well. If I then calculate the correlation between their guesses, it will obviously have to be lower than 1, even though this is the true correlation. I simulated this scenario below. On the x-axis I plot the amount of measurement error in cm (the standard deviation of Gaussian noise added to the actual body heights). On the y-axis I plot the median observed correlation and the shaded area is the 95% confidence interval over 10,000 simulations. As you can see, as measurement error increases, the observed correlation goes down and the confidence interval becomes wider.

Greater error leads to poorer correlations. So far, so obvious. But while I call this the observed correlation, it really is the *maximally observable correlation*. This means that in order to boost power, the first thing you could do is to reduce measurement error. In contrast, increasing your sample size can be highly inefficient and border on the infeasible.

For a correlation of 0.35, hardly an unrealistically low effect in a biological or psychological scenario, you would need a sample size of 62 to achieve 80% power. Let’s assume this is the correlation found by a previous study and we want to replicate it. Following common recommendations you would plan to collect two-and-a-half the sample size, so n = 155. Doing so may prove quite a challenge. Assume that each data point involves hours of data collection per participant and/or that it costs 100s of dollars to acquire the data (neither are atypical in neuroimaging experiments). This may be a considerable additional expense few researchers are able to afford.

And it gets worse. It is quite possible that by collecting more data you further *sacrifice* data quality. When it comes to neuroimaging data, I have heard from more than one source that some of the large-scale imaging projects contain only mediocre data contaminated by motion and shimming artifacts. The often mentioned suggestion that sample sizes for expensive experiments could be increased by multi-site collaborations ignores that this quite likely introduces additional variability due to differences between sites. The data quality even from the same equipment may differ. The research staff at the two sites may not have the same level of skill or meticulous attention to detail. Behavioral measurements acquired online via a website may be more variable than under controlled lab conditions. So you may end up polluting your effect size even further by increasing sample size.

The alternative is to improve your measurements. In my example here, even going from a measurement error of 20 cm to 15 cm improves the observable effect size quite dramatically, moving from 0.35 to about 0.5. To achieve 80% power, you would only need a sample size of 29. If you kept the original sample size of 62, your power would be 99%. So the critical question is not really what the original effect size was that you want to replicate – rather it is how much you can improve your experiment by reducing noise. If your measurements are already pretty precise to begin with, then there is probably little room for improvement and you also don’t win all that much, as going from measurement error 5 cm to 1 cm in my example. But when the original measurement was noisy, improving the experiment can help a hell of a lot.

There are many ways to make your measurements more reliable. It can mean ensuring that your subjects in the MRI scanner are padded in really well, that they are not prone to large head movements, that you did all in your power to maintain a constant viewing distance for each participant, and that they don’t fall asleep halfway through your experiment. It could mean scanning 10 subjects twice, instead of scanning 20 subjects once. It may be that you measure the speed that participants walk down the hall to the lift with laser sensors instead of having a confederate sit there with a stopwatch. Perhaps you can change from a group comparison to a within-subject design? If your measure is an average across trials collected in each subject, you can enhance the effect size by increasing the number of trials. And it definitely means not giving a damn what Nigel from down the pub says and investing in a bloody tape measure instead.

I’m not saying that you shouldn’t collect larger samples. Obviously, if measurement reliability remains constant*, larger samples can improve sensitivity. But the first thought should always be *how you can make your experiment a better test of your hypothesis*. Sometimes the only thing you can do is to increase the sample but I bet usually it isn’t – and if you’re not careful, it can even make things worse. If your aim is to conclude something about the human brain/mind in general, a larger and broader sample would allow you to generalize better. However, for this purpose increasing your subject pool from 20 undergraduate students at your university to 100 isn’t really helping. And when it comes to the choice between an exact replication study with three times the sample size than the original experiment, and one with the same sample but objectively better methods, I know I’d always pick the latter.

*(* In fact, it’s really a trade-off and in some cases a slight increase of measurement error may very well be outweighed by greater power due to a larger sample size. This probably happens for the kinds of experiments where slight difference in experimental parameters don’t matter much and you can collect 100s of people fast, for example online or at a public event).*

I really enjoyed reading this thoughtful post. Thanks for writing such a great piece that goes beyond only increasing sample size to boost power.

In the interest of applying these principles to replications, can you suggest a way to estimate a new, adjusted effect size (or power) upon switching to a method of collecting multiple samples from each individual? As a very simple example, consider a one-sample design examining performance on a task where each individual is tested once. If a researcher wanted to replicate the study but test each individual 30 times, what would be a good way to estimate the expected effect size for a power analysis? Could you use the original study’s SEM to compute an expected standard deviation for the new design?

LikeLike

Thanks, this is a really important question. It’s not that straightforward unfortunately and will depend very much on the specifics of each case. I don’t think (although I may of course be wrong) there is no simple way to determine the SD of the new design from the statistics of the original study alone. At the very least you’d have to know how much within-subject variance there was in the original design and you presumably do not have that because it’s a single-trial design.

But a simple rule of thumb would just be to use adequate power in the replication for the originally observed effect size (or perhaps a slightly lower one if you suspect it was inflated) but then improve your methods so you increase power beyond adequate.

Personally, I would most likely conduct simulations to quantify what power will be like in different scenarios. This requires making some assumptions about the within-subject variance which can be tricky. But you could perform this over a whole range of parameters as a robustness check.

LikeLike