Is d>10 a plausible effect size?

TL;DR: You may get a very large relative effect size (like Cohen’s d), if the main source of the variability in your sample is the reliability of each observation and the measurement was made as exact as is feasible. Such a large d is not trivial, but in this case talking about d is missing the point.

In discussions of scientific findings you will often hear talk about relative effect sizes, like the ubiquitous Cohen’s d. Essentially, such effect sizes quantify the mean difference between groups/treatments/conditions relative to the variability across subjects/observations. The situation is actually a lot more complicated because even for a seemingly simple results like the difference between conditions you will find that there are several ways of calculating the effect size. You can read a nice summary by Jake Westfall here. There are also other effect sizes, such as correlation coefficients, and what I write here applies to that, too. I will however stick to the difference-type effect size because it is arguably the most common.

One thing that has irked me about those discussions for some years is that this ignores a very substantial issue: the between-subject variance of your sample depends on the within-subject variance. The more unreliable the measurement of each subject, the greater is the variability of your sample. Thus the reliability of individual measurements limits the relative effect size you can possibly achieve in your experiment given a particular experimental design. In most of science – especially biological and psychological sciences – the reliability of individual observations is strongly limited by the measurement error and/or the quality of your experiment.

There are some standard examples that are sometimes used to illustrate what a given effect size means. I stole a common one from this blog post about the average height difference between men and women, which apparently was d=1.482 in 1980 Spain. I have no idea if this is true exactly but that figure should be in the right ballpark. I assume most people will agree that men are on average taller than women but that there is nevertheless substantial overlap in the distributions – so that relatively frequently you will find a woman who is taller than many men. That is an effect size we might consider strong.

The height difference between men and women is a good reference for an effect size because it is largely limited by the between-subject variance, the variability in actual heights across the population. Obviously, the reliability of each observation also plays a role. There will definitely be a degree of measurement error. However, I suspect that this error is small, probably on the order of a few millimeters. Even if you’re quite bad at this measurement I doubt you will typically err by more than 1-2 cm and you can probably still replicate this effect in a fairly small sample. However, in psychology experiments your measurement rarely is that accurate.

Now, in some experiments you can increase the reliability of your individual measurement by increasing the number of trials (at this point I’d like to again refer to Richard Morey’s related post on this topic). In psychophysics, collecting hundreds or thousands of trials on one individual subject is not at all uncommon. Let’s take a very simple case. Contour integration refers to the ability of the visual system to detect “snake” contours better than “ladder” contours or those defined by other orientations (we like to call those “ropes”):

In the left image you should hopefully see a circle defined by 16 grating patches embedded in a background or randomly oriented gratings. This “snake” contour pops out from the background because the visual system readily groups orientations along a collinear (or cocircular) path into a coherent object. In contrast, when the contour is defined by patches of other orientations, for example the “rope” contour in the right image which is defined by patches at 45 degrees relative to the path, then it is much harder to detect the presence of this contour. This isn’t a vision science post so I won’t go into any debates on what this means. The take-home message here is that if healthy subjects with normal vision are asked to determine the presence or absence of a contour like this, especially with limited viewing time, they will perform very well for the “snake” contours but only barely above chance levels for the “rope” contours.

This is a very robust effect and I’d argue this is quite representative of many psychophysical findings. A psychophysicist probably wouldn’t simply measure the accuracy but conduct a broader study of how this depends on particular stimulus parameters – but that’s not really important here. It is still pretty representative.

What is the size of this effect? 

If I study this in a group of subjects, the relative effect size at the group level will depend on how accurately I measure the performance in each individual. If I have 50 subjects (which is between 10-25 larger than your typical psychophysics study…) and each performs just one trial, then the sample variance will be much larger compared to if each of them does 100 trials or if they each do 1000 trials. As a result, the Cohen’s d of the group will be considerably different. A d>10 should be entirely feasible if we collect enough trials per person.

People will sometimes say that large effects (d>>2 perhaps) are trivial. But there is nothing trivial about this. In this particular example you may see the difference quite easily for yourself (so you are a single-subject and single-trial replication). But we might want to know just how much better we are at detecting the snake than the rope contours. Or, as I already mentioned, a psychophysicist might measure the sensitivity of subjects to various stimulus parameters in this experiment (e.g., the distance between patches, the amount of noise in the orientations we can tolerate, etc) and this could tell us something about how vision works. The Cohen’s d would be pretty large for all of these. That does not make it trivial but in my view it makes it useless:

Depending on my design choices the estimated effect size may be a very poor reflection of the true effect size. As mentioned earlier, the relative effect size is directly dependent on the between-subject variance – but that in turn depends on the reliability of individual measurements. If each subject only does one trial, the effect of just one attentional lapse or accidental button press in the task is much more detrimental than when they perform 1000 trials, even if the overall rate of lapses/accidents is the same*.

Why does this matter?

In many experiments, the estimate of between-subject variance will be swamped by the within-subject variability. Returning to the example of gender height differences, this is essentially what would happen if you chose to eyeball each person’s height instead of using a tape measure. I’d suspect that is the case for many experiments in social or personality psychology where each measurement is essentially a single quantity (say, timing the speed with which someone walks out of the lab in a priming experiment) rather than being based on hundreds or thousands of trials as in psychophysics. Notoriously noisy measurements are also doubtless the major limiting factor in most neuroimaging experiments. On the other hand, I assume a lot of questionnaire-type results you might have in psychology (such as IQ or the Big Five personality factors) have actually pretty high test-retest reliability and so you probably do get mostly the between-subject variance.

The problem is that often it is very difficult to determine which scenario we are in. In psychophysics, we are often so extremely dominated by the measurement reliability that a knowledge of the “true” population effect size is actually completely irrelevant. This is a critical issue because you cannot use such an effect size for power analysis: If I take an experiment someone did and base my power analysis on the effect size they reported, I am not really powering my experiment to detect a similar effect but a similar design. (This is particularly useless if I then decide to use a different design…)

So next time you see an unusually large Cohen’s (d>10 or even d>3) ask yourself not simply whether this is a plausible effect but whether this experiment can plausibly estimate the true population effect. If this result is based on a single observation per subject with a highly variable measurement (say, how often Breton drivers stop for female hitchhikers wearing red clothing…), even a d=1 seems incredibly large.

But if it is for a measurement that could have been made more reliable by doubling the amount of data collected in each subject (say, a change in psychometric thresholds), then a very high Cohen’s d is entirely plausible – but it is also pretty meaningless. In this situation, what we should really care about is the absolute effect size (How much does the threshold change? How much does the accuracy drop? etc).

And I must say, I remain unsure whether absolute effect sizes aren’t more useful in general, including for experiments on complex human behaviour, neuroimaging, or drug effects.

* Actually the lapse rate probably increases with a greater number of trials due to subject fatigue, drop in motivation, or out of pure spite. But even that increase is unlikely to be as detrimental as having too few trials.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s