or How innocent assumptions can lead to wrong conclusions
I promised you a (neuro)science post. Don’t let the title mislead you into thinking we’re talking about world affairs and societal ills again. While pigeonholing is directly related to polarised politics or social media, for once this is not what this post is about. Rather, it is about a common error in data analysis. While there have been numerous expositions about similar issues throughout the decades – as we’ve learned the hard way, it is a surprisingly easy mistake to make. A lay summary and some wider musings on the scientific process was published by Benjamin de Haas. A scientific article by Susanne Stoll laying out this problem in more detail is currently available as a preprint.
In science you often end up with large data sets, with hundreds or thousands of individual observations subject to considerable variance. For instance, in my own field of retinotopic population receptive field (pRF) mapping, a given visual brain area may have a few thousand recording sites, and each has a receptive field position. There are many other scenarios of course. It could be neural firing, or galvanic skin responses, or eye positions recorded at different time points. Or it could be hundreds or thousands of trials in a psychophysics experiment etc. I will talk about pRF mapping because this is where we recently encountered the problem and I am going to describe how it has affected our own findings – however, you may come across the same issue in many guises.
Imagine that we want to test how pRFs move around when you attend to a particular visual field location. I deliberately use this example because it is precisely what a bunch of published pRF studies did, including one of ours. There is some evidence that selective attention shifts the position of neuronal receptive fields, so it is not far-fetched that it might shift pRFs in fMRI experiments also. Our study for instance investigated whether pRFs shift when participants are engaged in a demanding (“high load”) task at fixation, compared to a baseline condition where they only need to detect a simple colour change of the fixation target (“low load”). Indeed, we found that across many visual areas pRFs shifted outwards (i.e. away from fixation). This suggested to us that the retinotopic map reorganises to reflect a kind of tunnel vision when participants are focussed on the central task.
What would be a good way to quantify such map reorganisation? One simple way might be to plot each pRF in the visual field with a vector showing how it is shifted under the attentional manipulation. In the graph below, each dot shows a pRF location under the attentional condition, and the line shows how it has moved away from baseline. Since there is a large number pRFs, many of which are affected by measurement noise or other errors, these plots can be cluttered and confusing:
Clearly, we need to do something to tidy up this mess. So we take the data from the baseline condition (in pRF studies, this would normally be attending to a simple colour change at fixation) and divide the visual field up into a number of smaller segments, each of which contains some pRFs. We then calculate the mean position of the pRFs from each segment under the attentional manipulation. Effectively, we summarise the shift from baseline for each segment:
This produces a much clearer plot that suggests some interesting, systematic changes in the visual field representation under attention. Surely, this is compelling evidence that pRFs are affected by this manipulation?
Unfortunately it is not1. The mistake here is to assume that there is no noise in the baseline measure that was used to divide up the data in the first place. If our baseline pRF map were a perfect measure of the visual field representation, then this would have been fine. However, like most data, pRF estimates are variable and subject to many sources of error. The misestimation is also unlikely to be perfectly symmetric – for example, there are several reasons why it is more likely that a pRF will be estimated closer to central vision than in the periphery. This means there could be complex and non-linear error patterns that are very difficult to predict.
The data I showed in these figures are in fact not from an attentional manipulation at all. Rather, they come from a replication experiment where we simply measured a person’s pRF maps twice over the course of several months. One thing we do know is that pRF measurements are quite robust, stable over time, and even similar between scanners with different magnetic field strengths. What this means is that any shifts we found are most likely due to noise. They are completely artifactual.
When you think about it, this error is really quite obvious: sorting observations into clear categories can only be valid if you can be confident in the continuous measure on which you base these categories. Pigeonholing can only work if you can be sure into which hole each pigeon belongs. This error is also hardly new. It has been described in numerous forms as regression to the mean and it rears its ugly head every few years in different fields. It is also related to circular inference, which has already caused a stir in cognitive and social neuroscience a few years ago. Perhaps the reason for this is that it is a damn easy mistake to make – but that doesn’t make the face-palming moment any less frustrating.
It is not difficult to correct this error. In the plot below, I used an independent map from yet another, third pRF mapping session to divide up the visual field. Then I calculated how the pRFs in each visual field segment shifted on average between the two experimental sessions. While some shift vectors remain, they are considerably smaller than in the earlier graph. Again, keep in mind that these are simple replication data and we would not really expect any systematic shifts. There certainly does not seem to be a very obvious pattern here – perhaps there is a bit of a clockwise shift in the right visual hemifield but that breaks down in the left. Either way, this analysis gives us an estimate of how much variability there may be in this measurement.
This approach of using a third, independent map loses some information because the vectors only tell you the direction and magnitude of the shifts, not exactly where the pRFs started from and where they end up. Often the magnitude and direction of the shift is all we really need to know. However, when the exact position is crucial we could use other approaches. We will explore this in greater depth in upcoming publications.
On the bright side, the example I picked here is probably extreme because I didn’t restrict these plots to a particular region of interest but used all supra-threshold voxels in the occipital cortex. A more restricted analysis would remove some of that noise – but the problem nevertheless remains. How much it skews the findings depends very much on how noisy the data are. Data tend to be less noisy in early visual cortex than in higher-level brain regions, which is where people usually find the most dramatic pRF shifts…
Correcting the literature
It is so easy to make this mistake that you can find it all over the pRF literature. Clearly, neither authors nor reviewers have given it much thought. It is definitely not confined to studies of visual attention, although this is how we stumbled across it. It could be a comparison between different analysis methods or stimulus protocols. It could be studies measuring the plasticity of retinotopic maps after visual field loss. Ironically, it could even be studies that investigate the potential artifacts when mapping such plasticity incorrectly. It is not restricted to the kinds of plots I showed here but should affect any form of binning, including the binning into eccentricity bins that is most common in the literature. We suspect the problem is also pervasive in many other fields or in studies using other techniques. Only a few years ago a similar issue was described by David Shanks in the context of studying unconscious processing. It is also related to warnings you may occasionally hear about using median splits – really just a simpler version of the same approach.
I cannot tell you if the findings from other studies that made this error are spurious. To know that we would need access to the data and reanalyse these studies. Many of them were published before data and code sharing was relatively common2. Moreover, you really need to have a validation dataset, like the replication data in my example figures here. The diversity of analysis pipelines and experimental designs makes this very complex – no two of these studies are alike. The error distributions may also vary between different studies, so ideally we need replication datasets for each study.
In any case, as far as our attentional load study is concerned, after reanalysing these data with unbiased methods, we found little evidence of the effects we published originally. While there is still a hint of pRF shifts, these are no longer statistically significant. As painful as this is, we therefore retracted that finding from the scientific record. There is a great stigma associated with retraction, because of the shady circumstances under which it often happens. But to err is human – and this is part of the scientific method. As I said many times before, science is self-correcting but that is not some magical process. Science doesn’t just happen, it requires actual scientists to do the work. While it can be painful to realise that your interpretation of your data was wrong, this does not diminish the value of this original work3 – if anything this work served an important purpose by revealing the problem to us.
We mostly stumbled across this problem by accident. Susanne Stoll and Elisa Infanti conducted a more complex pRF experiment on attention and found that the purported pRF shifts in all experimental conditions were suspiciously similar (you can see this in an early conference poster here). It took us many months of digging, running endless simulations, complex reanalyses, and sometimes heated arguments before we cracked that particular nut. The problem may seem really obvious now – it sure as hell wasn’t before all that.
This is why this erroneous practice appears to be widespread in this literature and may have skewed the findings of many other published studies. This does not mean that all these findings are false but it should serve as a warning. Ideally, other researchers will also revisit their own findings but whether or not they do so is frankly up to them. Reviewers will hopefully be more aware of the issue in future. People might question the validity of some of these findings in the absence of any reanalysis. But in the end, it doesn’t matter all that much which individual findings hold up and which don’t4.
Check your assumptions
I am personally more interested in taking this whole field forward. This issue is not confined to the scenario I described here. pRF analysis is often quite complex. So are many other studies in cognitive neuroscience and, of course, in many other fields as well. Flexibility in study designs and analysis approaches is not a bad thing – it is in fact essential for addressing scientific questions that we can adapt our experimental designs.
But what this story shows very clearly is the importance of checking our assumptions. This is all the more important when using the complex methods that are ubiquitous in our field. As cognitive neuroscience matures, it is critical that we adopt good practices in ensuring the validity of our methods. In the computational and software development sectors, it is to my knowledge commonplace to test algorithms on conditions where the ground truth is known, such as random and/or simulated data.
This idea is probably not even new to most people and it certainly isn’t to me. During my PhD there was a researcher in the lab who had concocted a pretty complicated analysis of single-cell electrophysiology recordings. It involved lots of summarising and recentering of neuronal tuning functions to produce the final outputs. Neither I nor our supervisor really followed every step of this procedure based only on our colleague’s description – it was just too complex. But eventually we suspected that something might be off and so we fed random numbers to the algorithm – lo and behold the results were a picture perfect reproduction of the purported “experimental” results. Since then, I have simulated the results of my analyses a few other times – for example, when I first started with pRF modelling or when I developed new techniques for measuring psychophysical quantities.
This latest episode taught me that we must do this much more systematically. For any new design, we should conduct control analyses to check how it behaves with data for which the ground truth is known. It can reveal statistical artifacts that might hide inside the algorithm but also help you determine the method’s sensitivity and thus allow you to conduct power calculations. Ideally, we would do that for every new experiment even if it uses a standard design. I realise that this may not always be feasible – but in that case there should be a justification why it is unnecessary.
Because what this really boils down to is simply good science. When you use a method without checking that it works as intended, you are effectively doing a study without a control condition – quite possibly the original sin of science.
In conclusion, I quickly want to thank several people: First of all, Susanne Stoll deserves major credit for tirelessly pursuing this issue in great detail over the past two years with countless reanalyses and simulations. Many of these won’t ever see the light of day but helped us wrap our heads around what is going on here. I want to thank Elisa Infanti for her input and in particular the suggestion of running the analysis on random data – without this we might never have realised how deep this rabbit hole goes. I also want to acknowledge the patience and understanding of our co-authors on the attentional load study, Geraint Rees and Elaine Anderson, for helping us deal with all the stages of grief associated with this. Lastly, I want to thank Benjamin de Haas, the first author of that study for honourably doing the right thing. A lesser man would have simply booked a press conference at Current Biology Total Landscaping instead to say it’s all fake news and announce a legal challenge5.
- The sheer magnitude of some of these shifts may also be scientifically implausible, an issue I’ve repeatedly discussed on this blog already. Similar shifts have however been reported in the literature – another clue that perhaps something is awry in these studies…
- Not that data sharing is enormously common even now.
- It is also a solid data set with a fairly large number of participants. We’ve based our canonical hemodynamic response function on the data collected for this study – there is no reason to stop using this irrespective of whether the main claims are correct or not.
- Although it sure would be nice to know, wouldn’t it?
- Did you really think I’d make it through a blog post without making any comment like this?