Category Archives: figures

When the hole changes the pigeon

or How innocent assumptions can lead to wrong conclusions

I promised you a (neuro)science post. Don’t let the title mislead you into thinking we’re talking about world affairs and societal ills again. While pigeonholing is directly related to polarised politics or social media, for once this is not what this post is about. Rather, it is about a common error in data analysis. While there have been numerous expositions about similar issues throughout the decades – as we’ve learned the hard way, it is a surprisingly easy mistake to make. A scientific article by Susanne Stoll laying out this problem in more detail is currently available as a preprint.

Pigeonholing (Source: https://commons.wikimedia.org/wiki/File:TooManyPigeons.jpg)

Data binning

In science you often end up with large data sets, with hundreds or thousands of individual observations subject to considerable variance. For instance, in my own field of retinotopic population receptive field (pRF) mapping, a given visual brain area may have a few thousand recording sites, and each has a receptive field position. There are many other scenarios of course. It could be neural firing, or galvanic skin responses, or eye positions recorded at different time points. Or it could be hundreds or thousands of trials in a psychophysics experiment etc. I will talk about pRF mapping because this is where we recently encountered the problem and I am going to describe how it has affected our own findings – however, you may come across the same issue in many guises.

Imagine that we want to test how pRFs move around when you attend to a particular visual field location. I deliberately use this example because it is precisely what a bunch of published pRF studies did, including one of ours. There is some evidence that selective attention shifts the position of neuronal receptive fields, so it is not far-fetched that it might shift pRFs in fMRI experiments also. Our study for instance investigated whether pRFs shift when participants are engaged in a demanding (“high load”) task at fixation, compared to a baseline condition where they only need to detect a simple colour change of the fixation target (“low load”). Indeed, we found that across many visual areas pRFs shifted outwards (i.e. away from fixation). This suggested to us that the retinotopic map reorganises to reflect a kind of tunnel vision when participants are focussed on the central task.

What would be a good way to quantify such map reorganisation? One simple way might be to plot each pRF in the visual field with a vector showing how it is shifted under the attentional manipulation. In the graph below, each dot shows a pRF location under the attentional condition, and the line shows how it has moved away from baseline. Since there is a large number pRFs, many of which are affected by measurement noise or other errors, these plots can be cluttered and confusing:

Plotting shift of each pRF in the attention condition relative to baseline. Each dot shows where a pRF landed under the attentional manipulation, and the line shows how it has shifted away from baseline. This plot is a hellishly confusing mess.

Clearly, we need to do something to tidy up this mess. So we take the data from the baseline condition (in pRF studies, this would normally be attending to a simple colour change at fixation) and divide the visual field up into a number of smaller segments, each of which contains some pRFs. We then calculate the mean position of the pRFs from each segment under the attentional manipulation. Effectively, we summarise the shift from baseline for each segment:

We divide the visual field into segments based on the pRF data from the baseline condition and then plot the mean shift in the experimental condition for each segment. A much clearer graph that suggests some very substantial shifts…

This produces a much clearer plot that suggests some interesting, systematic changes in the visual field representation under attention. Surely, this is compelling evidence that pRFs are affected by this manipulation?

False assumptions

Unfortunately it is not1. The mistake here is to assume that there is no noise in the baseline measure that was used to divide up the data in the first place. If our baseline pRF map were a perfect measure of the visual field representation, then this would have been fine. However, like most data, pRF estimates are variable and subject to many sources of error. The misestimation is also unlikely to be perfectly symmetric – for example, there are several reasons why it is more likely that a pRF will be estimated closer to central vision than in the periphery. This means there could be complex and non-linear error patterns that are very difficult to predict.

The data I showed in these figures are in fact not from an attentional manipulation at all. Rather, they come from a replication experiment where we simply measured a person’s pRF maps twice over the course of several months. One thing we do know is that pRF measurements are quite robust, stable over time, and even similar between scanners with different magnetic field strengths. What this means is that any shifts we found are most likely due to noise. They are completely artifactual.

When you think about it, this error is really quite obvious: sorting observations into clear categories can only be valid if you can be confident in the continuous measure on which you base these categories. Pigeonholing can only work if you can be sure into which hole each pigeon belongs. This error is also hardly new. It has been described in numerous forms as regression to the mean and it rears its ugly head every few years in different fields. It is also related to circular inference, which has already caused a stir in cognitive and social neuroscience a few years ago. Perhaps the reason for this is that it is a damn easy mistake to make – but that doesn’t make the face-palming moment any less frustrating.

It is not difficult to correct this error. In the plot below, I used an independent map from yet another, third pRF mapping session to divide up the visual field. Then I calculated how the pRFs in each visual field segment shifted on average between the two experimental sessions. While some shift vectors remain, they are considerably smaller than in the earlier graph. Again, keep in mind that these are simple replication data and we would not really expect any systematic shifts. There certainly does not seem to be a very obvious pattern here – perhaps there is a bit of a clockwise shift in the right visual hemifield but that breaks down in the left. Either way, this analysis gives us an estimate of how much variability there may be in this measurement.

We use an independent map to divide the visual field into segments. Then we calculate the mean position for each segment in the baseline and the experimental condition, and work out the shift vector between them. For each segment, this plot shows that vector. This plot loses some information, but it shows how much and into which direction pRFs in each segment shifted on average.

This approach of using a third, independent map loses some information because the vectors only tell you the direction and magnitude of the shifts, not exactly where the pRFs started from and where they end up. Often the magnitude and direction of the shift is all we really need to know. However, when the exact position is crucial we could use other approaches. We will explore this in greater depth in upcoming publications.

On the bright side, the example I picked here is probably extreme because I didn’t restrict these plots to a particular region of interest but used all supra-threshold voxels in the occipital cortex. A more restricted analysis would remove some of that noise – but the problem nevertheless remains. How much it skews the findings depends very much on how noisy the data are. Data tend to be less noisy in early visual cortex than in higher-level brain regions, which is where people usually find the most dramatic pRF shifts…

Correcting the literature

It is so easy to make this mistake that you can find it all over the pRF literature. Clearly, neither authors nor reviewers have given it much thought. It is definitely not confined to studies of visual attention, although this is how we stumbled across it. It could be a comparison between different analysis methods or stimulus protocols. It could be studies measuring the plasticity of retinotopic maps after visual field loss. Ironically, it could even be studies that investigate the potential artifacts when mapping such plasticity incorrectly. It is not restricted to the kinds of plots I showed here but should affect any form of binning, including the binning into eccentricity bins that is most common in the literature. We suspect the problem is also pervasive in many other fields or in studies using other techniques. Only a few years ago a similar issue was described by David Shanks in the context of studying unconscious processing. It is also related to warnings you may occasionally hear about using median splits – really just a simpler version of the same approach.

I cannot tell you if the findings from other studies that made this error are spurious. To know that we would need access to the data and reanalyse these studies. Many of them were published before data and code sharing was relatively common2. Moreover, you really need to have a validation dataset, like the replication data in my example figures here. The diversity of analysis pipelines and experimental designs makes this very complex – no two of these studies are alike. The error distributions may also vary between different studies, so ideally we need replication datasets for each study.

In any case, as far as our attentional load study is concerned, after reanalysing these data with unbiased methods, we found little evidence of the effects we published originally. While there is still a hint of pRF shifts, these are no longer statistically significant. As painful as this is, we therefore retracted that finding from the scientific record. There is a great stigma associated with retraction, because of the shady circumstances under which it often happens. But to err is human – and this is part of the scientific method. As I said many times before, science is self-correcting but that is not some magical process. Science doesn’t just happen, it requires actual scientists to do the work. While it can be painful to realise that your interpretation of your data was wrong, this does not diminish the value of this original work3 – if anything this work served an important purpose by revealing the problem to us.

We mostly stumbled across this problem by accident. Susanne Stoll and Elisa Infanti conducted a more complex pRF experiment on attention and found that the purported pRF shifts in all experimental conditions were suspiciously similar (you can see this in an early conference poster here). It took us many months of digging, running endless simulations, complex reanalyses, and sometimes heated arguments before we cracked that particular nut. The problem may seem really obvious now – it sure as hell wasn’t before all that.

This is why this erroneous practice appears to be widespread in this literature and may have skewed the findings of many other published studies. This does not mean that all these findings are false but it should serve as a warning. Ideally, other researchers will also revisit their own findings but whether or not they do so is frankly up to them. Reviewers will hopefully be more aware of the issue in future. People might question the validity of some of these findings in the absence of any reanalysis. But in the end, it doesn’t matter all that much which individual findings hold up and which don’t4.

Check your assumptions

I am personally more interested in taking this whole field forward. This issue is not confined to the scenario I described here. pRF analysis is often quite complex. So are many other studies in cognitive neuroscience and, of course, in many other fields as well. Flexibility in study designs and analysis approaches is not a bad thing – it is in fact essential for addressing scientific questions that we can adapt our experimental designs.

But what this story shows very clearly is the importance of checking our assumptions. This is all the more important when using the complex methods that are ubiquitous in our field. As cognitive neuroscience matures, it is critical that we adopt good practices in ensuring the validity of our methods. In the computational and software development sectors, it is to my knowledge commonplace to test algorithms on conditions where the ground truth is known, such as random and/or simulated data.

This idea is probably not even new to most people and it certainly isn’t to me. During my PhD there was a researcher in the lab who had concocted a pretty complicated analysis of single-cell electrophysiology recordings. It involved lots of summarising and recentering of neuronal tuning functions to produce the final outputs. Neither I nor our supervisor really followed every step of this procedure based only on our colleague’s description – it was just too complex. But eventually we suspected that something might be off and so we fed random numbers to the algorithm – lo and behold the results were a picture perfect reproduction of the purported “experimental” results. Since then, I have simulated the results of my analyses a few other times – for example, when I first started with pRF modelling or when I developed new techniques for measuring psychophysical quantities.

This latest episode taught me that we must do this much more systematically. For any new design, we should conduct control analyses to check how it behaves with data for which the ground truth is known. It can reveal statistical artifacts that might hide inside the algorithm but also help you determine the method’s sensitivity and thus allow you to conduct power calculations. Ideally, we would do that for every new experiment even if it uses a standard design. I realise that this may not always be feasible – but in that case there should be a justification why it is unnecessary.

Because what this really boils down to is simply good science. When you use a method without checking that it works as intended, you are effectively doing a study without a control condition – quite possibly the original sin of science.

Acknowlegdements

In conclusion, I quickly want to thank several people: First of all, Susanne Stoll deserves major credit for tirelessly pursuing this issue in great detail over the past two years with countless reanalyses and simulations. Many of these won’t ever see the light of day but helped us wrap our heads around what is going on here. I want to thank Elisa Infanti for her input and in particular the suggestion of running the analysis on random data – without this we might never have realised how deep this rabbit hole goes. I also want to acknowledge the patience and understanding of our co-authors on the attentional load study, Geraint Rees and Elaine Anderson, for helping us deal with all the stages of grief associated with this. Lastly, I want to thank Benjamin de Haas, the first author of that study for honourably doing the right thing. A lesser man would have simply booked a press conference at Current Biology Total Landscaping instead to say it’s all fake news and announce a legal challenge5.

Footnotes:

  1. The sheer magnitude of some of these shifts may also be scientifically implausible, an issue I’ve repeatedly discussed on this blog already. Similar shifts have however been reported in the literature – another clue that perhaps something is awry in these studies…
  2. Not that data sharing is enormously common even now.
  3. It is also a solid data set with a fairly large number of participants. We’ve based our canonical hemodynamic response function on the data collected for this study – there is no reason to stop using this irrespective of whether the main claims are correct or not.
  4. Although it sure would be nice to know, wouldn’t it?
  5. Did you really think I’d make it through a blog post without making any comment like this?

Visualising group data

Recently I have been thinking a bit about what the best way is to represent group data. The most typical way this is done is by showing summary statistics (usually the mean) and error bars (usually standard errors) either in bar plots or in plots with lines and symbols. A lot of people seem to think this is not an appropriate way to visualise results because it obscures the data distribution and also whether outliers may influence the results. One reason prompting me to think about this is that in at least one of our MSc courses students are explicitly told by course tutors that they should be plotting individual subject data. It is certainly true that close inspection of your data is always important – but I am not convinced that it is the only and best way to represent all sorts of data. In particular, looking at the results from an experiment of a recent student of mine you wouldn’t make heads or tails from just plotting individual data. Part of the reason is that most of the studies we do use within-subject designs and standard ways of plotting individual data points can actually be misleading. There are probably better ones, and perhaps my next post will deal with that.

For now though I want to only consider group data which were actually derived from between-subject or at least mixed designs. A recently published study in Psychological Science reported that sad people are worse at discriminating colours along the blue-yellow colour axis but not along the red-green colour axis. This sparked a lot of discussion on Twitter and in the blogosphere, for example this post by Andrew Gelman and also this one by Daniel Lakeland. Publications like this tend to attract a lot of coverage by mainstream media and this was no exception. This then further fuels the rage of skeptical researchers :P. There are a lot of things to debate here, from the fact that the study authors interpret a difference between differences as significant without testing the interaction, the potential inadequacy of the general procedure for measuring perceptual differences (raw accuracy rather than a visual threshold measure), and also that outliers may contribute to the main result. I won’t go into this discussion but I thought this data set (which to the authors’ credit is publicly available) would be a good example for my musings.

So here I am representing the data from their first study by plotting it in four different ways. The first plot, in the upper left, is a bar plot showing the means and standard errors for the different experimental conditions. The main result in the article is that the difference between control and sadness is significant for discriminating colours along the blue-yellow axis (the two bars on the left).

Slide1

And judging by the bar graph you could certainly be forgiven for thinking so (I am using the same truncated scale used in the original article). The error bars seem reasonably well separated and this comparison is in fact statistically significant at p=0.0147 on a parametric independent sample t-test or p=0.0096 on a Mann-Whitney U-test (let’s ignore the issue of the interaction for this example).

Now consider the plot in the upper right though. Here we have the individual data points for the different groups and conditions. To give an impression of how the data are distributed, I added a little Gaussian noise to the x-position of each point. The data are evidently quite discrete due to the relatively small number of trials used to calculate the accuracy for every subject. Looking at the data in this way does not seem to give a very clear impression that there is a substantial difference between the control and sadness groups in either colour condition. The most noticeable difference is that there is one subject in the sadness group whose accuracy is not matched with any counterpart in the control group, at 0.58 accuracy. Is this an outlier pulling the result?

Next I generated a box-and-whisker plot in the lower left panel. The boxes in these plots denote the inter-quartile range (IQR, i.e. between 25th and 75th percentile of the data), the red lines indicate the medians, the error bars denote a range of 1.5 times the IQR beyond the percentiles (although it is curtailed when there are no data points beyond that range as by the ceiling at 1), and the red crosses are outliers that fall outside this range. The triangular notches surrounding the medians are a way to represent uncertainty and if they do not overlap (as is the case for the blue-yellow data) this suggests a difference between medians at the 5% significance level. Clearly the data point at 0.58 accuracy in the sadness group is considered an outlier in this plot although it is not the only one.

Finally, I also wrote a Matlab function to create cat-eye plots (Wikipedia calls those violin plots – personally they look mostly like bottles, amphoras or vases to me – or, in this case, like balloons). This is shown in the lower right panel. These plots show the distribution of the data in each condition smoothed by a kernel density function. The filled circles indicate the median, the vertical lines the inter-quartile range, and the asterisk the mean. Plots like this seem to be becoming more popular lately. They do have the nice feature that they give a fairly direct impression of how the data are distributed. It seems fairly clear that these are not normal distributions, which probably has largely to do with the ceiling effect: as accuracy cannot be higher than 1 the distributions are truncated there. The critical data set, the blue-yellow discrimination for the sadness group, has a fairly thick tail towards the bottom which is at least partially due to that outlier. This all suggests that the traditional t-test was inappropriate here but then again we did see a significant difference on the U-test. And certainly, visual inspection still suggests that there may be a difference here.

Next I decided to see what happens if I remove this outlier at 0.58. For consistency, I also removed their data from the red-green data set. This change does not alter the statistical inference in a qualitative way even though the p-values increase slightly. The t-test is still significant at p=0.0259 and the U-test at p=0.014.

Slide2

Again, the bar graph shows a fairly noticeable difference. The scatter plot of the individual data points on the other hand now hardly seems to show any difference. Both the whisker and the cat-eye plot seem to still show qualitatively similar results as when the outlier is included. There seems to be a difference in medians for the blue-yellow data set. The cat-eye plot makes is more apparent that the tail of the distribution for the sadness group is quite heavy something that isn’t that clear in the whisker plot.

Finally, I decided to simulate a new data set with a similar pattern of results but in which I knew the ground truth. All four data sets contained 50 data points that were chosen from a Gaussian distribution with mean of 70 and standard deviation of 10 (I am a moron and therefore generated these on a scale of percent rather than proportion correct – and now I’m too lazy to replot all this just to correct it. It doesn’t matter really). For the control group in the blue-yellow condition I added an offset of 5 while in the sadness group I subtracted 5. This means that there is a significant difference (t-test: p=0.0017; U-test: 0.0042).

Slide3

Now all four types of plot fairly clearly reflect this difference between control and sadness groups. The bar graph in particular clearly reflects the true population means in each group. But even in the scatter plot the difference is clearly apparent even though the distributions overlap considerably. The difference seems a lot less obvious in the whisker and cat-eye plots however. The notches in the whisker plot do not overlap although they seem to be very close. The difference seems to be more visually striking for the cat-eye plot but it isn’t immediately apparent from the plot how much confidence this should instill in this result.

Conclusions & Confusions

My preliminary conclusion is that all of this is actually more confusing than I thought. I am inclined to agree that the bar graph (or a similar symbol and error bar plot) seems to overstate the strength of the evidence somewhat (although one should note that this is partly because of the truncated y-scales that such plots usually employ). On the other hand, showing the individual subject data does seem to understate the results considerably except when the effect is pretty strong. So perhaps things like whisker or cat-eye (violin/bottle/balloon) plots are the most suitable but in my view they also aren’t as intuitive as some people seem to suggest. Obviously, I am not the first person who has thought about these things nor have I spent an extraordinarily long time thinking about it. It might be useful to conduct a experiment/survey in which people have to judge the strength of effects based on different kinds of plot. Anyway, in general I would be very curious to hear other people’s thoughts.

The Matlab code and data file for these examples can be found here.