If you’re on cog neuro twitter, you may have already come across an ongoing debate about a Verification Report by Chalkia et al. published in Cortex in 2020. A Verification Report is a new format at Cortex in which researchers can publish their attempts at reproducing the results of previously published research, in this case an influential study by Schiller et al, published 2010 in Nature. This caused a heated debate between them, the original authors, and also the handling editors at Cortex, which is still ongoing. While I am an editor at Cortex and work closely with the editors handling this particular case, I was not involved with that process. The particular research in question is outside my immediate expertise and there are a lot of gory details that I am ill-prepared to discuss. There is also room for yet another debate on how such scientific debates should be conducted – I don’t want to get into any of that here either.
However, this case reminded me of considerations I’ve had for quite some time. Much of the debate on this case revolves around the criteria by which data from participants were excluded in the original study. A student of mine has also been struggling in the past few months with the issue of data cleaning – specifically removing outlier data that clearly result from artifacts which would undeniably contaminate the results and lead us to draw undeniably false conclusions.
Data cleaning and artifact exclusion are important. You wouldn’t draw a star chart using a dirty telescope – otherwise that celestial object might just be a speck of dust on the lens. Good science involves checking that your tools work and removing data when things go wrong (and go wrong they inevitably will, even with the best efforts to maintain your equipment and ensure high data quality). In visual fMRI studies, some of the major measurement artifacts result from excessive head motion or poor fixation compliance. In psychophysical experiments, a lot depends on the reliability of participants at doing the tasks (some of which can be quite arduous), also about maintaining stable eye gaze, and even at introspecting about their perceptual experience. In EEG experiments, poor conductance in some electrodes may produce artifacts, and so on.
So it is obvious that sometimes data must be removed from a data set before we can make any inferences about our research question. The real question is what is the right way to go about doing that. An obvious answer is that these criteria must be defined a priori, before any data collection took place. In theory, this is where piloting is justified and could inform the range of parameters to be used in the actual experiments. A Registered Report would allow researchers to prespecify these criteria, vetted by independent reviewers. However, even long before anybody was talking about preregistration, researchers knew that data exclusion criteria should be defined upfront (it was literally one of the first things my PhD supervisor taught me).
Unfortunately, in truth this is not realistic. Your perfectly thought out criteria may simply be inappropriate in the face of real data. Your pilot experiment is likely “cleaner” than the final data set because you probably have a limited pool of participants to work with. No level of scrutiny by reviewers and editors of a Registered Report submission can foresee all eventualities and in fact sometimes fail miserably at predicting what will happen (I wonder in how far this speaks to the debates I’ve had in the past about the possibility of precognition :P).
An example from psychophysics
So what do you do then? In some cases the decision could be obvious. Using an example I used in a previous post (that I’m too lazy to dig up as it’s deep in the history of this blog), imagine I am running a psychophysics experiment. The resulting plots of participants should look something like this:
The x-axis is the stimulus level we presented to the participant, and the y-axis is their behavioural choice. But don’t worry about the details. The point is that the curve should have a sigmoidal shape. It might vary in terms of slope and where the 50% threshold (red dotted line) is but it generally should look similar to this plot. Now imagine one participant comes out like this:
This is obviously rubbish. Even if the choice of what is “obviously test/reference” are not really as obvious to the random participant as they are to the experimenter (and this happens often…), the curve should nonetheless have a general sigmoid shape and range from closer to 0% on the left to closer to 100% on the right. It most certainly should not start on the left at 50% and rise. I don’t know what happened in this particular case here (in fact I don’t even recall which experiment I took this from) but the most generous explanation is that the participant misunderstood the task instructions. Perhaps a more likely scenario is that they were simply bored and didn’t do the task at all and pressed response buttons at random. While this doesn’t explain why the curve is still rising to the right, that could perhaps be because they did the task to begin with – performing adequately on a few trials where “obviously test” was the correct response – but then gave up on it after a while.
Outlier removal and robust statistics
It doesn’t matter. Whatever the reason, this data set is clearly unusable. A simple visual inspection shows this but this is obviously subjective. This is a pretty clear example but where should we draw the line between a bad data set and a good one? Interestingly, a more objective, quantitative way to address this, the goodness of the sigmoidal curve fit, would be entirely inappropriate. The goodness-of-fit here may not be perfect but it’s clearly still a pretty close description of the data. No, our decision to reject this data set must be contingent on the outcome variables, the parameter fits of the sigmoid (the slope and threshold). But outcome contingent decisions are precisely one of those cardinal sins of analytic flexibility that we are supposed to avoid.
There are quantitative ways to clean such parameters posthoc by definiting a criterion for outliers. Often this will take the form of using the dispersion of the data (e.g., standard deviation, median absolute deviation, or some such measure) to reject values who fall a certain number of dispersion units from the group mean/median – for instance, people might remove all participants whose values fall +/-2 SDs from the mean. This can be reasonable in some situations because it effectively trims the distribution of your values. However, there is ample flexibility where you set this criterion and looking through the literature you will probably find anything from 1-3 SDs being used without any justification. Moreover, because your measure of dispersion is inherently related to how the data are distributed it is obviously affected by the outlier. A few very extreme and influential outliers can therefore mean that some obviously artifactual data are not flagged as outliers. To address this issue, several years ago we proposed to use bootstrapping to estimate the dispersion using a measure we called Shepherd’s Pi (the snark was strong with me then…).
In general, there are of course robust and/or non-parametric statistical tests that can help you make inferences on cases like this. While the details differ, what they have in common is that they are most robust in the face of certain kinds of outliers or in unsual data distributions. Some of them are so fierce that they will remove the majority of your observations – which is clearly absurd suggesting that the test may be ill-suited for this particular situation at least. There is a whole statistical literature on the development of these tests and comparing their performance. To anyone but a stats aficionado, this is very tedious and full of names and Greek letters (hence our snark…). What these kinds of robust statistics (and more arbitrary data cleaning methods like a dispersion criterion) are good for is robustness checks, a type of multiverse analysis where you compare the outcome of a range of tests to see if your results depend strongly on the particular test used. It seems sensible to include these checks, provided they are done transparently. There are obviously also situations where you might want to prespecify a particular test in a Registered Report, such as when you know that you expect a lot of outliers in a correlation you might preregister upfront that you will use Spearman’s correlation instead of the standard Pearson’s. You might even go so far as to adapt the use of a particular robust test a priori, even though it means you are sacrificing statistical power for improved robustness.
(Im-)practicalities and a way forward?
But the real issue is that this is clearly not a solution for the actual problem. In our example above, the blue data set is crap. Whatever your statistical criterions or robust statistical tests tell you, the blue data set should never enter our inference in the first place. Again, this is still a rather extreme example. It isn’t hard to justify posthoc why the curve shouldn’t start at 50% on the left – it violates the entire concept of the experiment. But most real situations are not so obvious. Sometimes we can make the argument that if participants perform at considerably less than ceiling/floor level for the “obvious” stimulus levels (left and right side of the plot in my example), that consistutes valid grounds for exclusion. We had this in a student project a few years ago where some participants performed at levels that – if they had done the task properly – really could only be because they were legally blind (which they obviously weren’t). It seems justified to exclude those participants, but it probably also underlines issues with the experiment: participants either had a tendency to misunderstand the task instructions, the task might have been to difficult for them, or the experimenter didn’t ensure that they did the experiment properly. If there are a lot of such participants, this flags up a problem that you probably want to fix in future experiments. In the worst case, it casts serious doubt on the whole experiment because if there are such problems for a lot of people how confident can we be that it isn’t affecting the “good” data sets, albeit in more subtle ways?
Perhaps the best way to deal with such problems is in the context of a Registered Report. This may seem counterintuitive. There have been numerous debates on how preregistration may stifle the necessary flexibility to deal with real data like this (I have been guilty of perpetuating these discussions myself for a time). This view still prevails in some corners. But actually the exact opposite is true. From my own (admittedly still limited) experience editing Registered Reports, I can say that it is not uncommon for authors to request alterations after they started data collection. In most cases, these are such minor changes that I wouldn’t bother the original reviewers with. They can simply be addressed by a footnote in the final submission. But of course in cases like our example here, where your dream of the preregistered experiment crashes against the rocky shores of real data, it may necessitate sending an alteration back to the reviewers who approved the preregistered design. There are valid scientific debates about which data should be included and excluded and there will inevitably be differences in opinion. That is completely fine. The point is that in this scenario you have a transparent record of the changes and why they were made, and you have independent experts weighing these arguments before the final outcome is known. Of course, a more practical solution could be to simply include the new outlier criterion as an additional, exploratory analysis in your final manuscript alongside the one you originally preregistered. However, in a situation like my example here this seems ill-advised: if the preregistered approach contains data that obviously shouldn’t be included, it isn’t worth very much. Some RR editors may take a stricter view on this, but I’d rather have an adapted design where the reason for changes are clearly justified and transparent and anyone can check the effects of these changes for themselves with the published data sets.
Transparency is really key. A Registered Report will provide some safety nets in terms of ensuring people declare the changes that were made and that they were reviewed by independent experts. But even in a classical (or exploratory) research article, the solution to this problem is simply transparent reporting. If data must be removed for justifiable posthoc reasons, then this should be declared explicitly. The excluded data should be made available so others can inspect it for themselves. Your mileage may vary with regard to some choices, but as long as it is clear why certain decisions were made this is simply a matter for debate. The big problem with all this is that the incentives in scientific publishing still work against this. As far as many high impact journals are concerned, data cleanliness is next to godliness. However justified the reasons may be, a study where some data were excluded posthoc inevitably looks messier than one where those exclusions are simply hidden.
Some have argued for rather Orwellian measures of publishing lab books with detailed records alongside research studies. I haven’t used a “lab book” since the mid-noughties. We have digital records of experiments we conducted, which are less prone to human error. Insofar as this is possible, such records are shared as part of published data sets. I have actually heard of some people who upload each experimental data set automatically as soon as it is collected. For one thing, this is only realistic for relatively small data (e.g. psychophysics or some cognitive psychology experiments perhaps). It doesn’t seem particularly feasible for a lot of research producing big files. It also just seems over-the-top to me. More importantly, this might be illegal in some jurisdictions (data sharing rules where I live are rather strict, in fact I sometimes wonder if any of my open science oriented colleagues aren’t breaking the law).
In the end, such ideas seem to treat the symptom and not the cause. We need to change the climate to make research more transparent. For this it is imperative that editors and reviewers remember that real data are messy. Obviously, if a result only emerges with extensive posthoc exclusions and the problems with the data set are mounting, there can be good reasons to question the finding. But it is crucial that researchers feel free to do the right thing.