Category Archives: reproducibility

What is the right way to clean your data?

If you’re on cog neuro twitter, you may have already come across an ongoing debate about a Verification Report by Chalkia et al. published in Cortex in 2020. A Verification Report is a new format at Cortex in which researchers can publish their attempts at reproducing the results of previously published research, in this case an influential study by Schiller et al, published 2010 in Nature. This caused a heated debate between them, the original authors, and also the handling editors at Cortex, which is still ongoing. While I am an editor at Cortex and work closely with the editors handling this particular case, I was not involved with that process. The particular research in question is outside my immediate expertise and there are a lot of gory details that I am ill-prepared to discuss. There is also room for yet another debate on how such scientific debates should be conducted – I don’t want to get into any of that here either.

However, this case reminded me of considerations I’ve had for quite some time. Much of the debate on this case revolves around the criteria by which data from participants were excluded in the original study. A student of mine has also been struggling in the past few months with the issue of data cleaning – specifically removing outlier data that clearly result from artifacts which would undeniably contaminate the results and lead us to draw undeniably false conclusions.

Data cleaning and artifact exclusion are important. You wouldn’t draw a star chart using a dirty telescope – otherwise that celestial object might just be a speck of dust on the lens. Good science involves checking that your tools work and removing data when things go wrong (and go wrong they inevitably will, even with the best efforts to maintain your equipment and ensure high data quality). In visual fMRI studies, some of the major measurement artifacts result from excessive head motion or poor fixation compliance. In psychophysical experiments, a lot depends on the reliability of participants at doing the tasks (some of which can be quite arduous), also about maintaining stable eye gaze, and even at introspecting about their perceptual experience. In EEG experiments, poor conductance in some electrodes may produce artifacts, and so on.

So it is obvious that sometimes data must be removed from a data set before we can make any inferences about our research question. The real question is what is the right way to go about doing that. An obvious answer is that these criteria must be defined a priori, before any data collection took place. In theory, this is where piloting is justified and could inform the range of parameters to be used in the actual experiments. A Registered Report would allow researchers to prespecify these criteria, vetted by independent reviewers. However, even long before anybody was talking about preregistration, researchers knew that data exclusion criteria should be defined upfront (it was literally one of the first things my PhD supervisor taught me).

Unfortunately, in truth this is not realistic. Your perfectly thought out criteria may simply be inappropriate in the face of real data. Your pilot experiment is likely “cleaner” than the final data set because you probably have a limited pool of participants to work with. No level of scrutiny by reviewers and editors of a Registered Report submission can foresee all eventualities and in fact sometimes fail miserably at predicting what will happen (I wonder in how far this speaks to the debates I’ve had in the past about the possibility of precognition :P).

An example from psychophysics

So what do you do then? In some cases the decision could be obvious. Using an example I used in a previous post (that I’m too lazy to dig up as it’s deep in the history of this blog), imagine I am running a psychophysics experiment. The resulting plots of participants should look something like this:

The x-axis is the stimulus level we presented to the participant, and the y-axis is their behavioural choice. But don’t worry about the details. The point is that the curve should have a sigmoidal shape. It might vary in terms of slope and where the 50% threshold (red dotted line) is but it generally should look similar to this plot. Now imagine one participant comes out like this:

This is obviously rubbish. Even if the choice of what is “obviously test/reference” are not really as obvious to the random participant as they are to the experimenter (and this happens often…), the curve should nonetheless have a general sigmoid shape and range from closer to 0% on the left to closer to 100% on the right. It most certainly should not start on the left at 50% and rise. I don’t know what happened in this particular case here (in fact I don’t even recall which experiment I took this from) but the most generous explanation is that the participant misunderstood the task instructions. Perhaps a more likely scenario is that they were simply bored and didn’t do the task at all and pressed response buttons at random. While this doesn’t explain why the curve is still rising to the right, that could perhaps be because they did the task to begin with – performing adequately on a few trials where “obviously test” was the correct response – but then gave up on it after a while.

Outlier removal and robust statistics

It doesn’t matter. Whatever the reason, this data set is clearly unusable. A simple visual inspection shows this but this is obviously subjective. This is a pretty clear example but where should we draw the line between a bad data set and a good one? Interestingly, a more objective, quantitative way to address this, the goodness of the sigmoidal curve fit, would be entirely inappropriate. The goodness-of-fit here may not be perfect but it’s clearly still a pretty close description of the data. No, our decision to reject this data set must be contingent on the outcome variables, the parameter fits of the sigmoid (the slope and threshold). But outcome contingent decisions are precisely one of those cardinal sins of analytic flexibility that we are supposed to avoid.

There are quantitative ways to clean such parameters posthoc by definiting a criterion for outliers. Often this will take the form of using the dispersion of the data (e.g., standard deviation, median absolute deviation, or some such measure) to reject values who fall a certain number of dispersion units from the group mean/median – for instance, people might remove all participants whose values fall +/-2 SDs from the mean. This can be reasonable in some situations because it effectively trims the distribution of your values. However, there is ample flexibility where you set this criterion and looking through the literature you will probably find anything from 1-3 SDs being used without any justification. Moreover, because your measure of dispersion is inherently related to how the data are distributed it is obviously affected by the outlier. A few very extreme and influential outliers can therefore mean that some obviously artifactual data are not flagged as outliers. To address this issue, several years ago we proposed to use bootstrapping to estimate the dispersion using a measure we called Shepherd’s Pi (the snark was strong with me then…).

In general, there are of course robust and/or non-parametric statistical tests that can help you make inferences on cases like this. While the details differ, what they have in common is that they are most robust in the face of certain kinds of outliers or in unsual data distributions. Some of them are so fierce that they will remove the majority of your observations – which is clearly absurd suggesting that the test may be ill-suited for this particular situation at least. There is a whole statistical literature on the development of these tests and comparing their performance. To anyone but a stats aficionado, this is very tedious and full of names and Greek letters (hence our snark…). What these kinds of robust statistics (and more arbitrary data cleaning methods like a dispersion criterion) are good for is robustness checks, a type of multiverse analysis where you compare the outcome of a range of tests to see if your results depend strongly on the particular test used. It seems sensible to include these checks, provided they are done transparently. There are obviously also situations where you might want to prespecify a particular test in a Registered Report, such as when you know that you expect a lot of outliers in a correlation you might preregister upfront that you will use Spearman’s correlation instead of the standard Pearson’s. You might even go so far as to adapt the use of a particular robust test a priori, even though it means you are sacrificing statistical power for improved robustness.

(Im-)practicalities and a way forward?

But the real issue is that this is clearly not a solution for the actual problem. In our example above, the blue data set is crap. Whatever your statistical criterions or robust statistical tests tell you, the blue data set should never enter our inference in the first place. Again, this is still a rather extreme example. It isn’t hard to justify posthoc why the curve shouldn’t start at 50% on the left – it violates the entire concept of the experiment. But most real situations are not so obvious. Sometimes we can make the argument that if participants perform at considerably less than ceiling/floor level for the “obvious” stimulus levels (left and right side of the plot in my example), that consistutes valid grounds for exclusion. We had this in a student project a few years ago where some participants performed at levels that – if they had done the task properly – really could only be because they were legally blind (which they obviously weren’t). It seems justified to exclude those participants, but it probably also underlines issues with the experiment: participants either had a tendency to misunderstand the task instructions, the task might have been to difficult for them, or the experimenter didn’t ensure that they did the experiment properly. If there are a lot of such participants, this flags up a problem that you probably want to fix in future experiments. In the worst case, it casts serious doubt on the whole experiment because if there are such problems for a lot of people how confident can we be that it isn’t affecting the “good” data sets, albeit in more subtle ways?

Perhaps the best way to deal with such problems is in the context of a Registered Report. This may seem counterintuitive. There have been numerous debates on how preregistration may stifle the necessary flexibility to deal with real data like this (I have been guilty of perpetuating these discussions myself for a time). This view still prevails in some corners. But actually the exact opposite is true. From my own (admittedly still limited) experience editing Registered Reports, I can say that it is not uncommon for authors to request alterations after they started data collection. In most cases, these are such minor changes that I wouldn’t bother the original reviewers with. They can simply be addressed by a footnote in the final submission. But of course in cases like our example here, where your dream of the preregistered experiment crashes against the rocky shores of real data, it may necessitate sending an alteration back to the reviewers who approved the preregistered design. There are valid scientific debates about which data should be included and excluded and there will inevitably be differences in opinion. That is completely fine. The point is that in this scenario you have a transparent record of the changes and why they were made, and you have independent experts weighing these arguments before the final outcome is known. Of course, a more practical solution could be to simply include the new outlier criterion as an additional, exploratory analysis in your final manuscript alongside the one you originally preregistered. However, in a situation like my example here this seems ill-advised: if the preregistered approach contains data that obviously shouldn’t be included, it isn’t worth very much. Some RR editors may take a stricter view on this, but I’d rather have an adapted design where the reason for changes are clearly justified and transparent and anyone can check the effects of these changes for themselves with the published data sets.

Transparency is really key. A Registered Report will provide some safety nets in terms of ensuring people declare the changes that were made and that they were reviewed by independent experts. But even in a classical (or exploratory) research article, the solution to this problem is simply transparent reporting. If data must be removed for justifiable posthoc reasons, then this should be declared explicitly. The excluded data should be made available so others can inspect it for themselves. Your mileage may vary with regard to some choices, but as long as it is clear why certain decisions were made this is simply a matter for debate. The big problem with all this is that the incentives in scientific publishing still work against this. As far as many high impact journals are concerned, data cleanliness is next to godliness. However justified the reasons may be, a study where some data were excluded posthoc inevitably looks messier than one where those exclusions are simply hidden.

Some have argued for rather Orwellian measures of publishing lab books with detailed records alongside research studies. I haven’t used a “lab book” since the mid-noughties. We have digital records of experiments we conducted, which are less prone to human error. Insofar as this is possible, such records are shared as part of published data sets. I have actually heard of some people who upload each experimental data set automatically as soon as it is collected. For one thing, this is only realistic for relatively small data (e.g. psychophysics or some cognitive psychology experiments perhaps). It doesn’t seem particularly feasible for a lot of research producing big files. It also just seems over-the-top to me. More importantly, this might be illegal in some jurisdictions (data sharing rules where I live are rather strict, in fact I sometimes wonder if any of my open science oriented colleagues aren’t breaking the law).

In the end, such ideas seem to treat the symptom and not the cause. We need to change the climate to make research more transparent. For this it is imperative that editors and reviewers remember that real data are messy. Obviously, if a result only emerges with extensive posthoc exclusions and the problems with the data set are mounting, there can be good reasons to question the finding. But it is crucial that researchers feel free to do the right thing.

When the hole changes the pigeon

or How innocent assumptions can lead to wrong conclusions

I promised you a (neuro)science post. Don’t let the title mislead you into thinking we’re talking about world affairs and societal ills again. While pigeonholing is directly related to polarised politics or social media, for once this is not what this post is about. Rather, it is about a common error in data analysis. While there have been numerous expositions about similar issues throughout the decades – as we’ve learned the hard way, it is a surprisingly easy mistake to make. A lay summary and some wider musings on the scientific process was published by Benjamin de Haas. A scientific article by Susanne Stoll laying out this problem in more detail is currently available as a preprint.

Pigeonholing (Source: https://commons.wikimedia.org/wiki/File:TooManyPigeons.jpg)

Data binning

In science you often end up with large data sets, with hundreds or thousands of individual observations subject to considerable variance. For instance, in my own field of retinotopic population receptive field (pRF) mapping, a given visual brain area may have a few thousand recording sites, and each has a receptive field position. There are many other scenarios of course. It could be neural firing, or galvanic skin responses, or eye positions recorded at different time points. Or it could be hundreds or thousands of trials in a psychophysics experiment etc. I will talk about pRF mapping because this is where we recently encountered the problem and I am going to describe how it has affected our own findings – however, you may come across the same issue in many guises.

Imagine that we want to test how pRFs move around when you attend to a particular visual field location. I deliberately use this example because it is precisely what a bunch of published pRF studies did, including one of ours. There is some evidence that selective attention shifts the position of neuronal receptive fields, so it is not far-fetched that it might shift pRFs in fMRI experiments also. Our study for instance investigated whether pRFs shift when participants are engaged in a demanding (“high load”) task at fixation, compared to a baseline condition where they only need to detect a simple colour change of the fixation target (“low load”). Indeed, we found that across many visual areas pRFs shifted outwards (i.e. away from fixation). This suggested to us that the retinotopic map reorganises to reflect a kind of tunnel vision when participants are focussed on the central task.

What would be a good way to quantify such map reorganisation? One simple way might be to plot each pRF in the visual field with a vector showing how it is shifted under the attentional manipulation. In the graph below, each dot shows a pRF location under the attentional condition, and the line shows how it has moved away from baseline. Since there is a large number pRFs, many of which are affected by measurement noise or other errors, these plots can be cluttered and confusing:

Plotting shift of each pRF in the attention condition relative to baseline. Each dot shows where a pRF landed under the attentional manipulation, and the line shows how it has shifted away from baseline. This plot is a hellishly confusing mess.

Clearly, we need to do something to tidy up this mess. So we take the data from the baseline condition (in pRF studies, this would normally be attending to a simple colour change at fixation) and divide the visual field up into a number of smaller segments, each of which contains some pRFs. We then calculate the mean position of the pRFs from each segment under the attentional manipulation. Effectively, we summarise the shift from baseline for each segment:

We divide the visual field into segments based on the pRF data from the baseline condition and then plot the mean shift in the experimental condition for each segment. A much clearer graph that suggests some very substantial shifts…

This produces a much clearer plot that suggests some interesting, systematic changes in the visual field representation under attention. Surely, this is compelling evidence that pRFs are affected by this manipulation?

False assumptions

Unfortunately it is not1. The mistake here is to assume that there is no noise in the baseline measure that was used to divide up the data in the first place. If our baseline pRF map were a perfect measure of the visual field representation, then this would have been fine. However, like most data, pRF estimates are variable and subject to many sources of error. The misestimation is also unlikely to be perfectly symmetric – for example, there are several reasons why it is more likely that a pRF will be estimated closer to central vision than in the periphery. This means there could be complex and non-linear error patterns that are very difficult to predict.

The data I showed in these figures are in fact not from an attentional manipulation at all. Rather, they come from a replication experiment where we simply measured a person’s pRF maps twice over the course of several months. One thing we do know is that pRF measurements are quite robust, stable over time, and even similar between scanners with different magnetic field strengths. What this means is that any shifts we found are most likely due to noise. They are completely artifactual.

When you think about it, this error is really quite obvious: sorting observations into clear categories can only be valid if you can be confident in the continuous measure on which you base these categories. Pigeonholing can only work if you can be sure into which hole each pigeon belongs. This error is also hardly new. It has been described in numerous forms as regression to the mean and it rears its ugly head every few years in different fields. It is also related to circular inference, which has already caused a stir in cognitive and social neuroscience a few years ago. Perhaps the reason for this is that it is a damn easy mistake to make – but that doesn’t make the face-palming moment any less frustrating.

It is not difficult to correct this error. In the plot below, I used an independent map from yet another, third pRF mapping session to divide up the visual field. Then I calculated how the pRFs in each visual field segment shifted on average between the two experimental sessions. While some shift vectors remain, they are considerably smaller than in the earlier graph. Again, keep in mind that these are simple replication data and we would not really expect any systematic shifts. There certainly does not seem to be a very obvious pattern here – perhaps there is a bit of a clockwise shift in the right visual hemifield but that breaks down in the left. Either way, this analysis gives us an estimate of how much variability there may be in this measurement.

We use an independent map to divide the visual field into segments. Then we calculate the mean position for each segment in the baseline and the experimental condition, and work out the shift vector between them. For each segment, this plot shows that vector. This plot loses some information, but it shows how much and into which direction pRFs in each segment shifted on average.

This approach of using a third, independent map loses some information because the vectors only tell you the direction and magnitude of the shifts, not exactly where the pRFs started from and where they end up. Often the magnitude and direction of the shift is all we really need to know. However, when the exact position is crucial we could use other approaches. We will explore this in greater depth in upcoming publications.

On the bright side, the example I picked here is probably extreme because I didn’t restrict these plots to a particular region of interest but used all supra-threshold voxels in the occipital cortex. A more restricted analysis would remove some of that noise – but the problem nevertheless remains. How much it skews the findings depends very much on how noisy the data are. Data tend to be less noisy in early visual cortex than in higher-level brain regions, which is where people usually find the most dramatic pRF shifts…

Correcting the literature

It is so easy to make this mistake that you can find it all over the pRF literature. Clearly, neither authors nor reviewers have given it much thought. It is definitely not confined to studies of visual attention, although this is how we stumbled across it. It could be a comparison between different analysis methods or stimulus protocols. It could be studies measuring the plasticity of retinotopic maps after visual field loss. Ironically, it could even be studies that investigate the potential artifacts when mapping such plasticity incorrectly. It is not restricted to the kinds of plots I showed here but should affect any form of binning, including the binning into eccentricity bins that is most common in the literature. We suspect the problem is also pervasive in many other fields or in studies using other techniques. Only a few years ago a similar issue was described by David Shanks in the context of studying unconscious processing. It is also related to warnings you may occasionally hear about using median splits – really just a simpler version of the same approach.

I cannot tell you if the findings from other studies that made this error are spurious. To know that we would need access to the data and reanalyse these studies. Many of them were published before data and code sharing was relatively common2. Moreover, you really need to have a validation dataset, like the replication data in my example figures here. The diversity of analysis pipelines and experimental designs makes this very complex – no two of these studies are alike. The error distributions may also vary between different studies, so ideally we need replication datasets for each study.

In any case, as far as our attentional load study is concerned, after reanalysing these data with unbiased methods, we found little evidence of the effects we published originally. While there is still a hint of pRF shifts, these are no longer statistically significant. As painful as this is, we therefore retracted that finding from the scientific record. There is a great stigma associated with retraction, because of the shady circumstances under which it often happens. But to err is human – and this is part of the scientific method. As I said many times before, science is self-correcting but that is not some magical process. Science doesn’t just happen, it requires actual scientists to do the work. While it can be painful to realise that your interpretation of your data was wrong, this does not diminish the value of this original work3 – if anything this work served an important purpose by revealing the problem to us.

We mostly stumbled across this problem by accident. Susanne Stoll and Elisa Infanti conducted a more complex pRF experiment on attention and found that the purported pRF shifts in all experimental conditions were suspiciously similar (you can see this in an early conference poster here). It took us many months of digging, running endless simulations, complex reanalyses, and sometimes heated arguments before we cracked that particular nut. The problem may seem really obvious now – it sure as hell wasn’t before all that.

This is why this erroneous practice appears to be widespread in this literature and may have skewed the findings of many other published studies. This does not mean that all these findings are false but it should serve as a warning. Ideally, other researchers will also revisit their own findings but whether or not they do so is frankly up to them. Reviewers will hopefully be more aware of the issue in future. People might question the validity of some of these findings in the absence of any reanalysis. But in the end, it doesn’t matter all that much which individual findings hold up and which don’t4.

Check your assumptions

I am personally more interested in taking this whole field forward. This issue is not confined to the scenario I described here. pRF analysis is often quite complex. So are many other studies in cognitive neuroscience and, of course, in many other fields as well. Flexibility in study designs and analysis approaches is not a bad thing – it is in fact essential for addressing scientific questions that we can adapt our experimental designs.

But what this story shows very clearly is the importance of checking our assumptions. This is all the more important when using the complex methods that are ubiquitous in our field. As cognitive neuroscience matures, it is critical that we adopt good practices in ensuring the validity of our methods. In the computational and software development sectors, it is to my knowledge commonplace to test algorithms on conditions where the ground truth is known, such as random and/or simulated data.

This idea is probably not even new to most people and it certainly isn’t to me. During my PhD there was a researcher in the lab who had concocted a pretty complicated analysis of single-cell electrophysiology recordings. It involved lots of summarising and recentering of neuronal tuning functions to produce the final outputs. Neither I nor our supervisor really followed every step of this procedure based only on our colleague’s description – it was just too complex. But eventually we suspected that something might be off and so we fed random numbers to the algorithm – lo and behold the results were a picture perfect reproduction of the purported “experimental” results. Since then, I have simulated the results of my analyses a few other times – for example, when I first started with pRF modelling or when I developed new techniques for measuring psychophysical quantities.

This latest episode taught me that we must do this much more systematically. For any new design, we should conduct control analyses to check how it behaves with data for which the ground truth is known. It can reveal statistical artifacts that might hide inside the algorithm but also help you determine the method’s sensitivity and thus allow you to conduct power calculations. Ideally, we would do that for every new experiment even if it uses a standard design. I realise that this may not always be feasible – but in that case there should be a justification why it is unnecessary.

Because what this really boils down to is simply good science. When you use a method without checking that it works as intended, you are effectively doing a study without a control condition – quite possibly the original sin of science.

Acknowlegdements

In conclusion, I quickly want to thank several people: First of all, Susanne Stoll deserves major credit for tirelessly pursuing this issue in great detail over the past two years with countless reanalyses and simulations. Many of these won’t ever see the light of day but helped us wrap our heads around what is going on here. I want to thank Elisa Infanti for her input and in particular the suggestion of running the analysis on random data – without this we might never have realised how deep this rabbit hole goes. I also want to acknowledge the patience and understanding of our co-authors on the attentional load study, Geraint Rees and Elaine Anderson, for helping us deal with all the stages of grief associated with this. Lastly, I want to thank Benjamin de Haas, the first author of that study for honourably doing the right thing. A lesser man would have simply booked a press conference at Current Biology Total Landscaping instead to say it’s all fake news and announce a legal challenge5.

Footnotes:

  1. The sheer magnitude of some of these shifts may also be scientifically implausible, an issue I’ve repeatedly discussed on this blog already. Similar shifts have however been reported in the literature – another clue that perhaps something is awry in these studies…
  2. Not that data sharing is enormously common even now.
  3. It is also a solid data set with a fairly large number of participants. We’ve based our canonical hemodynamic response function on the data collected for this study – there is no reason to stop using this irrespective of whether the main claims are correct or not.
  4. Although it sure would be nice to know, wouldn’t it?
  5. Did you really think I’d make it through a blog post without making any comment like this?

Why Gilbert et al. are missing the point

This (hopefully brief) post is yet again related to the replicability debate I discussed in my previous post. I just read a response by Gilbert et al. to the blog comments about their reply to the reply to the reply to the (in my view, misnamed) Reproducibility Project Psychology. I won’t go over all of this again. I also won’t discuss the minutia of the statistical issues as many others have already done so and will no doubt do so again. I just want to say briefly why I believe they are missing the point:

The main argument put forth by Gilbert et al. is that there is no evidence for a replicability crisis in psychology and that the “conclusions” of the RPP are thus unfounded. I don’t think that the RPP ever claimed anything of the kind one way or the other (in fact, I was impressed by the modesty of the claims made by the RPP study when I read it) but I’ll leave that aside. I appreciate what Gilbert et al. are trying to do. I have myself frequently argued a contrarian position in these discussions (albeit not always entirely seriously). I am trying to view this whole debate the same way any scientist should: by evaluating the evidence without any investment in the answer. For that reason, the debate they have raised seems worthwhile. They tried to estimate a baseline level of replicability one could expect from psychology studies. I don’t think they’ve done it correctly (for statistical reasons) but I appreciate that they are talking about this. This is certainly what we would want to do in any other situation.

Unfortunately, it isn’t that simple. Even if there were no problems with publication bias, analytical flexibility, and lacking statistical power (and we can probably agree that this is not a tenable assumption), it wouldn’t be a straightforward thing to estimate how many psychology studies should replicate by chance. In order to know this you would need to know how many of the hypotheses are true and we usually don’t. As Einstein said – or at least the internet tells me he did: “If we knew what it was we were doing, it would not be called research, would it?”

One of the main points they brought up is that some of the replications in the RPP may have used inappropriate procedures to test the original hypotheses – I agree this is a valid concern but it also completely invalidates the argument they are trying to make. Instead of quibbling about what measure of replication rates is evidence for a “crisis” (a completely subjective judgement) let’s look at the data:

f1-large

This scatter graph from the RPP plots effect sizes in the replications against the originally reported ones. Green (referred to as “blue” by the presumably colour-blind art editors) points are replications that turned out significant, red ones are those that were not significant and thus “failed to replicate.” The separation of the two data clouds is fairly obvious. Significant replication effects have a clear linear relationship with the original ones. Non-significant ones are uncorrelated with the original effect sizes.

We can argue until the cows come home what this means. The red points are presumably at least to a large part false positives. Yes, of course some – perhaps many – may be because of methodological differences or hidden moderators etc. There is no way to quantify this reliably. And conversely, a lot of the green dots probably don’t tell us about any cosmic truths. While they are replicating wonderfully, they may just be replicating the same errors and artifacts. All of these arguments are undoubtedly valid.

But that’s not the point. When we test the reliability of something we should aim for high fidelity. Of course, perfect reliability is impossible so there must be some scatter around the identity line. We also know that there will always be false positives so there should be some data points scattering around the x-axis. But do you honestly think it should be as many as in that scatter graph? Even if these are not all false positives in the original but rather false negatives in the replication, for instance because the replicators did a poor job or there were unknown factors we don’t yet understand, this ratio of green to red dots is not very encouraging.

Replicability encompasses all of the aforementioned explanations. When I read a scientific finding I don’t expect it to be “true.” Even if the underlying effects are real, the explanation for them can be utterly wrong. But we should expect a level of replicability from a field of research that at least maximises the trustworthiness of the reported findings. Any which way you look at it, this scatter graph is unsettling: if two thirds of the dots are red because low statistical power and publication bias in the original effects, this is a major problem. But if they are red because the replications are somehow defective this isn’t exactly a great argument either. What this shows is that the way psychology studies are currently done does not permit very reliable replication. Either way, if you give me a psychology study I should probably bet against it replicating. Does anyone think that’s an acceptable state of affairs?

I am sure both of these issues play a role but the encouraging thing is that probably it is the former, false positives, that is more dominant after all. In my opinion the best way anyone has looked at the RPP data so far is Alex Etz’s Bayesian reanalysis. This suggests that one of the main reasons the replicability in the RPP is so underwhelming is that the level of evidence for the original effects was weak to begin with. This speaks for false positives (due to low power, publication bias, QRPs) and against unknown moderators being behind most of the replication failures. Believe it or not, this is actually a good thing – because it is much easier to address the former problem than the latter.

A brave new world of research parasites

What a week! I have rarely seen the definition of irony being demonstrated more clearly in front of my eyes than during the days following the publication of this comment by Lewandowsky and Bishop in Nature. I mentioned this at the end of my previous post. The comment discusses the question how to deal with data requests and criticisms of scientific claims in the new world of open science. A lot of digital ink has already been spilled elsewhere debating what they did or didn’t say and what they meant to say with their article. I have no intention of rehashing that debate here. So while I typically welcome any meaningful and respectful comments under my posts, I’ll regard any comments on the specifics of the L&B article as off-topic and will not publish them. There are plenty of other channels for this.

I think the critics attack a strawman and the L&B discussion is a red herring. Irrespective of what they actually said, I want to get back to the discussion we should be having, which I already alluded to last time.  In order to do so, let’s get the premise crystal clear. I have said all this before in my various posts about data sharing but let me summarise the fundamental points:

  1. Data sharing: All data for scientific studies needed to reproduce the results should be made public in some independent repository at the point of publication. This must exclude data which would be unethical to share, e.g. unprocessed brain images from human participants. Such data fall in a grey area as to how much anonymisation is necessary and it is my policy to err on the side of caution there. We have no permission from our participants (except for some individual cases) to share their data with anyone outside the team if there is a chance that they could be identified from it so we don’t. For the overwhelming majority of purposes such data are not required and the pre-processed, anonymised data will suffice.
  2. Material sharing: When I talk about sharing data I implicitly also mean material so any custom analysis code, stimulus protocols, or other materials used for the study  should also be shared. This is not only good for reproducibility, i.e. getting the same results using the same data. It is also useful for replication efforts aiming to repeat the same experiment to collect new data.
  3. Useful documentation: Shared data are unlikely to be much use to anyone if there isn’t a minimum of documentation explaining what it contains. I don’t think this needs to be excessive, especially given the fact that most data will probably never be accessed by anyone. But there should at least be some basic guide how to use the data to return a result. It should be reasonably clear what data can be found where or how to run the experiment. Provided the uncompiled code is included and the methods section of the publication contains sufficient detail of what is being done, anyone looking at it should be able to work it out by themselves. More extensive documentation is certainly helpful and may also help the researchers themselves in organising their work – but I don’t think we should expect more than the basics.

Now with this out of the way I don’t want to hear no lamentations about how I am “defending” the restriction of data access to anyone or any such rubbish. Let’s simply work on the assumption that the world is how it should be and that the necessary data are available to anyone with an internet connection. So let’s talk about the worries and potential problems this may bring. Note that, as I already said, most data sets will probably not generate much interest. That is fine – they should be available for potential future use in any case. More importantly this doesn’t mean the following concerns aren’t valid:

Volume of criticism

In some cases the number of people reusing the shared data will be very large. This is particularly likely for research on controversial topics. This could be because the topic is a political battleground or that the research is being used to promote policy changes people are not happy with. Perhaps the research receives undeserved accolades from the mainstream media or maybe it’s just a very sensational claim (Psi research springs to mind again…). The criticisms of this research may or may not be justified. None of this matters and I don’t care to hear about the specifics about your particular pet peeve whether it’s climate change or some medical trial. All that matters in this context is that the topic is controversial.

As I said last time, it should be natural that sensational or controversial research attracts more attention and more scepticism. This is how it should be. Scientists should be sceptical. But individual scientists or small research teams are composed of normal human beings and they have a limit with how much criticism they can keep up with. This is a simple fact. Of course this statement will no doubt draw out the usual suspects who feel the need to explain to me that criticism and scepticism is necessary in science and that this is simply what one should expect.

396px-bookplate_of_the_royal_society_28great_britain29

So let me cut the heads off this inevitable hydra right away. First of all, this is exactly what I just said: Yes, science depends on scepticism. But it is also true that humans have limited capacity for answering questions and criticisms and limited ability to handle stress. Simply saying that they should be prepared for that and have no right to complain is unrealistic. If anything it will drive people away from doing research on controversial questions which cannot be a good thing.

Similar, it is unrealistic to say that they could just ignore criticisms if it gets too much for them. It is completely natural that a given scientist will want to respond to criticisms, especially if those criticisms are public. They will want to defend the conclusions they’ve drawn and they will also feel that they have a reputation to defend. I believe science would generally be better off if we all learned to become less invested in our pet theories and conducted our inferences in a less dogmatic way. I hope there are ways we can encourage such a change – but I don’t think you can take ego out of the question completely. Especially if a critic accuses a researcher of incompetence or worse, it shouldn’t surprise anyone if they react emotionally and have personal stakes in the debate.

So what can we expect? To me it seems entirely justified in this situation that a researcher would write a summary response that addresses the criticism collectively. In that they would most likely have to be selective and only address the more serious points and ignore the minutia. This may require some training. Even then it may be difficult because critics might insist that their subtle points are of fundamental importance. In that situation an adjudicating article by an independent party may be helpful (albeit probably not always feasible).

On a related note, it also seems justified to me that a researcher will require time to make a response. This pertains more to how we should assess a scientific disagreement as outside observers. Just because a researcher hasn’t responded to every little criticism within days of somebody criticising their work doesn’t mean that the criticism is valid. Scientists have lives too. They have other professional duties, mortgages to pay with their too-low salaries, children to feed, and – hard as it is to believe – they deserve some time off occasionally. As long as they declare their intention to respond in depth at some stage we should respect that. Of course if they never respond that may be a sign that they simply don’t have a good response to the criticism. But you need some patience, something we seem to have lost in the age of instant access social media.

Excessive criticism or harassment

This brings us to the next issue. Harassment of researchers is never okay. Which is really because harassment of anyone is never okay. So pelting a researcher with repeated criticisms, making the same points or asking the same questions over and over, is not acceptable. This certainly borders on harassment and may cross the line. This constant background noise can wear people out. It is also counterproductive because it slows them down in making their response. It may also paralyse their other research efforts which in turn will stress them out because they have grant obligations to fulfill etc. Above all, stress can make you sick. If you harassed somebody out of the ability to work, you’ll never get a response – this doesn’t make your criticism valid.

If the researchers declared their intention to respond to criticism we should leave it at that. If they don’t respond after a significant time it might be worth a reminder if they are still working on it. As I said above, if they never respond this may be a sign that they have no response. In that case, leave it at that.

It should require no explanation why any blatant harassment, abusive contact, or any form of interference in the researchers’ personal lives, is completely unacceptable. Depending on the severity of such cases they should be prosecuted to the full extent of the law. And if someone reports harassment, in the first instance you should believe them. It is a common tactic of harassers to downplay claims of abuse. Sure, it is also unethical to make false accusations but you should leave that for the authorities to judge, in particular if you don’t have any evidence one way or the other. Harassment is also subjective. What might not bother you may very well affect another person badly. Brushing this off as them being too sensitive demonstrates a serious lack of compassion, is disrespectful, and I think it also makes you seem untrustworthy.

Motive and bias

Speaking of untrustworthiness brings me to the next point. There has been much discussion about the motives of critics and in how far a criticism is to be taken in “good faith”. This is a complex and highly subjective judgement. In my view, your motive for reanalysing or critiquing a particular piece of research is not automatically a problem. All the data should be available, remember? Anyone can reanalyse it.

However, as all researchers should be honest so should all critics. Obviously this isn’t mandatory and it couldn’t be enforced even if it were. But this is how it should be and how good scientists should work. I have myself criticised and reanalysed research by others and I was not beating around the bush in either case – I believe I was pretty clear that I didn’t believe their hypothesis was valid. Hiding your prior notions is disrespectful to the authors and also misleads the neutral observers of the discussion. Even if you think that your public image already makes your views clear – say, because you ranted at great length on social media about how terribly flawed you think that study was – this isn’t enough. Even the Science Kardashians don’t have that large a social media following and probably only a fraction of that following will have read all your in-depth rants.

In addition to declaring your potential bias you should also state your intention. It is perfectly justified to dig into the data because you suspect it isn’t kosher. But this is an exploratory analysis and it comes with many of the same biases that uncontrolled, undeclared exploration always has. Of course you may find some big smoking gun that invalidates or undermines the original authors’ conclusions. But you are just as likely to find some spurious glitch or artifact in the data that doesn’t actually mean anything. In the latter case it would make more sense to conduct a follow up experiment that tests  your new alternative hypothesis to see if your suspicion holds up. If on the other hand you have a clear suspicion to start with you should declare it and then test it and report the findings no matter what. Preregistration may help to discriminate the exploratory fishing trips from the pointed critical reanalyses – however, it is logistically not very feasible to check whether this wasn’t just a preregistration after the fact because the data were already available.

So I think this judgement will always rely heavily on trust but that’s not a bad thing. I’m happy to trust a critic if they declare their prior opinion. I will simply take their views with some scepticism that their bias may have influenced them. A critic who didn’t declare their bias but is then shown to have a bias appears far less trustworthy. So it is actually in your interest to declare your bias.

Now before anyone inevitably reminds us that we should also worry about the motives and biases of the original authors – yes, of course. But this is a discussion we’ve already had for years and this is why data sharing and novel publication models like preregistration and registered reports are becoming more commonplace.

Lack of expertise

On to the final point. Reanalyses or criticism may come from people with limited expertise and knowledge of a research area to provide useful contributions. Such criticisms may obfuscate the discussion and that is never a good thing. Again preempting the inevitable comments: No, this does not mean that you have to prove your expertise to reanalyse the data. (Seriously guys, which part of “all data should be available to anyone” don’t you get?!). What it does mean is that I might not want to weight the criticism by someone who once took a biology class in high school the same way as that of a world expert. It also means that I will be more sceptical when someone is criticising something outside their own field.

There are many situations where this caveat doesn’t matter. Any scientist with some statistical training may be able to comment on some statistical issue. In fact, a statistician is presumably more qualified to comment on some statistical point than a non-statistician of whatever field. And even if you may not be an expert on some particular research topic you may still be an expert on the methods used by the researchers. Importantly, even a non-expert can reveal a fundamental flaw. The lack of a critic’s expertise shouldn’t be misused to discredit them. In the end, what really matters is that your argument is coherent and convincing. For that it doesn’t actually matter if you are an expert or not (an expert may however find it easier to communicate their criticism convincingly).

However, let’s assume that a large number of non-experts are descending on a data set picking little things they perceive as flaws that aren’t actually consequential or making glaring errors (to an expert) in their analysis. What should the researchers do in this situation? Not responding at all is not in their interest. This can easily be misinterpreted as a tacit acknowledgement that their research is flawed. On the other hand, responding to every single case is not in their interest either if they want to get on with their work (and their lives for that matter). As above, perhaps the best thing to do would be write a summary response collectively rebuking the most pertinent points, make a clear argument about the inconsequentialness of these criticisms, and then leave it at that.

Conclusion

In general, scientific criticisms are publications that should work like any other scientific publications. They should be subject to peer review (which, as readers of this blog will know, I believe should be post-publication and public). This doesn’t mean that criticism cannot start on social media, blogs, journal comment sections, or on PubPeer, and the boundaries may also blur at times. For some kinds of criticism, such as pointing out basic errors or misinterpretations some public comments may suffice and there have been cases where a publication was retracted simply because of the social media response. But for a criticism to be taken seriously by anyone, especially non-experts, it helps if it is properly vetted by independent experts – just how any study should be vetted. This may also help particularly with cases where the validity of the criticism is uncertain.

I think this is a very important discussion to have. We need to have this to bring about the research culture most of us seem to want. A brave new world of happy research parasites.

Parasites

(Note: I changed the final section somewhat after Neuroskeptic rightly pointed out that the conclusions were a bit too general. Tal Yarkoni independently replicated this sentiment. But he was only giving me a hard time.)