If you don’t believe science self-corrects, then you probably shouldn’t believe that evolution by natural selection occurs either – it’s basically the same thing.
I have said it many times before, both under the guise of my satirical alter ego and later – more seriously – on this blog. I am getting very tired of repeating it so I wrote this final post about it that I will simply link to next time this inevitably comes up…
My latest outburst about this was triggered by this blog post by Keith Laws entitled “Science is ‘Other-Correcting‘”. I have no qualms with the actual content of this post. It gives an interesting account of the attempt to correct an error in the publication record. The people behind this effort are great researchers for whom I have the utmost respect. The story they tell is shocking and important. In particular, the email they received by accident from a journal editor is disturbing and serves as a reminder of all the things that are wrong with the way scientific research and publishing currently operates.
My issue is with the (in my view seemingly) ubiquitous doubts about the self-correcting nature of science. To quote from the first paragraph in that post:
“I have never been convinced by the ubiquitous phrase ‘Science is self-correcting’. Much evidence points to science being conservative and looking less self-correcting and more ego-protecting. It is also not clear why ‘self’ is the correct description – most change occurs because of the ‘other’ – Science is other correcting.”
In my view this and similar criticisms of self-correction completely miss the point. The suffix ‘self-‘ refers to science, not to scientists. In fact, the very same paragraph contains the key: “Science is a process.” Science is an iterative approach by which we gradually broaden our knowledge and understanding of the world. You can debate whether or not there is such a thing as the “scientific method” – perhaps it’s more of a collection of methods. However, in my view above all else science is a way of thinking.
Scientific thinking is being inquisitive, skeptical, and taking nothing for granted. Prestige, fame, success are irrelevant. Perfect theories are irrelevant. The smallest piece of contradictory evidence can refute your grand unifying theory. And science encompasses all that. It is an emergent concept. And this is what is self-correcting.
Scientists, on the other hand, are not self-correcting. Some are more so than others but none are perfect. Scientists are people and thus inherently fallible. They are subject to ego, pride, greed, and all of life’s pressures, such as the need to pay a mortgage, feed their children, and having a career. In the common vernacular “science” is often conflated with the scientific enterprise, the way scientists go about doing science. This involves all those human factors and more and, fair enough, it is anything but self-correcting. But to argue that this means science isn’t self-correcting is attacking a straw man because few people are seriously arguing that the scientific enterprise couldn’t be better.
We should always strive to improve the way we do science because due to our human failings it will never be perfect. However, in this context we also shouldn’t forget how much we have already improved it. In the times of Newton, in Europe (the hub of science then) science was largely done only by white men from a very limited socioeconomic background. Even decades later, most women or people of non-European origin didn’t even need to bother trying (although this uphill struggle makes the achievements of scientists like Marie Curie or Henrietta Swan Leavitt all the more impressive). And publishing your research findings was not subject to formal peer review but largely dependent on the ego of some society presidents and on whether they liked you. None of these problems have been wiped off the face of the Earth but I would hope most people agree that things are better than they were 100 years ago.
Like all human beings, scientists are flawed. Nevertheless I am actually optimistic about us as a group. I do believe that on the whole scientists are actually interested in learning the truth and widening their understanding of nature. Sure, there are black sheep and even the best of us will succumb to human failings. At some point or other our dogma and affinity to our pet hypotheses can blind us to the cold facts. But on average I’d like to think we do better than most of our fellow humans. (Then again, I’m probably biased…).
We will continue to make the scientific enterprise better. We will change the way we publish and evaluate scientific findings. We will improve the way we interpret evidence and we communicate scientific discoveries. The scientific enterprise will become more democratic, less dependent on publishers getting rich on our free labour. Already within the decade I have been a practicing scientist we have begun to tear down the wide-spread illusion that when a piece of research is published it must therefore be true. When I did my PhD, the only place we could critically discuss new publications was in a small journal club and the conclusions of these discussions were almost never shared with the world. Nowadays every new study is immediately discussed online by an international audience. We have taken leaps towards scientific findings, data, and materials being available to anyone, anywhere, provided they have internet access. I am very optimistic that this is only the beginning of much more fundamental changes.
Last year I participated in a workshop called “Is Science Broken?” that was solely organised by graduate students in my department. The growing number of replication attempts in the literature and all these post-publication discussions we are having are perfect examples of science correcting itself. It seems deeply ironic to me when posts like Keith Laws’, which describes an active effort to rectify errors, argue against the self-correcting nature of the scientific process.
Of course, self-correction is not guaranteed. It can easily be stifled. There is always a danger that we drift back into the 19th century or the dark ages. But the greater academic freedom (and generous funding) scientists are given, the more science will be allowed to correct itself.
Update (19 Jan 2016): I just read this nice post about the role of priors in Bayesian statistics. The author actually says Bayesian analysis is “self-correcting” and this epitomises my point here about science. I would say science is essentially Bayesian. We start with prior hypotheses and theories but by accumulating evidence we update our prior beliefs to posterior beliefs. It may take a long time but assuming we continue to collect data our assumptions will self-correct. It may take a reevaluation of what the evidence is (which in this analogy would be a change to the likelihood function). Thus the discussion about how we know how close to the truth we are is in my view missing the point. Self-correction describes the process.
Update (21 Jan 2016): I added a sentence from my comment in the discussion section to the top. It makes for a good summary of my post. The analogy may not be perfect – but even if not I’d say it’s close. If you disagree, please leave a comment below.
Data sharing has been in the news a lot lately from the refusal of the authors of the PACE trial to share their data even though the journal expects it to the eventful story of the “Sadness impairs color perception” study. A blog post by Dorothy Bishop called “Who’s afraid of Open Data?” made the rounds. The post itself is actually a month old already but it was republished by the LSE blog which gave it some additional publicity. In it she makes a impassioned argument for open data sharing and discusses the fears and criticisms many researchers have voiced against data sharing.
I have long believed in making all data available (and please note that in the following I will always mean data and materials, so not just the results but also the methods). The way I see it this transparency is the first and most important remedy to the ills of scientific research. I have regular discussions with one of my close colleagues* about how to improve science – we don’t always agree on various points like preregistration, but if there is one thing where we are on the same page, it is open data sharing. By making data available anyone can reanalyse it and check if the results reproduce and it allows you to check the robustness of a finding for yourself, if you feel that you should. Moreover, by documenting and organising your data you not only make it easier for other researchers to use, but also for yourself and your lab colleagues. It also helps you with spotting errors. It is also a good argument that stops reviewer 2 from requesting a gazillion additional analyses – if they really think these analyses are necessary they can do them themselves and publish them. This aspect in fact overlaps greatly with the debate on Registered Reports (RR) and it is one of the reasons I like the RR concept. But the benefits of data sharing go well beyond this. Access to the data will allow others to reuse the data to answer scientific questions you may not even have thought of. They can also be used in meta-analyses. With the increasing popularity and feasibility of large-scale permutation/bootstrapping methods it also means that availability to the raw values will be particularly important. Access to the data allows you to take into account distributional anomalies, outliers, or perhaps estimate the uncertainty on individual data points.
But as Dorothy describes, many scientists nevertheless remain afraid of publishing their actual data alongside their studies. For several years many journals and funding agencies have had a policy that data should always be shared upon request – but a laughably small proportion of such requests are successful. This is why some have now adopted the policy that all data must be shared in repositories upon publication or even upon submission. And to encourage this process recently the Peer Reviewer Openness Initiative was launched by which signatories would refuse to conduct in-depth reviews of manuscripts unless the authors can give a reason why data and materials aren’t public.
My most memorable experience with fears about open data involve a case where the lab head refused to share data and materials with the graduate student* who actually created the methods and collected the data. The exact details aren’t important. Maybe one day I will talk more about this little horror story… For me this demonstrates how far we have come already. Nowadays that story would be baffling to most researchers but back then (and that’s only a few years ago – I’m not that old!) more than one person actually told me that the PI and university were perfectly justified in keeping the student’s results and the fruits of their intellectual labour under lock and key.
Clearly, people are still afraid of open data. Dorothy lists the following reasons:
Lack of time to curate data; Data are only useful if they are understandable, and documenting a dataset adequately is a non-trivial task;
Personal investment – sense of not wanting to give away data that had taken time and trouble to collect to other researchers who are perceived as freeloaders;
Concerns about being scooped before the analysis is complete;
Fear of errors being found in the data;
Ethical concerns about confidentiality of personal data, especially in the context of clinical research;
Possibility that others with a different agenda may misuse the data, e.g. perform selective analysis that misrepresented the findings;
In my view, points 1-4 are invalid arguments even if they seem understandable. I have a few comments about some of these:
The fear of being scooped
I honestly am puzzled by this one. How often does this actually happen? The fear of being scooped is widespread and it may occasionally be justified. Say, if you discuss some great idea you have or post a pilot result on social media perhaps you shouldn’t be surprised if someone else agrees that the idea is great and also does it. Some people wouldn’t be bothered by that but many would and that’s understandable. Less understandable to me is if you present research at a conference and then complain about others publishing similar work because they were inspired by you. That’s what conferences are for. If you don’t want that to happen, don’t go to conferences. Personally, I think science would be a lot better if we cared a lot less about who did what first and instead cared more about what is true and how we can work together…
But anyway, as far as I can see none of that applies to data sharing. By definition data you share is either already published or at least submitted for peer review. If someone reuses your data for something else they have to cite you and give you credit. In many situations they may even do it in collaboration with you which could lead to coauthorship. More importantly, if the scooped result is so easily obtained that somebody beats you to it despite your head start (it’s your data, regardless of how well documented it is you will always know it better than some stranger) then perhaps you should have thought about that sooner. You could have held back on your first publication and combined the analyses. Or, if it really makes more sense to publish the data in separate papers, then you could perhaps declare that the full data set will be shared after the second one is published. I don’t really think this is necessary but I would accept that argument.
Either way, I don’t believe being scooped by data sharing is very realistic and any cases of that happening must be extremely rare. But please share these stories if you have them to prove me wrong! If you prefer, you can post it anonymously on the Neuroscience Devils. That’s what I created that website for.
Fear of errors being discovered
I’m sure everyone can understand that fear. It can be embarrassing to have your errors (and we all make mistakes) being discovered – at least if they are errors with big consequences. Part of the problem is also that all too often the discovery of errors is associated with some malice. To err is human, to forgive divine. We really need to stop treating every time somebody’s mistakes are being revealed (or, for that matter, when somebody’s findings fail to replicate) as an implication of sloppy science or malpractice. Sometimes (usually?) mistakes are just mistakes.
Probably nobody wants to have all of their data combed by vengeful sleuths nitpicking every tiny detail. If that becomes excessive and the same person is targeted, it could border on harassment and that should be counteracted. In-depth scrutiny of all the data by a particular researcher should be a special case that only happens when there is a substantial reason, say, in a fraud investigation. I would hope though that these cases are also rare.
And surely nobody can seriously want the scientific record to be littered with false findings, artifacts, and coding errors. I am not happy if someone tells me I made a serious error but I would nonetheless be grateful to them for telling me! It has happened before when lab members or collaborators spotted mistakes I made. In turn I have spotted mistakes colleagues made. None of this would have been possible if we didn’t share our data and methods amongst each another. I am always surprised when I hear how uncommon this seems to be in some labs. Labs should be collaborative, and so should science as a whole. And as I already said, organising and documenting your data actually helps you to spot errors before the work is published. If anything, data sharing reduces mistakes.
Ethical issues with patient confidentiality
This is a big concern – and the only one that I have full sympathy with. But all of our ethics and data protection applications actually discuss this. The only data that is shared should be anonymised. Participants should only be identified by unique codes that only the researchers who collected the data have access to. For a lot of psychology or other behavioural experiments this shouldn’t be hard to achieve.
Neuroimaging or biological data are a different story. I have a strict rule for my own results. We do not upload the actual brain images of our fMRI experiments to public repositories. While under certain conditions I am willing to share such data upon request as long as the participant’s name has been removed, I don’t think it is safe to make those data permanently available to the entire internet. Participant confidentiality must trump the need for transparency. It simply is not possible to remove all identifying information from these files. Skull-stripping, which removes the head tissues from an MRI scan except for the brain, does not remove all identifying information. Brains are like finger-prints and they can easily be matched up, if you have the required data. As someone* recently said in a discussion of this issue, the undergrad you are scanning in your experiment now may be Prime Minister in 20-30 years time. They definitely didn’t consent to their brain scans being available to anyone. It may not take much to identify a person’s data using only their age, gender, handedness, and a basic model of their head shape derived from their brain scan. We must also keep in mind of what additional data mining may be possible in the coming decades that we simply have no idea about yet. Nobody can know what information could be gleaned from these data, say, about health risks or personality factors. Sharing this without very clear informed consent (that many people probably wouldn’t give) is in my view irresponsible.
I also don’t believe that for most purposes this is even necessary. Most neuroimaging studies involve group analyses. In those you first spatially normalise the images of each participant and the perform statistical analysis across participants. It is perfectly reasonable to make those group results available. For purpose of non-parametric permutation analyses (also in the news recently) you would want to share individual data points but even there you can probably share images after sufficient processing that not much incidental information is left (e.g. condition contrast images). In our own work, these considerations don’t apply. We conduct almost all our analyses in the participant’s native brain space. As such we decided to only share the participants’ data projected on a cortical reconstruction. These data contain the functional results for every relevant voxel after motion correction and signal filtering. No this isn’t raw data but it is sufficient to reproduce the results and it is also sufficient for applying different analyses. I’d wager that for almost all purposes this is more than enough. And again, if someone were to be interested in applying different motion correction or filtering methods, this would be a negotiable situation. But I don’t think we need to allow unrestricted permanent access for such highly unlikely purposes.
Basically, rather than sharing all raw data I think we need to treat each data set on a case-by-case basis and weigh the risks against benefits. What should be mandatory in my view is sharing all data after default processing that is needed to reproduce the published results.
People with agendas and freeloaders
Finally a few words about a combination of points 2 and 6 in Dorothy Bishop’s list. When it comes to controversial topics (e.g. climate change, chronic fatigue syndrome, to name a few examples where this apparently happened) there could perhaps be the danger that people with shady motivations will reanalyse and nitpick the data to find fault with them and discredit the researcher. More generally, people with limited expertise may conduct poor reanalysis. Since failed reanalysis (and again, the same applies to failed replications) often cause quite a stir and are frequently discussed as evidence that the original claims were false, this could indeed be a problem. Also some will perceive these cases as “data tourism”, using somebody else’s hard-won results for quick personal gain – say by making a name for themselves as a cunning data detective.
There can be some truth in that and for that reason I feel we really have to work harder to change the culture of scientific discourse. We must resist the bias to agree with the “accuser” in these situations. (Don’t pretend you don’t have this bias because we all do. Maybe not in all cases but in many cases…)
Of course skepticism is good. Scientists should be skeptical but the skepticism should apply to all claims (see also this post by Neuroskeptic on this issue). If somebody reanalyses somebody else’s data using a different method that does not automatically make them right and the original author wrong. If somebody fails to replicate a finding, that doesn’t mean that finding was false.
Science thrives on discussion and disagreement. The critical thing is that the discussion is transparent and public. Anyone who has an interest should have the opportunity to follow it. Anyone who is skeptical of the authors’ or the reanalysers’/replicators’ claims should be able to check for themselves.
And the only way to achieve this level of openness is Open Data.
* They will remain anonymous unless they want to join this debate.
In my previous post, I talked about why I think all properly conducted research should be published. Null results are important. The larger scientific community needs to know whether or not a particular hypothesis has been tested before. Otherwise you may end up wasting somebody’s time because they repeatedly try in vain to answer the same question. What is worse, we may also propagate false positives through the scientific record because failed replications are often still not published. All of this contributes to poor replicability of scientific findings.
However, the emphasis here is on ‘properly conducted research‘. I already discussed this briefly in my post but it also became the topic of an exchange between (for the most part) Brad Wyble, Daniël Lakens, and myself. In some fields, for example psychophysics, extensive piloting, and “fine-tuning” of experiments is not only very common but probably also necessary. To me it doesn’t seem sensible to make the results of all of these attempts publicly available. This inevitably floods the scientific record with garbage. Most likely nobody will look at it. Even if you are a master at documenting your work, nobody but you (and after a few months maybe not even you) will understand what is in your archive.
Most importantly, it can actually be extremely misleading for others who are less familiar with the experiment to see all of the tests you did ensuring the task was actually doable, that monitors were at the correct distance from the participant, your stereoscope was properly aligned, the luminance of the stimuli was correct, that the masking procedure was effective, etc. Often you may only realise during your piloting that the beautiful stimulus you designed after much theoretical deliberation doesn’t really work in practice. For example, you may inadvertently induce an illusory percept that alters how participants respond in the task. This in fact happened recently with an experiment a collaborator of mine piloted. And more often than not, after having tested a particular task on myself at great length I then discover that it is far too difficult for anyone else (let’s talk about overtrained psychophysicists another time…).
Such pilot results are not very meaningful
It most certainly would not be justified to include them in a meta-analysis to quantify the effect – because they presumably don’t even measure the same effect (or at least not very reliably). A standardised effect size, like Cohen’s d, is a signal-to-noise ratio as it compares an effect (e.g. difference in group means) to the variability of the sample. The variability is inevitably larger if a lot of noisy, artifactual, and quite likely erroneous data are included. While some degree of this can be accounted for in meta-analysis by using a random-effects model, it simply doesn’t make sense to include bad data. We are not interested in the meta-effect, that is, the average result over all possible experimental designs we can dream up, no matter how inadequate.
What we are actually interested in is some biological effect and we should ensure that we take the most precise measurement as possible. Once you have a procedure that you are confident will yield precise measurements, by all means, carry out a confirmatory experiment. Replicate it several times, especially if it’s not an obvious effect. Pre-register your design if you feel you should. Maximise statistical power by testing many subjects if necessary (although often significance is tested on a subject-by-subject basis, so massive sample sizes are really overkill as you can treat each participant as a replication – I’ll talk about replication in a future post so I’ll leave it at this for now). But before you do all this you usually have to fine-tune an experiment, at least if it is a novel problem.
Isn’t this contributing to the problem?
Colleagues in social/personality psychology often seem to be puzzled and even concerned by this. The opacity of what has or hasn’t been tried is part of the problems that plague the field and lead to publication bias. There is now a whole industry meta-analysing results in the literature to quantify ‘excess significance’ or a ‘replication index’. This aims to reveal whether some additional results, especially null results, may have been suppressed or if p-hacking was employed. Don’t these pilot experiments count as suppressed studies or p-hacking?
No, at least not if this is done properly. The criteria you use to design your study must of course be orthogonal to and independent from your hypothesis. Publication bias, p-hacking, and other questionable practices are all actually sub-forms of circular reasoning: You must never use the results of your experiment to inform the design as you may end up chasing (overfitting) ghosts in your data. Of course, you must not run 2-3 subjects on an experiment, look at the results and say ‘The hypothesis wasn’t confirmed. Let’s tweak a parameter and start over.’ This would indeed be p-hacking (or rather ‘result hacking’ – there are usually no p-values at this stage).
A real example
I can mainly speak from my own experience but typically the criteria used to set up psychophysics experiments are sanity/quality checks. Look for example at the figure below, which shows a psychometric curve of one participant. The experiment was a 2AFC task using the method of constant stimuli: In each trial the participant made a perceptual judgement on two stimuli, one of which (the ‘test’) could vary physically while the other remained constant (the ‘reference’). The x-axis plots how different the two stimuli were, so 0 (the dashed grey vertical line) means they were identical. To the left or right of this line the correct choice would be the reference or test stimulus, respectively. The y-axis plots the percentage of trials the participant chose the test stimulus. By fitting a curve to these data we can extrapolate the ability of the participant to tell apart the stimuli – quantified by how steep the curve is – and also their bias, that is at what level of x the two stimuli appeared identical to them (dotted red vertical line):
As you can tell, this subject was quite proficient at discriminating the stimuli because the curve is rather steep. At many stimulus levels the performance is close to perfect (that is, either near 0 or 100%). There is a point where performance is at chance (dashed grey horizontal line). But once you move to the left or the right of this point performance becomes good very fast. The curve is however also shifted considerably to the right of zero, indicating that the participant indeed had a perceptual bias. We quantify this horizontal shift to infer the bias. This does not necessarily tell us the source of this bias (there is a lot of literature dealing with that question) but that’s beside the point – it clearly measures something reliably. Now look at this psychometric curve instead:
The general conventions here are the same but these results are from a completely different experiment that clearly had problems. This participant did not make correct choices very often as the curve only barely goes below the chance line – they chose the test stimulus far too often. There could be numerous reasons for this. Maybe they didn’t pay attention and simply made the same choice most of the time. For that the trend is bit too clean though. Perhaps the task was too hard for them, maybe because the stimulus presentation was too brief. This is possible although it is very unlikely that a healthy, young adult with normal vision would not be able to tell apart the more extreme stimulus levels with high accuracy. Most likely, the participant did not really understand the task instructions or perhaps the stimuli created some unforeseen effect (like the illusion I mentioned before) that actually altered what percept they were judging. Whatever the reason, there is no correct way to extrapolate the psychometric parameters here. The horizontal shift and the slope are completely unusable. We see an implausibly poor discrimination performance and extremely large perceptual bias. If their vision really worked this way, they should be severely impaired…
So these data are garbage. It makes no sense to meta-analyse biologically implausible parameter estimates. We have no idea what the participant was doing here and thus we can also have no idea what effect we are measuring. Now this particular example is actually a participant a student ran as part of their project. If you did this pilot experiment on yourself (or a colleague) you might have worked out what the reason for the poor performance was.
What can we do about it?
In my view, it is entirely justified to exclude such data from our publicly shared data repositories. It would be a major hassle to document all these iterations. And what is worse, it would obfuscate the results for anyone looking at the archive. If I look at a data set and see a whole string of brief attempts from a handful of subjects (usually just the main author), I could be forgiven for thinking that something dubious is going on here. However, in most cases this would be unjustified and a complete waste of everybody’s time.
At the same time, however, I also believe in transparency. Unfortunately, some people do engage in result-hacking and iteratively enhance their findings by making the experimental design contingent on the results. In most such cases this is probably not done deliberately and with malicious intent – but that doesn’t make it any less questionable. All too often people like to fiddle with their experimental design while the actual data collection is already underway. In my experience this tendency is particularly severe among psychophysicists who moved into neuroimaging where this is a really terrible (and costly) idea.
How can we reconcile these issues? In my mind, the best way is perhaps to document briefly what you did to refine the experimental design. We honestly don’t need or want to see all the failed attempts at setting up an experiment but it could certainly be useful to have an account of how the design was chosen. What experimental parameters were varied? How and why were they chosen? How many pilot participants were there? This last point is particularly telling. When I pilot something, there usually is one subject: Sam. Possibly I will have also tested one or two others, usually lab members, to see if my familiarity with the design influences my results. Only if the design passes quality assurance, say by producing clear psychometric curves or by showing to-be-expected results in a sanity check (e.g., the expected response on catch trials), I would dare to actually subject “real” people to a novel design. Having some record, even if as part of the documentation of your data set, is certainly a good idea though.
The number of participants and pilot experiments can also help you judge the quality of the design. Such “fine-tuning” and tweaking of parameters isn’t always necessary – in fact most designs we use are actually straight-up replications of previous ones (perhaps with an added condition). I would say though that in my field this is a very normal thing to do when setting up a new design at least. However, I have also heard of extreme cases that I find fishy. (I will spare you the details and will refrain from naming anyone). For example in one study the experimenters ran over a 100 pilot participants – tweaking the design all along the way – to identify those that showed a particular perceptual effect and then used literally a handful of these for an fMRI study that claims to have been about “normal” human brain function. Clearly, this isn’t alright. But this also cannot possibly count as piloting anymore. The way I see it, a pilot experiment can’t have an order of magnitude more data than the actual experiment…
How does this relate to the wider debate?
I don’t know how applicable these points are to social psychology research. I am not a social psychologist and my main knowledge about their experiments are from reading particularly controversial studies or the discussions about them on social media. I guess that some of these issues do apply but that it is far less common. An equivalent situation to what I describe here would be that you redesign your questionnaire because it people always score at maximum – and by ‘people’ I mean the lead author :P. I don’t think this is a realistic situation in social psychology, but it is exactly how psychophysical experiments work. Basically, what we do in piloting is what a chemist would do when they are calibrating their scales or cleaning their test tubes.
Or here’s another analogy using a famous controversial social psychology finding we discussed previously: Assume you want to test whether some stimulus makes people walk more slowly as they leave the lab. What I do in my pilot experiments is to ensure that the measurement I take of their walking speed is robust. This could involve measuring the walking time for a number of people before actually doing any experiment. It could also involve setting up sensors to automate this measurement (more automation is always good to remove human bias but of course this procedure needs to be tested too!). I assume – or I certainly hope so at least – that the authors of these social psychology studies did such pre-experiment testing that was not reported in their publications.
As I said before, humans are dirty test tubes. But you should ensure that you get them as clean as you can before you pour in your hypothesis. Perhaps a lot of this falls under methods we don’t report. I’m all for reducing this. Methods sections frequently lack necessary detail. But to some extend, I think some unreported methods and tests are unavoidable.
Yesterday Neuroskeptic came to our Cognitive Drinks event in the Experimental Psychology department at UCL to talk about p-hacking. His entertaining talk (see Figure 1) was followed by a lively and fairly long debate about p-hacking and related questions about reproducibility, preregistration, and publication bias. During the course of this discussion a few interesting things came up. (I deliberately won’t name anyone as I think this complicates matters. People can comment and identify themselves if they feel that they should…)
It was suggested that a lot of the problems with science would be remedied effectively if only people were encouraged (or required?) to replicate their own findings before publication. Now that sounds generally like a good idea. I have previously suggested that this would work very well in combination with preregistration: you first do a (semi-)exploratory experiment to finalise the protocol, then submit a preregistration of your hypothesis and methods, and then do the whole thing again as a replication (or perhaps more than one if you want to test several boundary conditions or parameters). You then submit the final set of results for publication. Under the Registered Report format, your preregistered protocol would already undergo peer review. This would ensure that the final results are almost certain to be published provided you didn’t stray excessively from the preregistered design. So far, so good.
Should you publish unclear results?
Or is it? Someone suggested that it would be a problem if your self-replication didn’t show the same thing as the original experiment. What should one do in this case? Doesn’t publishing something incoherent like this, one significant finding and a failed replication, just add to the noise in the literature?
At first, this question simply baffled me, as I suspect it would many of the folks campaigning to improve science. (My evil twin sister called these people Crusaders for True Science but I’m not supposed to use derogatory terms like that anymore nor should I impersonate lady demons for that matter. Most people from both sides of this mudslinging contest “debate” never seemed to understand that I’m also a revolutionary – you might just say that I’m more Proudhon, Bakunin, or Henry David Thoreau rather than Marx, Lenin, or Che Guevara. But I digress…)
Surely, the attitude that unclear, incoherent findings, that is, those that are more likely to be null results, are not worth publishing must contribute to the prevailing publication bias in the scientific literature? Surely, this view is counterproductive to the aims of science to accumulate evidence and gradually get closer to some universal truths? We must know which hypotheses have been supported by experimental data and which haven’t. One of the most important lessons I learned from one of my long-term mentors was that all good experiments should be published regardless of what they show. This doesn’t mean you should publish every single pilot experiment you ever did that didn’t work. (We can talk about what that does and doesn’t mean another time. But you know how life is: sometimes you think you have some great idea only to realise that it makes no sense at all when you actually try it in practice. Or maybe that’s just me? :P). Even with completed experiments you probably shouldn’t bother publishing if you realise afterwards that it is all artifactual or the result of some error. Hopefully you don’t have a lot of data sets like that though. So provided you did an experiment of suitable quality I believe you should publish it rather than hiding it in the proverbial file drawer. All scientific knowledge should be part of the scientific record.
I naively assumed that this view was self-evident and shared by almost everyone – but this clearly is not the case. Yet instead of sneering at such alternative opinions I believe we should understand why people hold them. There are reasonable arguments why one might wish to not publish every unclear finding. The person making this suggestion at our discussion said that it is difficult to interpret a null result, especially an assumed null result like this. If your original experiment O showed a significant effect supporting your hypothesis, but your replication experiment R does not, you cannot naturally conclude that the effect really doesn’t exist. For one thing you need to be more specific than that. If O showed a significant positive effect but R shows a significant negative one, this would be more consistent with the null hypothesis than if O is highly significant (p<10-30) and R just barely misses the threshold (p=0.051).
So let’s assume that we are talking about the former scenario. Even then things aren’t as straightforward, especially if R isn’t as exact a replication of O as you might have liked. If there is any doubt (and usually there is) that something could have been different in R than in O, this could be one of the hidden factors people always like to talk about in these discussions. Now you hopefully know your data better than anyone. If experiment O was largely exploratory and you tried various things to see what works best (dare we say p-hacking again?), then the odds are probably quite good that a significant non-replication in the opposite direction shows that the effect was just a fluke. But this is not a natural law but a probabilistic one. Youcannot everknow whether the original effect was real or not, especially not from such a limited data set of two non-independent experiments.
This is precisely why you should publish all results!
In my view, it is inherently dangerous if researchers decide for themselves which findings are important and which are not. It is not only a question of publishing only significant results. It applies much more broadly to the situation when a researcher publishes only results that support their pet theory but ignores or hides those that do not. I’d like to believe that most scientists don’t engage in this sort of behaviour – but sadly it is probably not uncommon. A way to counteract this is to train researchers to think of ways that test alternative hypotheses that make opposite predictions. However, such so-called “strong inference” is not always feasible. And even when it is, the two alternatives are not always equally interesting, which in turn means that people may still become emotionally attached to one hypothesis.
The decision whether a result is meaningful should be left to posterity. You should publish all your properly conducted experiments. If you have defensible doubts that the data are actually rubbish (say, an fMRI data set littered with spikes, distortions, and excessive motion artifacts, or a social psychology study where you discovered posthoc that all the participants were illiterate and couldn’t read the questionnaires) then by all means throw them in the bin. But unless you have a good reason, you should never do this and instead add the results to the scientific record.
Now the suggestion during our debate was that such inconclusive findings clog up the record with unnecessary noise. There is an enormous and constantly growing scientific literature. As it is, it is becoming increasingly harder to separate the wheat from the chaff. I can barely keep up with the continuous feed of new publications in my field and I am missing a lot. Total information overload. So from that point of view the notion makes sense that only those studies that meet a certain threshold for being conclusive are accepted as part of the scientific record.
I can certainly relate to this fear. For the same reason I am sceptical of proposals that papers should be published before review and all decisions about the quality and interest of some piece of research, including the whole peer review process, should be entirely post-publication. Some people even seem to think that the line between scientific publication and science blog should be blurred beyond recognition. I don’t agree with this. I don’t think that rating systems like those used on Amazon or IMDb are an ideal way to evaluate scientific research. It doesn’t sound wise to me to assess scientific discoveries and medical breakthroughs in the same way we rank our entertainment and retail products. And that is not even talking about unleashing the horror of internet comment sections onto peer review…
Solving the (false) dilemma
I think this discussion is creating a false dichotomy. These are not mutually exclusive options. The solution to a low signal-to-noise ratio in the scientific literature is not to maintain publication bias of significant results. Rather the solution is to improve our filtering mechanisms. As I just described, I don’t think it will be sufficient to employ online shopping and social network procedures to rank the scientific literature. Even in the best-case scenario this is likely to highlight the results of authors who are socially dominant or popular and probably also those who are particularly unpopular or controversial. It does not necessarily imply that the highest quality research floats to the top [cue obvious joke about what kind of things float to the top…].
No, a high quality filter requires some organisation. I am convinced the scientific community can organise itself very well to create these mechanisms without too much outside influence. (I told you I’m Thoreau and Proudhon, not some insane Chaos Worshipper :P). We need some form of acceptance to the record. As I outlined previously, we should reorganise the entire publication process so that the whole peer-review process is transparent and public. It should be completely separate from journals. The journals’ only job should be to select interesting manuscripts and to publish short summary versions of them in order to communicate particularly exciting results to the broader community. But this peer-review should still involve a “pre-publication” stage – in the sense that the initial manuscript should not generate an enormous amount of undue interest before it has been properly vetted. To reiterate (because people always misunderstand that): the “vetting” should be completely public. Everyone should be able to see all the reviews, all the editorial decisions, and the whole evolution of the manuscript. If anyone has any particular insight to share about the study, by all means they should be free to do so. But there should be some editorial process. Someone should chase potential reviewers to ensure the process takes off at all.
The good news about all this is that it benefits you. Instead of weeping bitterly and considering to quit science because yet again you didn’t find the result you hypothesised, this just means that you get to publish more research. Taking the focus off novel, controversial, special, cool or otherwise “important” results should also help make the peer review more about the quality and meticulousness of the methods. Peer review should be about ensuring that the science is sound. In current practice it instead often resembles a battle with authors defending to the death their claims about the significance of their findings against the reviewers’ scepticism. Scepticism is important in science but this kind of scepticism is completely unnecessary when people are not incentivised to overstate the importance of their results.
Practice what you preach
I honestly haven’t followed all of the suggestions I make here. Neither have many other people who talk about improving science. I know of vocal proponents of preregistration who have yet to preregister any study of their own. The reasons for this are complex. Of course, you should “be the change you wish to see in the world” (I’m told Gandhi said this). But it’s not always that simple.
On the whole though I think I have published almost all of the research I’ve done. While I currently have a lot of unpublished results there is very little in the file drawer as most of these experiments have either been submitted or are being written up for eventual publication. There are two exceptions. One is a student project that produced somewhat inconclusive results although I would say it is a conceptual replication of a published study by others. The main reason we haven’t tried to publish this yet is that the student isn’t here anymore and hasn’t been in contact and the data aren’t that exciting to us to bother with the hassle of publication (and it is a hassle!).
The other data set is perhaps ironic because it is a perfect example of the scenario I described earlier. A few years ago when I started a new postdoc I was asked to replicate an experiment a previous lab member had done. For simplicity, let’s just call this colleague Dr Toffee. Again, they can identify themselves if they wish. The main reason for this was that reviewers had asked Dr Toffee to collect eye-movement data. So I replicated the original experiment but added eye-tracking. My replication wasn’t an exact one in the strictest terms because I decided to code the experimental protocol from scratch (this was a lot easier). I also had to use a different stimulus setup than the previous experiment as that wouldn’t have worked with the eye-tracker. Still, I did my best to match the conditions in all other ways.
My results were a highly significant effect in the opposite direction than the original finding. We did all the necessary checks to ensure that this wasn’t just a coding error etc. It seemed to be real. Dr Toffee and I discussed what to do about it and we eventually decided that we wouldn’t bother to publish this set of experiments. The original experiment had been conducted several years before my replication. Dr Toffee had moved on with their life. I on the other hand had done this experiment as a courtesy because I was asked to. It was very peripheral to my own research interests. So, as in the other example, we both felt that going through the publication process would have been a fairly big hassle for very little gain.
Now this is bad. Perhaps there is some other poor researcher, a student perhaps, who will do a similar experiment again and waste a lot of time on testing the hypothesis that, at least according to our incoherent results, is unlikely to be true. And perhaps they will also not publish their failure to support this hypothesis. The circle of null results continues…
But you need to pick your battles. We are all just human beings and we do not have unlimited (research) energy. With both of these lacklustre or incoherent results I mentioned (and these are literally the only completed experiments we haven’t at least begun to write up), it seems like a daunting task to undergo the pain of submission->review->rejection->repeat that simply doesn’t seem worth it.
So what to do? Well, the solution is again what I described. The very reason the task of publishing these results isn’t worth our energy is everything that is wrong with the current publication process! In my dream world in which I can simply write up a manuscript formatted in a way that pleases me and then upload this to the pre-print peer-review site my life would be infinitely simpler. No more perusing dense journal websites for their guide to authors or hunting for the Zotero/Endnote/Whatever style to format the bibliography. No more submitting your files to one horribly designed, clunky journal website after another, checking the same stupid tick boxes, adding the same reviewer suggestions. No more rewriting your cover letters by changing the name of the journal. Certainly for my student’s project, it would not be hard to do as there is already a dissertation that can be used as a basis for the manuscript. Dr Toffee’s experiment and its contradictory replication might require a bit more work – but to be fair even there is already a previous manuscript. So all we’d need to add would be the modifications of the methods and the results of my replication. In a world where all you need to do is upload the manuscript and address some reviewers’ comments to ensure the quality of the science this should be fairly little effort. In turn it would ensure that the file drawer is empty and we are all much more productive.
This world isn’t here yet but there are journals that will allow something that isn’t too far off from that, namely F1000Research and PeerJ (and the Winnower also counts although the content there seems to be different and I don’t quite know how much review editing happens there). So, maybe I should email Dr Toffee now…
(* In case you didn’t get this from the previous 2700ish words: the answer to this question is unequivocally “No.”)
Because it’s so much more fun than the things I should really be doing (correcting student dissertations and responding to grant reviews) I read a long blog post entitled “Science isn’t broken” by Christie Aschwanden. In large parts this is a summary of the various controversies and “crises” that seem to have engulfed scientific research in recent years. The title is a direct response to an event I participated in recently at UCL. More importantly, I think it’s a really good read so I recommend it.
This post is a quick follow-up response to the general points raised there. As I tried to argue (probably not very coherently) at that event, I also don’t think science is broken. First of all, probably nobody seriously believes that the lofty concept of science, the scientific method (if there is one such thing), can even be broken. But even in more pragmatic terms, the human aspects of how science works are not broken either. My main point was that the very fact we are having these kinds of discussions, about how scientific research can be improved, is direct proof that science is in fact very healthy. This is what self-correction looks like.
If anything, the fact that there has been a recent surge of these kinds of debates shows that science has already improved a lot recently. After decades of complacency with the status quo there now seems to be real energy afoot to effect some changes. However, it is not the first time this happened (for example, the introduction of peer review would have been a similarly revolutionary time) and it will not be the last. Science will always need to be improved. If some day conventional wisdom were that our procedure is now perfect, that it cannot be improved anymore, that would be a tell-tale sign for me that I should do something else.
So instead of fretting over whether science is “broken” (No, it isn’t) or even whether it needs improvement (Yes, it does), what we should be talking about is specifically what really urgently needs improvement. Here is my short list. I am not proposing many solutions (except for point 1). I’d be happy to hear suggestions:
I. Publishing and peer review
The way we publish and review seriously needs to change. We are wasting far too much time on trivialities instead of the science. The trivialities range from reformatting manuscripts to fit journal guidelines and uploading files on the practical side to chasing impact factors and “novel” research on the more abstract side. Both hurt research productivity although in different ways. I recently proposed a solution that combines some of the ideas by Dorothy Bishop and Micah Allen (and no doubt many others).
II. Post-publication review
Related to this, the way we evaluate and discuss published science needs to change, too. We need to encourage more post-publication review. This currently still doesn’t happen as most studies never receive any post-pub review or get commented on at all. Sure, some (including some of my own) probably just don’t deserve any attention, but how will you know unless somebody tells you the study even exists? Many precious gems will be missed that way. This has of course always been the case in science but we should try to minimise that problem. Some believe post-publication review is all we will ever need but unless there are robust mechanisms to attract reviewers to new manuscripts besides the authors’ fame, (un-)popularity, and/or their social media presence – none of which are good scientific arguments – I can’t see how a post-pub only system can change this. On this note I should mention that Tal Yarkoni, with whom I’ve had some discussions about this issue, wrote an article about this which presents some suggestions. I am not entirely convinced of the arguments he makes for enhancing post-publication review but I need more time to respond to this in detail. So I will just point this out for now to any interested reader.
III. Research funding and hiring decisions
Above all, what seriously needs to change is how we allocate research funds and how we make hiring decisions. The solution to that probably goes hand in hand with solving the other two points, but I think it also requires direct action now in the absence of good solutions for the other issues. We must stop judging grant and job applicants based on impact factors or h-indeces. This is certainly more easily done for job applications than for grant decisions as in the latter the volume of applications is much greater – and the expertise of the panel members in judging the applications is lower. But it should be possible to reduce the reliance on metrics and ratings – even newer, more refined ones. Also grant applications shouldn’t be killed by a single off-hand critical review comment. Most importantly, grants shouldn’t all be written in a way that devalues exploratory research (by pretending to have strong hypotheses when you don’t) or – even worse – by pretending that the research you already conducted and are ready to publish is a “preliminary pilot data set.” For work that actually is hypothesis driven I quite like Dorothy Bishop’s idea that research funds could be obtained at the pre-registration stage when the theoretical background and experimental design have been established but before data collection commences. Realistically, this is probably more suitable for larger experimental programs than for every single study. But then again, encouraging larger, more thorough, projects may in fact be a good thing.
Enough of this political squabble and twitter war (twar?) and back to the “real” world of neuroneuroticism. Last year I sadly agreed (for various reasons) to act as corresponding author on one of our studies. I also have this statistics pet project that I want to try to publish as a single-author paper. Both of these experiences reminded me of something I have long known:
I seriously hate submitting manuscripts and the whole peer review and publication process.
The way publishing currently works authors are encouraged to climb down the rungs of the impact factor ladder, starting at whatever journal they feel is sufficiently general interest and high impact to take their manuscript and then gradually working their way down through editorial and/or scientific rejections until it is eventually accepted by, how the rejection letters from high impact journals suggest, a “more specialised journal.” At each step one battles with an online submission system that competes for the least user-friendly webpage of the year award and you repeat the same actions: uploading your manuscript files, suggesting reviewers, and checking the PDF conversion worked. Before you can do these you of course need to format your manuscript into the style the journal expects, with the right kind of citations, and having the various sections in the correct place. You also modify the cover letter to the editors, in which you hype up the importance of the work rather than letting the research speak for itself, to adjust it to the particular journal you are submitting to. All of this takes precious time and has very little to do with research.
Because I absolutely loathe having to do this sort of mindless work, I have long tried to outsource this to my postdocs and students as much as I can. I don’t need to be corresponding author on all my research. Of course, this doesn’t absolve you from being involved with the somewhat more important decisions, such as rewriting the manuscripts and drafting the cover letters. More importantly, while this may help my own peace of mind, it just makes somebody else suffer. It is not a real solution.
The truth is that this wasted time and effort would be far better used for doing science and ensuring that the study is of the best possible quality. I have long felt that the entire publication process should be remodelled so that these things are no longer a drain of researcher’s time and sanity. I am by far not the first person to suggest a publication model like this. For instance, Micah Allen mentioned very similar ideas on his blog and, more recently, Dorothy Bishop had a passionate proposal to get rid of journals altogether. Both touched on many of the same points and partly inspired my own thoughts on this.
Centralised review platform
Some people think that all peer review could be post-publication. I don’t believe this is a good idea – depending on what you regard as publication. I think we need some sort of fundamental vetting procedure before a scientific study is indexed and regarded as part of the scientific record. I fear that without some expert scrutiny we will become swamped with poor quality outputs that make it impossible to separate the wheat from the chaff. Post-publication peer review alone is not enough to find the needles in the haystack. If there is so much material out there that most studies never get read, let alone reviewed or even receive comments, this isn’t going to work. By having some traditional review prior to “acceptance” in which experts are invited to review the manuscript – and reminded to do so – we can at least ensure that every manuscript will be read by someone. Nobody is stopping you from turning blog posts into actual publications. Daniël Lakens has a whole series of blog posts that have turned into peer reviewed publications.
A key feature of this pre-publication peer review though should be that it all takes place in a central place completely divorced from any of the traditional journals. All the judgements on the scientific quality of a study requires expert reviewers and editors but there should be no evaluation of the novelty or “impact” of the research. It should be all about the scientific details to ensure that the work is robust. The manuscript should be as detailed as necessary to replicate a study (and the hypotheses and protocols can be pre-registered – a peer review system of pre-registered protocols is certainly an option in this system).
Ideally this review should involve access to the data and materials so that reviewers can try to replicate the findings presented in the study. Most expert reviewers rarely reanalyse data even if they are available. Many people usually do not have the time to get that deeply involved in a review. An interesting possible solution to this dilemma was suggested to me recently by Lee de-Wit: there could be reviewers whose primary role is to check the data and try to reproduce the analysed results based on the documentation. These data reviewers would likely be junior researchers, that is, PhD students and junior postdocs perhaps. It would provide an opportunity to learn about reviewing and also to become known with editors. There is presently a huge variability as to what stage of their career a researcher starts reviewing manuscripts. While some people begin reviewing even as graduate students others don’t even seem to review often after several years of postdoc experience. This idea could help close that gap.
Another in my mind essential key aspect should be that reviews are transparent. That is, all the review comments should be public and the various revisions of the manuscript should be accessible. Ideally, the platform allows easy navigation between the changes so that it is straightforward to simply look at the current/final product and filter out the tracked changes – but equally easy to blend in the comments.
It remains a very controversial and polarising issue whether reviewers’ names should be public as well. I haven’t come to a final conclusion on that. There are certainly arguments for both. One of the reasons many people dislike the idea of mandatory signed reviews is that it could put junior researchers at a disadvantage. It may discourage them from writing critical reviews of the work by senior colleagues, those people who make hiring and funding decisions. Reviewer anonymity can protect against that but it can also lead to biased, overly harsh, and sometimes outright nasty reviews. It also has the odd effect of creating a reviewer guessing game. People often display a surprising level of confidence in who they “know” their anonymous reviewers were – and I would bet they are often wrong. In fact, I know of certainly one case where this sort of false belief resulted in years of animosity directed at the presumed reviewer and even their students. Publishing reviewer names would put an end to this sort of nonsense. It also encourages people to be more polite. Editors at F1000Research (a journal with a completely transparent review process) told me that they frequently ask reviewers to check if they are prepared to publish the review in the state they submitted because it will be associated with their name – and many then decide to edit their comments to tone down the hostility.
However, I think even with anonymous reviews we could go a long way, provided that reviewer comments are public. Since the content of the review is subject to public scrunity it is in the reviewer’s, and even more so the editor’s, interest to ensure they are fair and of suitable quality. Reviews of poor quality or with potential political motivation could easily be flagged up and result in public discussion. I believe it was Chris Chambers who recently suggested a compromise in which tenured scientists must sign their reviews while junior researchers who still exist at the mercy of senior colleagues have the option to remain anonymous. I think this idea has merit although even tenured researchers can still suffer from political and personal biases so I am not sure this really protects against those problems.
One argument that is sometimes made against anonymous reviews is that it prevents people from taking credit for their reviewing work. I don’t think this is true. Anonymous reviews are nevertheless associated with a digital reviewer’s account and ratings of review quality and reliability etc could be easily quantified in that way. (In fact, this is precisely what websites like Publons are already doing).
Novelty, impact, and traditional journals
So what happens next? Let’s assume a manuscript passes this initial peer review. It then enters the official scientific record, is indexed on PubMed and Google Scholar. Perhaps it could follow the example of F1000Research in that the title of the study itself contains an indication that it has been accepted/approved by peer review.
This is where it gets complicated. A lot of the ideas I discussed are already implemented to some extent by journals like F1000Research, PeerJ, or the Frontiers brand. The only aspect that these implementations do not have is that they are not a single, centralised platform for reviews. And although I think having a single platform would be more ideal to avoid confusion and splintering, even a handful of venues for scientific review could probably work.
However, what these systems currently do not provide is the role currently still played by the high impact, traditional publishers: filtering the enormous volume of scientific work to select ground-breaking, timely, and important research findings. There is a lot of hostility towards this aspect of scientific publishing. It often seems completely arbitrary, obsessed with temporary fads and shallow buzzwords. I think for many researchers the implicit or even explicit pressure to publish as much “high impact” work as possible to sustain their careers is contributing to this. It isn’t entirely clear to me how much of this pressure is real and how much is an illusion. Certainly some grant applications still require you to list impact factors and citation numbers (which are directly linked to impact factors) to support your case.
Whatever you may think about this (and I personally agree that it has lots of negative effects and can be extremely annoying) the filtering and sorting by high impact journals does also have its benefits. The short format publications, brief communications, and perspective articles in these journals make work much more accessible to wider audiences and I think there is some point in highlighting new, creative, surprising, and/or controversial findings over incremental follow-up research. While published research should provide detailed methods and well-scrutinised results, there are different audiences. When I read about findings in astronomy or particle physics, or even many studies from biological sciences that aren’t in my area, I don’t typically read all the in-depth methods (nor would I understand them). An easily accessible article that appeals to a general scientific audience is certainly a nice way to communicate scientific findings. In the present system this is typically achieved by separating a general main text from Supplementary/Online sections that contain methods, additional results, and possibly even in-depth discussion.
This is where I think we should implement an explicit tier system. The initial research is published, after scientific peer review as discussed above, in the centralised repository of new manuscripts. These publications are written as traditional journal articles complete with detailed methods and results. Novelty and impact played no role up to this stage. However, now the more conventional publishers come into play. Authors may want to write cover letters competing for the attention of higher impact journals. Conversely, journal editors may want to contact authors of particularly interesting studies to ask them to submit a short-form article in their journal. There are several mechanisms by which new publications may come to the attention of journal editors. They could simply generate a strong social media buzz and lots of views, downloads, and citations. This in fact seems to be the basis of the Frontiers tier system. I think this is far from optimal because it doesn’t necessarily highlight the scientifically most valuable but the most sensational studies, which can be for all sorts of reasons, such as making extraordinary claims or because the titles contain curse words. Rather it would be ideal to highlight research that attracts a lot of post-publication review and discussion – but of course this still poses the question how to encourage that.
In either case, the decision as to what is novel, general interest research is still up to editorial discretion making it easier for traditional journals to accept this change. How these articles are accepted is still up to each journal. Some may not require any further peer review and simply ask for a copy-edited summary article. Others may want to have some additional peer review to keep the interpretation of these summaries in check. It is likely that these high impact articles would be heavy on the implications and wider interpretation while the original scientific publication has only brief discussion sections detailing the basic interpretation and elaborating on the limitations. Some peer review may help keep the authors honest at this stage. Importantly, instead of having endless online methods sections and (sometimes barely reviewed) supplementary materials, the full scientific detail of any study would be available within its original publication. The high impact short-form article simply contains a link to that detailed publications.
One important aim that this system would achieve is to ensure that the research that actually is published as high impact will typically meet high thresholds of scientific quality. Our current publishing model is still incentivising publishing shoddy research because it emphasises novelty and the speed of publication over quality. In the new system, every study would first have to pass a quality threshold. Novelty judgements should be entirely secondary to that.
How can we make this happen?
The biggest problem with all of these grand ideas we are kicking around is that it remains mostly unclear to me how we can actually effect this change. The notion that we can do away with traditional journals altogether sounds like a pipe dream to me as it is diametrically opposed to the self-interest of traditional publishers and our current funding structures. While some great upheavals have already happened in scientific publishing, such as the now wide spread of open access papers, I feel that a lot of these changes have simply occurred because traditional publishers realised that they can make considerable profit from open access charges.
I do hope that eventually the kinds of journals publishing short-form, general interest articles to filter the ground-breaking research from incremental, specialised topics will not be for-profit publishers. There are already a few examples of traditional journals now that are more community driven, such as the Journal of Neuroscience, the Journal of Vision, and also e-Life (not so much community-driven but driven by a research funder rather than a for-profit publishing house). I hope to see more of that in the future. Since many scientists seem to be quite idealist in their hearts I think there is hope for that.
But in the meantime it seems to be necessary to work together with traditional publishing houses rather than antagonising them. I would think it shouldn’t be that difficult to convince some publishers of the idea that what now forms the supplementary materials and online methods in many high impact journals could be actual proper publications in their own right. Journals that already have a system like I envision, e.g. F1000Research or PeerJ, could perhaps negotiate such deals with traditional journals. This need not be mutually exclusive but it could simply apply to some articles published in these journals.
The main obstacle to do away with here is the in my mind obsolete notion that none of the results can have been published elsewhere. This is already no longer true in most cases anyway. Most research will have been published prior to journal publication in conference proceedings. Many traditional journals nowadays tolerate manuscripts uploaded to pre-print servers. The new aspect of the system I described would be that there is actually an assurance that the pre-published work has been peer reviewed properly thus guaranteeing a certain level of quality.
I know there are probably many issues to still resolve with these ideas and I would love to hear them. However, I think this vision is not a dream but a distinct possibility. Let’s make it come true.
In recent months I have written a lot (and thought a lot more) about the replication crisis and the proliferation of direct replication attempts. I admit I haven’t bothered to quantify this but I have an impression that most of these attempts fail to reproduce the findings they try to replicate. I can understand why this is unsettling to many people. However, as I have argued before, I find the current replication movement somewhat misguided.
A big gaping hole where your theory should be
Over the past year I have also written a lot too much about Psi research. Most recently, I summarised my views on this in an uncharacteristically short post (by my standards) in reply to Jacob Jolij. But only very recently I realised my that my views on all of this actually converge on the same fundamental issue. On that note I would like to thank Malte Elson with whom I discussed some of these issues at that Open Science event at UCL recently. Our conversation played a significant role in clarifying my thoughts on this.
My main problem with Psi research is that it has no firm theoretical basis and that the use of labels like “Psi” or “anomalous” or whatnot reveals that this line of research is simply about stating the obvious. There will always be unexplained data but that doesn’t prove any theory. It has now dawned on me that my discomfort with the current replication movement stems from the same problem: failed direct replications do not explain anything. They don’t provide any theoretical advance to our knowledge about the world.
I am certainly not the first person to say this. Jason Mitchell’s treatise about failed replications covered many of the same points. In my opinion it is unfortunate that these issues have been largely ignored by commenters. Instead his post has been widely maligned and ridiculed. In my mind, this reaction was not only uncivil but really quite counter-productive to the whole debate.
Why most published research findings are probably not waterfowl
A major problem with his argument was pointed out by Neuroskeptic: Mitchell seems to hold replication attempts to a different standard than original research. While I often wonder if it is easier to incompetently fail to replicate a result than to incompetently p-hack it into existence, I agree that it is not really feasible to take that into account. I believe science should err on the side of open-minded skepticism. Thus even though it is very easy to fail to replicate a finding, the only truly balanced view is to use the same standards for original and replication evidence alike.
Mitchell describes the problems with direct replications with a famous analogy: if you want to prove the existence of black swans, all it takes is to show one example. No matter how many white swans you may produce afterwards, they can never refute the original reports. However, in my mind this analogy is flawed. Most of the effects we study in psychology or neuroscience research are not black swans. A significant social priming effect or a structural brain-behaviour correlation is not irrefutable evidence that it is real.
Imagine that there really were no black swans. It is conceivable that someone might parade around a black swan but maybe it’s all an elaborate hoax. Perhaps somebody just painted a white swan? Frauds of such a sensational nature are not unheard of in science, but most of us trust that they are nonetheless rare. More likely, it could be that the evidence is somehow faulty. Perhaps the swan was spotted in poor lighting conditions making it appear black. Considering how many people can disagree about whether a photo depicts a black or a white dress this possibility seems entirely conceivable. Thus simply showing a black swan is insufficient evidence.
On the other hand, Mitchell is entirely correct that parading a whole swarm of white swans is also insufficient evidence against the existence of black swans. The same principle applies here. The evidence could also be faulty. If we only looked at swans native to Europe we would have a severe sampling bias. In the worst case, people might be photographing black swans under conditions that make them appear white.
On the wizardry of cooking social psychologists
This brings us to another oft repeated argument about direct replications. Perhaps the “replicators” are just incompetent or lacking in skill. Mitchell also has an analogy for this (which I unintentionally also used in my previous post). Replicators may just be bad cooks who follow the recipes but nonetheless fail to produce meals that match the beautiful photographs in the cookbooks. In contrast, Neuroskeptic referred to this tongue-in-cheek as the Harry Potter Theory: only those blessed with magical powers are able to replicate. Inept “muggles” failing to replicate a social priming effect should just be ignored.
In my opinion both of these analogies are partly right. The cooking analogy correctly points out that simply following the recipe in a cookbook does not make you a master chef. However, it also ignores the fact that the beautiful photographs in a cookbook are frequently not entirely genuine. To my knowledge, many cookbook photos are actually of cold food to circumvent problems like steam on the camera etc. Most likely the photos will have been doctored in some way and they will almost certainly be the best pick out of several cooking attempts and numerous photos. So while it is true that the cook was an expert while you probably aren’t, the photo does not necessarily depict a representative meal.
The jocular wizardry argument implies that anyone with a modicum of expertise in a research area should be able to replicate a research finding. As students we are taught that the methods sections of our research publications should allow anyone to replicate our experiments. But this is certainly not feasible: some level of expertise and background knowledge should be expected for a successful replication. I don’t think I could replicate any findings in radio astronomy regardless how well established they may be.
One frustration many authors of results that have failed to replicate have expressed to me (and elsewhere) is the implicit assumption by many “replicators” that social psychology research is easy. I am not a social psychologist. I have no idea how easy these experiments are but I am willing to give people the benefit of the doubt here. It is possible that some replication attempts overlook critical aspects of the original experiments.
However, I think one of the key points of Neuroskeptic’s Harry Potter argument applies here: the validity of a “replicator’s” expertise, that is their ability to cast spells, cannot be contingent on their ability to produce these effects in the first place. This sort of reasoning seems circular and, appropriately enough, sounds like magical thinking.
How to fix our replicator malfunction
The way I see it both arguments carry some weight here. I believe that muggles replicators should have to demonstrate their ability to do this kind of research properly in order for us to have any confidence in their failed wizardry. When it comes to the recent failure to replicate nearly half a dozen studies reporting structural brain-behaviour correlations, Ryota Kanai suggested that the replicators should have analysed the age dependence of grey matter density to confirm that their methods were sensitive enough to detect such well-established effects. Similarly, all the large-scale replication attempts in social psychology should contain such sanity checks. On a positive note, the Many Labs 3 project included a replication of the Stroop effect and similar objective tests that fulfill such a role.
However, while such clear-cut baselines are great they are probably insufficient, in particular if the effect size of the “sanity check” is substantially greater than the effect of interest. Ideally, any replication attempt should contain a theoretical basis, an alternative hypothesis to be tested that could explain the original findings. As I said previously, it is the absence of such theoretical considerations that makes most failed replications so unsatisfying to me.
The problem is that for a lot of the replication attempts, whether they are of brain-behaviour correlations, social priming, or Bem’s precognition effects, the only underlying theory replicators put forth is that the original findings were spurious and potentially due to publication bias, p-hacking and/or questionable research practices. This seems mostly unfalsifiable. Perhaps these replication studies could incorporate control conditions/analyses to quantify the severity of p-hacking required to produce the original effects. But this is presumably unfeasible in practice because the parameter space of questionable research practices is so vast that it is impossible to derive a sufficiently accurate measure of them. In a sense, methods for detecting publication bias in meta-analysis are a way to estimate this but the evidence they provide is only probabilistic, not experimental.
Of course this doesn’t mean that we cannot have replication attempts in the absence of a good alternative hypothesis. My mentors instilled in me the view that any properly conducted experiment should be published. It shouldn’t matter whether the results are positive, negative, or inconclusive. Publication bias is perhaps the most pervasive problem scientific research faces and we should seek to reduce it, not amplify it by restricting what should and shouldn’t be published.
Rather I believe we must change the philosophy underlying our attempts to improve science. If you disbelieve the claims of many social priming studies (and honestly, I don’t blame you!) it would be far more convincing to test a hypothesis on why the entire theory is false than showing that some specific findings fail to replicate. It would also free up a lot of resources to actually advance scientific knowledge that are currently used on dismantling implausible ideas.
There is a reason why I haven’t tried to replicate “presentiment” experiments even though I have written about it. Well, to be honest the biggest reason is that my grant is actually quite specific as to what research I should be doing. However, if I were to replicate these findings I would want to test a reasonable hypothesis as to how they come about. I actually have some ideas how to do that but in all honesty I simply find these effects so implausible that I don’t really feel like investing a lot of my time into testing them. Still, if I were to try a replication it would have to be to test an alternative theory because a direct replication is simply insufficient. If my replication failed, it would confirm my prior beliefs but not explain anything. However, if it succeeded, I probably still wouldn’t believe the claims. In other words, we wouldn’t have learned very much either way.
All too often debates on best scientific practices descend into a chaotic mire of accusations, hurt feelings, and offended egos. As I mentioned in my previous post, I therefore decided to write a list of guidelines that I believe could help improve scientific discourse. It applies as much to replication attempts as to other scientific disagreements, say, when reanalysing someone else’s data or spotting a mistake they made.
Note that these are general points and should not be interpreted as relating to any particular case. Furthermore I know I am no angel. I have broken many of these rules before and fallible as I am I may end up doing so again. I hope that –I- learn from them but I hope they will also be useful for others. If you can think of additional rules please leave a comment!
1. Talk to the original authors
Don’t just send them a brusque email requesting data or with some basic questions about their paradigm (if you want to replicate their research). I mean actually talk and discuss with them, ideally in person at a conference or during a visit perhaps. Obviously this won’t always be possible. Either way, be open and honest about what you’re doing, why you want the data, why you want to replicate and how. Don’t be afraid to say that you find the results implausible, especially if you can name objective reasons for it.
One of my best experiences at a conference was when a man (whose name I unfortunately can’t remember) waited for me at my poster as I arrived at the beginning of my session. He said “I’m very skeptical of your results”. We had a very productive, friendly discussion and I believe it greatly improved my follow-up research.
2. Involve the original authors in your efforts
Suggest to the original authors to start a collaboration. It doesn’t need to be “adversarial” in the formal sense although it could be if you clearly disagree. But either way everyone will be better off if you work together instead of against each other. Publications are a currency in our field and I don’t see that changing anytime soon. I think you may be surprised how much more amenable many researchers will be to your replication/reanalysis effort if you both get a publication out of it.
3. Always try to be nice
If someone makes an error, point it out in an objective but friendly manner. Don’t let that manuscript you submitted/published to a journal in which you ridicule their life’s work be the first time they hear about it. Avoid emotional language or snarky comments.
I know I have a sarcastic streak so I have been no saint when it comes to that. I may have learned from a true master…
4. Don’t accuse people without hard evidence
Give people the benefit of the doubt. Don’t just blankly insinuate the original authors used questionable research practices no matter how wide-spread they apparently are. In the end you don’t know they engaged in bad practices unless you saw it yourself or some trustworthy whistle-blower told you. Statistics may indicate it but they don’t prove it.
5. Apologise even if (you think) you did no wrong
You might know this one if you’re married… We all make mistakes and slip up sometimes. We say things that come across as more offensive than intended. Sometimes we can find it very hard to understand what made the other person angry. Usually if you try you can empathise. It also doesn’t matter if you’re right. You should still say sorry because that is the right thing to do.
I am very sorry if I offended Eric-Jan Wagenmakers or any of his co-authors with my last post. Perhaps I spoke too harshly. I have the utmost respect for EJ and his colleagues and I do appreciate that they initiated the discussion we are having!
6. Admit that you could be wrong
We are all wrong, frequently. There are few things in life that I am very confident about. I believe scientists should be skeptical, including of their own beliefs. I remain somewhat surprised just with how much fervent conviction many scientists argue. I’d expect this from religious icons or political leaders. To me the best thing about being a scientist is the knowledge that we just don’t really know all that much.
I find it very hard to fathom that people can foretell the future or that their voting behaviour changes months later just because they saw a tiny flag in the corner of the screen. I just don’t think that’s plausible. But I don’t know for sure. It is possible that I’m wrong about this. It has happened before.
7. Don’t hound individuals
Replicating findings in the general field is fair enough. It is even fair enough if your replication is motivated by “I just cannot believe this!” even though this somewhat biases your expectations, which I think is problematic. But if you replicate or reanalyse some findings try not to do it only to the same person all the time. This looks petty and like a vendetta. And take a look around you. Even if you haven’t tried to replicate any of Research X’s findings, there is a chance a lot of other people already have. Don’t pester that one person and force them into a multi-front war they can’t possibly win with their physical and mental health intact.
This is one of the reasons I removed my reanalysis of Bem 2011’s precognition data from my bootstrapping manuscript.
8. Think of the children!
In the world we currently live in, researcher’s livelihoods depend on their reputation, their publication record and citations, their grant income etc. Yes, I would love for grant and hiring committees to only value trust-seekers who do creative and rigorous research. But in many cases it’s not a reality (yet?) and so it shouldn’t be surprising when some people react with panic and anger when they are criticised. It’s an instinct. Try to understand that. Saying “everyone makes mistakes” or “replication is necessary” isn’t really enough. Giving them a way to keep their job and their honour might (see rule 2).
9. Science should seek to explain
In my opinion the purpose of scientific research is to explain how the universe works (or even just that tiny part of the universe between your ears). This is what should motivate all our research, including the research that scrutinises and/or corrects previous claims. That’s why I am so skeptical of Many Labs and most direct replication projects in general. They do not explain anything. They are designed to prove the null hypothesis, which is conceptually impossible. It is fine to disbelieve big claims, in fact that’s what scientists should do. But if you don’t believe some finding, think about why you don’t believe it and then seek to disprove it by showing how simpler explanations could have produced the same finding. Showing that you, following the same recipe as the original authors, failed to reproduce the same result is pretty weak evidence – it could just mean that you are a bad cook. In general, sometimes we can’t trust our own motivations. If you really disbelieve some finding, try to think what kind of evidence could possibly convince you that it is true. Then, go and test it.
A few months ago a study from EJ Wagenmakers’ lab (Boekel et al. in Cortex) failed to replicate 17 structural brain-behaviour correlations reported in the published literature. The study was preregistered by uploading the study protocol to a blog and so was what Wagenmakers generally refers to as “purely confirmatory“. As Wagenmakers is also a vocal proponent of Bayesian inferential methods, they used a one-tailed Bayesian hypothesis tests to ask whether their replication evidence supported the original findings. A lot has already been written about the Boekel study and I was previously engaged in a discussion on it. Therefore, in the interest of brevity (and thus the time Alexander Etz’s needs to spend on reading it :P) I will not cover all the details again but cut right to the case (It is pretty long anyway despite by earlier promises…)
Ryota Kanai, author of several of the results Boekel et al. failed to replicate, has now published a response in which he reanalyses their replication data. He shows that at least one finding (a correlation between grey matter volume in the left SPL and a measure of distractibility as quantified by a questionnaire) replicates successfully if the same methods as his original study are used. In fact, while Kanai does not report these statistics, using the same Bayesian replication test for which Boekel reported “anecdotal” evidence* for the null hypothesis (r=0.22, BF10=0.8), Kanai’s reanalysis reveals “strong” evidence for the alternative hypothesis (r=0.48, BF10=28.1). This successful replication is further supported by a third study that replicated this finding in an independent sample (albeit with some of the same authors as the original study). Taken together this suggests that at least for this finding, the failure to replicate may be due to methodological differences rather than that the original result was spurious.
Now, disagreements between scientists are common and essential to scientific progress. Replication is essential for healthy science. However, I feel that these days, as a field, psychology and neuroscience researchers are going about it in the wrong way. To me this case is a perfect illustration of these problems. In my next post I will summarise this one in a positive light by presenting ten simple rules for a good replication effort (and – hand on my heart – that one will be short!)
1. No such thing as “direct replication”
Recent years have seen the rise of numerous replication attempts with a particular emphasis on “direct” replications, that is, the attempt to exactly reproduce the experimental conditions that generated the original results. This is in contrast to “conceptual” replications in which a new experiment follows the spirit of a previous one but the actual parameters may be very different. So for instance a finding that exposing people to a tiny picture of the US flag influences their voting behaviour months in the future could be interpreted as conceptually replicating the result that people walk more slowly when they were primed with words describing the elderly.
However, I believe this dichotomy is false. The “directness” of a replication attempt is not categorical but exists on a gradual spectrum. Sure, the examples of conceptual replications from the social priming literature are quite distinct from Boekel’s attempt to replicate the brain-behaviour correlations or all the other Many Labs projects currently being undertaken with the aim to test (or disprove?) the validity of social psychology research.
However, there is no such thing as a perfectly direct replication. The most direct replication would be an exact carbon copy of the original, with the same participants, tested at the same time in the same place under the exact same conditions. This is impossible and nobody actually wants that because it would be completely meaningless other than testing just how deterministic our universe really is. What people mean when they talk about direct replications is that they match the experimental conditions reasonably well but use an independent sample of participants and (ideally) independent experimenters. Just how “direct” the replication is depends on how closely matched the experimental parameters are. By that logic, I would call the replication attempt of Boekel et al. less direct than say Wagenmakers et al’s replication of Bem’s precognition experiments. Boekel’s experiments were not matched at least with those by Kanai on a number of methodological points. However, even for the precognition replication Bem challenged Wakenmakers** on the directness of their methods because his replication attempt did not use the same software and stimuli as the original experiment.
Controversies like this reveal several issues. While you can strive to match the conditions of an original experiment as closely as possible, there will always be discrepancies. Ideally the original authors and the “replicators”*** can reach a consensus as to whether or not the discrepancies should matter. However, even if they do, this does not mean that they are unimportant. Saying that “original authors agreed to the protocol” means that a priori they made the assumption that methodological differences are insignificant. This does not mean that this assumption is correct. In the end this is an empirical question.
Discussions about failed replications are often contaminated with talk about “hidden moderators”, that is, unknown factors and discrepancies between the original experiment and the replication effort. As I pointed out under the guise of my alter ego****, I have little patience for this argument. It is counter-productive because there are always unknown factors. Saying that unknown factors can explain failures to replicate is an unfalsifiable hypothesis and a truism. The only thing that should matter in this situation is empirical evidence for additional factors. If you cannot demonstrate that your result hinges on a particular factor, this argument is completely meaningless. In the case of Bem’s precognition experiments, this could have been done by conducting an explicit experiment that compares the use of his materials with those used by Wagenmakers, ideally in the same group of participants. However, in the case of these brain-behaviour correlations this is precisely what Kanai did in his reply: he reanalysed Boekel’s data using the methods he had used originally and he found a different result. Importantly, this does not necessarily prove that Kanai’s theory about these results is correct. However, it clearly demonstrates that the failure to replicate was due to another factor that Boekel et al. did not take into account.
2. Misleading dichotomy
I also think the dichotomy between direct and conceptual replication is misleading. When people conduct “conceptual” replications the aim is different but equally important: direct replications (in so far that they exist) can test whether specific effects are reproducible. Conceptual replications are designed to test theories. Taking again the elderly-walking-speed and voting-behaviour priming examples from above, whether or not you believe that such experiments constitute compelling evidence for this idea, they are both experiments aiming to test the idea that subtle (subconscious?) information can influence people’s behaviour.
There is also a gradual spectrum for conceptual replication but here it depends on how general the overarching theory is that the replication seeks to test. These social priming examples clearly test a pretty diffuse theory of subconscious processing. By the same logic one could say that all of the 17 results scrutinised by Boekel test the theory that brain structure shares some common variance with behaviour. This theory is not only vague but so generic that it is almost meaningless. If you honestly doubt that there are any structural links between brain and behaviour, may I recommend checking some textbooks on brain lesions or neurodegenerative illness in your local library?
A more meaningful conceptual replication would be to show that the same grey matter volume in the SPL not only correlates with a cognitive failure questionnaire but with other, independent measures of distractibility. You could even go a step further and show that this brain area is somehow causally related to distraction. In fact, this is precisely what Kanai’s original study did.
I agree that replicating actual effects (i.e. what is called “direct” replication) is important because it can validate the existence of previous findings and – as I described earlier – help us identify the factors that modulate it. You may however also consider ways to improve your methodology. A single replication with a demonstrably better method (say, better model fits, higher signal-to-noise ratios, or more precise parameter estimates) is worth a 100 direct replications from a Many Labs project. In any of these cases, the directness of your replication will vary.
In the long run, however, conceptual replication that tests a larger overarching theory is more important than showing that a specific effect exists. The distinction between these two is very blurred though. It is important to know what factors modulate specific findings to derive a meaningful theory. Still, if we focus too much on Many Labs direct replication efforts, science will slow down to a snail’s pace and waste an enormous amount of resources (and taxpayer money). I feel that these experiments are largely designed to deconstruct the social priming theory in general. And sure, if the majority of these findings fail to replicate in repeated independent attempts, perhaps we can draw the conclusion that current theory is simply wrong. This happens a lot in science – just look at the history of phrenology or plate tectonics or our model of the solar system.
However, wouldn’t it be better to replace subconscious processing theory with a better model that actually describes what is really going on than to invest years of research funds and effort to prove the null hypothesis? As far as I can tell, the current working theory about social priming by most replicators is that social priming is all about questionable research practices, p-hacking, and publication bias. I know King Ioannidis and his army of Spartans show that the situation is dire***** – but I am not sure it is realistically that dire.
3. A fallacious power fallacy
Another issue with the Boekel replications, which is also discussed in Kanai’s response, is that the sample size they used was very small. For the finding that Kanai reanalysed the sample size was only 36. Across the 17 results they failed to replicate, their sample size ranged between 31-36. This is in stark contrast with the majority of the original studies many of which used samples well above 100. Only for one of the replications, which was of one of their own findings, Boekel et al. used a sample that was actually larger (n=31) than that in the original study (n=9). It seems generally accepted that larger samples are better, especially for replication attempts. A recent article recommended a sample size for replications two and a half times larger than the original. This may be a mathematical rule of thumb but it is hardly realistic, especially for neuroimaging experiments.
Thus I can understand why Boekel et al. couldn’t possibly have done their experiment on hundreds of participants. However, at the very least you should think that a direct replication effort should at least try to match the sample of the original study not one that is four times smaller. In our online discussions Wagenmakers explained the small sample by saying that they “simply lacked the financial resources” to collect more data. I do not think this is a very compelling argument. Using the same logic I could build a lego version of the Large Hadron Collider in my living room but fail to find the Higgs boson – only to then claim that my inadequate methodology was due to the lack of several billion dollars on my bank account******.
I must admit I sympathise a little with Wagenmakers here because it isn’t like I never had to collect more data for an experiment than I had planned (usually this sort of optional stopping happens at the behest of anonymous peer reviewers). But surely you can’t just set out to replicate somebody’s research, using a preregistered protocol no less, with a wholly inadequate sample size? As a matter of fact,their preregistered protocol states the structural data for this project (which is the expensive part) had already been collected previously and that the maximum sample of 36 was pre-planned. While they left “open the possibility of testing additional participants” they opted not to do so even though the evidence for half of the 17 findings remained inconclusively low (more on this below). Presumably this was as they say because they ran “out of time, money, or patience.”
In the online discussion Wagenmakers further states that power is a pre-experimental concept and refers to another publications by him and others in which they describe a “power fallacy.” I hope I am piecing together their argument accurately in my own head. Essentially statistical power tells you how probable it is that you can detect evidence for a given effect with your planned sample size. It thus quantifies the probabilities across all possible outcomes given these parameters. I ran a simulation to do this for the aforementioned correlation between left SPL grey matter and cognitive failure questionnaire scores. So I drew 10,000 samples of 36 participants each from a bivariate Gaussian distribution with a correlation of rho=0.38 (i.e. the observed correlation coefficient in Kanai’s study). I also repeated this for the null hypothesis so I drew similar samples from an uncorrelated Gaussian distribution. The histograms in the figure below show the distributions of the 10,000 Bayes factors calculated using the same replication test used by Boekel et al. for the alternative hypothesis in red and the null hypothesis in blue.
Out of those 10,000 simulations in the red curve only about 62% would pass the criterion for “anecdotal” evidence of BF10=3. Thus even if the effect size originally reported by Kanai’s study had been a perfect estimate of the true population effect (which is highly improbable) only in somewhat less than two thirds of replicate experiments should you expect conclusive evidence supporting the alternative hypothesis. The peak of the red distribution is in fact very close to the anecdotal criterion. With the exception of the study by Xu et al. (which I am in no position to discuss) this result was in fact one of the most highly powered experiments in Boekel’s study: as I showed in the online discussion the peaks of expected Bayes factors of the other correlations were all below the anecdotal criterion. To me this suggests that the pre-planned power of these replication experiments was wholly insufficient to give the replication a fighting chance.
Now, Wagenmakers’ reasoning of the “power fallacy” however is that after the experiment is completed power is a meaningless concept. It doesn’t matter what potential effect sizes (and thus Bayesian evidence) one could have gotten if one repeated the experiment infinitely. What matters is the results and evidence they did find. It is certainly true that a low-powered experiment can produce conclusive evidence in favour of a hypothesis – for example the proportion of simulations at the far right end of the red curve would very compellingly support H1 while those simulations forming the peak of the blue curve would afford reasonable confidence that the null hypothesis is true. Conversely, a high-powered experiment can still fail to provide conclusive evidence. This essentially seems to be Wagenmakers’ argument of the power fallacy: just because an experiment had low power doesn’t necessarily mean that its results are uninterpretable.
However, in my opinion this argument serves to obfuscate the issue. I don’t believe that Wagenmakers is doing this on purpose but I think that he has himself fallen victim to a logical fallacy. It is a non-sequitur. While it is true that low-powered experiments can produce conclusive evidence, this does not make the evidence conclusive. In fact, it is the beauty of Bayesian inference that it allows quantification of the strength of evidence. The evidence Boekel et al. observed in was inconclusive (“anecdotal”) in 9 of the 17 replications. Only in 3 the evidence for the null hypothesis was anywhere close to “strong” (i.e. below 1/10 or very close to it).
Imagine you want to test if a coin is biased. You flip it once and it comes up heads. What can we conclude from this? Absolutely nothing. Even though the experiment has been completed it was obviously underpowered. The nice thing about Bayesian inference is that it reflects that fact.
4. Interpreting (replication) evidence
You can’t have it both ways. You either take Bayesian inference to the logical conclusion and interpret the evidence you get according to Bayesian theory or you shouldn’t use it. Bayes factor analysis has the potential to be a perfect tool for statistical inference. Had Boekel et al. observed a correlation coefficient near 0 in the replication of the distractibility correlation they would have been right to conclude (in the context of their test) that the evidence supports the null hypothesis with reasonable confidence.
Now a close reading of Boekel’s study shows that the authors were in fact very careful in how they worded the interpretation of their results. They say that they “were unable to successfully replicate any of these 17 correlations”. This is entirely correct in the context of their analysis. What they do not say, however, is that they were also unable to refute the previously reported effects even though this was the case for over half of their results.
Unfortunately, this sort of subtlety is entirely lost on most people. The reaction of many commenters on the aforementioned blog post, on social media, and in personal communications was to interpret this replication study as a demonstration that these structural brain-behaviour correlations have been conclusively disproved. This is in spite of the statement in the actual article that “a single replication cannot be conclusive in terms of confirmation or refutation of a finding.” On social media I heard people say that “this is precisely what we need more of.” And you can literally feel the unspoken, gleeful satisfaction of many commenters that yet more findings by some famous and successful researchers have been “debunked.”
Do we really need more low-powered replication attempts and more inconclusive evidence? As I described above, a solid replication attempt can actually inform us about the factors governing a particular effect, which in turn can help us formulate better theories. This is what we need more of. We need more studies that test assumptions but that also take all the available evidence into account. Many of these 17 brain-behaviour correlation results here originally came with internal replications in the original studies. As far as I can tell these were not incorporated in Boekel’s analysis (although they mentioned them). For some of the results independent replications – or at least related studies – had already been published and it seems odd that Boekel et al. didn’t discuss at least those that had already been published months earlier.
Also some results, like Kanai’s distractibility correlation, were accompanied in the original paper by additional tests of the causal link between the brain area and behaviour. In my mind, from a scientific perspective it is far more important to test those questions in detail rather than simply asking whether the original MRI results can be reproduced.
5. Communicating replication efforts
I think there is also a more general problem with how the results of replication efforts are communicated. Replication should be a natural component of scientific research. All too often failed replications result in mudslinging contests, heated debates, and sometimes in inappropriate comparisons of replication authors with video game characters. Some talk about how the reputation of the original authors is hurt by failed replication.
It shouldn’t have to be this way. Good scientists also produce non-replicable results and even geniuses can believe in erroneous theories. However, the way our publishing and funding system works as well as our general human emotions predispose us to having these unfortunate disagreements.
I don’t think you can solely place the blame for such arguments on the authors of the original studies. Because scientists are human beings the way you talk to them influences how they will respond. Personally I think that the reports of many high profile replication failures suffer from a lack of social awareness. In that sense the discussion surrounding the Boekel replications has actually been very amicable. There have been far worse cases where the whole research programs of some authors have been denigrated and ridiculed on social media, sometimes while the replication efforts were still on-going. I’m not going to delve into that. Perhaps one of the Neuroscience Devils wants to pick up that torch in the future.
However, even the Boekel study shows how this communication could have been done with more tact. The first sentences of the Boekel article read as follows:
“A recent ‘crisis of confidence’ has emerged in the empirical sciences. Several studies have suggested that questionable research practices (QRPs) such as optional stopping and selective publication may be relatively widespread. These QRPs can result in a high proportion of false-positive findings, decreasing the reliability and replicability of research output.”
I know what Boekel et al. are trying to say here. EJ Wagenmakers has a declared agenda to promote “purely confirmatory” research in which experimental protocols are preregistered. There is nothing wrong with this per se. However, surely the choice of language here is odd? Preregistration is not the most relevant part about the Boekel study. It could have been done without it. It is fine to argue for why it is necessary in the article, but to actually start the article with a discussion of the replication crisis in the context of questionable research practices is very easy to be (mis?-)interpreted as an accusation. Whatever the intentions may have been, starting the article in this manner immediately places a spark of doubt in the reader’s mind and primes them to consider the original studies as being of a dubious nature. In fact, in the online debate Wagenmakers went a step further to suggest (perhaps somewhat tongue-in-cheek) that:
“For this particular line of research (brain-behavior correlations) I’d like to suggest an exploration-safeguard principle (ESP): after collecting the imaging data, researchers are free to analyze these data however they see fit. Data cleaning, outlier rejection, noise reduction: this is all perfectly legitimate and even desirable. Crucially, however, the behavioral measures are not available until after completion of the neuroimaging data analysis. This can be ensured by collecting the behavioral data in a later session, or by collaborating with a second party that holds the behavioral data in reserve until the imaging analysis is complete. This kind of ESP is something I can believe in.”
This certainly sounds somewhat accusatory to me. Quite frankly this is a bit offensive. I am all in favour of scientific skepticism but this is not the same as baseless suspicion. Having been on the receiving end of a particularly bad case of reviewer 2 once who made similar unsubstantiated accusations (and in fact ignored evidence to the contrary) I can relate to people who would be angered by that. For one thing such procedures are common in many labs conducting experiments like this. Having worked with Ryota Kanai in the past I have a fairly good idea of the meticulousness of his research. I also have great respect for EJ Wagenmakers and I don’t think he actually meant to offend anyone. Still, I think it could easily happen with statements like this and I think it speaks for Kanai’s character that he didn’t take offense here.
There is a better way. This recently published failure to replicate a link between visually induced gamma oscillation frequency and resting occipital GABA concentration is a perfect example of a well-written replication failure. There is no paranoid language about replication crises and p-hacking but a simple, factual account of the research question and the results. In my opinion this exposition certainly facilitated the rather calm reaction to this publication.
6. Don’t hide behind preregistration
Of course, the question about optional stopping and outcome-dependent analysis (I think that term was coined by Tal Yarkoni) could be avoided by preregistering the experimental protocols (in fact at least some of these original experiments were almost certainly preregistered in departmental project reviews). As opposed to what some may think, I am not opposed to preregistration as such. In fact, I fully intend to try it.
However, there is a big problem with this, which Kanai also discusses in his response. As a peer reviewer, he actually recommended Boekel et al. to use the same analysis pipeline he employed now to test for the effects. The reason Boekel et al. did not do this is that these methods were not part of the preregistered protocol. However, this did not stop them from employing other non-registered methods, which they report as exploratory analyses. In fact, we are frequently told that pre-registration does not preclude exploration. So why not here?
Moreover, preregistration is in the first instance designed to help authors control the flexibility of their experimental procedure. It should not be used as a justification to refuse performing essential analyses when reviewers ask for them. In this case, a cynic might say that Boekel et al. in fact did these analyses and chose not to report them because the results were inconsistent with the message they wanted to argue. Now I do not believe this to be the case but it’s an example of how unfounded accusations can go both ways in these discussions.
If this is how preregistration is handled in the future, we are in danger of slowing down scientific progress substantially. If Boekel et al. had performed these additional analyses (which should have been part of the originally preregistered protocol in the first place), this would have saved Kanai the time to do them himself. Both he and Boekel et al. could have done something more productive with their time (and so could I, for that matter :P).
It doesn’t have to go this way but we must be careful. If we allow this line of reasoning with preregistration, we may be able to stop the Texas sharpshooter from bragging but we will also break his rifle. It will then take much longer and more ammunition to finally hit the bulls-eye than is necessary.
*) I actually dislike categorical labels for Bayesian evidence. I don’t think we need them.
**) This is a pre-print manuscript. It keeps changing with on-going peer review so this statement may no longer be true when you read this.
***) Replicators is a very stupid word but I can’t think of a better, more concise one.
****) Actually this post was my big slip-up as Devil’s Neuroscientist. In that one a lot of Sam’s personality shown through, especially in the last paragraph.
*****) I should add that I am merely talking about the armies of people pointing out the proneness of false positives. I am not implying that all these researchers I linked to here agree with one another.
******) To be fair, I probably wouldn’t be able to find the Higgs boson even if I had the LHC.
Recently I participated in an event organised by PhD students in Experimental Psychology at UCL called “Is Science broken?”. It involved a lively panel discussion in which the panel members answered many questions from the audience about how we feel science can be improved. The opening of the event was a talk by Chris Chambers of Cardiff University about Registered Reports (RR), a new publishing format in which researchers preregister their introduction and methods sections before any data collection takes place. Peer review takes place in two stages: first, reviewers evaluate the appropriateness of the question and the proposed experimental design and analysis procedures, and then, after data collection and analysis have been completed and the results are known, peer review continues to finalise the study for publication. This approach is aimed to make scientific publishing independent from the pressure to get perfect results or changing one’s apparent hypothesis depending on the outcome of the experiments.
Chris’ talk was in large part a question and answer session for specific concerns with the RR approach that had been raised at other talks or in writing. Most of these questions he (and his coauthors) had already answered in a similar way in a published FAQ paper. However, it was nice to see him talk so passionately about this topic. Also speaking for myself at least I can say that seeing a person arguing their case is usually far more compelling than reading an article on it – even though the latter will in the end probably have a wider reach.
Here I want to raise some additional questions about the answers Chris (and others) have given to some of these specific concerns. The purpose in doing so is not to “bring about death by a thousand cuts” to the RR concept as Aidan Horner calls it. I completely agree with Aidan that many concerns people have with RR (and lighter forms of preregistration) are probably logistical. It may well be that some people just really want to oppose this idea and are looking for any little reason to use as an excuse. However, I think both sides of the debate about this issue have suffered from a focus on potentials rather than fact. We simply won’t know how much good or bad preregistration will do for science unless we try it. This seems to be a concept that everyone at the discussion was very much in agreement on and we all discussed ways in which we could actually assess the evidence for whether RRs improved science over the next few decades.
Therefore I want it to be clear that the points I raise are not an ideological opposition to preregistration. Rather they are some points where I found the answers Chris describe to be not entirely satisfying. I very much believe that preregistration must be tried but I want to provoke some thought about possible problems with it. The sooner we are aware of these issues, the sooner they can be fixed.
Wouldn’t reviewers rely even more on the authors’ reputation?
In the Stage 1 of an RR, when only the scientific question and experimental design are reviewed, reviewers have little to go on to evaluate the protocol. Provided that the logic of the question and the quality of the design are evident, they would hopefully be able to make some informed decisions about it. However, I think it is a bit naive to assume that the reputation of the authors isn’t going to influence the reviewers’ judgements. I have heard of many grant reviews asking questions as to whether the authors would be capable of pulling off some proposed research. There is an extensive research literature on how the evaluation of identical exam papers or job applications or whatnot can be influenced by factors like the subject’s gender or name. I don’t think simply saying “Author reputation is not among” the review criteria is enough of a safeguard.
I also don’t think that having a double-blind review system is necessarily a good way to protect against this. There have been wider discussions about the short-comings of double-blind reviews and this situation is no different. In many situations you could easily guess the authors’ identity by the experimental protocol alone. And double blind review suffers even more from one of the main problems with anonymous reviewers (which I generally support): when the reviewers guess the authors’ identities incorrectly that could have even worse consequences because their decision will be based on an incorrect assessment of the authors.
Can’t people preregister experiments they have already completed?
The general answer here is that this would constitute fraud. The RR format would also require time stamped data files and lab logs to guarantee that data were produced only after the protocol has been registered. Both of these points are undeniably true. However, while there may be such a thing as an absolute ethical ideal, in the end a lot of our ethics are probably governed by majority consensus. The fact that many questionable research practices are apparently so widespread is presumably just that: while most people deep down understand that these things are “not ideal”, they may nonetheless engage in them because they feel that “everybody else does it.”
For instance, I often hear that people submit grant proposals for experiments they have already completed although I have personally never seen this with any grant proposals. I have also heard that it is more common in the US perhaps but at least based on all the anecdotes it may generally be widespread. Surely this is also fraudulent but nevertheless people apparently do it?
Regarding time stamped data, I also don’t know if this is necessarily a sufficiently strong safeguard. For the most part, time stamps are pretty easy to “adjust”. Crucially, I don’t think many reviewers or post-publication commenters will go through the trouble of checking them. Faking time stamps is certainly deep into fraud territory but people’s ethical views are probably not all black and white. I could easily see some people bending the rules just that little, especially if preregistered studies become a new gold standard in the scientific community.
Now perhaps this is a bit too pessimistic a view of our colleagues. I agree that we probably should not exaggerate this concern. But given the concerns people have with questionable research practices now I am not entirely sure we can really just dismiss this possibility by saying that this would be fraud.
Finally, another answer to this concern is that preregistering your experiment after the fact would backfire because the authors could then not implement any changes suggested by reviewers in Stage 1. However, this only applies to changes in the experimental design, the stimuli or apparatus etc. The most confusing corners in the “garden of forking paths” are usually the analysis procedure, not the design. There are only so many ways to run a simple experiment and most minor changes suggested by a reviewer could easily be dismissed by authors. However, changes to the analysis approach could quite easily be implemented after the results are known.
Reviewers could steal my preregistered protocol and scoop me
I agree that this concern is not overly realistic. In fact, I don’t even believe the fear of being scooped is overly realistic. I’m sure it happens (and there are some historical examples) but it is far rarer than most people believe. Certainly it is rather unlikely for a reviewer to do this. For one thing, time is usually simply not on their side. There is a lot that could be written about the fear of getting scooped and I might do that at some future point. But this is outside the scope of this post.
Whatever its actual prevalence, the paranoia (or boogieman) of scooping is clearly widespread. Until we find a way to allay this fear I am not sure that it will be enough to tell people that the Manuscript Received date of a preregistered protocol would clearly show who had the idea sooner. First of all, the Received date doesn’t really tell you when somebody had an idea. The “scooper” could always argue that they had the idea before that date but only ended up submitting the study afterwards (and I am sure that actually happens fairly often without scooping).
More importantly though, one of the main reasons people are afraid of being scooped is the pressure to publish in high impact journals. Having a high impact publication has greater currency than the Received date of a RR in what is most likely a lower impact journal. I doubt many people would actually check the date unless you specifically point it out to them. We already now have a problem with people not reading the publications of job and grant applicants but relying on metrics like impact factors and h-indeces. I don’t easily see them looking through that information.
As a junior researcher I must publish in high impact journals
I think this is an interesting issue. I would love nothing more than if we could stop caring about who published what where. At the same time I think that there is a role for high impact journals like Nature, Science or Neuron (seriously folks, PNAS doesn’t belong in that list – even if you didn’t boycott it like me…). I would like the judgement of scientific quality and merit to be completely divorced from issues of sensationalism, novelty, and news that quite likely isn’t the whole story. I don’t know how to encourage that change though. Perhaps RRs can help with that but I am not sure they’re enough. Either way, it may be a foolish and irrational fear but I know that as a junior scientist I (and my postdocs and students) currently do seek to publish at least some (but not exclusively) “high impact” research to be successful. But I digress.
Chris et al. write that sooner or later high impact outlets will probably come on board with offering RRs. I don’t honestly see that happening, at least not without a much more wide-ranging change in culture. I think RRs are a great format for specialised journals to have. However, the top impact journals primarily exist for publishing exciting results (whatever that means). I don’t think they will be keen to open the floodgates for lots of submissions that aim to test exciting ideas but fail to deliver the results to match them. What I could see perhaps is a system in which a journal like Nature would review a protocol and accept to publish it in its flagship journal if the results are positive but in its lower-impact outlet (e.g. Nature Communications) if the results are negative. The problem with this idea is that it somehow goes against the egalitarian philosophy of the current RR proposals. The publication again would be dependent on the outcome of the experiments.
Registered Reports are incompatible with short student projects
After all the previous fairly negative points I think this one is actually about a positive aspect of science. For me this is actually one of the greatest concerns. In my mind this is a very valid worry and Chris and co acknowledge this also in their article. I think RRs would be a viable solution for experiments by a PhD student but for master students, who are typically around for a few months only, it is simply not very realistic to first submit a protocol and revising it over weeks and months of reviews before even collecting the first data.
A possible solution suggested for this problem is that you could design the experiments and have them approved by peer reviewers before the students commence. I think this is a terrible idea. For me perhaps the best part of supervising student projects in my lab is when we discuss the experimental design. The best students typically come with their own ideas and make critical suggestions and improvements to the procedure. Not only is this phase very enjoyable but I think designing good experiments is also one of the most critical skills for junior scientists to learn. By having the designs finalised before the students even step through the door of the lab would undermine that.
Perhaps for those cases it would make more sense to just use light preregistration, that is, uploading your protocol to a timestamped archive but without external review. But if in the future RR do become the new gold standard, I would worry that this denigrates the excellent research projects of many students.
As I said, these points are not meant to shoot down the concept of Registered Reports. Some of the points may not even be such enormous concerns at all. However, I hope that my questions provoke thought and that we can discuss ways to improve the concept further and find safeguards against these possible problems.
Sorry this post was very long as usual but there seems to be a lot to say. My next post though will be short, I promise! 😉