In June 2016, the United Kingdom carried out a little study to test the hypothesis that it is the “will of the people” that the country should leave the European Union. The result favoured the Leave hypothesis, albeit with a really small effect size (1.89%). This finding came as a surprise to many but as so often it is the most surprising results that have the most impact.
Accusations of p-hacking soon emerged. Not only was there a clear sampling bias but data thugs suggested that the results might have even been obtained by fraud. Nevertheless, the original publication was never retracted. What’s wrong with inflating the results a bit? Massaging data to fit a theory is not the worst sin! The history of science is rich with errors. Such studies can be of value if they offer new clarity in looking at phenomena.
In fact, the 2016 study did offer a lot of new ways to look at the situation. There was a fair amount of HARKing about what the result of the 2016 study actually meant. Prior to conducting the study, at conferences and in seminars the proponents of the Leave hypotheses kept talking about the UK having a relationship with the EU like Norway and Switzerland. Yet somehow in the eventual publication of the 2016 findings, the authors had changed their tune. Now they professed that their hypothesis was obviously always that the UK should leave the EU without any deal whatsoever.
Sceptics of the Leave hypothesis pointed out various problems with this idea. For one thing, leaving the EU without a deal wasn’t a very plausible hypothesis. There were thousands of little factors to be considered and it seemed unlikely that this was really the will of the people. And of course, the nitpickers also said that “barely more than half” could never be considered the “will of the people”.
Almost immediately, there were calls for a replication to confirm that the “will of the people” really was what believers in the Leave-without-a-deal hypothesis claimed. At first, these voices came only from a ragtag band of second stringers – but as time went on and more and more people realised just how implausible the Leave hypothesis really was, their numbers grew.
Leavers however firmly disagreed. To them, a direct replication was meaningless. That was odd for some of them had openly admitted they wanted to p-hack the hell out of this thing until they got the result they wanted. But now they claimed that there had by now been several conceptual replications of the 2016 results, first in the United States and then later also Brazil, and some might argue even in Italy, Hungary, and Poland. Also in several other European countries similar results were found, albeit not statistically significant. Based on all this evidence, a meta-analysis surely supported the general hypothesis?
But the replicators weren’t dissuaded. The more radical among these methodological terrorists posited that any study in which the experimental design isn’t clearly defined and preregistered prior to data collection is inherently exploratory, and cannot be used to test any hypotheses. They instead called for a preregistered replication, ideally a Registered Report where the methods are peer-reviewed and the manuscript is in principle accepted for publication before data collection even commences. The fact that the 2016 study didn’t do this was just one of its many problems. But people still cite it simply because of its novelty. The replicators also pointed to other research fields, like Switzerland and Ireland, where this approach has long been used very successfully.
As an added twist, it turns out that nobody actually read the background literature. The 2016 study was already a replication attempt of previous findings from 1975. Sure, some people had vaguely heard about this earlier study. Everybody who has ever been to a conference knows that there is always one white-haired emeritus professor in the audience who will shout out “But I already did this four decades ago!”. But nobody really bothered to read this original study until now. It found an enormous result in the opposite direction, 17.23% in favour of remaining in Europe. As some commentators suggested, the population at large may have changed over the past four decades or that there may have been subtle but important differences in the methodology. What if leaving Europe then meant something different to what it means now? But if that were the case, couldn’t leaving Europe in 2016 also have meant something different than in 2019?
But the Leave proponents wouldn’t have any of that. They had already invested too much money and effort and spent all this time giving TED talks about their shiny little theory to give up now. They were in fact desperately afraid of a direct replication because they knew that as with most replications it would probably end in a null result and their beautiful theoretical construct would collapse like a house of cards. Deep inside, most of these people already knew they were chasing a phantom but they couldn’t ever admit it. People like Professor BoJo, Dr Moggy, and Micky “The Class Clown” Gove had built their whole careers on this Leave idea and so they defended the “will of the people” with religious zeal. The last straw they clutched to was to warn that all these failures to replicate would cause irreparable damage to the public’s faith in science.
Only Nigel Farage, unaffiliated garden gnome and self-styled “irreverent citizen scientist”, relented somewhat. Naturally, he claimed he would be doing all that just for science and the pursuit of the truth and that the result of this replication would be even clearer than the 2016 finding. But in truth, he smelled the danger on the wind. He knew that should the Leave hypothesis be finally accepted by consensus, he would be reduced to a complete irrelevance. What was more, he would not get that hefty paycheck.
As of today, the situation remains unresolved. The preregistered replication attempt is still stuck in editorial triage and hasn’t even been sent out for peer review yet. But meanwhile, people in the corridors of power in Westminster and Brussels and Tokyo and wherever else are already basing their decisions on the really weak and poor and quite possibly fraudulent data from the flawed 2016 study. But then, it’s all about the flair, isn’t it?
I have stayed out of the Wansink saga for the most part. If you don’t know what this is about, I suggest reading about this case on Retraction Watch. I had a few private conversations about this with Nick Brown, who has been one of the people instrumental in bringing about a whole series of retractions of Wansink’s publications. I have a marginal interest in some of Wansink’s famous research, specifically whether the size of plates can influence how much a person eats, because I have a broader interest in the interplay between perception and behaviour.
But none of that is particularly important. The short story is that considerable irregularities have been discovered in a string of Wansink’s publications, many of which has since been retracted. The whole affair first kicked off with a fundamental own-goal of a blog post (now removed, so posting Gelman’s coverage instead) he wrote in which he essentially seemed to promote p-hacking. Since then the problems that came to light ranged from irregularities (or impossibility) of some of the data he reported, evidence of questionable research practices in terms of cherry-picking or excluding data, to widespread self-plagiarism. Arguably, not all of these issues are equally damning and for some the evidence is more tenuous than for others – but the sheer quantity of problems is egregious. The resulting retractions seem entirely justified.
Today I read an article on Times Higher Education entitled “Massaging data to fit a theory is not the worst research sin” by Martin Cohen, which discusses Wansink’s research sins in a broader context of the philosophy of science. The argument is pretty muddled to me so I am not entirely sure what the author’s point is – but the effective gist seems to shrug off concerns about questionable research practices and that Wansink’s research is still a meaningful contribution to science. In my mind, Cohen’s article reflects a fundamental misunderstanding of how science works and in places sounds positively post-Truthian. In the following, I will discuss some of the more curious claims made by this article.
“Massaging data to fit a theory is not the worst research sin”
I don’t know about the “worst” sin. I don’t even know if science can have “sins” although this view has been popularised by Chris Chamber’s book and Neuroskeptic’s Circles of Scientific Hell. Note that “inventing data”, a.k.a. going Full-Stapel, is considered the worst affront to the scientific method in the latter worldview. “Massaging data” is perhaps not the same as outright making it up, but on the spectrum of data fabrication it is certainly trending in that direction.
Science is about seeking the truth. In Cohen’s words, “science should above all be about explanation”. It is about finding regularities, relationships, links, and eventually – if we’re lucky – laws of nature that help us make sense of a chaotic, complex world. Altering, cherry-picking, or “shoe-horning” data to fit your favourite interpretation is the exact opposite of that.
Now, the truth is that p-hacking, the garden of forking paths, flexible outcome-contingent analyses fall under this category. Such QRPs are extremely widespread and to some degree pervade most of the scientific literature. But just because it is common, doesn’t mean that this isn’t bad. Massaging data inevitably produces a scientific literature of skewed results. The only robust way to minimise these biases is through preregistration of experimental designs and confirmatory replications. We are working towards that becoming more commonplace – but in the absence of that it is still possible to do good and honest science.
In contrast, prolifically engaging in such dubious practices, as Wansink appears to have done, fundamentally undermines the validity of scientific research. It is not a minor misdemeanour.
“We forget too easily that the history of science is rich with errors”
I sympathise with the notion that science has always made errors. One of my favourite quotes about the scientific method is that it is about “finding better ways of being wrong.” But we need to be careful not to conflate some very different things here.
First of all, a better way of being wrong is an acknowledgement that science is never a done deal. We don’t just figure out the truth but constantly seek to home in on it. Our hypotheses and theories are constantly refined, hopefully by gradually becoming more correct, but there will also be occasional missteps down a blind alley.
But these “errors” are not at all the same thing as the practices Wansink appears to have engaged in. These were not mere mistakes. While the problems with many QRPs (like optional stopping) have long been underappreciated by many, a lot of the problems in Wansink’s retracted articles are quite deliberate distortions of scientific facts. For most, he could have and should have known better. This isn’t the same as simply getting things wrong.
The examples Cohen offers for the “rich errors” in past research are also not applicable. Miscalculating the age of the earth or presenting an incorrect equation are genuine mistakes. They might be based on incomplete or distorted knowledge. Publishing an incorrect hypothesis (e.g., that DNA is a triple helix) is not the same as mining data to confirm a hypothesis. It is perfectly valid to derive new hypotheses, even if they turn out to be completely false. For example, I might posit that gremlins cause the outdoor socket on my deck to fail. Sooner or later, a thorough empirical investigation will disprove this hypothesis and the evidence will support an alternative, such as that the wiring is faulty. The gremlin hypothesis may be false – and it is also highly implausible – but nothing stops me from formulating it. Wansink’s problem wasn’t with his hypotheses (some of which may indeed turn out to be true) but with the irregularities in the data he used to support them.
“Underlying it all is a suspicion that he was in the habit of forming hypotheses and then searching for data to support them”
Ahm, no. Forming hypotheses before collecting data is how it’s supposed to work. Using Cohen’s “generous perspective”, this is indeed how hypothetico-deductive research works. In how far this relates to Wansink’s “research sin” depends on what exactly is meant here by “searching for data to support” your hypotheses. If this implies you are deliberately looking for data that confirms your prior belief while ignoring or rejecting observations that contradict it, then that is not merely a questionable research practice, but antithetical to the whole scientific endeavour itself. It is also a perfect definition of confirmation bias, something that afflicts all human beings to some extent, scientists included. Scientists must find protections from fooling themselves in this way and that entails constant vigilance and scepticism of our own pet theories. In stark contrast, engaging in this behaviour actively and deliberately is not science but pure story-telling.
The critics are not merely “indulging themselves in a myth of neutral observers uncovering ‘facts'”. Quite to the contrary, I think Wansink’s critics are well aware of the human fallibility of scientists. People are rarely perfectly neutral when it pertains to hypotheses. Even when you are not emotionally invested in which one of multiple explanations for a phenomenon might be correct, they are frequently not equal in terms of how exciting it might be to confirm them. Finding gremlins under my deck would certainly be more interesting (and scary?) than evidence of faulty wiring.
But in the end, facts are facts. There are no “alternative facts”. Results are results. We can differ on how to interpret them but that doesn’t change the underlying data. Of course, some data are plainly wrong because they come from incorrect measurements, artifacts, or statistical flukes. These results are wrong. They aren’t facts even if we think of them as facts at the moment. Sooner or later, they will be refuted. That’s normal. But this is a long shot from deliberately misreporting or distorting facts.
“…studies like Wansink’s can be of value if they offer new clarity in looking at phenomena…”
This seems to be the crux of Cohen’s argument. Somehow, despite all the dubious and possibly fraudulent nature of his research, Wansink still makes a useful contribution to science. How exactly? What “new clarity” do we gain from cherry-picked results?
I can see though that Wansink may “stimulate ideas for future investigations”. There is no denying that he is a charismatic presenter and that some of his ideas were ingenuous. I like the concept of self-filling soup bowls. I do think we must ask some critical questions about this experimental design, such as whether people can be truly unaware that the soup level doesn’t go down as they spoon it up. But the idea is neat and there is certainly scope for future research.
But don’t present this as some kind of virtue. By all means, give credit to him for developing a particular idea or a new experimental method. But please, let’s not pretend that this excuses the dubious and deliberate distortion of the scientific record. It does not justify the amount of money that has quite possibly been wasted on changing how people eat, the advice given to schools based on false research. Deliberately telling untruths is not an error, it is called a lie.
The other day, my twitter feed got embroiled in another discussion about whether or not p-hacking is deliberate and if it constitutes fraud. Fortunately, I then immediately left for a trip abroad and away from my computer, so there was no danger of me being drawn into this debate too deeply and running the risk of owing Richard Morey another drink. However, now that I am back I wanted to elaborate a bit more on why I think the way our field has often approached p-hacking is both wrong and harmful.
What the hell is p-hacking anyway? When I google it I get this Wikipedia article, which uses it as a synonym for “data dredging”. There we already have a term that seems to me more appropriate. P-hacking refers to when you massage your data and analysis methods until your result reaches a statistically significant p-value. I will put it to you that in practice most p-hacking is not necessarily about hacking p-s but about dredging your data until your results fit a particular pattern. That may be something you predicted but didn’t find or could even just be some chance finding that looked interesting and is amplified this way. However, the p-value is usually probably secondary to the act here. The end result may very well be the same in that you continue abusing the data until a finding becomes significant, but I would bet that in most cases what matters to people is not the p-value but the result. Moreover, while null-hypothesis significance testing with p-values is still by far the most widespread way to make inferences about results, it is not the only way. All this fussing about p-hacking glosses over the fact that the same analytic flexibility or data dredging can be applied to any inference, whether it is based on p-values, confidence intervals, Bayes factors, posterior probabilities, or simple summary statistics. By talking of p-hacking we create a caricature that this is somehow a problem specific to p-values. Whether or not NHST is the best approach for making statistical inferences is a (much bigger) debate for another day – but it has little to do with p-hacking.
What is more, not only is p-hacking not really about p’s but it is also not really about hacking. Here is the dictionary entry for the term ‘hacking‘. I think we can safely assume that when people say p-hacking they don’t mean that peas are physically being chopped or cut or damaged in any way. I’d also hazard a guess that it’s not meant in the sense of “to deal or cope with” p-values. In fact, the only meaning of the term that seems to come even remotely close is this:
Obviously, what is being modified in p-hacking is the significance or impressiveness of a result, rather than a computer program or electronic device, but we can let this slide. I’d also suggest that it isn’t always done in a skillful or clever way either, but perhaps we can also ignore this. However, the verb ‘hacking’ to me implies that this is done in a very deliberate way. It may even, as with computer hacking, carry the connotation of fraud, of criminal intent. I believe neither of these things are true about p-hacking.
That is not to say that p-hacking isn’t deliberate. I believe in many situations it likely is. People no doubt make conscious decisions when they dig through their data. But the overwhelming majority of p-hacking is not deliberately done to createspurious resultsthat the researcher knows to be false. Anyone who does so would be committing actual fraud. Rather, most p-hacking is the result of confirmation bias combined with analytical flexibility. This leads people to sleep walk into creating false positives or – as Richard Feynman would have called it – fooling themselves. Simine Vazire already wrote an excellent post about this a few years ago (and you may see a former incarnation of yours truly in the comment section arguing against the point I’m making here… I’d like to claim that it’s cause I have grown as a person but in truth I only exorcised this personality :P). I’d also guess that a lot of p-hacking happens out of ignorance, although that excuse really shouldn’t fly as easily in 2017 as it may have done in 2007. Nevertheless, I am pretty sure people do not normally p-hack because they want to publish false results.
Some may say that it doesn’t matter whether or not p-hacking is fraud – the outcome is the same: many published results are false. But in my view it’s not so simple. First, the solution to these two problems surely isn’t the same. Preregistration and transparency may very well solve the problem of analytical flexibility and data dredging – but it is not going to stop deliberate fraud, nor is it meant to. Second, actively conflating fraud and data dredging implicitly accuses researchers of being deliberately misleading and thus automatically puts them on the defensive. This is hardly a way to have a productive discussion and convince people to do something about p-hacking. You don’t have to look very far for examples of that playing out. Several protracted discussions on a certain Psychology Methods Facebook group come to mind…
Methodological flexibility is a real problem. We definitely should do something about it and new moves towards preregistration and data transparency are at least theoretically effective solutions to improve things. The really pernicious thing about p-hacking is that people are usually entirely unaware of the fact that they are doing it. Until you have tried to do a preregistered study, you don’t appreciate just how many forks in the road you passed along the way (I may blog about my own experiences with that at some point). So implying, however unintentionally, that people are fraudsters is not helping matters.
Preregistration and data sharing have gathered a lot of momentum over the past few years. Perhaps the opinions of some old tenured folks opposed to such approaches no longer carry so much weight now, regardless how powerful they may be. But I’m not convinced that this is true. Just because there has been momentum now does not mean that these ideas will prevail. It is just as likely that they fizzle out due to lacking enthusiasm or because people begin to feel that the effort isn’t worth it. I seems to me that “open science” very much exists in a bubble and I have bemoaned that before. To change scientific practices we need to open the hearts and minds of sceptics to why p-hacking is so pervasive. I don’t believe we will achieve that by preaching to them. Everybody p-hacks if left to their own devices. Preregistration and open data can help protect yourself against your mind’s natural tendency to perceive patterns in noise. A scientist’s training is all about developing techniques to counteract this tendency, and so open practices are just another tool for achieving that purpose.
Imagine you are a radio astronomer and you suddenly stumble across a signal from outer space that appears to be evidence of an extra-terrestrial intelligence. Let’s also assume you already confidently ruled out any trivial artifactual explanation to do with naturally occurring phenomena or defective measurements. How could you confirm that this signal isn’t simply a random fluke?
This is actually the premise of the novel Contact by Carl Sagan, which happens to be one of my favorite books (I never watched the movie but only caught the end which is nothing like the book so I wouldn’t recommend it…). The solution to this problem proposed in the book is that one should quantify how likely the observed putative extraterrestrial signal would be under the assumption that it is the product of random background radiation.
This is basically what a p-value in frequentist null hypothesis significance testing represents. Using frequentist inference requires that you have a pre-specified hypothesis and a pre-specified design. You should have an effect size in mind, determine how many measurements you need to achieve a particular statistical power, and then you must carry out this experiment precisely as planned. This is rarely how real science works and it is often put forth as one of the main arguments why we should preregister our experimental designs. Any analysis that wasn’t planned a priori is by definition exploratory. The most extreme form of this argument posits that any experiment that hasn’t been preregistered is exploratory. While I still find it hard to agree with this extremist position, it is certainly true that analytical flexibility distorts the inferences we can make about an observation.
This proposed frequentist solution is therefore inappropriate for confirming our extraterrestrial signal. Because the researcher stumbled across the signal, the analysis is by definition exploratory. Moreover, you must also beware of the base-rate fallacy: even an event extremely unlikely under the null hypothesis is not necessarily evidence against the null hypothesis. Even if p=0.00001, a true extraterrestrial signal may be even less likely, say, p=10-100. Even if extra-terrestrial signals are quite common, given the small amount of space, time, and EM bands we have studied thus far, how probable is it we would just stumble across a meaningful signal?
None of that means that exploratory results aren’t important. I think you’d agree that finding credible evidence of an extra-terrestrial intelligence capable of sending radio transmissions would be a major discovery. The other day I met up with Rob McIntosh, one of the editors for Registered Reports at Cortex, to discuss the distinction between exploratory and confirmatory research. A lot of the criticism of preregistration focuses on whether it puts too much emphasis on hypothesis-driven research and whether it in turn devalues or marginalizes exploratory studies. I have spent a lot of time thinking about this issue and (encouraged by discussions with many proponents of preregistration) I have come to the conclusion that the opposite is true: by emphasizing which parts of your research are confirmatory I believe exploration is actually valued more. The way scientific publishing works conventionally many studies are written up in a way that pretends to be hypothesis-driven when in truth they weren’t. Probably for a lot of published research the truth lies somewhere in the middle.
So preregistration just keeps you honest with yourself and if anything it allows you to be more honest about how you explored the data. Nobody is saying that you can’t explore, and in fact I would argue you should always include some exploration. Whether it is an initial exploratory experiment that you did that you then replicate or test further in a registered experiment, or whether it is a posthoc robustness test you do to ensure that your registered result isn’t just an unforeseen artifact, some exploration is almost always necessary. “If we knew what we were doing, it would not be called research, would it?” (a quote by Albert Einstein, apparently).
One idea I discussed with Rob is whether there should be a publication format that specifically caters to exploration (Chris Chambers has also mentioned this idea previously). Such Exploratory Reports would allow researchers to publish interesting and surprising findings without first registering a hypothesis. You may think this sounds a lot like what a lot of present day high impact papers are like already. The key difference is that these Exploratory Reports would contain no inferential statistics and critically they are explicit about the fact that the research is exploratory – something that is rarely the case in conventional studies. However, this idea poses a critical challenge: on the one hand you want to ensure that the results presented in such a format are trustworthy. But how do you ensure this without inferential statistics?
Proponents of the New Statistics (which aren’t actually “new” and it is also questionable whether you should call them “statistics”) will tell you that you could just report the means/medians and confidence intervals, or perhaps the whole distributions of data. But that isn’t really helping. Inspecting confidence intervals and how far they are from zero (or another value of no interest) is effectively the same thing as a significance test. Even merely showing the distribution of observations isn’t really helping. If a result is so blatantly obvious that it convinces you by visual inspection (the “inter-ocular trauma test”), then formal statistical testing would be unnecessary anyway. If the results are even just a little subtler, it can be very difficult to decide whether the finding is interesting. So the way I see it, we either need a way to estimate statistical evidence, or you need to follow up the finding with a registered, confirmatory experiment that specifically seeks to replicate and/or further test the original exploratory finding.
In the case of our extra-terrestrial signal you may plan a new measurement. You know the location in the sky where the signal came from, so part of your preregistered methods is to point your radio telescope at the same point. You also have an idea of the signal strength, which allows you to determine the number of measurements needed to have adequate statistical power. Then you carry out this experiment, sticking meticulously to your planned recipe. Finally, you report your result and the associated p-value.
Sounds good in theory. In practice, however, this is not how science typically works. Maybe the signal isn’t continuous. There could be all sorts of reasons why the signal may only be intermittent, be it some interstellar dust clouds blocking the line of transmission, the transmitter pointing away from Earth due to the rotation of the aliens’ home planet, or even simply the fact that the aliens are operating their transmitter on a random schedule. We know nothing about what an alien species, let alone their civilization, may be like. Who is to say that they don’t just fall into random sleeping periods in irregular intervals?
So some exploratory, flexible analysis is almost always necessary. If you are too rigid in your approach, you are very likely to miss important discoveries. At the same time, you must be careful not to fool yourself. If we are really going down the route of Exploratory Reports without any statistical inference we need to come up with a good way to ensure that such exploratory findings aren’t mostly garbage. I think in the long run the only way to do so is to replicate and test results in confirmatory studies. But this could already be done as part of a Registered Report in which your design is preregistered. Experiment 1 would be exploratory without any statistical inference but simply reporting the basic pattern of results. Experiment 2 would then be preregistered and replicate or test the finding further.
However, Registered Reports can take a long time to publish. This may in fact be one of the weak points about this format that may stop the scientific community from becoming more enthusiastic about them. As long as there is no real incentive to doing slow science, the idea that you may take two or three years to publish one study is not going to appeal to many people. It will stop early career researchers from getting jobs and research funding. It also puts small labs in poorer universities at a considerable disadvantage compared to researchers with big grants, big data, and legions of research assistants.
The whole point of Exploratory Reports would be to quickly push out interesting observations. In some ways, this is then exactly what brief communications in high impact journals are currently for. I don’t think it will serve us well to replace the notion of snappy (and likely untrue) high impact findings with inappropriate statistical inferences with snappy (and likely untrue) exploratory findings without statistical inference. If the purpose of Exploratory Reports is solely to provide an outlet for quick publication of interesting results, we still have the same kind of skewed incentive structure as now. Also, while removing statistical inference from our exploratory findings may be better statistical practice I am not convinced that it is better scientific practice unless we have other ways of ensuring that these exploratory results are kosher.
The way I see it, the only way around this dilemma is to finally stop treating publications as individual units. Science is by nature a lengthy, incremental process. Yes, we need exciting discoveries to drive science forward. At the same time, replicability and robustness of our discoveries is critical. In order to combine these two needs I believe research findings should not be seen as separate morsels but as a web of interconnected results. A single Exploratory Report (or even a bunch of them) could serve as the starting point. But unless they are followed up by Registered Reports replicating or scrutinizing these findings further, they are not all that meaningful. Only once replications and follow up experiments have been performed the whole body of a finding takes shape. A search on PubMed or Google Scholar would not merely spit out the original paper but a whole tree of linked experiments.
The perceived impact and value of a finding thus would be related to how much of a interconnected body of evidence it has generated rather than whether it was published in Nature or Science. Critically, this would allow people to quickly publish their exciting finding and thus avoid being deadlocked by endless review processes and disadvantaged compared to other people who can afford to do more open science. At the same time, they would be incentivized to conduct follow-up studies. Because a whole body of related literature is linked, it would however also be an incentive for others to conduct replications or follow up experiments on your exploratory finding.
There are obviously logistic and technical challenges with this idea. The current publication infrastructure still does not really allow for this to work. This is not a major problem however. It seems entirely feasible to implement such a system. The bigger challenge is how to convince the broader community and publishers and funders to take this on board.
A lot of young researchers are worried about being “scooped”. No, this is not about something unpleasantly kinky but about when some other lab publishes an experiment that is very similar to yours before you do. Sometimes this is even more than just a worry and it actually happens. I know that this could be depressing. You’ve invested months or years of work and sleepless nights in this project and then somebody else comes along and publishes something similar and – poof – all the novelty is gone. Your science career is over. You will never publish this high impact now. You won’t ever get a grant. Immeasurable effort down the drain. Might as well give up, sell your soul to the Devil, and get a slave job in the pharmaceutical industry and get rich.
Except that this is total crap. There is no such thing as being scooped in this way, or at least if there is, it is not the end of your scientific career. In this post I want to briefly explain why I think so. This won’t be a lecture on the merits of open science, on replications, on how we should care more about the truth than about novelty and “sexiness”. All of these things are undoubtedly true in my mind and they are things we as a community should be actively working to change – but this is no help to young scientists who are still trying to make a name for themselves in a system that continues to reward high impact publications over substance.
No. Here I will talk about this issue with respect to the status quo. I think even in the current system, imperfect as it may be, this irrational fear is in my view unfounded. It is essential to dispel these myths about impact and novelty, about how precedence is tied to your career prospects. Early career scientists are the future of science. How can we ever hope to change science for the better if we allow this sort of madness to live on in the next generation of scientists? I say ‘live on’ for good reason – I, too, used to suffer from this madness when I was a graduate student and postdoc.
Why did I have this madness? Honestly I couldn’t say. Perhaps it’s a natural evolution of young researchers, at least in our current system. People like to point the finger at the lab PIs pressuring you into this sort of crazy behaviour. But that wasn’t it for me. For most of my postdoc I worked with Geraint Rees at UCL and perhaps the best thing he ever told me was to fucking chill. He taught me – more by example than words – that while having a successful career was useful, what is much more important is to remember why you’re doing it: The point of having a (reasonably successful) science career is to be able to pay the rent/mortgage and take some enjoyment out of this life you’ve been given. The reason I do science, rather than making a mint in the pharma industry, is that I am genuinely curious and want to figure shit out.
Guess what? Neither of these things depend on whether somebody else publishes a similar (or even very similar) experiment while you’re still running it. We all know that novelty still matters to a lot of journals. Some have been very reluctant to publish replication attempts. I agree that publishing high impact papers does help wedge your foot in the door (that is, get you short-listed) in grant and job applications. But even if this were all that matters to be a successful scientist (and it really isn’t), here’s why you shouldn’t care too much about that anyway:
No paper was ever rejected because it was scooped
While journal editors will reject papers because they aren’t “novel,” I have never seen any paper being rejected because somebody else published something similar a few months earlier. Most editors and reviewers will not even be aware of the scooping study. You may find this hard to believe because you think your own research is so timely and important, but statistically it is true. Of course, some reviewers will know of the work. But most reviewers are not actually bad people and will not say “Something like this was published three months ago already and therefore this is not interesting.” Again, you may find this hard to believe because we’ve all heard too many stories of Reviewer 2 being an asshole. But in the end, most people aren’t that big of an asshole. It happens quite frequently that I suggest in reviews that the authors cite some recently published work (usually not my own, in case you were wondering) that is very similar to theirs. In my experience this has never led to a rejection but I ask to them to put their results in the context of similar findings in the literature. You know, the way a Discussion section should be.
No two scooped studies are the same
You may think that the scooper’s experiment was very similar, but unless they actually stole your idea (a whole different story I also don’t believe but I have no time for this now…) and essentially pre-replicated (preclicated?) your design, I’d bet that there are still significant differences. Your study has not lost any of its value because of this. And it’s certainly no reason to quit and/or be depressed.
It’s actually a compliment
Not 100% sure about this one. Scientific curiosity shouldn’t have anything to do with a popularity contest if you ask me. Study whatever the hell you want to (within ethical limits, that is). But I admit, it feels reassuring to me when other people agree that the research questions I am interested in are also interesting to them. For one thing, this means that they will appreciate you working and (eventually) publishing on it, which again from a pragmatic point of view means that you can pay those rents/mortgages. And from a simple vanity perspective it is also reassuring that you’re not completely mad for pursuing a particular research question.
It has little to do with publishing high impact
Honestly, from what I can tell neither precedence nor the popularity of your topic are the critical factors in getting your work into high impact journals. The novelty of your techniques, how surprising and/or reassuringly expected your results are, and the simplicity of the narrative are actually major factors. Moreover, the place you work, the co-authors you with whom you write your papers, and the accessibility of the writing (in particular your cover letter to the editors!) definitely matter a great deal also (and these are not independent of the first points either…). It is quite possible that your “rival” will publish first, but that doesn’t mean you won’t publish similar work in a higher impact journal. Journal review outcome is pretty stochastic and not really very predictable.
Actual decisions are not based on this
We all hear the horror stories of impact factors and h-indexes determining your success with grant applications and hiring decisions. Even if this were true (and I actually have my doubts that it is as black and white as this), a CV with lots of high impact publications may get your foot in the door – but it does not absolve the panel from making a hiring/funding decision. You need to do the work on that one yourself and even then luck may be against you (the odds certainly are). It also simply is not true that most people are looking for the person with the most Nature papers. Instead I bet you they are looking for people who can string together a coherent argument, communicate their ideas, and who have the drive and intellect to be a good researcher. Applicants with a long list of high impact papers may still come up with awful grant proposals or do terribly in job interviews while people with less stellar publication records can demonstrate their excellence in other ways. You may already have made a name for yourself in your field anyway, through conferences, social media, public engagement etc. This may matter far more than any high impact paper could.
There are more important things
And now we’re coming back to the work-life balance and why you’re doing this in the first place. Honestly, who the hell cares whether someone else published this a few months earlier? Is being the first to do this the reason you’re doing science? I can see the excitement of discovery but let’s face it, most of our research is neither like the work of Einstein or Newton nor are we discovering extraterrestrial life. Your discovery is no doubt exciting to you, it is hopefully exciting to some other scientists in your little bubble and it may even be exciting to some journalist who will write a distorting, simplifying article about it for the mainstream news. But seriously, it’s not as groundbreaking that it is worth sacrificing your mental and physical health over it. Live your life. Spend time with your family. Be good to your fellow creatures on this planet. By all means, don’t be complacent, ensure you make a living but don’t pressure yourself into believing that publishing ultra-high impact papers is the meaning of life.
A positive suggestion for next time…
Now if you’re really worried about this sort of thing, why not preregister your experiment? I know I said I wouldn’t talk about open science here but bear with me just this once because this is a practical point you can implement today. As I keep saying, the whole discussion about preregistration is dominated by talking about “questionable research practices”, HARKing, and all that junk. Not that these aren’t worthwhile concerns but this is a lot of negativity. There are plenty of positive reasons why preregistration can help and the (fallacious) fear of being scooped is one of them. Preregistration does not stop anyone else from publishing the same experiment before you but it does allow you to demonstrate that you had thought of the idea before they published it. With Registered Reports it becomes irrelevant if someone else published before you because your publication is guaranteed after the method has been reviewed. And I believe it will also make it far clearer to everyone how much who published what first where actually matters in the big scheme of things.
 Actually there are a lot of old and experienced researchers who worry about this too. And that is far worse than when early career researchers do it because they should really know better and they shouldn’t feel the same career pressures.
 It may sound appealing now, but thinking about it I wouldn’t trade my current professional life for anything. Except for grant admin bureaucracy perhaps. I would happily give that up at any price…
 He didn’t quite say it in those terms.
 This doesn’t actually happen. If you want to make a mint you need to go into scientific publishing but the whole open science movement is screwing up that opportunity now as well so you may be out of luck!
 Don’t bombard me with “Reviewer 2 held up my paper to publish theirs first” stories. Unless Reviewer 2 signed their review or told you specifically that it was them I don’t take such stories at face value.
 The sooner we stop thinking of other scientists in those terms the better for all of us.
Since last night the internet has been all atwitter about a commentary* by Dan Gilbert and colleagues about the recent and (in my view) misnamed Reproducibility Project: Psychology. In this commentary, Gilbert et al. criticise the RPP for a number of technical reasons asserting that the sampling was non-random and biased and that essentially the conclusions, in particular in the coverage by science media and blogosphere, of a replicability crisis in psychology is unfounded. Some of their points are rather questionable to say the least and some, like their interpretation of confidence intervals, are statistically simply wrong. But I won’t talk about this here.
One point they raise is the oft repeated argument that replications differed in some way from the original research. We’ve discussed this already ad nauseam in the past and there is little point going over this again. Exact replication of the methods and conditions of an original experiment can test the replicability of a finding. Indirect replications loosely testing similar hypotheses instead inform about generalisability of the idea, which in turn tells us about the robustness of the purported processes we posited. Everybody (hopefully) knows this. Both are important aspects to scientific progress.
The main problem is that most debates about replicability go down that same road with people arguing about whether the replication was of sufficient quality to yield interpretable results. One example by Gilbert and co is that one of the replications in the RPP used the same video stimuli used by the original study, even though the original study was conducted in the US while the replication was carried out in the Netherlands, and the dependent variable was related to something that had no relevance to the participants in the replication (race relations and affirmative action). Other examples like this were brought up in previous debates about replication studies. A similar argument has also been made about the differences in language context between the original Bargh social priming studies and the replications. In my view, some of these points have merit and the example raised by Gilbert et al. is certain worth a facepalm or two. It does seem mind-boggling how anyone could have thought that it is valid to replicate a result about a US-specific issue in a liberal European country whilst using the original stimuli in English.
But what this example illustrates is a much larger problem. In my mind that is actually the crux of the matter: Psychology, or at least most forms of more traditional psychology, do not lend themselves very well to replication. As I am wont to point out, I am not a psychologist but a neuroscientist. I do work in a psychology department, however, and my field obviously has considerable overlap with traditional psychology. I also think many subfields of experimental psychology work in much the same way as other so-called “harder” sciences. This is not to say that neuroscience, psychophysics, or other fields do not also have problems with replicability, publication bias, and other concerns that plague science as a whole. We know they do. But the social sciences, the more lofty sides of psychology dealing with vague concepts of the mind and psyche, in my view have an additional problem: They lack the lawful regularity of effects that scientific discovery requires.
For example, we are currently conducting an fMRI experiment in which we replicate a previous finding. We are using the approach I have long advocated that in order to try to replicate you should design experiments that do both, replicate a previous result but also seek to address a novel question. The details of the experiment are not very important. (If we ever complete this experiment and publish it you can read about it then…) What matters is that we very closely replicate the methods of a study from 2012 and this study closely replicated the methods of one from 2008. The results are pretty consistent across all three instances of the experiment. The 2012 study provided a somewhat alternative interpretation of the findings of the 2008 one. Our experiment now adds more spatially sensitive methods to yet again paint a somewhat different picture. Since we’re not finished with it I can’t tell you how interesting this difference is. It is however already blatantly obvious that the general finding is the same. Had we analysed our experiment in the same way as the 2008 study, we would have reached the same conclusions they did.
The whole idea of science is to find regularities in our complex observations of the world, to uncover lawfulness in the chaos. The entire empirical approach is based on the idea that I can perform an experiment with particular parameters and repeat it with the same results, blurred somewhat by random chance. Estimating the generalisability allows me to understand how tweaking the parameters can affect the results and thus allows me to determine what the laws are the govern the whole system.
And this right there is where much of psychology has a big problem. I agree with Gilbert et al. that repeating a social effect in US participants with identical methods in Dutch participants is not a direct replication. But what would be? They discuss how the same experiment was then repeated in the US and found results weakly consistent with the original findings. But this isn’t a direct replication either. It does not suffer from the same cultural and language differences as the replication in the Netherlands did but it has other contextual discrepancies. Even repeating exactly the same experiment in the original Stanford(?) population would not necessarily be equivalent because of the time that has passed and the way cultural factors have changed. A replication is simply not possible.
For all the failings that all fields of science have, this is a problem my research area does not suffer from (and to clarify: “my field” is not all of cognitive neuroscience, much of which is essentially straight-up psychology with the brain tagged on, and also while I don’t see myself as a psychologist, I certainly acknowledge that my research also involves psychology). Our experiment is done on people living in London. The 2012 study was presumably done mainly on Belgians in Belgium. As far as I know the 2008 study was run in the mid-western US. We are asking a question that deals with a fairly fundamental aspect of human brain function. This does not mean that there aren’t any population differences but our prior for such things affecting the results in a very substantial way are pretty small. Similarly, the methods can certainly modulate the results somewhat but I would expect the effects to be fairly robust to minor methodological changes. In fact, whenever we see that small changes in the method (say, the stimulus duration or the particular scanning sequence used) seem to obliterate a result completely, my first instinct is usually that such a finding is non-robust and thus unlikely to be meaningful.
From where I’m standing, social and other forms of traditional psychology can’t say the same. Small contextual or methodological differences can quite likely skew the results because the mind is a damn complex thing. For that reason alone, we should expect psychology to have low replicability and the effect sizes should be pretty small (i.e. smaller than what is common in the literature) because they will always be diluted by a multitude of independent factors. Perhaps more than any other field, psychology can benefit from preregistering experimental protocols to delineate the exploratory garden-path from hypothesis-driven confirmatory results.
I agree that a direct replication of a contextually dependent effect in a different country and at a different time makes little sense but that is no excuse. If you just say that the effects are so context-specific it is difficult to replicate them, you are bound to end up chasing lots of phantoms. And that isn’t science – not even a “soft” one.
* At first I thought the commentary was due to be published by Science on 4th March and embargoed until that date. However, it turns out to be more complicated than that because the commentary I am discussing here is not the Science article but Gilbert et al.’s reply to Nosek et al.’s reply to Gilbert et al.’s reply to the RPP (Confused yet?). It appeared on a website and then swiftly vanished again. I don’t know how I would feel posting it because the authors evidently didn’t want it to be public. I don’t think actually having that article is central to understanding my post so I feel it’s not important.
What a week! I have rarely seen the definition of irony being demonstrated more clearly in front of my eyes than during the days following the publication of this comment by Lewandowsky and Bishop in Nature. I mentioned this at the end of my previous post. The comment discusses the question how to deal with data requests and criticisms of scientific claims in the new world of open science. A lot of digital ink has already been spilled elsewhere debating what they did or didn’t say and what they meant to say with their article. I have no intention of rehashing that debate here. So while I typically welcome any meaningful and respectful comments under my posts, I’ll regard any comments on the specifics of the L&B article as off-topic and will not publish them. There are plenty of other channels for this.
I think the critics attack a strawman and the L&B discussion is a red herring. Irrespective of what they actually said, I want to get back to the discussion we should be having, which I already alluded to last time. In order to do so, let’s get the premise crystal clear. I have said all this before in my various posts about data sharing but let me summarise the fundamental points:
Data sharing: All data for scientific studies needed to reproduce the results should be made public in some independent repository at the point of publication. This must exclude data which would be unethical to share, e.g. unprocessed brain images from human participants. Such data fall in a grey area as to how much anonymisation is necessary and it is my policy to err on the side of caution there. We have no permission from our participants (except for some individual cases) to share their data with anyone outside the team if there is a chance that they could be identified from it so we don’t. For the overwhelming majority of purposes such data are not required and the pre-processed, anonymised data will suffice.
Material sharing: When I talk about sharing data I implicitly also mean material so any custom analysis code, stimulus protocols, or other materials used for the study should also be shared. This is not only good for reproducibility, i.e. getting the same results using the same data. It is also useful for replication efforts aiming to repeat the same experiment to collect new data.
Useful documentation: Shared data are unlikely to be much use to anyone if there isn’t a minimum of documentation explaining what it contains. I don’t think this needs to be excessive, especially given the fact that most data will probably never be accessed by anyone. But there should at least be some basic guide how to use the data to return a result. It should be reasonably clear what data can be found where or how to run the experiment. Provided the uncompiled code is included and the methods section of the publication contains sufficient detail of what is being done, anyone looking at it should be able to work it out by themselves. More extensive documentation is certainly helpful and may also help the researchers themselves in organising their work – but I don’t think we should expect more than the basics.
Now with this out of the way I don’t want to hear no lamentations about how I am “defending” the restriction of data access to anyone or any such rubbish. Let’s simply work on the assumption that the world is how it should be and that the necessary data are available to anyone with an internet connection. So let’s talk about the worries and potential problems this may bring. Note that, as I already said, most data sets will probably not generate much interest. That is fine – they should be available for potential future use in any case. More importantly this doesn’t mean the following concerns aren’t valid:
Volume of criticism
In some cases the number of people reusing the shared data will be very large. This is particularly likely for research on controversial topics. This could be because the topic is a political battleground or that the research is being used to promote policy changes people are not happy with. Perhaps the research receives undeserved accolades from the mainstream media or maybe it’s just a very sensational claim (Psi research springs to mind again…). The criticisms of this research may or may not be justified. None of this matters and I don’t care to hear about the specifics about your particular pet peeve whether it’s climate change or some medical trial. All that matters in this context is that the topic is controversial.
As I said last time, it should be natural that sensational or controversial research attracts more attention and more scepticism. This is how it should be. Scientists should be sceptical. But individual scientists or small research teams are composed of normal human beings and they have a limit with how much criticism they can keep up with. This is a simple fact. Of course this statement will no doubt draw out the usual suspects who feel the need to explain to me that criticism and scepticism is necessary in science and that this is simply what one should expect.
So let me cut the heads off this inevitable hydra right away. First of all, this is exactly what I just said: Yes, science depends on scepticism. But it is also true that humans have limited capacity for answering questions and criticisms and limited ability to handle stress. Simply saying that they should be prepared for that and have no right to complain is unrealistic. If anything it will drive people away from doing research on controversial questions which cannot be a good thing.
Similar, it is unrealistic to say that they could just ignore criticisms if it gets too much for them. It is completely natural that a given scientist will want to respond to criticisms, especially if those criticisms are public. They will want to defend the conclusions they’ve drawn and they will also feel that they have a reputation to defend. I believe science would generally be better off if we all learned to become less invested in our pet theories and conducted our inferences in a less dogmatic way. I hope there are ways we can encourage such a change – but I don’t think you can take ego out of the question completely. Especially if a critic accuses a researcher of incompetence or worse, it shouldn’t surprise anyone if they react emotionally and have personal stakes in the debate.
So what can we expect? To me it seems entirely justified in this situation that a researcher would write a summary response that addresses the criticism collectively. In that they would most likely have to be selective and only address the more serious points and ignore the minutia. This may require some training. Even then it may be difficult because critics might insist that their subtle points are of fundamental importance. In that situation an adjudicating article by an independent party may be helpful (albeit probably not always feasible).
On a related note, it also seems justified to me that a researcher will require time to make a response. This pertains more to how we should assess a scientific disagreement as outside observers. Just because a researcher hasn’t responded to every little criticism within days of somebody criticising their work doesn’t mean that the criticism is valid. Scientists have lives too. They have other professional duties, mortgages to pay with their too-low salaries, children to feed, and – hard as it is to believe – they deserve some time off occasionally. As long as they declare their intention to respond in depth at some stage we should respect that. Of course if they never respond that may be a sign that they simply don’t have a good response to the criticism. But you need some patience, something we seem to have lost in the age of instant access social media.
Excessive criticism or harassment
This brings us to the next issue. Harassment of researchers is never okay. Which is really because harassment of anyone is never okay. So pelting a researcher with repeated criticisms, making the same points or asking the same questions over and over, is not acceptable. This certainly borders on harassment and may cross the line. This constant background noise can wear people out. It is also counterproductive because it slows them down in making their response. It may also paralyse their other research efforts which in turn will stress them out because they have grant obligations to fulfill etc. Above all, stress can make you sick. If you harassed somebody out of the ability to work, you’ll never get a response – this doesn’t make your criticism valid.
If the researchers declared their intention to respond to criticism we should leave it at that. If they don’t respond after a significant time it might be worth a reminder if they are still working on it. As I said above, if they never respond this may be a sign that they have no response. In that case, leave it at that.
It should require no explanation why any blatant harassment, abusive contact, or any form of interference in the researchers’ personal lives, is completely unacceptable. Depending on the severity of such cases they should be prosecuted to the full extent of the law. And if someone reports harassment, in the first instance you should believe them. It is a common tactic of harassers to downplay claims of abuse. Sure, it is also unethical to make false accusations but you should leave that for the authorities to judge, in particular if you don’t have any evidence one way or the other. Harassment is also subjective. What might not bother you may very well affect another person badly. Brushing this off as them being too sensitive demonstrates a serious lack of compassion, is disrespectful, and I think it also makes you seem untrustworthy.
Motive and bias
Speaking of untrustworthiness brings me to the next point. There has been much discussion about the motives of critics and in how far a criticism is to be taken in “good faith”. This is a complex and highly subjective judgement. In my view, your motive for reanalysing or critiquing a particular piece of research is not automatically a problem. All the data should be available, remember? Anyone can reanalyse it.
However, as all researchers should be honest so should all critics. Obviously this isn’t mandatory and it couldn’t be enforced even if it were. But this is how it should be and how good scientists should work. I have myself criticised and reanalysed research by others and I was not beating around the bush in either case – I believe I was pretty clear that I didn’t believe their hypothesis was valid. Hiding your prior notions is disrespectful to the authors and also misleads the neutral observers of the discussion. Even if you think that your public image already makes your views clear – say, because you ranted at great length on social media about how terribly flawed you think that study was – this isn’t enough. Even the Science Kardashians don’t have that large a social media following and probably only a fraction of that following will have read all your in-depth rants.
In addition to declaring your potential bias you should also state your intention. It is perfectly justified to dig into the data because you suspect it isn’t kosher. But this is an exploratory analysis and it comes with many of the same biases that uncontrolled, undeclared exploration always has. Of course you may find some big smoking gun that invalidates or undermines the original authors’ conclusions. But you are just as likely to find some spurious glitch or artifact in the data that doesn’t actually mean anything. In the latter case it would make more sense to conduct a follow up experiment that tests your new alternative hypothesis to see if your suspicion holds up. If on the other hand you have a clear suspicion to start with you should declare it and then test it and report the findings no matter what. Preregistration may help to discriminate the exploratory fishing trips from the pointed critical reanalyses – however, it is logistically not very feasible to check whether this wasn’t just a preregistration after the fact because the data were already available.
So I think this judgement will always rely heavily on trust but that’s not a bad thing. I’m happy to trust a critic if they declare their prior opinion. I will simply take their views with some scepticism that their bias may have influenced them. A critic who didn’t declare their bias but is then shown to have a bias appears far less trustworthy. So it is actually in your interest to declare your bias.
Now before anyone inevitably reminds us that we should also worry about the motives and biases of the original authors – yes, of course. But this is a discussion we’ve already had for years and this is why data sharing and novel publication models like preregistration and registered reports are becoming more commonplace.
Lack of expertise
On to the final point. Reanalyses or criticism may come from people with limited expertise and knowledge of a research area to provide useful contributions. Such criticisms may obfuscate the discussion and that is never a good thing. Again preempting the inevitable comments: No, this does not mean that you have to prove your expertise to reanalyse the data. (Seriously guys, which part of “all data should be available to anyone” don’t you get?!). What it does mean is that I might not want to weight the criticism by someone who once took a biology class in high school the same way as that of a world expert. It also means that I will be more sceptical when someone is criticising something outside their own field.
There are many situations where this caveat doesn’t matter. Any scientist with some statistical training may be able to comment on some statistical issue. In fact, a statistician is presumably more qualified to comment on some statistical point than a non-statistician of whatever field. And even if you may not be an expert on some particular research topic you may still be an expert on the methods used by the researchers. Importantly, even a non-expert can reveal a fundamental flaw. The lack of a critic’s expertise shouldn’t be misused to discredit them. In the end, what really matters is that your argument is coherent and convincing. For that it doesn’t actually matter if you are an expert or not (an expert may however find it easier to communicate their criticism convincingly).
However, let’s assume that a large number of non-experts are descending on a data set picking little things they perceive as flaws that aren’t actually consequential or making glaring errors (to an expert) in their analysis. What should the researchers do in this situation? Not responding at all is not in their interest. This can easily be misinterpreted as a tacit acknowledgement that their research is flawed. On the other hand, responding to every single case is not in their interest either if they want to get on with their work (and their lives for that matter). As above, perhaps the best thing to do would be write a summary response collectively rebuking the most pertinent points, make a clear argument about the inconsequentialness of these criticisms, and then leave it at that.
In general, scientific criticisms are publications that should work like any other scientific publications. They should be subject to peer review (which, as readers of this blog will know, I believe should be post-publication and public). This doesn’t mean that criticism cannot start on social media, blogs, journal comment sections, or on PubPeer, and the boundaries may also blur at times. For some kinds of criticism, such as pointing out basic errors or misinterpretations some public comments may suffice and there have been cases where a publication was retracted simply because of the social media response. But for a criticism to be taken seriously by anyone, especially non-experts, it helps if it is properly vetted by independent experts – just how any study should be vetted. This may also help particularly with cases where the validity of the criticism is uncertain.
I think this is a very important discussion to have. We need to have this to bring about the research culture most of us seem to want. A brave new world of happy research parasites.
(Note: I changed the final section somewhat after Neuroskeptic rightly pointed out that the conclusions were a bit too general. Tal Yarkoni independently replicated this sentiment. But he was only giving me a hard time.)
This weekend marked another great moment in the saga surrounding the discussion about open science – a worthy sequel to “angry birds” and “shameless little bullies”. This time it was an editorial about data sharing in the New England Journal of Medicine which contains the statement that:
There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”
Remarks like this from journal editors are just all kinds of stupid. Even though this was presented in the context of quotes by unnamed “front-line researchers” (whatever that means) they implicitly endorse the interpretation that re-using other people’s published data is parasitical. In fact, their endorsement is made clear later on in the editorial when the editors express the hope that data sharing “should happen symbiotically, not parasitically.”
It shouldn’t come as a surprise that this editorial was immediately greeted by wide-spread ridicule and the creation of all sorts of internet memes poking fun of the notion of “research parasites.” Even if some people believe this, hell, even if the claim were true (spoiler: it’s not), this is just a very idiotic thing to do. Like it or not, open access, transparency, and post-publication scrutiny of published scientific findings are becoming increasingly common and are already required in many places. We’re now less than a year away from the date when the Peer Reviewers Openness Initiative, whose function it is to encourage data sharing, will come into effect. Not only is the clock not turning back on this stuff – it is deeply counterproductive to liken the supporters of this movement to parasites. This is no way to start (or have) a reasonable conversation.
And there should be a conversation. If there is one thing I have learned from talking with colleagues, worries about data sharing and open science as a whole are far from rare. Misguided as it may be, the concern about others scooping your ideas and sifting through your data you spent blood, sweat, and tears collecting resonates with many people. This editorial didn’t just pop into existence from the quantum foam – it comes from a real place. The mocking and snide remarks about this editorial are fully deserved. This editorial is moronic and ass-backwards. But speaking more generally, snide and mocking are never a good way to convince people of the strength of your argument. All too often worries like this are met with disrespect and ridicule. Is it any surprise that a lot of people don’t dare to speak up against open science? Similarly, when someone discovers errors or problems in somebody else’s data, some are quick to make jokes or serious accusations about these researchers. Is this encouraging them to be open their lab books and file drawers? I think not.
Scientists are human beings and they tend to have normal human reactions when being accused of wrong-doing, incompetence, or sloppiness. Whether or not the accusations are correct is irrelevant. Even mentioning the dreaded “questionable research practices” sounds like a fierce accusation to the accused even though questionable research practices can occur quite naturally without conscious ill intent when people are wandering in the garden of forking paths. In my opinion we need to be mindful of that and try to be more considerate in the way we discuss these issues. Social media like Facebook and Twitter do not exactly seem to encourage respectful dialogue. I know this firsthand as I have myself said things about (in my view) questionable research that I subsequently regretted. Scepticism is good and essential to scientific progress – disrespect is not.
It seems to have been the intention of this misguided editorial to communicate a similar message. It encourages researchers using other people’s data to work with the original authors. So far so good. I am sure no sensible person would actually disagree with that notion. But where the editorial misses the point is that there is no plan for what happens if this “symbiotic” relationship isn’t forming, either because the original authors are not cooperating or because there is a conflict of interests between skeptics and proponents of a scientific claim. In fact, the editorial lays bare what I think is the heart of the problem in a statement that to me seems much worse than the “research parasites” label. They say that people…
…even use the data to try to disprove what the original investigators had posited.
It baffles me that anyone can write something like this whilst keeping a straight face. Isn’t this how science is supposed to work? Trying to disprove a hypothesis is just basic Popperian falsification. Not only should others do that, you should do that yourself with your own research claims. To be fair, the best way to do science in my opinion is to generate competing hypotheses and test them with as little emotional attachment to any of them as possible but this is more easily said than done… So ideally we should try to find the hypothesis that best explains the data rather than just seeking to disprove. Either way however, this sentence is clearly symptomatic of a much greater problem: Science should be about “finding better ways of being wrong.” The first step towards this is to acknowledge that anything we posited is never really going to be “true” and that it can always use a healthy dose of scientific scepticism and disproving.
I want to have this dialogue. I want to debate the ways to make science healthier, more efficient, and more flexible in overturning false ideas. As I outlined in a previous post, data sharing is the single most important improvement we can make to our research culture. I think even if there are downsides to it, the benefits outweigh them by far. But not everyone shares my enthusiasm for data sharing and many people seem worried but afraid to speak up. This is wrong and it must change. I strongly believe that most of the worries can be alleviated:
I think it’s delusional that data sharing will produce a “class” of “research parasites.” People will still need to generate their own science to be successful. Simply sitting around waiting for other people to generate data is not going to be a viable career strategy. If anything, large consortia like the Human Genome or Human Connectome Project will produce large data sets that a broad base of researchers can use. But this won’t allow them to test every possible hypothesis under the sun. In fact, most data sets are far too specific to be much use to many other people.
I’m willing to bet that the vast majority of publicly shared data sets won’t be downloaded, let alone analysed by anyone other than the original authors. This is irrelevant. The point is that the data are available because they could be potentially useful to future science.
Scooping other people’s research ideas by doing the experiment they wanted to do by using their published data is a pretty ineffective and risky strategy. In most cases, there is just no way that someone else would be faster than you publishing an experiment you wanted to do using your data. This doesn’t mean that it never happens but I’m still waiting for anyone to tell me of a case where this actually did happen… But if you are worried about it, preregister your intention so at least anyone can see that you planned it. Or even better, submit it as a Registered Report so you can guarantee that this work will be published in a journal regardless of what other people did with your data.
While we’re at it, upload the preprints of your manuscripts when you submit them to journals. I still dream of a publication system where we don’t submit to journals at all, or at least not until peer review took place and the robustness of the finding has been confirmed. But until we get there, preprints are the next best thing. With a public preprint the chronological precedence is clear for all to see.
Now that covers the “parasites” feeding on your research productivity. But what to do if someone else subjects your data to sceptical scrutiny in the attempt to disprove what you posited? Again, first of all I don’t think this is going to be that frequent. It is probably more frequent for controversial or surprising claims and it bloody well should be. This is how science progresses and shouldn’t be a concern. And if it actually turns out that the result or your interpretation of it is wrong, wouldn’t you want to know about it? If your answer to this question is No, then I honestly wonder why you do research.
I can however empathise with the fear that people, some of whom may lack the necessary expertise or who cherry pick the results, will actively seek to dismantle your findings. I am sure that this does happen and with more general data sharing this may certainly become more common. If the volume of such efforts becomes so large that it overwhelms an individual researcher and thus hinders their own progress unnecessarily, this would indeed be a concern. Perhaps we need to have a discussion on what safeguards could ensure that this doesn’t get out of hand or how one should deal with that situation. I think it’s a valid concern and worth some serious thought. (Update on 25 Jan 2016: In this context Stephan Lewandowsky and Dorothy Bishop wrote an interesting comment about this).
But I guarantee you, throwing the blame at data sharing is not the solution to this potential problem. The answer to scepticism and scrutiny cannot ever be to keep your data under lock and key. You may never convince a staunch sceptic but you also will not win the hearts and minds of the undecidedly doubtful by hiding in your ivory tower. In science, the only convincing argument is data, more data, better tests – and the willingness to change your mind if the evidence demands it.
Yesterday Neuroskeptic came to our Cognitive Drinks event in the Experimental Psychology department at UCL to talk about p-hacking. His entertaining talk (see Figure 1) was followed by a lively and fairly long debate about p-hacking and related questions about reproducibility, preregistration, and publication bias. During the course of this discussion a few interesting things came up. (I deliberately won’t name anyone as I think this complicates matters. People can comment and identify themselves if they feel that they should…)
It was suggested that a lot of the problems with science would be remedied effectively if only people were encouraged (or required?) to replicate their own findings before publication. Now that sounds generally like a good idea. I have previously suggested that this would work very well in combination with preregistration: you first do a (semi-)exploratory experiment to finalise the protocol, then submit a preregistration of your hypothesis and methods, and then do the whole thing again as a replication (or perhaps more than one if you want to test several boundary conditions or parameters). You then submit the final set of results for publication. Under the Registered Report format, your preregistered protocol would already undergo peer review. This would ensure that the final results are almost certain to be published provided you didn’t stray excessively from the preregistered design. So far, so good.
Should you publish unclear results?
Or is it? Someone suggested that it would be a problem if your self-replication didn’t show the same thing as the original experiment. What should one do in this case? Doesn’t publishing something incoherent like this, one significant finding and a failed replication, just add to the noise in the literature?
At first, this question simply baffled me, as I suspect it would many of the folks campaigning to improve science. (My evil twin sister called these people Crusaders for True Science but I’m not supposed to use derogatory terms like that anymore nor should I impersonate lady demons for that matter. Most people from both sides of this mudslinging contest “debate” never seemed to understand that I’m also a revolutionary – you might just say that I’m more Proudhon, Bakunin, or Henry David Thoreau rather than Marx, Lenin, or Che Guevara. But I digress…)
Surely, the attitude that unclear, incoherent findings, that is, those that are more likely to be null results, are not worth publishing must contribute to the prevailing publication bias in the scientific literature? Surely, this view is counterproductive to the aims of science to accumulate evidence and gradually get closer to some universal truths? We must know which hypotheses have been supported by experimental data and which haven’t. One of the most important lessons I learned from one of my long-term mentors was that all good experiments should be published regardless of what they show. This doesn’t mean you should publish every single pilot experiment you ever did that didn’t work. (We can talk about what that does and doesn’t mean another time. But you know how life is: sometimes you think you have some great idea only to realise that it makes no sense at all when you actually try it in practice. Or maybe that’s just me? :P). Even with completed experiments you probably shouldn’t bother publishing if you realise afterwards that it is all artifactual or the result of some error. Hopefully you don’t have a lot of data sets like that though. So provided you did an experiment of suitable quality I believe you should publish it rather than hiding it in the proverbial file drawer. All scientific knowledge should be part of the scientific record.
I naively assumed that this view was self-evident and shared by almost everyone – but this clearly is not the case. Yet instead of sneering at such alternative opinions I believe we should understand why people hold them. There are reasonable arguments why one might wish to not publish every unclear finding. The person making this suggestion at our discussion said that it is difficult to interpret a null result, especially an assumed null result like this. If your original experiment O showed a significant effect supporting your hypothesis, but your replication experiment R does not, you cannot naturally conclude that the effect really doesn’t exist. For one thing you need to be more specific than that. If O showed a significant positive effect but R shows a significant negative one, this would be more consistent with the null hypothesis than if O is highly significant (p<10-30) and R just barely misses the threshold (p=0.051).
So let’s assume that we are talking about the former scenario. Even then things aren’t as straightforward, especially if R isn’t as exact a replication of O as you might have liked. If there is any doubt (and usually there is) that something could have been different in R than in O, this could be one of the hidden factors people always like to talk about in these discussions. Now you hopefully know your data better than anyone. If experiment O was largely exploratory and you tried various things to see what works best (dare we say p-hacking again?), then the odds are probably quite good that a significant non-replication in the opposite direction shows that the effect was just a fluke. But this is not a natural law but a probabilistic one. Youcannot everknow whether the original effect was real or not, especially not from such a limited data set of two non-independent experiments.
This is precisely why you should publish all results!
In my view, it is inherently dangerous if researchers decide for themselves which findings are important and which are not. It is not only a question of publishing only significant results. It applies much more broadly to the situation when a researcher publishes only results that support their pet theory but ignores or hides those that do not. I’d like to believe that most scientists don’t engage in this sort of behaviour – but sadly it is probably not uncommon. A way to counteract this is to train researchers to think of ways that test alternative hypotheses that make opposite predictions. However, such so-called “strong inference” is not always feasible. And even when it is, the two alternatives are not always equally interesting, which in turn means that people may still become emotionally attached to one hypothesis.
The decision whether a result is meaningful should be left to posterity. You should publish all your properly conducted experiments. If you have defensible doubts that the data are actually rubbish (say, an fMRI data set littered with spikes, distortions, and excessive motion artifacts, or a social psychology study where you discovered posthoc that all the participants were illiterate and couldn’t read the questionnaires) then by all means throw them in the bin. But unless you have a good reason, you should never do this and instead add the results to the scientific record.
Now the suggestion during our debate was that such inconclusive findings clog up the record with unnecessary noise. There is an enormous and constantly growing scientific literature. As it is, it is becoming increasingly harder to separate the wheat from the chaff. I can barely keep up with the continuous feed of new publications in my field and I am missing a lot. Total information overload. So from that point of view the notion makes sense that only those studies that meet a certain threshold for being conclusive are accepted as part of the scientific record.
I can certainly relate to this fear. For the same reason I am sceptical of proposals that papers should be published before review and all decisions about the quality and interest of some piece of research, including the whole peer review process, should be entirely post-publication. Some people even seem to think that the line between scientific publication and science blog should be blurred beyond recognition. I don’t agree with this. I don’t think that rating systems like those used on Amazon or IMDb are an ideal way to evaluate scientific research. It doesn’t sound wise to me to assess scientific discoveries and medical breakthroughs in the same way we rank our entertainment and retail products. And that is not even talking about unleashing the horror of internet comment sections onto peer review…
Solving the (false) dilemma
I think this discussion is creating a false dichotomy. These are not mutually exclusive options. The solution to a low signal-to-noise ratio in the scientific literature is not to maintain publication bias of significant results. Rather the solution is to improve our filtering mechanisms. As I just described, I don’t think it will be sufficient to employ online shopping and social network procedures to rank the scientific literature. Even in the best-case scenario this is likely to highlight the results of authors who are socially dominant or popular and probably also those who are particularly unpopular or controversial. It does not necessarily imply that the highest quality research floats to the top [cue obvious joke about what kind of things float to the top…].
No, a high quality filter requires some organisation. I am convinced the scientific community can organise itself very well to create these mechanisms without too much outside influence. (I told you I’m Thoreau and Proudhon, not some insane Chaos Worshipper :P). We need some form of acceptance to the record. As I outlined previously, we should reorganise the entire publication process so that the whole peer-review process is transparent and public. It should be completely separate from journals. The journals’ only job should be to select interesting manuscripts and to publish short summary versions of them in order to communicate particularly exciting results to the broader community. But this peer-review should still involve a “pre-publication” stage – in the sense that the initial manuscript should not generate an enormous amount of undue interest before it has been properly vetted. To reiterate (because people always misunderstand that): the “vetting” should be completely public. Everyone should be able to see all the reviews, all the editorial decisions, and the whole evolution of the manuscript. If anyone has any particular insight to share about the study, by all means they should be free to do so. But there should be some editorial process. Someone should chase potential reviewers to ensure the process takes off at all.
The good news about all this is that it benefits you. Instead of weeping bitterly and considering to quit science because yet again you didn’t find the result you hypothesised, this just means that you get to publish more research. Taking the focus off novel, controversial, special, cool or otherwise “important” results should also help make the peer review more about the quality and meticulousness of the methods. Peer review should be about ensuring that the science is sound. In current practice it instead often resembles a battle with authors defending to the death their claims about the significance of their findings against the reviewers’ scepticism. Scepticism is important in science but this kind of scepticism is completely unnecessary when people are not incentivised to overstate the importance of their results.
Practice what you preach
I honestly haven’t followed all of the suggestions I make here. Neither have many other people who talk about improving science. I know of vocal proponents of preregistration who have yet to preregister any study of their own. The reasons for this are complex. Of course, you should “be the change you wish to see in the world” (I’m told Gandhi said this). But it’s not always that simple.
On the whole though I think I have published almost all of the research I’ve done. While I currently have a lot of unpublished results there is very little in the file drawer as most of these experiments have either been submitted or are being written up for eventual publication. There are two exceptions. One is a student project that produced somewhat inconclusive results although I would say it is a conceptual replication of a published study by others. The main reason we haven’t tried to publish this yet is that the student isn’t here anymore and hasn’t been in contact and the data aren’t that exciting to us to bother with the hassle of publication (and it is a hassle!).
The other data set is perhaps ironic because it is a perfect example of the scenario I described earlier. A few years ago when I started a new postdoc I was asked to replicate an experiment a previous lab member had done. For simplicity, let’s just call this colleague Dr Toffee. Again, they can identify themselves if they wish. The main reason for this was that reviewers had asked Dr Toffee to collect eye-movement data. So I replicated the original experiment but added eye-tracking. My replication wasn’t an exact one in the strictest terms because I decided to code the experimental protocol from scratch (this was a lot easier). I also had to use a different stimulus setup than the previous experiment as that wouldn’t have worked with the eye-tracker. Still, I did my best to match the conditions in all other ways.
My results were a highly significant effect in the opposite direction than the original finding. We did all the necessary checks to ensure that this wasn’t just a coding error etc. It seemed to be real. Dr Toffee and I discussed what to do about it and we eventually decided that we wouldn’t bother to publish this set of experiments. The original experiment had been conducted several years before my replication. Dr Toffee had moved on with their life. I on the other hand had done this experiment as a courtesy because I was asked to. It was very peripheral to my own research interests. So, as in the other example, we both felt that going through the publication process would have been a fairly big hassle for very little gain.
Now this is bad. Perhaps there is some other poor researcher, a student perhaps, who will do a similar experiment again and waste a lot of time on testing the hypothesis that, at least according to our incoherent results, is unlikely to be true. And perhaps they will also not publish their failure to support this hypothesis. The circle of null results continues…
But you need to pick your battles. We are all just human beings and we do not have unlimited (research) energy. With both of these lacklustre or incoherent results I mentioned (and these are literally the only completed experiments we haven’t at least begun to write up), it seems like a daunting task to undergo the pain of submission->review->rejection->repeat that simply doesn’t seem worth it.
So what to do? Well, the solution is again what I described. The very reason the task of publishing these results isn’t worth our energy is everything that is wrong with the current publication process! In my dream world in which I can simply write up a manuscript formatted in a way that pleases me and then upload this to the pre-print peer-review site my life would be infinitely simpler. No more perusing dense journal websites for their guide to authors or hunting for the Zotero/Endnote/Whatever style to format the bibliography. No more submitting your files to one horribly designed, clunky journal website after another, checking the same stupid tick boxes, adding the same reviewer suggestions. No more rewriting your cover letters by changing the name of the journal. Certainly for my student’s project, it would not be hard to do as there is already a dissertation that can be used as a basis for the manuscript. Dr Toffee’s experiment and its contradictory replication might require a bit more work – but to be fair even there is already a previous manuscript. So all we’d need to add would be the modifications of the methods and the results of my replication. In a world where all you need to do is upload the manuscript and address some reviewers’ comments to ensure the quality of the science this should be fairly little effort. In turn it would ensure that the file drawer is empty and we are all much more productive.
This world isn’t here yet but there are journals that will allow something that isn’t too far off from that, namely F1000Research and PeerJ (and the Winnower also counts although the content there seems to be different and I don’t quite know how much review editing happens there). So, maybe I should email Dr Toffee now…
(* In case you didn’t get this from the previous 2700ish words: the answer to this question is unequivocally “No.”)
Because it’s so much more fun than the things I should really be doing (correcting student dissertations and responding to grant reviews) I read a long blog post entitled “Science isn’t broken” by Christie Aschwanden. In large parts this is a summary of the various controversies and “crises” that seem to have engulfed scientific research in recent years. The title is a direct response to an event I participated in recently at UCL. More importantly, I think it’s a really good read so I recommend it.
This post is a quick follow-up response to the general points raised there. As I tried to argue (probably not very coherently) at that event, I also don’t think science is broken. First of all, probably nobody seriously believes that the lofty concept of science, the scientific method (if there is one such thing), can even be broken. But even in more pragmatic terms, the human aspects of how science works are not broken either. My main point was that the very fact we are having these kinds of discussions, about how scientific research can be improved, is direct proof that science is in fact very healthy. This is what self-correction looks like.
If anything, the fact that there has been a recent surge of these kinds of debates shows that science has already improved a lot recently. After decades of complacency with the status quo there now seems to be real energy afoot to effect some changes. However, it is not the first time this happened (for example, the introduction of peer review would have been a similarly revolutionary time) and it will not be the last. Science will always need to be improved. If some day conventional wisdom were that our procedure is now perfect, that it cannot be improved anymore, that would be a tell-tale sign for me that I should do something else.
So instead of fretting over whether science is “broken” (No, it isn’t) or even whether it needs improvement (Yes, it does), what we should be talking about is specifically what really urgently needs improvement. Here is my short list. I am not proposing many solutions (except for point 1). I’d be happy to hear suggestions:
I. Publishing and peer review
The way we publish and review seriously needs to change. We are wasting far too much time on trivialities instead of the science. The trivialities range from reformatting manuscripts to fit journal guidelines and uploading files on the practical side to chasing impact factors and “novel” research on the more abstract side. Both hurt research productivity although in different ways. I recently proposed a solution that combines some of the ideas by Dorothy Bishop and Micah Allen (and no doubt many others).
II. Post-publication review
Related to this, the way we evaluate and discuss published science needs to change, too. We need to encourage more post-publication review. This currently still doesn’t happen as most studies never receive any post-pub review or get commented on at all. Sure, some (including some of my own) probably just don’t deserve any attention, but how will you know unless somebody tells you the study even exists? Many precious gems will be missed that way. This has of course always been the case in science but we should try to minimise that problem. Some believe post-publication review is all we will ever need but unless there are robust mechanisms to attract reviewers to new manuscripts besides the authors’ fame, (un-)popularity, and/or their social media presence – none of which are good scientific arguments – I can’t see how a post-pub only system can change this. On this note I should mention that Tal Yarkoni, with whom I’ve had some discussions about this issue, wrote an article about this which presents some suggestions. I am not entirely convinced of the arguments he makes for enhancing post-publication review but I need more time to respond to this in detail. So I will just point this out for now to any interested reader.
III. Research funding and hiring decisions
Above all, what seriously needs to change is how we allocate research funds and how we make hiring decisions. The solution to that probably goes hand in hand with solving the other two points, but I think it also requires direct action now in the absence of good solutions for the other issues. We must stop judging grant and job applicants based on impact factors or h-indeces. This is certainly more easily done for job applications than for grant decisions as in the latter the volume of applications is much greater – and the expertise of the panel members in judging the applications is lower. But it should be possible to reduce the reliance on metrics and ratings – even newer, more refined ones. Also grant applications shouldn’t be killed by a single off-hand critical review comment. Most importantly, grants shouldn’t all be written in a way that devalues exploratory research (by pretending to have strong hypotheses when you don’t) or – even worse – by pretending that the research you already conducted and are ready to publish is a “preliminary pilot data set.” For work that actually is hypothesis driven I quite like Dorothy Bishop’s idea that research funds could be obtained at the pre-registration stage when the theoretical background and experimental design have been established but before data collection commences. Realistically, this is probably more suitable for larger experimental programs than for every single study. But then again, encouraging larger, more thorough, projects may in fact be a good thing.