Is publication bias actually a good thing?*

Yesterday Neuroskeptic came to our Cognitive Drinks event in the Experimental Psychology department at UCL to talk about p-hacking. His entertaining talk (see Figure 1) was followed by a lively and fairly long debate about p-hacking and related questions about reproducibility, preregistration, and publication bias. During the course of this discussion a few interesting things came up. (I deliberately won’t name anyone as I think this complicates matters. People can comment and identify themselves if they feel that they should…)

Figure 1. Using this super-high-tech interactive fMRI results simulator Neuroskeptic clearly demonstrated a significant blob of activation in the pre-SMA (I think?) in stressed compared to relaxed people. This result made perfect sense.

It was suggested that a lot of the problems with science would be remedied effectively if only people were encouraged (or required?) to replicate their own findings before publication. Now that sounds generally like a good idea. I have previously suggested that this would work very well in combination with preregistration: you first do a (semi-)exploratory experiment to finalise the protocol, then submit a preregistration of your hypothesis and methods, and then do the whole thing again as a replication (or perhaps more than one if you want to test several boundary conditions or parameters). You then submit the final set of results for publication. Under the Registered Report format, your preregistered protocol would already undergo peer review. This would ensure that the final results are almost certain to be published provided you didn’t stray excessively from the preregistered design. So far, so good.

Should you publish unclear results?

Or is it? Someone suggested that it would be a problem if your self-replication didn’t show the same thing as the original experiment. What should one do in this case? Doesn’t publishing something incoherent like this, one significant finding and a failed replication, just add to the noise in the literature?

At first, this question simply baffled me, as I suspect it would many of the folks campaigning to improve science. (My evil twin sister called these people Crusaders for True Science but I’m not supposed to use derogatory terms like that anymore nor should I impersonate lady demons for that matter. Most people from both sides of this mudslinging contest “debate” never seemed to understand that I’m also a revolutionary – you might just say that I’m more Proudhon, Bakunin, or Henry David Thoreau rather than Marx, Lenin, or Che Guevara. But I digress…)

Surely, the attitude that unclear, incoherent findings, that is, those that are more likely to be null results, are not worth publishing must contribute to the prevailing publication bias in the scientific literature? Surely, this view is counterproductive to the aims of science to accumulate evidence and gradually get closer to some universal truths? We must know which hypotheses have been supported by experimental data and which haven’t. One of the most important lessons I learned from one of my long-term mentors was that all good experiments should be published regardless of what they show. This doesn’t mean you should publish every single pilot experiment you ever did that didn’t work. (We can talk about what that does and doesn’t mean another time. But you know how life is: sometimes you think you have some great idea only to realise that it makes no sense at all when you actually try it in practice. Or maybe that’s just me? :P). Even with completed experiments you probably shouldn’t bother publishing if you realise afterwards that it is all artifactual or the result of some error. Hopefully you don’t have a lot of data sets like that though. So provided you did an experiment of suitable quality I believe you should publish it rather than hiding it in the proverbial file drawer. All scientific knowledge should be part of the scientific record.

I naively assumed that this view was self-evident and shared by almost everyone – but this clearly is not the case. Yet instead of sneering at such alternative opinions I believe we should understand why people hold them. There are reasonable arguments why one might wish to not publish every unclear finding. The person making this suggestion at our discussion said that it is difficult to interpret a null result, especially an assumed null result like this. If your original experiment O showed a significant effect supporting your hypothesis, but your replication experiment R does not, you cannot naturally conclude that the effect really doesn’t exist. For one thing you need to be more specific than that. If O showed a significant positive effect but R shows a significant negative one, this would be more consistent with the null hypothesis than if O is highly significant (p<10-30) and R just barely misses the threshold (p=0.051).

So let’s assume that we are talking about the former scenario. Even then things aren’t as straightforward, especially if R isn’t as exact a replication of O as you might have liked. If there is any doubt (and usually there is) that something could have been different in R than in O, this could be one of the hidden factors people always like to talk about in these discussions. Now you hopefully know your data better than anyone. If experiment O was largely exploratory and you tried various things to see what works best (dare we say p-hacking again?), then the odds are probably quite good that a significant non-replication in the opposite direction shows that the effect was just a fluke. But this is not a natural law but a probabilistic one. You cannot ever know whether the original effect was real or not, especially not from such a limited data set of two non-independent experiments.

This is precisely why you should publish all results!

In my view, it is inherently dangerous if researchers decide for themselves which findings are important and which are not. It is not only a question of publishing only significant results. It applies much more broadly to the situation when a researcher publishes only results that support their pet theory but ignores or hides those that do not. I’d like to believe that most scientists don’t engage in this sort of behaviour – but sadly it is probably not uncommon. A way to counteract this is to train researchers to think of ways that test alternative hypotheses that make opposite predictions. However, such so-called “strong inference” is not always feasible. And even when it is, the two alternatives are not always equally interesting, which in turn means that people may still become emotionally attached to one hypothesis.

The decision whether a result is meaningful should be left to posterity. You should publish all your properly conducted experiments. If you have defensible doubts that the data are actually rubbish (say, an fMRI data set littered with spikes, distortions, and excessive motion artifacts, or a social psychology study where you discovered posthoc that all the participants were illiterate and couldn’t read the questionnaires) then by all means throw them in the bin. But unless you have a good reason, you should never do this and instead add the results to the scientific record.

Now the suggestion during our debate was that such inconclusive findings clog up the record with unnecessary noise. There is an enormous and constantly growing scientific literature. As it is, it is becoming increasingly harder to separate the wheat from the chaff. I can barely keep up with the continuous feed of new publications in my field and I am missing a lot. Total information overload. So from that point of view the notion makes sense that only those studies that meet a certain threshold for being conclusive are accepted as part of the scientific record.

I can certainly relate to this fear. For the same reason I am sceptical of proposals that papers should be published before review and all decisions about the quality and interest of some piece of research, including the whole peer review process, should be entirely post-publication. Some people even seem to think that the line between scientific publication and science blog should be blurred beyond recognition. I don’t agree with this. I don’t think that rating systems like those used on Amazon or IMDb are an ideal way to evaluate scientific research. It doesn’t sound wise to me to assess scientific discoveries and medical breakthroughs in the same way we rank our entertainment and retail products. And that is not even talking about unleashing the horror of internet comment sections onto peer review…

Solving the (false) dilemma

I think this discussion is creating a false dichotomy. These are not mutually exclusive options. The solution to a low signal-to-noise ratio in the scientific literature is not to maintain publication bias of significant results. Rather the solution is to improve our filtering mechanisms. As I just described, I don’t think it will be sufficient to employ online shopping and social network procedures to rank the scientific literature. Even in the best-case scenario this is likely to highlight the results of authors who are socially dominant or popular and probably also those who are particularly unpopular or controversial. It does not necessarily imply that the highest quality research floats to the top [cue obvious joke about what kind of things float to the top…].

No, a high quality filter requires some organisation. I am convinced the scientific community can organise itself very well to create these mechanisms without too much outside influence. (I told you I’m Thoreau and Proudhon, not some insane Chaos Worshipper :P). We need some form of acceptance to the record. As I outlined previously, we should reorganise the entire publication process so that the whole peer-review process is transparent and public. It should be completely separate from journals. The journals’ only job should be to select interesting manuscripts and to publish short summary versions of them in order to communicate particularly exciting results to the broader community. But this peer-review should still involve a “pre-publication” stage – in the sense that the initial manuscript should not generate an enormous amount of undue interest before it has been properly vetted. To reiterate (because people always misunderstand that): the “vetting” should be completely public. Everyone should be able to see all the reviews, all the editorial decisions, and the whole evolution of the manuscript. If anyone has any particular insight to share about the study, by all means they should be free to do so. But there should be some editorial process. Someone should chase potential reviewers to ensure the process takes off at all.

The good news about all this is that it benefits you. Instead of weeping bitterly and considering to quit science because yet again you didn’t find the result you hypothesised, this just means that you get to publish more research. Taking the focus off novel, controversial, special, cool or otherwise “important” results should also help make the peer review more about the quality and meticulousness of the methods. Peer review should be about ensuring that the science is sound. In current practice it instead often resembles a battle with authors defending to the death their claims about the significance of their findings against the reviewers’ scepticism. Scepticism is important in science but this kind of scepticism is completely unnecessary when people are not incentivised to overstate the importance of their results.

Practice what you preach

I honestly haven’t followed all of the suggestions I make here. Neither have many other people who talk about improving science. I know of vocal proponents of preregistration who have yet to preregister any study of their own. The reasons for this are complex. Of course, you should “be the change you wish to see in the world” (I’m told Gandhi said this). But it’s not always that simple.

On the whole though I think I have published almost all of the research I’ve done. While I currently have a lot of unpublished results there is very little in the file drawer as most of these experiments have either been submitted or are being written up for eventual publication. There are two exceptions. One is a student project that produced somewhat inconclusive results although I would say it is a conceptual replication of a published study by others. The main reason we haven’t tried to publish this yet is that the student isn’t here anymore and hasn’t been in contact and the data aren’t that exciting to us to bother with the hassle of publication (and it is a hassle!).

The other data set is perhaps ironic because it is a perfect example of the scenario I described earlier. A few years ago when I started a new postdoc I was asked to replicate an experiment a previous lab member had done. For simplicity, let’s just call this colleague Dr Toffee. Again, they can identify themselves if they wish. The main reason for this was that reviewers had asked Dr Toffee to collect eye-movement data. So I replicated the original experiment but added eye-tracking. My replication wasn’t an exact one in the strictest terms because I decided to code the experimental protocol from scratch (this was a lot easier). I also had to use a different stimulus setup than the previous experiment as that wouldn’t have worked with the eye-tracker. Still, I did my best to match the conditions in all other ways.

My results were a highly significant effect in the opposite direction than the original finding. We did all the necessary checks to ensure that this wasn’t just a coding error etc. It seemed to be real. Dr Toffee and I discussed what to do about it and we eventually decided that we wouldn’t bother to publish this set of experiments. The original experiment had been conducted several years before my replication. Dr Toffee had moved on with their life. I on the other hand had done this experiment as a courtesy because I was asked to. It was very peripheral to my own research interests. So, as in the other example, we both felt that going through the publication process would have been a fairly big hassle for very little gain.

Now this is bad. Perhaps there is some other poor researcher, a student perhaps, who will do a similar experiment again and waste a lot of time on testing the hypothesis that, at least according to our incoherent results, is unlikely to be true. And perhaps they will also not publish their failure to support this hypothesis. The circle of null results continues… :/

But you need to pick your battles. We are all just human beings and we do not have unlimited (research) energy. With both of these lacklustre or incoherent results I mentioned (and these are literally the only completed experiments we haven’t at least begun to write up), it seems like a daunting task to undergo the pain of submission->review->rejection->repeat that simply doesn’t seem worth it.

So what to do? Well, the solution is again what I described. The very reason the task of publishing these results isn’t worth our energy is everything that is wrong with the current publication process! In my dream world in which I can simply write up a manuscript formatted in a way that pleases me and then upload this to the pre-print peer-review site my life would be infinitely simpler. No more perusing dense journal websites for their guide to authors or hunting for the Zotero/Endnote/Whatever style to format the bibliography. No more submitting your files to one horribly designed, clunky journal website after another, checking the same stupid tick boxes, adding the same reviewer suggestions. No more rewriting your cover letters by changing the name of the journal. Certainly for my student’s project, it would not be hard to do as there is already a dissertation that can be used as a basis for the manuscript. Dr Toffee’s experiment and its contradictory replication might require a bit more work – but to be fair even there is already a previous manuscript. So all we’d need to add would be the modifications of the methods and the results of my replication. In a world where all you need to do is upload the manuscript and address some reviewers’ comments to ensure the quality of the science this should be fairly little effort. In turn it would ensure that the file drawer is empty and we are all much more productive.

This world isn’t here yet but there are journals that will allow something that isn’t too far off from that, namely F1000Research and PeerJ (and the Winnower also counts although the content there seems to be different and I don’t quite know how much review editing happens there). So, maybe I should email Dr Toffee now…

(* In case you didn’t get this from the previous 2700ish words: the answer to this question is unequivocally “No.”)

7 thoughts on “Is publication bias actually a good thing?*

  1. Excellent post, Sam. I think you’ve really nailed the conflict between pragmatism and idealism — “being the change you want to see” is all well and good, but pushing against the tide in a competitive resource-stretched system could well lead to pushing oneself out of the profession entirely. This is the conundrum faced by junior researchers and highlights the need for incentives to align with open science. We gotta change the tide itself!

    One thing that occurs to me reading this post is that there is a feature of Registered Reports that could help add some clarity to cases where an exact pre-registered replication produces very different findings to the original experiment (whether null or just confusing) – and that is the sequential registration option. Were an author at Cortex to find themselves in this scenario, we would be happy for them to say “We don’t really know what’s happening here but we’d like to register a third study to try and understand why the effect doesn’t replicate” and then go ahead and do it.

    On the other hand, we’d also be happy for the final outcome to be uncertain – because sometimes that just how science rolls. We all need to get more comfortable with uncertainty, though it is a bitter pill to swallow for many scientists who – to put it mildly – have developed a sweet tooth.

    Like

    1. Yes sequential registration is a good point. In a way I tried to allude to that already but wasn’t very clear. You could have more than one replication attempt as part of the same study. Some of them could be more indirect to test possible moderators. Or (even better) they would contain a direct replication condition but also test additional factors. You could certainly register these in sequence – in fact that would be a very sensible way to go about it. I’d like to believe this would even be interesting to reviewers as they would be more interactively involved in the evolution of the study.

      In general I agree that we need to embrace uncertainty more. To me science is all about admitting that which we don’t yet know. This is why we’re doing it, isn’t it? We want to reduce uncertainty but the whole reason for that is that there remain unanswered questions. But you’d never guess it from reading most papers out there as they usually sound very confident in their (posthoc) hypotheses.

      Like

  2. “The decision whether a result is meaningful should be left to posterity. You should publish all your properly conducted experiments.”

    I agree. However (and this is what I was trying to argue in the discussion yesterday) I can see why people (or labs) might find it very difficult to publish two sets of data that seemingly contradict each other.

    Your example of the replication that gave the opposite result from the original being one good instance of this.

    People worry that if they were to publish such “messy” results, they would be seen as sloppy (or at least as a research team with some sloppy members.)

    It’s a tough thing to expect people to do.

    So maybe in an ideal world, people wouldn’t replicate their own work. They would replicate other people’s work, which would avoid the possibility of self-contradiction, which, it seems to me, puts scientists into difficult positions.

    Also, a result that’s been replicated 5 times by 5 different people, is more reliable than a result that’s been replicated 10 times by the same person (I would say).

    Like

    1. You’re right that it is difficult to expect people to do this but I think that this has more to do with unfounded fears than with reality (like many of these concerns scientists have, e.g. the fear of being scooped or the idea that people might preregister work they’ve already done). It will no doubt depend on the individual case but I’d bet in most scenarios where you find contradictory results in your replication people will actually applaud you for publishing it. At least now they would. If this became common practice in the future, nobody would bat an eye lid at it anymore.

      Also note that, as discussed above with Chris, this could be combined with more experiments that might try to dig deeper and find out why the results are so inconsistent. The more replications you do the more confident you can be in the end and the more conclusively you can actually answer your research question. Combined with guaranteed publication under a Registered Report this sounds like a good plan to me.

      You are of course right that independent replications are ideal. But that doesn’t mean you shouldn’t try to replicate your own work too. At the very least, whenever feasible you should build in replications of your previous findings in new experiments you are doing.

      And while I like your idea of asking someone else to replicate your result, I think this is more difficult to implement than asking people to replicate their own findings. Still it’s a neat idea really. We should try this – next time we have a result we think is cool we should ask someone else to try to replicate it and thus join the author list. On a voluntary basis like this it could be fairly straightforward to make this happen.

      Like

    2. I assume you’ll be talking more about this in your up-coming post so we can discuss this more when that’s done 🙂

      Like

  3. I’m the person who advocated self-replication as a way around the problem of useless clutter in the psychology literature, and I stand by that comment. As you say, the current incentives are perverse – people’s contributions to science are counted on a per-paper basis rather than a per-chunk-of-knowledge basis, so, there is no incentive to really thoroughly pursue a line of enquiry with multiple experiments until you have figured out what is going on, and *then* publish one big story. The result is that the field becomes clogged with multiple small under-powered studies that just squeaked past that magic p = 0.05 threshold.

    Physics doesn’t seem to have this problem – but then standards in physics seem to be a bit higher. The Higgs boson, for example, took p=0.0000003 to be accepted. Where does our miserable 0.05 come from? It’s not well founded in anything meaningful, and is more of a convenience than anything else (enough of our effort crosses the 0.05 boundary that we feel rewarded enough to carry on).

    The reason I advocate self-replication is that no matter what p-hacking you resort to to get one of your studies to get past the threshold, the other one only has a 1:20 chance of falsely agreeing (though of course you could p-hack both together – but that’s a bit harder!). The original scientists stand to benefit more from self-replication than anyone else would benefit from replicating their study, and anyway if they do it properly the replication becomes part of the sequence of studies that pushes them towards their ultimate conclusion.

    A variant of self-replication is just to keep doing the experiment until p=0.05 becomes p=0.01. Whatever, I think we need a culture where we don’t publish papers where the whole claim rests on a single p=0.05 result.

    Something else I suggested at the meeting is that when a finding is replicated, by self or other, the original study should be marked in PubMed with a star to show that it has found subsequent support in the literature. Pre-registration and self-replication would also earn a star. Then, we would know to treat un-starred papers as preliminary findings, and not build elaborate stories around them.

    As for publishing everything? We’ll have to agree to disagree on that. I think that negative findings can be useful but only rarely, and they tend to add to the clutter. Better to stick things in your file drawer (or on http://psychfiledrawer.org/) and carry on experimenting until you have something positive to say. We need to figure out how to reward that approach though.

    Like

    1. Thanks for your comment, Kate, and hats off for identifying yourself (I don’t think this is needed but I also don’t think it should matter that you do!). Please note that I have nothing bad to say about self-replication. It is a good idea. So is having other people replicate your work. Independent replications instill more confidence but as you (and I above) point you it is also harder to make this happen. It is probably more likely you can convince people to do this with really sensational findings (and it is probably also more needed in those cases) but unlikely to ever happen for the bulk of incremental research.

      I also think (and have stated it many times before) that we should also build in replications in most experiments we do. I certainly try to do that (I am doing it more now than I did in the past except perhaps for the most basic things like replicating the presence of human visual areas). This is a form of replication that doesn’t waste anybody’s time and effort. Rather I’d argue that it is essential to the novel research you do that you can show that the previous findings on which your hypothesis is based indeed hold.

      The idea of a star or flag indicating how often and widely an effect in the literature has been replicated is also something I have discussed in the past. In my vision something like this would appear on PubMed and it would visualise the straight-up direct replications, the ones that were built into on-going research, as well as the more conceptual ones that show whether an effect generalises. PsychFileDrawer and CurateScience (whose site I should finally check out after they gave me a login ages ago) are already providing this in a way but it’s not yet as general as I would like.

      The bit where we just entirely disagree is the idea about “negative” and “positive” findings. If a piece of research is executed well, it shouldn’t matter if the results are positive or negative. I don’t even think these adjectives really mean what you seem to imply. A “negative” finding can very well be a significant difference in a critical condition. In fact, in my view the best research is designed in this way. That recent Firestone & Schall article discussed a nice example of this:
      http://www.ncbi.nlm.nih.gov/pubmed/26189677

      I certainly agree that we shouldn’t clog up the literature with more underpowered, poorly done rubbish. This is why I think opening the floodgates to any old manuscripts and counting them as “publications” is unwise. We need proper quality assurance before we count a study as sufficiently trustworthy. You may be right that a badly done incoherent and contradictory set of results isn’t worth that much. But I’d rather take a study with 2-3 contradictory findings that clearly had ample power to detect the effects they hypothesised over the notion that we should only publish “positive” findings. In the latter case you never know if nobody ever tested your hypothesis before or if they did but had inconclusive results. This leads to a lot of wasted effort.

      Finally, I think lowering significance thresholds is not the answer either, or at least not the only one. Instead I think we need more hypothesis-driven predictions that allow people to adequately power their studies. This is practically similar to lowering the significance threshold but conceptually I think it is very different. When lowering the threshold you simply say that you want to be more precise in your wild stab in the dark. Personally I think it would be much better for science to be clearer about what effects we expect and then test them (there is definitely a Bayesian perspective on this too but I’ll leave that out for now).

      It is also not really realistic to expect significant thresholds used in particle physics in biology or psychology. If you aimed for this with the noisy effects we study, you would probably need to use up the entire UK science budget to do just one experiment. That’s neither feasible nor sensible. As Chris said above, we need to embrace some uncertainty. I think it’s part of the deal. Certainty will increase gradually with converging evidence and independent replications – including the negative ones.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s