Category Archives: scientific publishing

Chris Chambers is a space alien

Imagine you are a radio astronomer and you suddenly stumble across a signal from outer space that appears to be evidence of an extra-terrestrial intelligence. Let’s also assume you already confidently ruled out any trivial artifactual explanation to do with naturally occurring phenomena or defective measurements. How could you confirm that this signal isn’t simply a random fluke?

This is actually the premise of the novel Contact by Carl Sagan, which happens to be one of my favorite books (I never watched the movie but only caught the end which is nothing like the book so I wouldn’t recommend it…). The solution to this problem proposed in the book is that one should quantify how likely the observed putative extraterrestrial signal would be under the assumption that it is the product of random background radiation.

This is basically what a p-value in frequentist null hypothesis significance testing represents. Using frequentist inference requires that you have a pre-specified hypothesis and a pre-specified design. You should have an effect size in mind, determine how many measurements you need to achieve a particular statistical power, and then you must carry out this experiment precisely as planned. This is rarely how real science works and it is often put forth as one of the main arguments why we should preregister our experimental designs. Any analysis that wasn’t planned a priori is by definition exploratory. The most extreme form of this argument posits that any experiment that hasn’t been preregistered is exploratory. While I still find it hard to agree with this extremist position, it is certainly true that analytical flexibility distorts the inferences we can make about an observation.

This proposed frequentist solution is therefore inappropriate for confirming our extraterrestrial signal. Because the researcher stumbled across the signal, the analysis is by definition exploratory. Moreover, you must also beware of the base-rate fallacy: even an event extremely unlikely under the null hypothesis is not necessarily evidence against the null hypothesis. Even if p=0.00001, a true extraterrestrial signal may be even less likely, say, p=10-100. Even if extra-terrestrial signals are quite common, given the small amount of space, time, and EM bands we have studied thus far, how probable is it we would just stumble across a meaningful signal?

None of that means that exploratory results aren’t important. I think you’d agree that finding credible evidence of an extra-terrestrial intelligence capable of sending radio transmissions would be a major discovery. The other day I met up with Rob McIntosh, one of the editors for Registered Reports at Cortex, to discuss the distinction between exploratory and confirmatory research. A lot of the criticism of preregistration focuses on whether it puts too much emphasis on hypothesis-driven research and whether it in turn devalues or marginalizes exploratory studies. I have spent a lot of time thinking about this issue and (encouraged by discussions with many proponents of preregistration) I have come to the conclusion that the opposite is true: by emphasizing which parts of your research are confirmatory I believe exploration is actually valued more. The way scientific publishing works conventionally many studies are written up in a way that pretends to be hypothesis-driven when in truth they weren’t. Probably for a lot of published research the truth lies somewhere in the middle.

So preregistration just keeps you honest with yourself and if anything it allows you to be more honest about how you explored the data. Nobody is saying that you can’t explore, and in fact I would argue you should always include some exploration. Whether it is an initial exploratory experiment that you did that you then replicate or test further in a registered experiment, or whether it is a posthoc robustness test you do to ensure that your registered result isn’t just an unforeseen artifact, some exploration is almost always necessary. “If we knew what we were doing, it would not be called research, would it?” (a quote by Albert Einstein, apparently).

One idea I discussed with Rob is whether there should be a publication format that specifically caters to exploration (Chris Chambers has also mentioned this idea previously). Such Exploratory Reports would allow researchers to publish interesting and surprising findings without first registering a hypothesis. You may think this sounds a lot like what a lot of present day high impact papers are like already. The key difference is that these Exploratory Reports would contain no inferential statistics and critically they are explicit about the fact that the research is exploratory – something that is rarely the case in conventional studies. However, this idea poses a critical challenge: on the one hand you want to ensure that the results presented in such a format are trustworthy. But how do you ensure this without inferential statistics?

Proponents of the New Statistics (which aren’t actually “new” and it is also questionable whether you should call them “statistics”) will tell you that you could just report the means/medians and confidence intervals, or perhaps the whole distributions of data. But that isn’t really helping. Inspecting confidence intervals and how far they are from zero (or another value of no interest) is effectively the same thing as a significance test. Even merely showing the distribution of observations isn’t really helping. If a result is so blatantly obvious that it convinces you by visual inspection (the “inter-ocular trauma test”), then formal statistical testing would be unnecessary anyway. If the results are even just a little subtler, it can be very difficult to decide whether the finding is interesting. So the way I see it, we either need a way to estimate statistical evidence, or you need to follow up the finding with a registered, confirmatory experiment that specifically seeks to replicate and/or further test the original exploratory finding.

In the case of our extra-terrestrial signal you may plan a new measurement. You know the location in the sky where the signal came from, so part of your preregistered methods is to point your radio telescope at the same point. You also have an idea of the signal strength, which allows you to determine the number of measurements needed to have adequate statistical power. Then you carry out this experiment, sticking meticulously to your planned recipe. Finally, you report your result and the associated p-value.

Sounds good in theory. In practice, however, this is not how science typically works. Maybe the signal isn’t continuous. There could be all sorts of reasons why the signal may only be intermittent, be it some interstellar dust clouds blocking the line of transmission, the transmitter pointing away from Earth due to the rotation of the aliens’ home planet, or even simply the fact that the aliens are operating their transmitter on a random schedule. We know nothing about what an alien species, let alone their civilization, may be like. Who is to say that they don’t just fall into random sleeping periods in irregular intervals?

So some exploratory, flexible analysis is almost always necessary. If you are too rigid in your approach, you are very likely to miss important discoveries. At the same time, you must be careful not to fool yourself. If we are really going down the route of Exploratory Reports without any statistical inference we need to come up with a good way to ensure that such exploratory findings aren’t mostly garbage. I think in the long run the only way to do so is to replicate and test results in confirmatory studies. But this could already be done as part of a Registered Report in which your design is preregistered. Experiment 1 would be exploratory without any statistical inference but simply reporting the basic pattern of results. Experiment 2 would then be preregistered and replicate or test the finding further.

However, Registered Reports can take a long time to publish. This may in fact be one of the weak points about this format that may stop the scientific community from becoming more enthusiastic about them. As long as there is no real incentive to doing slow science, the idea that you may take two or three years to publish one study is not going to appeal to many people. It will stop early career researchers from getting jobs and research funding. It also puts small labs in poorer universities at a considerable disadvantage compared to researchers with big grants, big data, and legions of research assistants.

The whole point of Exploratory Reports would be to quickly push out interesting observations. In some ways, this is then exactly what brief communications in high impact journals are currently for. I don’t think it will serve us well to replace the notion of snappy (and likely untrue) high impact findings with inappropriate statistical inferences with snappy (and likely untrue) exploratory findings without statistical inference. If the purpose of Exploratory Reports is solely to provide an outlet for quick publication of interesting results, we still have the same kind of skewed incentive structure as now. Also, while removing statistical inference from our exploratory findings may be better statistical practice I am not convinced that it is better scientific practice unless we have other ways of ensuring that these exploratory results are kosher.

The way I see it, the only way around this dilemma is to finally stop treating publications as individual units. Science is by nature a lengthy, incremental process. Yes, we need exciting discoveries to drive science forward. At the same time, replicability and robustness of our discoveries is critical. In order to combine these two needs I believe research findings should not be seen as separate morsels but as a web of interconnected results. A single Exploratory Report (or even a bunch of them) could serve as the starting point. But unless they are followed up by Registered Reports replicating or scrutinizing these findings further, they are not all that meaningful. Only once replications and follow up experiments have been performed the whole body of a finding takes shape. A search on PubMed or Google Scholar would not merely spit out the original paper but a whole tree of linked experiments.

The perceived impact and value of a finding thus would be related to how much of a interconnected body of evidence it has generated rather than whether it was published in Nature or Science. Critically, this would allow people to quickly publish their exciting finding and thus avoid being deadlocked by endless review processes and disadvantaged compared to other people who can afford to do more open science. At the same time, they would be incentivized to conduct followed up studies. Because a whole body of related literature is linked, it would however also be an incentive for others to conduct replications or follow up experiments on your exploratory finding.

There are obviously logistic and technical challenges with this idea. The current publication infrastructure still does not really allow for this to work. This is not a major problem however. It seems entirely feasible to implement such a system. The bigger challenge is how to convince the broader community and publishers and funders to take this on board.


Strolling through the Garden of Forking Paths

The other day I got into another Twitter argument – for which I owe Richard Morey another drink – about preregistration of experimental designs before data collection. Now, as you may know, I have in the past had long debates with proponents of preregistration. Not really because I was against it per se but because I am a natural skeptic. It is still far too early to tell if the evidence supports the claim that preregistration improves the replicability and validity of published research. I also have an innate tendency to view any revolutionary proposals with suspicion. However, these long discussions have eased my worries and led me to revise my views on this issue. As Russ Poldrack put it nicely, preregistration no longer makes me nervous. I believe the theoretical case for preregistration is compelling. While solid empirical evidence for the positive and negative consequences of preregistration will only emerge over the course of the coming decades, this is not actually all that important. I seriously doubt that preregistration actually hurts scientific progress. At worst it has not much of an effect at all – but I am fairly confident that it will prove to be a positive development.

Curiously, largely due to the heroic efforts by one Christopher Chambers, a Sith Lord at my alma mater Cardiff University, I am now strongly in favor of the more radical form of preregistration, registered reports (RRs), where the hypothesis and design is first subject to peer review, data collection only commences when the design has been accepted, and eventual publication is guaranteed if the registered plan was followed. In departmental discussions, a colleague of mine repeatedly voiced his doubts that RRs could ever become mainstream, because they are such a major effort. It is obvious that RRs are not ideal for all kinds of research and to my knowledge nobody claims otherwise. RRs are a lot of work that I wouldn’t invest in something like a short student project, in particular a psychophysics experiment. But I do think they should become the standard operating procedure for many larger, more expensive projects. We already have project presentations at our imaging facility where we discuss new projects and make suggestions on the proposed design. RRs are simply a way to take this concept into the 21st century and the age of transparent research. It can also improve the detail or quality of the feedback: most people at our project presentations will not be experts on the proposed research while peer reviewers at least are supposed to be. And, perhaps most important, RRs ensure that someone actually compares the proposed design to what was carried out eventually.

When RRs are infeasible or impractical, there is always the option of using light preregistration, in which you only state your hypothesis and experimental plans and upload this to OSF or a similar repository. I have done so twice now (although one is still in the draft stage and therefore not yet public). I would strongly encourage people to at least give that a try. If a detailed preregistration document is too much effort (it can be a lot of work although it should save you work when writing up your methods later on), there is even the option for very basic registration. The best format invariably depends on your particular research question. Such basic preregistrations can add transparency to the distinction between exploratory and confirmatory results because you have a public record of your prior predictions. Primarily, I think they are extremely useful to you, the researcher, as it allows you to check how directly you navigated the Garden of Forking Paths. Nobody stops you from taking a turn here or there. Maybe this is my OCD speaking, but I think you should always peek down some of the paths at least, simply as a robustness check. But the preregistration makes it less likely that you fool yourself. It is surprisingly easy to start believing that you took a straight path and forget about all the dead ends along the way.

This for me is really the main point of preregistration and RRs. I think a lot of the early discussion of this concept, and a lot of the opposition to it, stems from the implicit or even explicit accusation that nobody can be trusted. I can totally understand why this fails to win the hearts and minds of many people. However, it’s also clear that questionable research practices and deliberate p-hacking have been rampant. Moreover, unconscious p-hacking due to analytical flexibility almost certainly affects many findings. There are a lot of variables here and so I’d wager that most of the scientific literature is actually only mildly skewed by that. But that is not the point. Rather, I think as scientists, especially those who study cognitive and mental processes of all things, shouldn’t you want to minimize your own cognitive biases and human errors that could lead you astray? Instead of  the rather negative “data police” narrative that you often hear, this is exactly what preregistration is about. And so I think first and foremost a basic preregistration is only for yourself.

When I say such a basic preregistration is for yourself, this does not necessarily mean it cannot also be interesting to others. But I do believe their usefulness to other people is limited and should not be overstated. As with many of the changes brought on by open science, we must remain skeptical of any unproven claims of their benefits and keep in mind potential dangers. The way I see it, most (all?) public proponents of either form of preregistration are fully aware of this. I think the danger really concerns the wider community. I occasionally see anonymous or sock-puppet accounts popping up in online comment sections espousing a very radical view that only preregistered research can be trusted. Here is why this is disturbing me:

1. “I’ll just get some fresh air in the garden …”

Preregistered methods can only be as good as the detail they provide. A preregistration can be so vague that you cannot make heads or tails of it. The basic OSF-style registrations (e.g. the AsPredicted format) may be particularly prone to this problem but it could even be the case when you wrote a long design document. In essence, this is just saying you’ll take a stroll in the hedge maze without giving any indication whatsoever which paths you will take.

2. “I don’t care if the exit is right there!”

Preregistration doesn’t mean that your predictions make any sense or that there isn’t a better way to answer the research question. Often such things will only be revealed once the experiment is under way or completed and I’d actually hazard the guess that this is usually the case. Part of the beauty of preregistration is that it demonstrates to everyone (including yourself!) how many things you probably didn’t think of before starting the study. But it should never be used as an excuse not to try something unregistered when there are good scientific reasons to do so. This would be the equivalent of taking one predetermined path through the maze and then getting stuck in a dead end – in plain sight of the exit.

3. “Since I didn’t watch you, you must have chosen forking paths!”

Just because someone didn’t preregister their experiment does not mean their experiment was not confirmatory. Exploratory research is actually undervalued in the current system. A lot of research is written up as if it were confirmatory even if it wasn’t. Ironically, critics of preregistration often suggest that it devalues exploratory research but it actually places greater value on it because you are no longer incentivized to hide it. But nevertheless, confirmatory research does happen even without preregistration. It doesn’t become any less confirmatory because the authors didn’t tell you about it. I’m all in favor of constructive skepticism. If a result seems so surprising or implausible that you find it hard to swallow, by all means scrutinize it closely and/or carry out an (ideally preregistered) attempt to replicate it. But astoundingly, even people who don’t believe in open science sometimes do good science. When a tree falls in the garden and nobody is there to hear it, it still makes a sound.

Late September when the forks are in bloom

Obviously, RRs are not completely immune to these problems either. Present day peer review frequently fails to spot even glaring errors, so it is inevitable that it will also make mistakes in the RR situation. Moreover, there are additional problems with RRs, such as the fact that they require an observant and dedicated editor. This may not be so problematic while RR editors are strong proponents of RRs but if this concept becomes more widespread this will not always be the case. It remains to be seen how that works out. However, I think on the whole the RR concept is a reasonably good guarantee that hypotheses and designs are scrutinized, and that results are published, independent of the final outcome. The way I see it, both of these are fundamental improvements over the way we have been doing science so far.

But I’d definitely be very careful not to over-interpret the fact that a study is preregistered, especially when it isn’t a RR. Those badges they put on Psych Science articles may be a good incentive for people to embrace open science practices but I’m very skeptical of anyone who implies that just because a study was preregistered, or because it shares data and materials, that this makes it more trustworthy. Because it simply doesn’t. It lulls you into a false sense of security and I thought the intention here was not to fool ourselves so much any more. A recent case of data being manipulated after it was uploaded demonstrates how misleading an open data badge can be. In the same vein, just because an experiment is preregistered does not mean the authors didn’t lead us (and themselves) down the garden path. There have also been cases of preregistered studies that then did not actually report the outcomes of their intended analyses.

So, preregistration only means that you can read what the authors said they would do and then check for yourself how this compares to what they did do. That’s great because it’s transparent. But unless you actually do this check, you should treat the findings with the same skepticism (and the authors with the same courtesy and respect) as you would those of any other, non-registered study.

Sometimes it is really not that hard to find your way through the garden…

3 scoops of vanilla science in a low impact waffle please

A lot of young[1] researchers are worried about being “scooped”. No, this is not about something unpleasantly kinky but about when some other lab publishes an experiment that is very similar to yours before you do. Sometimes this is even more than just a worry and it actually happens. I know that this could be depressing. You’ve invested months or years of work and sleepless nights in this project and then somebody else comes along and publishes something similar and – poof – all the novelty is gone. Your science career is over. You will never publish this high impact now. You won’t ever get a grant. Immeasurable effort down the drain. Might as well give up, sell your soul to the Devil, and get a slave job in the pharmaceutical industry and get rich[2].

Except that this is total crap. There is no such thing as being scooped in this way, or at least if there is, it is not the end of your scientific career. In this post I want to briefly explain why I think so. This won’t be a lecture on the merits of open science, on replications, on how we should care more about the truth than about novelty and “sexiness”. All of these things are undoubtedly true in my mind and they are things we as a community should be actively working to change – but this is no help to young scientists who are still trying to make a name for themselves in a system that continues to reward high impact publications over substance.

No. Here I will talk about this issue with respect to the status quo. I think even in the current system, imperfect as it may be, this irrational fear is in my view unfounded. It is essential to dispel these myths about impact and novelty, about how precedence is tied to your career prospects. Early career scientists are the future of science. How can we ever hope to change science for the better if we allow this sort of madness to live on in the next generation of scientists? I say ‘live on’ for good reason – I, too, used to suffer from this madness when I was a graduate student and postdoc.

Why did I have this madness? Honestly I couldn’t say. Perhaps it’s a natural evolution of young researchers, at least in our current system. People like to point the finger at the lab PIs pressuring you into this sort of crazy behaviour. But that wasn’t it for me. For most of my postdoc I worked with Geraint Rees at UCL and perhaps the best thing he ever told me was to fucking chill[3]. He taught me – more by example than words – that while having a successful career was useful, what is much more important is to remember why you’re doing it: The point of having a (reasonably successful) science career is to be able to pay the rent/mortgage and take some enjoyment out of this life you’ve been given. The reason I do science, rather than making a mint in the pharma industry[4], is that I am genuinely curious and want to figure shit out.

Guess what? Neither of these things depend on whether somebody else publishes a similar (or even very similar) experiment while you’re still running it. We all know that novelty still matters to a lot of journals. Some have been very reluctant to publish replication attempts. I agree that publishing high impact papers does help wedge your foot in the door (that is, get you short-listed) in grant and job applications. But even if this were all that matters to be a successful scientist (and it really isn’t), here’s why you shouldn’t care too much about that anyway:

No paper was ever rejected because it was scooped

While journal editors will reject papers because they aren’t “novel,” I have never seen any paper being rejected because somebody else published something similar a few months earlier. Most editors and reviewers will not even be aware of the scooping study. You may find this hard to believe because you think your own research is so timely and important, but statistically it is true. Of course, some reviewers will know of the work. But most reviewers are not actually bad people and will not say “Something like this was published three months ago already and therefore this is not interesting.” Again, you may find this hard to believe because we’ve all heard too many stories of Reviewer 2 being an asshole. But in the end, most people aren’t that big of an asshole[5]. It happens quite frequently that I suggest in reviews that the authors cite some recently published work (usually not my own, in case you were wondering) that is very similar to theirs. In my experience this has never led to a rejection but I ask to them to put their results in the context of similar findings in the literature. You know, the way a Discussion section should be.

No two scooped studies are the same

You may think that the scooper’s experiment was very similar, but unless they actually stole your idea (a whole different story I also don’t believe but I have no time for this now…) and essentially pre-replicated (preclicated?) your design, I’d bet that there are still significant differences. Your study has not lost any of its value because of this. And it’s certainly no reason to quit and/or be depressed.

It’s actually a compliment

Not 100% sure about this one. Scientific curiosity shouldn’t have anything to do with a popularity contest if you ask me. Study whatever the hell you want to (within ethical limits, that is). But I admit, it feels reassuring to me when other people agree that the research questions I am interested in are also interesting to them. For one thing, this means that they will appreciate you working and (eventually) publishing on it, which again from a pragmatic point of view means that you can pay those rents/mortgages. And from a simple vanity perspective it is also reassuring that you’re not completely mad for pursuing a particular research question.

It has little to do with publishing high impact

Honestly, from what I can tell neither precedence nor the popularity of your topic are the critical factors in getting your work into high impact journals. The novelty of your techniques, how surprising and/or reassuringly expected your results are, and the simplicity of the narrative are actually major factors. Moreover, the place you work, the co-authors you with whom you write your papers, and the accessibility of the writing (in particular your cover letter to the editors!) definitely matter a great deal also (and these are not independent of the first points either…). It is quite possible that your “rival”[6] will publish first, but that doesn’t mean you won’t publish similar work in a higher impact journal. Journal review outcome is pretty stochastic and not really very predictable.

Actual decisions are not based on this

We all hear the horror stories of impact factors and h-indexes determining your success with grant applications and hiring decisions. Even if this were true (and I actually have my doubts that it is as black and white as this), a CV with lots of high impact publications may get your foot in the door – but it does not absolve the panel from making a hiring/funding decision. You need to do the work on that one yourself and even then luck may be against you (the odds certainly are). It also simply is not true that most people are looking for the person with the most Nature papers. Instead I bet you they are looking for people who can string together a coherent argument, communicate their ideas, and who have the drive and intellect to be a good researcher. Applicants with a long list of high impact papers may still come up with awful grant proposals or do terribly in job interviews while people with less stellar publication records can demonstrate their excellence in other ways. You may already have made a name for yourself in your field anyway, through conferences, social media, public engagement etc. This may matter far more than any high impact paper could.

There are more important things

And now we’re coming back to the work-life balance and why you’re doing this in the first place. Honestly, who the hell cares whether someone else published this a few months earlier? Is being the first to do this the reason you’re doing science? I can see the excitement of discovery but let’s face it, most of our research is neither like the work of Einstein or Newton nor are we discovering extraterrestrial life. Your discovery is no doubt exciting to you, it is hopefully exciting to some other scientists in your little bubble and it may even be exciting to some journalist who will write a distorting, simplifying article about it for the mainstream news. But seriously, it’s not as groundbreaking that it is worth sacrificing your mental and physical health over it. Live your life. Spend time with your family. Be good to your fellow creatures on this planet. By all means, don’t be complacent, ensure you make a living but don’t pressure yourself into believing that publishing ultra-high impact papers is the meaning of life.

A positive suggestion for next time…

Now if you’re really worried about this sort of thing, why not preregister your experiment? I know I said I wouldn’t talk about open science here but bear with me just this once because this is a practical point you can implement today. As I keep saying, the whole discussion about preregistration is dominated by talking about “questionable research practices”, HARKing, and all that junk. Not that these aren’t worthwhile concerns but this is a lot of negativity. There are plenty of positive reasons why preregistration can help and the (fallacious) fear of being scooped is one of them. Preregistration does not stop anyone else from publishing the same experiment before you but it does allow you to demonstrate that you had thought of the idea before they published it. With Registered Reports it becomes irrelevant if someone else published before you because your publication is guaranteed after the method has been reviewed. And I believe it will also make it far clearer to everyone how much who published what first where actually matters in the big scheme of things.

[1] Actually there are a lot of old and experienced researchers who worry about this too. And that is far worse than when early career researchers do it because they should really know better and they shouldn’t feel the same career pressures.
[2] It may sound appealing now, but thinking about it I wouldn’t trade my current professional life for anything. Except for grant admin bureaucracy perhaps. I would happily give that up at any price… :/
[3] He didn’t quite say it in those terms.
[4] This doesn’t actually happen. If you want to make a mint you need to go into scientific publishing but the whole open science movement is screwing up that opportunity now as well so you may be out of luck!
[5] Don’t bombard me with “Reviewer 2 held up my paper to publish theirs first” stories. Unless Reviewer 2 signed their review or told you specifically that it was them I don’t take such stories at face value.
[6] The sooner we stop thinking of other scientists in those terms the better for all of us.

Strawberry Ice Cream Cone

Yes, science is self-correcting

If you don’t believe science self-corrects, then you probably shouldn’t believe that evolution by natural selection occurs either – it’s basically the same thing.

I have said it many times before, both under the guise of my satirical alter ego and later – more seriously – on this blog. I am getting very tired of repeating it so I wrote this final post about it that I will simply link to next time this inevitably comes up…

My latest outburst about this was triggered by this blog post by Keith Laws entitled “Science is ‘Other-Correcting‘”. I have no qualms with the actual content of this post. It gives an interesting account of the attempt to correct an error in the publication record. The people behind this effort are great researchers for whom I have the utmost respect. The story they tell is shocking and important. In particular, the email they received by accident from a journal editor is disturbing and serves as a reminder of all the things that are wrong with the way scientific research and publishing currently operates.

My issue is with the (in my view seemingly) ubiquitous doubts about the self-correcting nature of science. To quote from the first paragraph in that post:

“I have never been convinced by the ubiquitous phrase ‘Science is self-correcting’. Much evidence points to science being conservative and looking less self-correcting and more ego-protecting. It is also not clear why ‘self’ is the correct description – most change occurs because of the ‘other’ – Science is other correcting.”

In my view this and similar criticisms of self-correction completely miss the point. The suffix ‘self-‘ refers to science, not to scientists. In fact, the very same paragraph contains the key: “Science is a process.” Science is an iterative approach by which we gradually broaden our knowledge and understanding of the world. You can debate whether or not there is such a thing as the “scientific method” – perhaps it’s more of a collection of methods. However, in my view above all else science is a way of thinking.

Scientific thinking is being inquisitive, skeptical, and taking nothing for granted. Prestige, fame, success are irrelevant. Perfect theories are irrelevant. The smallest piece of contradictory evidence can refute your grand unifying theory. And science encompasses all that. It is an emergent concept. And this is what is self-correcting.

Scientists, on the other hand, are not self-correcting. Some are more so than others but none are perfect. Scientists are people and thus inherently fallible. They are subject to ego, pride, greed, and all of life’s pressures, such as the need to pay a mortgage, feed their children, and having a career. In the common vernacular “science” is often conflated with the scientific enterprise, the way scientists go about doing science. This involves all those human factors and more and, fair enough, it is anything but self-correcting. But to argue that this means science isn’t self-correcting is attacking a straw man because few people are seriously arguing that the scientific enterprise couldn’t be better.

We should always strive to improve the way we do science because due to our human failings it will never be perfect. However, in this context we also shouldn’t forget how much we have already improved it. In the times of Newton, in Europe (the hub of science then) science was largely done only by white men from a very limited socioeconomic background. Even decades later, most women or people of non-European origin didn’t even need to bother trying (although this uphill struggle makes the achievements of scientists like Marie Curie or Henrietta Swan Leavitt all the more impressive). And publishing your research findings was not subject to formal peer review but largely dependent on the ego of some society presidents and on whether they liked you. None of these problems have been wiped off the face of the Earth but I would hope most people agree that things are better than they were 100 years ago.

Like all human beings, scientists are flawed. Nevertheless I am actually optimistic about us as a group. I do believe that on the whole scientists are actually interested in learning the truth and widening their understanding of nature. Sure, there are black sheep and even the best of us will succumb to human failings. At some point or other our dogma and affinity to our pet hypotheses can blind us to the cold facts. But on average I’d like to think we do better than most of our fellow humans. (Then again, I’m probably biased…).

We will continue to make the scientific enterprise better. We will change the way we publish and evaluate scientific findings. We will improve the way we interpret evidence and we communicate scientific discoveries. The scientific enterprise will become more democratic, less dependent on publishers getting rich on our free labour. Already within the decade I have been a practicing scientist we have begun to tear down the wide-spread illusion that when a piece of research is published it must therefore be true. When I did my PhD, the only place we could critically discuss new publications was in a small journal club and the conclusions of these discussions were almost never shared with the world. Nowadays every new study is immediately discussed online by an international audience. We have taken leaps towards scientific findings, data, and materials being available to anyone, anywhere, provided they have internet access.  I am very optimistic that this is only the beginning of much more fundamental changes.

Last year I participated in a workshop called “Is Science Broken?” that was solely organised by graduate students in my department. The growing number of replication attempts in the literature and all these post-publication discussions we are having are perfect examples of science correcting itself. It seems deeply ironic to me when posts like Keith Laws’, which describes an active effort to rectify errors, argue against the self-correcting nature of the scientific process.

Of course, self-correction is not guaranteed. It can easily be stifled. There is always a danger that we drift back into the 19th century or the dark ages. But the greater academic freedom (and generous funding) scientists are given, the more science will be allowed to correct itself.

Science is like a calm lagoon in the sunset… Or whatever. There is no real reason why this picture is here.

Update (19 Jan 2016): I just read this nice post about the role of priors in Bayesian statistics. The author actually says Bayesian analysis is “self-correcting” and this epitomises my point here about science. I would say science is essentially Bayesian. We start with prior hypotheses and theories but by accumulating evidence we update our prior beliefs to posterior beliefs. It may take a long time but assuming we continue to collect data our assumptions will self-correct. It may take a reevaluation of what the evidence is (which in this analogy would be a change to the likelihood function). Thus the discussion about how we know how close to the truth we are is in my view missing the point. Self-correction describes the process.

Update (21 Jan 2016): I added a sentence from my comment in the discussion section to the top. It makes for a good summary of my post. The analogy may not be perfect – but even if not I’d say it’s close. If you disagree, please leave a comment below.

Why wouldn’t you share data?

Data sharing has been in the news a lot lately from the refusal of the authors of the PACE trial to share their data even though the journal expects it to the eventful story of the “Sadness impairs color perception” study. A blog post by Dorothy Bishop called “Who’s afraid of Open Data?” made the rounds. The post itself is actually a month old already but it was republished by the LSE blog which gave it some additional publicity. In it she makes a impassioned argument for open data sharing and discusses the fears and criticisms many researchers have voiced against data sharing.

I have long believed in making all data available (and please note that in the following I will always mean data and materials, so not just the results but also the methods). The way I see it this transparency is the first and most important remedy to the ills of scientific research. I have regular discussions with one of my close colleagues* about how to improve science – we don’t always agree on various points like preregistration, but if there is one thing where we are on the same page, it is open data sharing. By making data available anyone can reanalyse it and check if the results reproduce and it allows you to check the robustness of a finding for yourself, if you feel that you should. Moreover, by documenting and organising your data you not only make it easier for other researchers to use, but also for yourself and your lab colleagues. It also helps you with spotting errors. It is also a good argument that stops reviewer 2 from requesting a gazillion additional analyses – if they really think these analyses are necessary they can do them themselves and publish them. This aspect in fact overlaps greatly with the debate on Registered Reports (RR) and it is one of the reasons I like the RR concept. But the benefits of data sharing go well beyond this. Access to the data will allow others to reuse the data to answer scientific questions you may not even have thought of. They can also be used in meta-analyses. With the increasing popularity and feasibility of large-scale permutation/bootstrapping methods it also means that availability to the raw values will be particularly important. Access to the data allows you to take into account distributional anomalies, outliers, or perhaps estimate the uncertainty on individual data points.

But as Dorothy describes, many scientists nevertheless remain afraid of publishing their actual data alongside their studies. For several years many journals and funding agencies have had a policy that data should always be shared upon request – but a laughably small proportion of such requests are successful. This is why some have now adopted the policy that all data must be shared in repositories upon publication or even upon submission. And to encourage this process recently the Peer Reviewer Openness Initiative was launched by which signatories would refuse to conduct in-depth reviews of manuscripts unless the authors can give a reason why data and materials aren’t public.

My most memorable experience with fears about open data involve a case where the lab head refused to share data and materials with the graduate student* who actually created the methods and collected the data. The exact details aren’t important. Maybe one day I will talk more about this little horror story… For me this demonstrates how far we have come already. Nowadays that story would be baffling to most researchers but back then (and that’s only a few years ago – I’m not that old!) more than one person actually told me that the PI and university were perfectly justified in keeping the student’s results and the fruits of their intellectual labour under lock and key.

Clearly, people are still afraid of open data. Dorothy lists the following reasons:

  1. Lack of time to curate data;  Data are only useful if they are understandable, and documenting a dataset adequately is a non-trivial task;
  2. Personal investment – sense of not wanting to give away data that had taken time and trouble to collect to other researchers who are perceived as freeloaders;
  3. Concerns about being scooped before the analysis is complete;
  4. Fear of errors being found in the data;
  5. Ethical concerns about confidentiality of personal data, especially in the context of clinical research;
  6. Possibility that others with a different agenda may misuse the data, e.g. perform selective analysis that misrepresented the findings;

In my view, points 1-4 are invalid arguments even if they seem understandable. I have a few comments about some of these:

The fear of being scooped 

I honestly am puzzled by this one. How often does this actually happen? The fear of being scooped is widespread and it may occasionally be justified. Say, if you discuss some great idea you have or post a pilot result on social media perhaps you shouldn’t be surprised if someone else agrees that the idea is great and also does it. Some people wouldn’t be bothered by that but many would and that’s understandable. Less understandable to me is if you present research at a conference and then complain about others publishing similar work because they were inspired by you. That’s what conferences are for. If you don’t want that to happen, don’t go to conferences. Personally, I think science would be a lot better if we cared a lot less about who did what first and instead cared more about what is true and how we can work together…

But anyway, as far as I can see none of that applies to data sharing. By definition data you share is either already published or at least submitted for peer review. If someone reuses your data for something else they have to cite you and give you credit. In many situations they may even do it in collaboration with you which could lead to coauthorship. More importantly, if the scooped result is so easily obtained that somebody beats you to it despite your head start (it’s your data, regardless of how well documented it is you will always know it better than some stranger) then perhaps you should have thought about that sooner. You could have held back on your first publication and combined the analyses. Or, if it really makes more sense to publish the data in separate papers, then you could perhaps declare that the full data set will be shared after the second one is published. I don’t really think this is necessary but I would accept that argument.

Either way, I don’t believe being scooped by data sharing is very realistic and any cases of that happening must be extremely rare. But please share these stories if you have them to prove me wrong! If you prefer, you can post it anonymously on the Neuroscience Devils. That’s what I created that website for.

Fear of errors being discovered

I’m sure everyone can understand that fear. It can be embarrassing to have your errors (and we all make mistakes) being discovered – at least if they are errors with big consequences. Part of the problem is also that all too often the discovery of errors is associated with some malice. To err is human, to forgive divine. We really need to stop treating every time somebody’s mistakes are being revealed (or, for that matter, when somebody’s findings fail to replicate) as an implication of sloppy science or malpractice. Sometimes (usually?) mistakes are just mistakes.

Probably nobody wants to have all of their data combed by vengeful sleuths nitpicking every tiny detail. If that becomes excessive and the same person is targeted, it could border on harassment and that should be counteracted. In-depth scrutiny of all the data by a particular researcher should be a special case that only happens when there is a substantial reason, say, in a fraud investigation. I would hope though that these cases are also rare.

And surely nobody can seriously want the scientific record to be littered with false findings, artifacts, and coding errors. I am not happy if someone tells me I made a serious error but I would nonetheless be grateful to them for telling me! It has happened before when lab members or collaborators spotted mistakes I made. In turn I have spotted mistakes colleagues made. None of this would have been possible if we didn’t share our data and methods amongst each another. I am always surprised when I hear how uncommon this seems to be in some labs. Labs should be collaborative, and so should science as a whole. And as I already said, organising and documenting your data actually helps you to spot errors before the work is published. If anything, data sharing reduces mistakes.

Ethical issues with patient confidentiality

This is a big concern – and the only one that I have full sympathy with. But all of our ethics and data protection applications actually discuss this. The only data that is shared should be anonymised. Participants should only be identified by unique codes that only the researchers who collected the data have access to. For a lot of psychology or other behavioural experiments this shouldn’t be hard to achieve.

Neuroimaging or biological data are a different story. I have a strict rule for my own results. We do not upload the actual brain images of our fMRI experiments to public repositories. While under certain conditions I am willing to share such data upon request as long as the participant’s name has been removed, I don’t think it is safe to make those data permanently available to the entire internet. Participant confidentiality must trump the need for transparency. It simply is not possible to remove all identifying information from these files. Skull-stripping, which removes the head tissues from an MRI scan except for the brain, does not remove all identifying information. Brains are like finger-prints and they can easily be matched up, if you have the required data. As someone* recently said in a discussion of this issue, the undergrad you are scanning in your experiment now may be Prime Minister in 20-30 years time. They definitely didn’t consent to their brain scans being available to anyone. It may not take much to identify a person’s data using only their age, gender, handedness, and a basic model of their head shape derived from their brain scan. We must also keep in mind of what additional data mining may be possible in the coming decades that we simply have no idea about yet. Nobody can know what information could be gleaned from these data, say, about health risks or personality factors. Sharing this without very clear informed consent (that many people probably wouldn’t give) is in my view irresponsible.

I also don’t believe that for most purposes this is even necessary. Most neuroimaging studies involve group analyses. In those you first spatially normalise the images of each participant and the perform statistical analysis across participants. It is perfectly reasonable to make those group results available. For purpose of non-parametric permutation analyses (also in the news recently) you would want to share individual data points but even there you can probably share images after sufficient processing that not much incidental information is left (e.g. condition contrast images). In our own work, these considerations don’t apply. We conduct almost all our analyses in the participant’s native brain space. As such we decided to only share the participants’ data projected on a cortical reconstruction. These data contain the functional results for every relevant voxel after motion correction and signal filtering. No this isn’t raw data but it is sufficient to reproduce the results and it is also sufficient for applying different analyses. I’d wager that for almost all purposes this is more than enough. And again, if someone were to be interested in applying different motion correction or filtering methods, this would be a negotiable situation. But I don’t think we need to allow unrestricted permanent access for such highly unlikely purposes.

Basically, rather than sharing all raw data I think we need to treat each data set on a case-by-case basis and weigh the risks against benefits. What should be mandatory in my view is sharing all data after default processing that is needed to reproduce the published results.

People with agendas and freeloaders

Finally a few words about a combination of points 2 and 6 in Dorothy Bishop’s list. When it comes to controversial topics (e.g. climate change, chronic fatigue syndrome, to name a few examples where this apparently happened) there could perhaps be the danger that people with shady motivations will reanalyse and nitpick the data to find fault with them and discredit the researcher. More generally, people with limited expertise may conduct poor reanalysis. Since failed reanalysis (and again, the same applies to failed replications) often cause quite a stir and are frequently discussed as evidence that the original claims were false, this could indeed be a problem. Also some will perceive these cases as “data tourism”, using somebody else’s hard-won results for quick personal gain – say by making a name for themselves as a cunning data detective.

There can be some truth in that and for that reason I feel we really have to work harder to change the culture of scientific discourse. We must resist the bias to agree with the “accuser” in these situations. (Don’t pretend you don’t have this bias because we all do. Maybe not in all cases but in many cases…)

Of course skepticism is good. Scientists should be skeptical but the skepticism should apply to all claims (see also this post by Neuroskeptic on this issue). If somebody reanalyses somebody else’s data using a different method that does not automatically make them right and the original author wrong. If somebody fails to replicate a finding, that doesn’t mean that finding was false.

Science thrives on discussion and disagreement. The critical thing is that the discussion is transparent and public. Anyone who has an interest should have the opportunity to follow it. Anyone who is skeptical of the authors’ or the reanalysers’/replicators’ claims should be able to check for themselves.

And the only way to achieve this level of openness is Open Data.


* They will remain anonymous unless they want to join this debate.

On the value of unrecorded piloting

In my previous post, I talked about why I think all properly conducted research should be published. Null results are important. The larger scientific community needs to know whether or not a particular hypothesis has been tested before. Otherwise you may end up wasting somebody’s time because they repeatedly try in vain to answer the same question. What is worse, we may also propagate false positives through the scientific record because failed replications are often still not published. All of this contributes to poor replicability of scientific findings.

However, the emphasis here is on ‘properly conducted research‘. I already discussed this briefly in my post but it also became the topic of an exchange between (for the most part) Brad Wyble, Daniël Lakens, and myself. In some fields, for example psychophysics, extensive piloting, and “fine-tuning” of experiments is not only very common but probably also necessary. To me it doesn’t seem sensible to make the results of all of these attempts publicly available. This inevitably floods the scientific record with garbage. Most likely nobody will look at it. Even if you are a master at documenting your work, nobody but you (and after a few months maybe not even you) will understand what is in your archive.

Most importantly, it can actually be extremely misleading for others who are less familiar with the experiment to see all of the tests you did ensuring the task was actually doable, that monitors were at the correct distance from the participant, your stereoscope was properly aligned, the luminance of the stimuli was correct, that the masking procedure was effective, etc. Often you may only realise during your piloting that the beautiful stimulus you designed after much theoretical deliberation doesn’t really work in practice. For example, you may inadvertently induce an illusory percept that alters how participants respond in the task. This in fact happened recently with an experiment a collaborator of mine piloted. And more often than not, after having tested a particular task on myself at great length I then discover that it is far too difficult for anyone else (let’s talk about overtrained psychophysicists another time…).

Such pilot results are not very meaningful

It most certainly would not be justified to include them in a meta-analysis to quantify the effect – because they presumably don’t even measure the same effect (or at least not very reliably). A standardised effect size, like Cohen’s d, is a signal-to-noise ratio as it compares an effect (e.g. difference in group means) to the variability of the sample. The variability is inevitably larger if a lot of noisy, artifactual, and quite likely erroneous data are included. While some degree of this can be accounted for in meta-analysis by using a random-effects model, it simply doesn’t make sense to include bad data. We are not interested in the meta-effect, that is, the average result over all possible experimental designs we can dream up, no matter how inadequate.

What we are actually interested in is some biological effect and we should ensure that we take the most precise measurement as possible. Once you have a procedure that you are confident will yield precise measurements, by all means, carry out a confirmatory experiment. Replicate it several times, especially if it’s not an obvious effect. Pre-register your design if you feel you should. Maximise statistical power by testing many subjects if necessary (although often significance is tested on a subject-by-subject basis, so massive sample sizes are really overkill as you can treat each participant as a replication – I’ll talk about replication in a future post so I’ll leave it at this for now). But before you do all this you usually have to fine-tune an experiment, at least if it is a novel problem.

Isn’t this contributing to the problem?

Colleagues in social/personality psychology often seem to be puzzled and even concerned by this. The opacity of what has or hasn’t been tried is part of the problems that plague the field and lead to publication bias. There is now a whole industry meta-analysing results in the literature to quantify ‘excess significance’ or a ‘replication index’. This aims to reveal whether some additional results, especially null results, may have been suppressed or if p-hacking was employed. Don’t these pilot experiments count as suppressed studies or p-hacking?

No, at least not if this is done properly. The criteria you use to design your study must of course be orthogonal to and independent from your hypothesis. Publication bias, p-hacking, and other questionable practices are all actually sub-forms of circular reasoning: You must never use the results of your experiment to inform the design as you may end up chasing (overfitting) ghosts in your data. Of course, you must not run 2-3 subjects on an experiment, look at the results and say ‘The hypothesis wasn’t confirmed. Let’s tweak a parameter and start over.’ This would indeed be p-hacking (or rather ‘result hacking’ – there are usually no p-values at this stage).

A real example

I can mainly speak from my own experience but typically the criteria used to set up psychophysics experiments are sanity/quality checks. Look for example at the figure below, which shows a psychometric curve of one participant. The experiment was a 2AFC task using the method of constant stimuli: In each trial the participant made a perceptual judgement on two stimuli, one of which (the ‘test’) could vary physically while the other remained constant (the ‘reference’). The x-axis plots how different the two stimuli were, so 0 (the dashed grey vertical line) means they were identical. To the left or right of this line the correct choice would be the reference or test stimulus, respectively. The y-axis plots the percentage of trials the participant chose the test stimulus. By fitting a curve to these data we can extrapolate the ability of the participant to tell apart the stimuli – quantified by how steep the curve is – and also their bias, that is at what level of x the two stimuli appeared identical to them (dotted red vertical line):Good

As you can tell, this subject was quite proficient at discriminating the stimuli because the curve is rather steep. At many stimulus levels the performance is close to perfect (that is, either near 0 or 100%). There is a point where performance is at chance (dashed grey horizontal line). But once you move to the left or the right of this point performance becomes good very fast. The curve is however also shifted considerably to the right of zero, indicating that the participant indeed had a perceptual bias. We quantify this horizontal shift to infer the bias. This does not necessarily tell us the source of this bias (there is a lot of literature dealing with that question) but that’s beside the point – it clearly measures something reliably. Now look at this psychometric curve instead:


The general conventions here are the same but these results are from a completely different experiment that clearly had problems. This participant did not make correct choices very often as the curve only barely goes below the chance line – they chose the test stimulus far too often. There could be numerous reasons for this. Maybe they didn’t pay attention and simply made the same choice most of the time. For that the trend is bit too clean though. Perhaps the task was too hard for them, maybe because the stimulus presentation was too brief. This is possible although it is very unlikely that a healthy, young adult with normal vision would not be able to tell apart the more extreme stimulus levels with high accuracy. Most likely, the participant did not really understand the task instructions or perhaps the stimuli created some unforeseen effect (like the illusion I mentioned before) that actually altered what percept they were judging. Whatever the reason, there is no correct way to extrapolate the psychometric parameters here. The horizontal shift and the slope are completely unusable. We see an implausibly poor discrimination performance and extremely large perceptual bias. If their vision really worked this way, they should be severely impaired…

So these data are garbage. It makes no sense to meta-analyse biologically implausible parameter estimates. We have no idea what the participant was doing here and thus we can also have no idea what effect we are measuring. Now this particular example is actually a participant a student ran as part of their project. If you did this pilot experiment on yourself (or a colleague) you might have worked out what the reason for the poor performance was.

What can we do about it?

In my view, it is entirely justified to exclude such data from our publicly shared data repositories. It would be a major hassle to document all these iterations. And what is worse, it would obfuscate the results for anyone looking at the archive. If I look at a data set and see a whole string of brief attempts from a handful of subjects (usually just the main author), I could be forgiven for thinking that something dubious is going on here. However, in most cases this would be unjustified and a complete waste of everybody’s time.

At the same time, however, I also believe in transparency. Unfortunately, some people do engage in result-hacking and iteratively enhance their findings by making the experimental design contingent on the results. In most such cases this is probably not done deliberately and with malicious intent – but that doesn’t make it any less questionable. All too often people like to fiddle with their experimental design while the actual data collection is already underway. In my experience this tendency is particularly severe among psychophysicists who moved into neuroimaging where this is a really terrible (and costly) idea.

How can we reconcile these issues? In my mind, the best way is perhaps to document briefly what you did to refine the experimental design. We honestly don’t need or want to see all the failed attempts at setting up an experiment but it could certainly be useful to have an account of how the design was chosen. What experimental parameters were varied? How and why were they chosen? How many pilot participants were there? This last point is particularly telling. When I pilot something, there usually is one subject: Sam. Possibly I will have also tested one or two others, usually lab members, to see if my familiarity with the design influences my results. Only if the design passes quality assurance, say by producing clear psychometric curves or by showing to-be-expected results in a sanity check (e.g., the expected response on catch trials), I would dare to actually subject “real” people to a novel design. Having some record, even if as part of the documentation of your data set, is certainly a good idea though.

The number of participants and pilot experiments can also help you judge the quality of the design. Such “fine-tuning” and tweaking of parameters isn’t always necessary – in fact most designs we use are actually straight-up replications of previous ones (perhaps with an added condition). I would say though that in my field this is a very normal thing to do when setting up a new design at least. However, I have also heard of extreme cases that I find fishy. (I will spare you the details and will refrain from naming anyone). For example in one study the experimenters ran over a 100 pilot participants – tweaking the design all along the way – to identify those that showed a particular perceptual effect and then used literally a handful of these for an fMRI study that claims to have been about “normal” human brain function. Clearly, this isn’t alright. But this also cannot possibly count as piloting anymore. The way I see it, a pilot experiment can’t have an order of magnitude more data than the actual experiment…

How does this relate to the wider debate?

I don’t know how applicable these points are to social psychology research. I am not a social psychologist and my main knowledge about their experiments are from reading particularly controversial studies or the discussions about them on social media. I guess that some of these issues do apply but that it is far less common. An equivalent situation to what I describe here would be that you redesign your questionnaire because it people always score at maximum – and by ‘people’ I mean the lead author :P. I don’t think this is a realistic situation in social psychology, but it is exactly how psychophysical experiments work. Basically, what we do in piloting is what a chemist would do when they are calibrating their scales or cleaning their test tubes.

Or here’s another analogy using a famous controversial social psychology finding we discussed previously: Assume you want to test whether some stimulus makes people walk more slowly as they leave the lab. What I do in my pilot experiments is to ensure that the measurement I take of their walking speed is robust. This could involve measuring the walking time for a number of people before actually doing any experiment. It could also involve setting up sensors to automate this measurement (more automation is always good to remove human bias but of course this procedure needs to be tested too!). I assume – or I certainly hope so at least – that the authors of these social psychology studies did such pre-experiment testing that was not reported in their publications.

As I said before, humans are dirty test tubes. But you should ensure that you get them as clean as you can before you pour in your hypothesis. Perhaps a lot of this falls under methods we don’t report. I’m all for reducing this. Methods sections frequently lack necessary detail. But to some extend, I think some unreported methods and tests are unavoidable.

Humans apparently also glow with unnatural light

Is publication bias actually a good thing?*

Yesterday Neuroskeptic came to our Cognitive Drinks event in the Experimental Psychology department at UCL to talk about p-hacking. His entertaining talk (see Figure 1) was followed by a lively and fairly long debate about p-hacking and related questions about reproducibility, preregistration, and publication bias. During the course of this discussion a few interesting things came up. (I deliberately won’t name anyone as I think this complicates matters. People can comment and identify themselves if they feel that they should…)

Figure 1. Using this super-high-tech interactive fMRI results simulator Neuroskeptic clearly demonstrated a significant blob of activation in the pre-SMA (I think?) in stressed compared to relaxed people. This result made perfect sense.

It was suggested that a lot of the problems with science would be remedied effectively if only people were encouraged (or required?) to replicate their own findings before publication. Now that sounds generally like a good idea. I have previously suggested that this would work very well in combination with preregistration: you first do a (semi-)exploratory experiment to finalise the protocol, then submit a preregistration of your hypothesis and methods, and then do the whole thing again as a replication (or perhaps more than one if you want to test several boundary conditions or parameters). You then submit the final set of results for publication. Under the Registered Report format, your preregistered protocol would already undergo peer review. This would ensure that the final results are almost certain to be published provided you didn’t stray excessively from the preregistered design. So far, so good.

Should you publish unclear results?

Or is it? Someone suggested that it would be a problem if your self-replication didn’t show the same thing as the original experiment. What should one do in this case? Doesn’t publishing something incoherent like this, one significant finding and a failed replication, just add to the noise in the literature?

At first, this question simply baffled me, as I suspect it would many of the folks campaigning to improve science. (My evil twin sister called these people Crusaders for True Science but I’m not supposed to use derogatory terms like that anymore nor should I impersonate lady demons for that matter. Most people from both sides of this mudslinging contest “debate” never seemed to understand that I’m also a revolutionary – you might just say that I’m more Proudhon, Bakunin, or Henry David Thoreau rather than Marx, Lenin, or Che Guevara. But I digress…)

Surely, the attitude that unclear, incoherent findings, that is, those that are more likely to be null results, are not worth publishing must contribute to the prevailing publication bias in the scientific literature? Surely, this view is counterproductive to the aims of science to accumulate evidence and gradually get closer to some universal truths? We must know which hypotheses have been supported by experimental data and which haven’t. One of the most important lessons I learned from one of my long-term mentors was that all good experiments should be published regardless of what they show. This doesn’t mean you should publish every single pilot experiment you ever did that didn’t work. (We can talk about what that does and doesn’t mean another time. But you know how life is: sometimes you think you have some great idea only to realise that it makes no sense at all when you actually try it in practice. Or maybe that’s just me? :P). Even with completed experiments you probably shouldn’t bother publishing if you realise afterwards that it is all artifactual or the result of some error. Hopefully you don’t have a lot of data sets like that though. So provided you did an experiment of suitable quality I believe you should publish it rather than hiding it in the proverbial file drawer. All scientific knowledge should be part of the scientific record.

I naively assumed that this view was self-evident and shared by almost everyone – but this clearly is not the case. Yet instead of sneering at such alternative opinions I believe we should understand why people hold them. There are reasonable arguments why one might wish to not publish every unclear finding. The person making this suggestion at our discussion said that it is difficult to interpret a null result, especially an assumed null result like this. If your original experiment O showed a significant effect supporting your hypothesis, but your replication experiment R does not, you cannot naturally conclude that the effect really doesn’t exist. For one thing you need to be more specific than that. If O showed a significant positive effect but R shows a significant negative one, this would be more consistent with the null hypothesis than if O is highly significant (p<10-30) and R just barely misses the threshold (p=0.051).

So let’s assume that we are talking about the former scenario. Even then things aren’t as straightforward, especially if R isn’t as exact a replication of O as you might have liked. If there is any doubt (and usually there is) that something could have been different in R than in O, this could be one of the hidden factors people always like to talk about in these discussions. Now you hopefully know your data better than anyone. If experiment O was largely exploratory and you tried various things to see what works best (dare we say p-hacking again?), then the odds are probably quite good that a significant non-replication in the opposite direction shows that the effect was just a fluke. But this is not a natural law but a probabilistic one. You cannot ever know whether the original effect was real or not, especially not from such a limited data set of two non-independent experiments.

This is precisely why you should publish all results!

In my view, it is inherently dangerous if researchers decide for themselves which findings are important and which are not. It is not only a question of publishing only significant results. It applies much more broadly to the situation when a researcher publishes only results that support their pet theory but ignores or hides those that do not. I’d like to believe that most scientists don’t engage in this sort of behaviour – but sadly it is probably not uncommon. A way to counteract this is to train researchers to think of ways that test alternative hypotheses that make opposite predictions. However, such so-called “strong inference” is not always feasible. And even when it is, the two alternatives are not always equally interesting, which in turn means that people may still become emotionally attached to one hypothesis.

The decision whether a result is meaningful should be left to posterity. You should publish all your properly conducted experiments. If you have defensible doubts that the data are actually rubbish (say, an fMRI data set littered with spikes, distortions, and excessive motion artifacts, or a social psychology study where you discovered posthoc that all the participants were illiterate and couldn’t read the questionnaires) then by all means throw them in the bin. But unless you have a good reason, you should never do this and instead add the results to the scientific record.

Now the suggestion during our debate was that such inconclusive findings clog up the record with unnecessary noise. There is an enormous and constantly growing scientific literature. As it is, it is becoming increasingly harder to separate the wheat from the chaff. I can barely keep up with the continuous feed of new publications in my field and I am missing a lot. Total information overload. So from that point of view the notion makes sense that only those studies that meet a certain threshold for being conclusive are accepted as part of the scientific record.

I can certainly relate to this fear. For the same reason I am sceptical of proposals that papers should be published before review and all decisions about the quality and interest of some piece of research, including the whole peer review process, should be entirely post-publication. Some people even seem to think that the line between scientific publication and science blog should be blurred beyond recognition. I don’t agree with this. I don’t think that rating systems like those used on Amazon or IMDb are an ideal way to evaluate scientific research. It doesn’t sound wise to me to assess scientific discoveries and medical breakthroughs in the same way we rank our entertainment and retail products. And that is not even talking about unleashing the horror of internet comment sections onto peer review…

Solving the (false) dilemma

I think this discussion is creating a false dichotomy. These are not mutually exclusive options. The solution to a low signal-to-noise ratio in the scientific literature is not to maintain publication bias of significant results. Rather the solution is to improve our filtering mechanisms. As I just described, I don’t think it will be sufficient to employ online shopping and social network procedures to rank the scientific literature. Even in the best-case scenario this is likely to highlight the results of authors who are socially dominant or popular and probably also those who are particularly unpopular or controversial. It does not necessarily imply that the highest quality research floats to the top [cue obvious joke about what kind of things float to the top…].

No, a high quality filter requires some organisation. I am convinced the scientific community can organise itself very well to create these mechanisms without too much outside influence. (I told you I’m Thoreau and Proudhon, not some insane Chaos Worshipper :P). We need some form of acceptance to the record. As I outlined previously, we should reorganise the entire publication process so that the whole peer-review process is transparent and public. It should be completely separate from journals. The journals’ only job should be to select interesting manuscripts and to publish short summary versions of them in order to communicate particularly exciting results to the broader community. But this peer-review should still involve a “pre-publication” stage – in the sense that the initial manuscript should not generate an enormous amount of undue interest before it has been properly vetted. To reiterate (because people always misunderstand that): the “vetting” should be completely public. Everyone should be able to see all the reviews, all the editorial decisions, and the whole evolution of the manuscript. If anyone has any particular insight to share about the study, by all means they should be free to do so. But there should be some editorial process. Someone should chase potential reviewers to ensure the process takes off at all.

The good news about all this is that it benefits you. Instead of weeping bitterly and considering to quit science because yet again you didn’t find the result you hypothesised, this just means that you get to publish more research. Taking the focus off novel, controversial, special, cool or otherwise “important” results should also help make the peer review more about the quality and meticulousness of the methods. Peer review should be about ensuring that the science is sound. In current practice it instead often resembles a battle with authors defending to the death their claims about the significance of their findings against the reviewers’ scepticism. Scepticism is important in science but this kind of scepticism is completely unnecessary when people are not incentivised to overstate the importance of their results.

Practice what you preach

I honestly haven’t followed all of the suggestions I make here. Neither have many other people who talk about improving science. I know of vocal proponents of preregistration who have yet to preregister any study of their own. The reasons for this are complex. Of course, you should “be the change you wish to see in the world” (I’m told Gandhi said this). But it’s not always that simple.

On the whole though I think I have published almost all of the research I’ve done. While I currently have a lot of unpublished results there is very little in the file drawer as most of these experiments have either been submitted or are being written up for eventual publication. There are two exceptions. One is a student project that produced somewhat inconclusive results although I would say it is a conceptual replication of a published study by others. The main reason we haven’t tried to publish this yet is that the student isn’t here anymore and hasn’t been in contact and the data aren’t that exciting to us to bother with the hassle of publication (and it is a hassle!).

The other data set is perhaps ironic because it is a perfect example of the scenario I described earlier. A few years ago when I started a new postdoc I was asked to replicate an experiment a previous lab member had done. For simplicity, let’s just call this colleague Dr Toffee. Again, they can identify themselves if they wish. The main reason for this was that reviewers had asked Dr Toffee to collect eye-movement data. So I replicated the original experiment but added eye-tracking. My replication wasn’t an exact one in the strictest terms because I decided to code the experimental protocol from scratch (this was a lot easier). I also had to use a different stimulus setup than the previous experiment as that wouldn’t have worked with the eye-tracker. Still, I did my best to match the conditions in all other ways.

My results were a highly significant effect in the opposite direction than the original finding. We did all the necessary checks to ensure that this wasn’t just a coding error etc. It seemed to be real. Dr Toffee and I discussed what to do about it and we eventually decided that we wouldn’t bother to publish this set of experiments. The original experiment had been conducted several years before my replication. Dr Toffee had moved on with their life. I on the other hand had done this experiment as a courtesy because I was asked to. It was very peripheral to my own research interests. So, as in the other example, we both felt that going through the publication process would have been a fairly big hassle for very little gain.

Now this is bad. Perhaps there is some other poor researcher, a student perhaps, who will do a similar experiment again and waste a lot of time on testing the hypothesis that, at least according to our incoherent results, is unlikely to be true. And perhaps they will also not publish their failure to support this hypothesis. The circle of null results continues… :/

But you need to pick your battles. We are all just human beings and we do not have unlimited (research) energy. With both of these lacklustre or incoherent results I mentioned (and these are literally the only completed experiments we haven’t at least begun to write up), it seems like a daunting task to undergo the pain of submission->review->rejection->repeat that simply doesn’t seem worth it.

So what to do? Well, the solution is again what I described. The very reason the task of publishing these results isn’t worth our energy is everything that is wrong with the current publication process! In my dream world in which I can simply write up a manuscript formatted in a way that pleases me and then upload this to the pre-print peer-review site my life would be infinitely simpler. No more perusing dense journal websites for their guide to authors or hunting for the Zotero/Endnote/Whatever style to format the bibliography. No more submitting your files to one horribly designed, clunky journal website after another, checking the same stupid tick boxes, adding the same reviewer suggestions. No more rewriting your cover letters by changing the name of the journal. Certainly for my student’s project, it would not be hard to do as there is already a dissertation that can be used as a basis for the manuscript. Dr Toffee’s experiment and its contradictory replication might require a bit more work – but to be fair even there is already a previous manuscript. So all we’d need to add would be the modifications of the methods and the results of my replication. In a world where all you need to do is upload the manuscript and address some reviewers’ comments to ensure the quality of the science this should be fairly little effort. In turn it would ensure that the file drawer is empty and we are all much more productive.

This world isn’t here yet but there are journals that will allow something that isn’t too far off from that, namely F1000Research and PeerJ (and the Winnower also counts although the content there seems to be different and I don’t quite know how much review editing happens there). So, maybe I should email Dr Toffee now…

(* In case you didn’t get this from the previous 2700ish words: the answer to this question is unequivocally “No.”)