Category Archives: scientific publishing

On the value of unrecorded piloting

In my previous post, I talked about why I think all properly conducted research should be published. Null results are important. The larger scientific community needs to know whether or not a particular hypothesis has been tested before. Otherwise you may end up wasting somebody’s time because they repeatedly try in vain to answer the same question. What is worse, we may also propagate false positives through the scientific record because failed replications are often still not published. All of this contributes to poor replicability of scientific findings.

However, the emphasis here is on ‘properly conducted research‘. I already discussed this briefly in my post but it also became the topic of an exchange between (for the most part) Brad Wyble, Daniël Lakens, and myself. In some fields, for example psychophysics, extensive piloting, and “fine-tuning” of experiments is not only very common but probably also necessary. To me it doesn’t seem sensible to make the results of all of these attempts publicly available. This inevitably floods the scientific record with garbage. Most likely nobody will look at it. Even if you are a master at documenting your work, nobody but you (and after a few months maybe not even you) will understand what is in your archive.

Most importantly, it can actually be extremely misleading for others who are less familiar with the experiment to see all of the tests you did ensuring the task was actually doable, that monitors were at the correct distance from the participant, your stereoscope was properly aligned, the luminance of the stimuli was correct, that the masking procedure was effective, etc. Often you may only realise during your piloting that the beautiful stimulus you designed after much theoretical deliberation doesn’t really work in practice. For example, you may inadvertently induce an illusory percept that alters how participants respond in the task. This in fact happened recently with an experiment a collaborator of mine piloted. And more often than not, after having tested a particular task on myself at great length I then discover that it is far too difficult for anyone else (let’s talk about overtrained psychophysicists another time…).

Such pilot results are not very meaningful

It most certainly would not be justified to include them in a meta-analysis to quantify the effect – because they presumably don’t even measure the same effect (or at least not very reliably). A standardised effect size, like Cohen’s d, is a signal-to-noise ratio as it compares an effect (e.g. difference in group means) to the variability of the sample. The variability is inevitably larger if a lot of noisy, artifactual, and quite likely erroneous data are included. While some degree of this can be accounted for in meta-analysis by using a random-effects model, it simply doesn’t make sense to include bad data. We are not interested in the meta-effect, that is, the average result over all possible experimental designs we can dream up, no matter how inadequate.

What we are actually interested in is some biological effect and we should ensure that we take the most precise measurement as possible. Once you have a procedure that you are confident will yield precise measurements, by all means, carry out a confirmatory experiment. Replicate it several times, especially if it’s not an obvious effect. Pre-register your design if you feel you should. Maximise statistical power by testing many subjects if necessary (although often significance is tested on a subject-by-subject basis, so massive sample sizes are really overkill as you can treat each participant as a replication – I’ll talk about replication in a future post so I’ll leave it at this for now). But before you do all this you usually have to fine-tune an experiment, at least if it is a novel problem.

Isn’t this contributing to the problem?

Colleagues in social/personality psychology often seem to be puzzled and even concerned by this. The opacity of what has or hasn’t been tried is part of the problems that plague the field and lead to publication bias. There is now a whole industry meta-analysing results in the literature to quantify ‘excess significance’ or a ‘replication index’. This aims to reveal whether some additional results, especially null results, may have been suppressed or if p-hacking was employed. Don’t these pilot experiments count as suppressed studies or p-hacking?

No, at least not if this is done properly. The criteria you use to design your study must of course be orthogonal to and independent from your hypothesis. Publication bias, p-hacking, and other questionable practices are all actually sub-forms of circular reasoning: You must never use the results of your experiment to inform the design as you may end up chasing (overfitting) ghosts in your data. Of course, you must not run 2-3 subjects on an experiment, look at the results and say ‘The hypothesis wasn’t confirmed. Let’s tweak a parameter and start over.’ This would indeed be p-hacking (or rather ‘result hacking’ – there are usually no p-values at this stage).

A real example

I can mainly speak from my own experience but typically the criteria used to set up psychophysics experiments are sanity/quality checks. Look for example at the figure below, which shows a psychometric curve of one participant. The experiment was a 2AFC task using the method of constant stimuli: In each trial the participant made a perceptual judgement on two stimuli, one of which (the ‘test’) could vary physically while the other remained constant (the ‘reference’). The x-axis plots how different the two stimuli were, so 0 (the dashed grey vertical line) means they were identical. To the left or right of this line the correct choice would be the reference or test stimulus, respectively. The y-axis plots the percentage of trials the participant chose the test stimulus. By fitting a curve to these data we can extrapolate the ability of the participant to tell apart the stimuli – quantified by how steep the curve is – and also their bias, that is at what level of x the two stimuli appeared identical to them (dotted red vertical line):Good

As you can tell, this subject was quite proficient at discriminating the stimuli because the curve is rather steep. At many stimulus levels the performance is close to perfect (that is, either near 0 or 100%). There is a point where performance is at chance (dashed grey horizontal line). But once you move to the left or the right of this point performance becomes good very fast. The curve is however also shifted considerably to the right of zero, indicating that the participant indeed had a perceptual bias. We quantify this horizontal shift to infer the bias. This does not necessarily tell us the source of this bias (there is a lot of literature dealing with that question) but that’s beside the point – it clearly measures something reliably. Now look at this psychometric curve instead:


The general conventions here are the same but these results are from a completely different experiment that clearly had problems. This participant did not make correct choices very often as the curve only barely goes below the chance line – they chose the test stimulus far too often. There could be numerous reasons for this. Maybe they didn’t pay attention and simply made the same choice most of the time. For that the trend is bit too clean though. Perhaps the task was too hard for them, maybe because the stimulus presentation was too brief. This is possible although it is very unlikely that a healthy, young adult with normal vision would not be able to tell apart the more extreme stimulus levels with high accuracy. Most likely, the participant did not really understand the task instructions or perhaps the stimuli created some unforeseen effect (like the illusion I mentioned before) that actually altered what percept they were judging. Whatever the reason, there is no correct way to extrapolate the psychometric parameters here. The horizontal shift and the slope are completely unusable. We see an implausibly poor discrimination performance and extremely large perceptual bias. If their vision really worked this way, they should be severely impaired…

So these data are garbage. It makes no sense to meta-analyse biologically implausible parameter estimates. We have no idea what the participant was doing here and thus we can also have no idea what effect we are measuring. Now this particular example is actually a participant a student ran as part of their project. If you did this pilot experiment on yourself (or a colleague) you might have worked out what the reason for the poor performance was.

What can we do about it?

In my view, it is entirely justified to exclude such data from our publicly shared data repositories. It would be a major hassle to document all these iterations. And what is worse, it would obfuscate the results for anyone looking at the archive. If I look at a data set and see a whole string of brief attempts from a handful of subjects (usually just the main author), I could be forgiven for thinking that something dubious is going on here. However, in most cases this would be unjustified and a complete waste of everybody’s time.

At the same time, however, I also believe in transparency. Unfortunately, some people do engage in result-hacking and iteratively enhance their findings by making the experimental design contingent on the results. In most such cases this is probably not done deliberately and with malicious intent – but that doesn’t make it any less questionable. All too often people like to fiddle with their experimental design while the actual data collection is already underway. In my experience this tendency is particularly severe among psychophysicists who moved into neuroimaging where this is a really terrible (and costly) idea.

How can we reconcile these issues? In my mind, the best way is perhaps to document briefly what you did to refine the experimental design. We honestly don’t need or want to see all the failed attempts at setting up an experiment but it could certainly be useful to have an account of how the design was chosen. What experimental parameters were varied? How and why were they chosen? How many pilot participants were there? This last point is particularly telling. When I pilot something, there usually is one subject: Sam. Possibly I will have also tested one or two others, usually lab members, to see if my familiarity with the design influences my results. Only if the design passes quality assurance, say by producing clear psychometric curves or by showing to-be-expected results in a sanity check (e.g., the expected response on catch trials), I would dare to actually subject “real” people to a novel design. Having some record, even if as part of the documentation of your data set, is certainly a good idea though.

The number of participants and pilot experiments can also help you judge the quality of the design. Such “fine-tuning” and tweaking of parameters isn’t always necessary – in fact most designs we use are actually straight-up replications of previous ones (perhaps with an added condition). I would say though that in my field this is a very normal thing to do when setting up a new design at least. However, I have also heard of extreme cases that I find fishy. (I will spare you the details and will refrain from naming anyone). For example in one study the experimenters ran over a 100 pilot participants – tweaking the design all along the way – to identify those that showed a particular perceptual effect and then used literally a handful of these for an fMRI study that claims to have been about “normal” human brain function. Clearly, this isn’t alright. But this also cannot possibly count as piloting anymore. The way I see it, a pilot experiment can’t have an order of magnitude more data than the actual experiment…

How does this relate to the wider debate?

I don’t know how applicable these points are to social psychology research. I am not a social psychologist and my main knowledge about their experiments are from reading particularly controversial studies or the discussions about them on social media. I guess that some of these issues do apply but that it is far less common. An equivalent situation to what I describe here would be that you redesign your questionnaire because it people always score at maximum – and by ‘people’ I mean the lead author :P. I don’t think this is a realistic situation in social psychology, but it is exactly how psychophysical experiments work. Basically, what we do in piloting is what a chemist would do when they are calibrating their scales or cleaning their test tubes.

Or here’s another analogy using a famous controversial social psychology finding we discussed previously: Assume you want to test whether some stimulus makes people walk more slowly as they leave the lab. What I do in my pilot experiments is to ensure that the measurement I take of their walking speed is robust. This could involve measuring the walking time for a number of people before actually doing any experiment. It could also involve setting up sensors to automate this measurement (more automation is always good to remove human bias but of course this procedure needs to be tested too!). I assume – or I certainly hope so at least – that the authors of these social psychology studies did such pre-experiment testing that was not reported in their publications.

As I said before, humans are dirty test tubes. But you should ensure that you get them as clean as you can before you pour in your hypothesis. Perhaps a lot of this falls under methods we don’t report. I’m all for reducing this. Methods sections frequently lack necessary detail. But to some extend, I think some unreported methods and tests are unavoidable.

Humans apparently also glow with unnatural light

Is publication bias actually a good thing?*

Yesterday Neuroskeptic came to our Cognitive Drinks event in the Experimental Psychology department at UCL to talk about p-hacking. His entertaining talk (see Figure 1) was followed by a lively and fairly long debate about p-hacking and related questions about reproducibility, preregistration, and publication bias. During the course of this discussion a few interesting things came up. (I deliberately won’t name anyone as I think this complicates matters. People can comment and identify themselves if they feel that they should…)

Figure 1. Using this super-high-tech interactive fMRI results simulator Neuroskeptic clearly demonstrated a significant blob of activation in the pre-SMA (I think?) in stressed compared to relaxed people. This result made perfect sense.

It was suggested that a lot of the problems with science would be remedied effectively if only people were encouraged (or required?) to replicate their own findings before publication. Now that sounds generally like a good idea. I have previously suggested that this would work very well in combination with preregistration: you first do a (semi-)exploratory experiment to finalise the protocol, then submit a preregistration of your hypothesis and methods, and then do the whole thing again as a replication (or perhaps more than one if you want to test several boundary conditions or parameters). You then submit the final set of results for publication. Under the Registered Report format, your preregistered protocol would already undergo peer review. This would ensure that the final results are almost certain to be published provided you didn’t stray excessively from the preregistered design. So far, so good.

Should you publish unclear results?

Or is it? Someone suggested that it would be a problem if your self-replication didn’t show the same thing as the original experiment. What should one do in this case? Doesn’t publishing something incoherent like this, one significant finding and a failed replication, just add to the noise in the literature?

At first, this question simply baffled me, as I suspect it would many of the folks campaigning to improve science. (My evil twin sister called these people Crusaders for True Science but I’m not supposed to use derogatory terms like that anymore nor should I impersonate lady demons for that matter. Most people from both sides of this mudslinging contest “debate” never seemed to understand that I’m also a revolutionary – you might just say that I’m more Proudhon, Bakunin, or Henry David Thoreau rather than Marx, Lenin, or Che Guevara. But I digress…)

Surely, the attitude that unclear, incoherent findings, that is, those that are more likely to be null results, are not worth publishing must contribute to the prevailing publication bias in the scientific literature? Surely, this view is counterproductive to the aims of science to accumulate evidence and gradually get closer to some universal truths? We must know which hypotheses have been supported by experimental data and which haven’t. One of the most important lessons I learned from one of my long-term mentors was that all good experiments should be published regardless of what they show. This doesn’t mean you should publish every single pilot experiment you ever did that didn’t work. (We can talk about what that does and doesn’t mean another time. But you know how life is: sometimes you think you have some great idea only to realise that it makes no sense at all when you actually try it in practice. Or maybe that’s just me? :P). Even with completed experiments you probably shouldn’t bother publishing if you realise afterwards that it is all artifactual or the result of some error. Hopefully you don’t have a lot of data sets like that though. So provided you did an experiment of suitable quality I believe you should publish it rather than hiding it in the proverbial file drawer. All scientific knowledge should be part of the scientific record.

I naively assumed that this view was self-evident and shared by almost everyone – but this clearly is not the case. Yet instead of sneering at such alternative opinions I believe we should understand why people hold them. There are reasonable arguments why one might wish to not publish every unclear finding. The person making this suggestion at our discussion said that it is difficult to interpret a null result, especially an assumed null result like this. If your original experiment O showed a significant effect supporting your hypothesis, but your replication experiment R does not, you cannot naturally conclude that the effect really doesn’t exist. For one thing you need to be more specific than that. If O showed a significant positive effect but R shows a significant negative one, this would be more consistent with the null hypothesis than if O is highly significant (p<10-30) and R just barely misses the threshold (p=0.051).

So let’s assume that we are talking about the former scenario. Even then things aren’t as straightforward, especially if R isn’t as exact a replication of O as you might have liked. If there is any doubt (and usually there is) that something could have been different in R than in O, this could be one of the hidden factors people always like to talk about in these discussions. Now you hopefully know your data better than anyone. If experiment O was largely exploratory and you tried various things to see what works best (dare we say p-hacking again?), then the odds are probably quite good that a significant non-replication in the opposite direction shows that the effect was just a fluke. But this is not a natural law but a probabilistic one. You cannot ever know whether the original effect was real or not, especially not from such a limited data set of two non-independent experiments.

This is precisely why you should publish all results!

In my view, it is inherently dangerous if researchers decide for themselves which findings are important and which are not. It is not only a question of publishing only significant results. It applies much more broadly to the situation when a researcher publishes only results that support their pet theory but ignores or hides those that do not. I’d like to believe that most scientists don’t engage in this sort of behaviour – but sadly it is probably not uncommon. A way to counteract this is to train researchers to think of ways that test alternative hypotheses that make opposite predictions. However, such so-called “strong inference” is not always feasible. And even when it is, the two alternatives are not always equally interesting, which in turn means that people may still become emotionally attached to one hypothesis.

The decision whether a result is meaningful should be left to posterity. You should publish all your properly conducted experiments. If you have defensible doubts that the data are actually rubbish (say, an fMRI data set littered with spikes, distortions, and excessive motion artifacts, or a social psychology study where you discovered posthoc that all the participants were illiterate and couldn’t read the questionnaires) then by all means throw them in the bin. But unless you have a good reason, you should never do this and instead add the results to the scientific record.

Now the suggestion during our debate was that such inconclusive findings clog up the record with unnecessary noise. There is an enormous and constantly growing scientific literature. As it is, it is becoming increasingly harder to separate the wheat from the chaff. I can barely keep up with the continuous feed of new publications in my field and I am missing a lot. Total information overload. So from that point of view the notion makes sense that only those studies that meet a certain threshold for being conclusive are accepted as part of the scientific record.

I can certainly relate to this fear. For the same reason I am sceptical of proposals that papers should be published before review and all decisions about the quality and interest of some piece of research, including the whole peer review process, should be entirely post-publication. Some people even seem to think that the line between scientific publication and science blog should be blurred beyond recognition. I don’t agree with this. I don’t think that rating systems like those used on Amazon or IMDb are an ideal way to evaluate scientific research. It doesn’t sound wise to me to assess scientific discoveries and medical breakthroughs in the same way we rank our entertainment and retail products. And that is not even talking about unleashing the horror of internet comment sections onto peer review…

Solving the (false) dilemma

I think this discussion is creating a false dichotomy. These are not mutually exclusive options. The solution to a low signal-to-noise ratio in the scientific literature is not to maintain publication bias of significant results. Rather the solution is to improve our filtering mechanisms. As I just described, I don’t think it will be sufficient to employ online shopping and social network procedures to rank the scientific literature. Even in the best-case scenario this is likely to highlight the results of authors who are socially dominant or popular and probably also those who are particularly unpopular or controversial. It does not necessarily imply that the highest quality research floats to the top [cue obvious joke about what kind of things float to the top…].

No, a high quality filter requires some organisation. I am convinced the scientific community can organise itself very well to create these mechanisms without too much outside influence. (I told you I’m Thoreau and Proudhon, not some insane Chaos Worshipper :P). We need some form of acceptance to the record. As I outlined previously, we should reorganise the entire publication process so that the whole peer-review process is transparent and public. It should be completely separate from journals. The journals’ only job should be to select interesting manuscripts and to publish short summary versions of them in order to communicate particularly exciting results to the broader community. But this peer-review should still involve a “pre-publication” stage – in the sense that the initial manuscript should not generate an enormous amount of undue interest before it has been properly vetted. To reiterate (because people always misunderstand that): the “vetting” should be completely public. Everyone should be able to see all the reviews, all the editorial decisions, and the whole evolution of the manuscript. If anyone has any particular insight to share about the study, by all means they should be free to do so. But there should be some editorial process. Someone should chase potential reviewers to ensure the process takes off at all.

The good news about all this is that it benefits you. Instead of weeping bitterly and considering to quit science because yet again you didn’t find the result you hypothesised, this just means that you get to publish more research. Taking the focus off novel, controversial, special, cool or otherwise “important” results should also help make the peer review more about the quality and meticulousness of the methods. Peer review should be about ensuring that the science is sound. In current practice it instead often resembles a battle with authors defending to the death their claims about the significance of their findings against the reviewers’ scepticism. Scepticism is important in science but this kind of scepticism is completely unnecessary when people are not incentivised to overstate the importance of their results.

Practice what you preach

I honestly haven’t followed all of the suggestions I make here. Neither have many other people who talk about improving science. I know of vocal proponents of preregistration who have yet to preregister any study of their own. The reasons for this are complex. Of course, you should “be the change you wish to see in the world” (I’m told Gandhi said this). But it’s not always that simple.

On the whole though I think I have published almost all of the research I’ve done. While I currently have a lot of unpublished results there is very little in the file drawer as most of these experiments have either been submitted or are being written up for eventual publication. There are two exceptions. One is a student project that produced somewhat inconclusive results although I would say it is a conceptual replication of a published study by others. The main reason we haven’t tried to publish this yet is that the student isn’t here anymore and hasn’t been in contact and the data aren’t that exciting to us to bother with the hassle of publication (and it is a hassle!).

The other data set is perhaps ironic because it is a perfect example of the scenario I described earlier. A few years ago when I started a new postdoc I was asked to replicate an experiment a previous lab member had done. For simplicity, let’s just call this colleague Dr Toffee. Again, they can identify themselves if they wish. The main reason for this was that reviewers had asked Dr Toffee to collect eye-movement data. So I replicated the original experiment but added eye-tracking. My replication wasn’t an exact one in the strictest terms because I decided to code the experimental protocol from scratch (this was a lot easier). I also had to use a different stimulus setup than the previous experiment as that wouldn’t have worked with the eye-tracker. Still, I did my best to match the conditions in all other ways.

My results were a highly significant effect in the opposite direction than the original finding. We did all the necessary checks to ensure that this wasn’t just a coding error etc. It seemed to be real. Dr Toffee and I discussed what to do about it and we eventually decided that we wouldn’t bother to publish this set of experiments. The original experiment had been conducted several years before my replication. Dr Toffee had moved on with their life. I on the other hand had done this experiment as a courtesy because I was asked to. It was very peripheral to my own research interests. So, as in the other example, we both felt that going through the publication process would have been a fairly big hassle for very little gain.

Now this is bad. Perhaps there is some other poor researcher, a student perhaps, who will do a similar experiment again and waste a lot of time on testing the hypothesis that, at least according to our incoherent results, is unlikely to be true. And perhaps they will also not publish their failure to support this hypothesis. The circle of null results continues… :/

But you need to pick your battles. We are all just human beings and we do not have unlimited (research) energy. With both of these lacklustre or incoherent results I mentioned (and these are literally the only completed experiments we haven’t at least begun to write up), it seems like a daunting task to undergo the pain of submission->review->rejection->repeat that simply doesn’t seem worth it.

So what to do? Well, the solution is again what I described. The very reason the task of publishing these results isn’t worth our energy is everything that is wrong with the current publication process! In my dream world in which I can simply write up a manuscript formatted in a way that pleases me and then upload this to the pre-print peer-review site my life would be infinitely simpler. No more perusing dense journal websites for their guide to authors or hunting for the Zotero/Endnote/Whatever style to format the bibliography. No more submitting your files to one horribly designed, clunky journal website after another, checking the same stupid tick boxes, adding the same reviewer suggestions. No more rewriting your cover letters by changing the name of the journal. Certainly for my student’s project, it would not be hard to do as there is already a dissertation that can be used as a basis for the manuscript. Dr Toffee’s experiment and its contradictory replication might require a bit more work – but to be fair even there is already a previous manuscript. So all we’d need to add would be the modifications of the methods and the results of my replication. In a world where all you need to do is upload the manuscript and address some reviewers’ comments to ensure the quality of the science this should be fairly little effort. In turn it would ensure that the file drawer is empty and we are all much more productive.

This world isn’t here yet but there are journals that will allow something that isn’t too far off from that, namely F1000Research and PeerJ (and the Winnower also counts although the content there seems to be different and I don’t quite know how much review editing happens there). So, maybe I should email Dr Toffee now…

(* In case you didn’t get this from the previous 2700ish words: the answer to this question is unequivocally “No.”)

Revolutionise the publication process

Enough of this political squabble and twitter war (twar?) and back to the “real” world of neuroneuroticism. Last year I sadly agreed (for various reasons) to act as corresponding author on one of our studies. I also have this statistics pet project that I want to try to publish as a single-author paper. Both of these experiences reminded me of something I have long known:

I seriously hate submitting manuscripts and the whole peer review and publication process.

The way publishing currently works authors are encouraged to climb down the rungs of the impact factor ladder, starting at whatever journal they feel is sufficiently general interest and high impact to take their manuscript and then gradually working their way down through editorial and/or scientific rejections until it is eventually accepted by, how the rejection letters from high impact journals suggest, a “more specialised journal.” At each step one battles with an online submission system that competes for the least user-friendly webpage of the year award and you repeat the same actions: uploading your manuscript files, suggesting reviewers, and checking the PDF conversion worked. Before you can do these you of course need to format your manuscript into the style the journal expects, with the right kind of citations, and having the various sections in the correct place. You also modify the cover letter to the editors, in which you hype up the importance of the work rather than letting the research speak for itself, to adjust it to the particular journal you are submitting to. All of this takes precious time and has very little to do with research.

Because I absolutely loathe having to do this sort of mindless work, I have long tried to outsource this to my postdocs and students as much as I can. I don’t need to be corresponding author on all my research. Of course, this doesn’t absolve you from being involved with the somewhat more important decisions, such as rewriting the manuscripts and drafting the cover letters. More importantly, while this may help my own peace of mind, it just makes somebody else suffer. It is not a real solution.

The truth is that this wasted time and effort would be far better used for doing science and ensuring that the study is of the best possible quality. I have long felt that the entire publication process should be remodelled so that these things are no longer a drain of researcher’s time and sanity. I am by far not the first person to suggest a publication model like this. For instance, Micah Allen mentioned very similar ideas on his blog and, more recently, Dorothy Bishop had a passionate proposal to get rid of journals altogether. Both touched on many of the same points and partly inspired my own thoughts on this.

Centralised review platform

Some people think that all peer review could be post-publication. I don’t believe this is a good idea – depending on what you regard as publication. I think we need some sort of fundamental vetting procedure before a scientific study is indexed and regarded as part of the scientific record. I fear that without some expert scrutiny we will become swamped with poor quality outputs that make it impossible to separate the wheat from the chaff. Post-publication peer review alone is not enough to find the needles in the haystack. If there is so much material out there that most studies never get read, let alone reviewed or even receive comments, this isn’t going to work. By having some traditional review prior to “acceptance” in which experts are invited to review the manuscript – and reminded to do so – we can at least ensure that every manuscript will be read by someone. Nobody is stopping you from turning blog posts into actual publications. Daniël Lakens has a whole series of blog posts that have turned into peer reviewed publications.

A key feature of this pre-publication peer review though should be that it all takes place in a central place completely divorced from any of the traditional journals. All the judgements on the scientific quality of a study requires expert reviewers and editors but there should be no evaluation of the novelty or “impact” of the research. It should be all about the scientific details to ensure that the work is robust. The manuscript should be as detailed as necessary to replicate a study (and the hypotheses and protocols can be pre-registered – a peer review system of pre-registered protocols is certainly an option in this system).

Ideally this review should involve access to the data and materials so that reviewers can try to replicate the findings presented in the study. Most expert reviewers rarely reanalyse data even if they are available. Many people usually do not have the time to get that deeply involved in a review. An interesting possible solution to this dilemma was suggested to me recently by Lee de-Wit: there could be reviewers whose primary role is to check the data and try to reproduce the analysed results based on the documentation. These data reviewers would likely be junior researchers, that is, PhD students and junior postdocs perhaps. It would provide an opportunity to learn about reviewing and also to become known with editors. There is presently a huge variability as to what stage of their career a researcher starts reviewing manuscripts. While some people begin reviewing even as graduate students others don’t even seem to review often after several years of postdoc experience. This idea could help close that gap.

Transparent reviewing

Another in my mind essential key aspect should be that reviews are transparent. That is, all the review comments should be public and the various revisions of the manuscript should be accessible. Ideally, the platform allows easy navigation between the changes so that it is straightforward to simply look at the current/final product and filter out the tracked changes – but equally easy to blend in the comments.

It remains a very controversial and polarising issue whether reviewers’ names should be public as well. I haven’t come to a final conclusion on that. There are certainly arguments for both. One of the reasons many people dislike the idea of mandatory signed reviews is that it could put junior researchers at a disadvantage. It may discourage them from writing critical reviews of the work by senior colleagues, those people who make hiring and funding decisions. Reviewer anonymity can protect against that but it can also lead to biased, overly harsh, and sometimes outright nasty reviews. It also has the odd effect of creating a reviewer guessing game. People often display a surprising level of confidence in who they “know” their anonymous reviewers were – and I would bet they are often wrong. In fact, I know of certainly one case where this sort of false belief resulted in years of animosity directed at the presumed reviewer and even their students. Publishing reviewer names would put an end to this sort of nonsense. It also encourages people to be more polite. Editors at F1000Research (a journal with a completely transparent review process) told me that they frequently ask reviewers to check if they are prepared to publish the review in the state they submitted because it will be associated with their name – and many then decide to edit their comments to tone down the hostility.

However, I think even with anonymous reviews we could go a long way, provided that reviewer comments are public. Since the content of the review is subject to public scrunity it is in the reviewer’s, and even more so the editor’s, interest to ensure they are fair and of suitable quality. Reviews of poor quality or with potential political motivation could easily be flagged up and result in public discussion. I believe it was Chris Chambers who recently suggested a compromise in which tenured scientists must sign their reviews while junior researchers who still exist at the mercy of senior colleagues have the option to remain anonymous. I think this idea has merit although even tenured researchers can still suffer from political and personal biases so I am not sure this really protects against those problems.

One argument that is sometimes made against anonymous reviews is that it prevents people from taking credit for their reviewing work. I don’t think this is true. Anonymous reviews are nevertheless associated with a digital reviewer’s account and ratings of review quality and reliability etc could be easily quantified in that way. (In fact, this is precisely what websites like Publons are already doing).

Novelty, impact, and traditional journals

So what happens next? Let’s assume a manuscript passes this initial peer review. It then enters the official scientific record, is indexed on PubMed and Google Scholar. Perhaps it could follow the example of F1000Research in that the title of the study itself contains an indication that it has been accepted/approved by peer review.

This is where it gets complicated. A lot of the ideas I discussed are already implemented to some extent by journals like F1000Research, PeerJ, or the Frontiers brand. The only aspect that these implementations do not have is that they are not a single, centralised platform for reviews. And although I think having a single platform would be more ideal to avoid confusion and splintering, even a handful of venues for scientific review could probably work.

However, what these systems currently do not provide is the role currently still played by the high impact, traditional publishers: filtering the enormous volume of scientific work to select ground-breaking, timely, and important research findings. There is a lot of hostility towards this aspect of scientific publishing. It often seems completely arbitrary, obsessed with temporary fads and shallow buzzwords. I think for many researchers the implicit or even explicit pressure to publish as much “high impact” work as possible to sustain their careers is contributing to this. It isn’t entirely clear to me how much of this pressure is real and how much is an illusion. Certainly some grant applications still require you to list impact factors and citation numbers (which are directly linked to impact factors) to support your case.

Whatever you may think about this (and I personally agree that it has lots of negative effects and can be extremely annoying) the filtering and sorting by high impact journals does also have its benefits. The short format publications, brief communications, and perspective articles in these journals make work much more accessible to wider audiences and I think there is some point in highlighting new, creative, surprising, and/or controversial findings over incremental follow-up research. While published research should provide detailed methods and well-scrutinised results, there are different audiences. When I read about findings in astronomy or particle physics, or even many studies from biological sciences that aren’t in my area, I don’t typically read all the in-depth methods (nor would I understand them). An easily accessible article that appeals to a general scientific audience is certainly a nice way to communicate scientific findings. In the present system this is typically achieved by separating a general main text from Supplementary/Online sections that contain methods, additional results, and possibly even in-depth discussion.

This is where I think we should implement an explicit tier system. The initial research is published, after scientific peer review as discussed above, in the centralised repository of new manuscripts. These publications are written as traditional journal articles complete with detailed methods and results. Novelty and impact played no role up to this stage. However, now the more conventional publishers come into play. Authors may want to write cover letters competing for the attention of higher impact journals. Conversely, journal editors may want to contact authors of particularly interesting studies to ask them to submit a short-form article in their journal. There are several mechanisms by which new publications may come to the attention of journal editors. They could simply generate a strong social media buzz and lots of views, downloads, and citations. This in fact seems to be the basis of the Frontiers tier system. I think this is far from optimal because it doesn’t necessarily highlight the scientifically most valuable but the most sensational studies, which can be for all sorts of reasons, such as making extraordinary claims or because the titles contain curse words. Rather it would be ideal to highlight research that attracts a lot of post-publication review and discussion – but of course this still poses the question how to encourage that.

In either case, the decision as to what is novel, general interest research is still up to editorial discretion making it easier for traditional journals to accept this change. How these articles are accepted is still up to each journal. Some may not require any further peer review and simply ask for a copy-edited summary article. Others may want to have some additional peer review to keep the interpretation of these summaries in check. It is likely that these high impact articles would be heavy on the implications and wider interpretation while the original scientific publication has only brief discussion sections detailing the basic interpretation and elaborating on the limitations. Some peer review may help keep the authors honest at this stage. Importantly, instead of having endless online methods sections and (sometimes barely reviewed) supplementary materials, the full scientific detail of any study would be available within its original publication. The high impact short-form article simply contains a link to that detailed publications.

One important aim that this system would achieve is to ensure that the research that actually is published as high impact will typically meet high thresholds of scientific quality. Our current publishing model is still incentivising publishing shoddy research because it emphasises novelty and the speed of publication over quality. In the new system, every study would first have to pass a quality threshold. Novelty judgements should be entirely secondary to that.

How can we make this happen?

The biggest problem with all of these grand ideas we are kicking around is that it remains mostly unclear to me how we can actually effect this change. The notion that we can do away with traditional journals altogether sounds like a pipe dream to me as it is diametrically opposed to the self-interest of traditional publishers and our current funding structures. While some great upheavals have already happened in scientific publishing, such as the now wide spread of open access papers, I feel that a lot of these changes have simply occurred because traditional publishers realised that they can make considerable profit from open access charges.

I do hope that eventually the kinds of journals publishing short-form, general interest articles to filter the ground-breaking research from incremental, specialised topics will not be for-profit publishers. There are already a few examples of traditional  journals now that are more community driven, such as the Journal of Neuroscience, the Journal of Vision, and also e-Life (not so much community-driven but driven by a research funder rather than a for-profit publishing house). I hope to see more of that in the future. Since many scientists seem to be quite idealist in their hearts I think there is hope for that.

But in the meantime it seems to be necessary to work together with traditional publishing houses rather than antagonising them. I would think it shouldn’t be that difficult to convince some publishers of the idea that what now forms the supplementary materials and online methods in many high impact journals could be actual proper publications in their own right. Journals that already have a system like I envision, e.g. F1000Research or PeerJ, could perhaps negotiate such deals with traditional journals. This need not be mutually exclusive but it could simply apply to some articles published in these journals.

The main obstacle to do away with here is the in my mind obsolete notion that none of the results can have been published elsewhere. This is already no longer true in most cases anyway. Most research will have been published prior to journal publication in conference proceedings. Many traditional journals nowadays tolerate manuscripts uploaded to pre-print servers. The new aspect of the system I described would be that there is actually an assurance that the pre-published work has been peer reviewed properly thus guaranteeing a certain level of quality.

I know there are probably many issues to still resolve with these ideas and I would love to hear them. However, I think this vision is not a dream but a distinct possibility. Let’s make it come true.