All posts by Sam Schwarzkopf

About Sam Schwarzkopf

Neuroscientist studying perception at UCL

Would the TEF have passed peer review?

Today we have the first NeuroNeurotic guest post! The following short rant was written by my colleague Lee de Wit. In this he talks about the recently published “Teaching Excellence Framework” in which UK universities are ranked based on the quality of their undergraduate teaching… If you are also a psychologist and would like to write and sign a more formal letter/editorial to the Times Higher Education outlining these points, email l.de-wit@ucl.ac.uk

As psychologists when we create a new way of measuring something complex (like a personality trait) we have to go to rigorous lengths to demonstrate that the measures we use are valid, reliable and that we classify people meaningfully.

When it comes to measuring teaching in higher education however, it seems we can just lower the standards. Apparently the TEF is meant to help students make meaningful choices, yet I can see no evidence that it is a valid measure, no evidence it is reliable, and no evidence that it meaningfully clusters Universities.

Validity – One of the key measures used in TEF are student satisfaction scores – yet we already know that they are not a valid measure of teaching quality. In fact there are meta-analyses demonstrating that high student satisfaction scores don’t even correlate with learning outcomes.

Reliability – Apparently it is fine to just have a panel of 27 people make some subjective judgements about the quantitative data, to classify Universities. No need to have two panels rate them and then check they come to similar judgements.

Clustering – in terms of the underlying distribution of the data, no need to seriously think about whether there are meaningful clusters or more continuous variability. Gold, Sliver and Bronze – job done.

If there are any academics tweeting today about your University’s strong result, I would seriously call into question the excellence with which you can teach critical thinking to your students.

The one lesson I would take from this for UK Universities, is that we are clearly failing to educate politicians and policy makers to think carefully about evidence based policy. Presumably most of the key players in designing and implementing TEF went to UK Universities. So I’m worried about what they learnt that made them think this was a good idea.

Marking myself annoyed

First of all, let me apologise for the very long delay since my last blog post. As you all know, the world is going through a lot of turmoil right now. I was also busy and travelling a lot and so I’ve had neither time nor the energy to blog. But anyway, I’m back and have a number of posts in mind for the next few weeks.

Before I begin, let me say this: My heart goes out to the victims of the horrific terrorist attack at Westminster Bridge and the Houses of Parliament the other day. All whose loved ones were injured or killed in this senseless act of violence are in my thoughts. I admire the efficiency and bravery of the emergency services and the bystanders who rushed to help. There is never an excuse to commit such vile crimes in the pursuit of some political goal. In the case of this brand of Islamic terrorism (if this is indeed confirmed to be the case), the actual political goal is also pretty obscure. Either way, it is a meaningless and evil act. We should stand united in the face of such evil. Don’t be cowed into giving up liberty and justice and never give in to hate and fear.

Having said this, let me get to the point. For several years now Facebook has had this feature where people “mark themselves safe” when a terror attack strikes. I presume it may also be used for natural disasters but if so I haven’t seen that yet. From the first time I saw this, during the terror attack in Paris, I found this rather tasteless and also far from helpful.

Back then, many people criticised Facebook as the feature was heavily biased towards white, western countries. Around the same time of the Paris attacks there were several other attacks in Turkey and the Middle East. Nobody got to “mark themselves safe” during those attacks. And in certain parts of the world terror attacks are a weekly occurrence. So the outrage over Facebook starting this feature for attacks in Europe is understandable. But I think it is misplaced: Facebook has always rolled out their new features in a geographically limited way and they typically start in the western world where they are based. There is also a related discussion to be had about in-groups and out-groups. And about our habituation to bad news: sad as it may be, even after this string of terror attacks in European cities they remain more newsworthy than those in Baghdad or Kabul where this seems to happen all the time. Since then, Facebook have expanded their use of this feature to non-western countries. Whether this was because of people’s complaints or they always planned this I do not know. But either way, it is no longer limited to the West.

What annoys me about this Facebook feature is something else however. To me it seems  demeaning and callous. I don’t think the emotional engagement we should have with such events and the concern we should feel for our fellow human beings should be condensed down to a button press and notification on social media. Perhaps I’m just an old fart who doesn’t comprehend the way the modern world works. I certainly don’t really understand dating via Tinder and a lot of the social media communication millennials get up to these days (snapstagram or chat roulette or whatever they’re called). And don’t get me started on the excessive hash tagging.

But there is a big difference: most of those other things are trivial or affectations. I have no problem with people looking for casual sex or even seeking a life partner via modern social media if this is what works for them. I may not understand the excessive selfie craze and glammed up pictures some people post of themselves emulating the growing ranks of celebrities who are only famous for being famous. But I don’t have a problem with that. It’s up to each and everyone how they want to spend their spare time and what they do in the pursuit of happiness. And of course I use social media too. I like using Facebook actually and use it often (some of my friends probably think too often, although they vastly overestimate how much time it actually takes from my life). Facebook is a great way to stay in touch with friends and family. I even got back in touch with some really old friends who I would not otherwise have any contact with now. So I don’t even feel that all of our social media contact is trivial. I have some very meaningful human contact that way and rekindled old friendships.

In contrast, this marking safe business seems deeply inappropriate to me. It trivialises the gravity of the situations. In my view, our emotional reaction to a situation like this should go beyond an emoji or clicking a “sad” button. You might say, to each their own. You don’t have to use this thing and can turn off notifications about this. But it’s not that simple. That’s not how social media work. The whole feature is designed around the idea that people mark themselves safe, thus spreading the word, and also ask their friends if they are safe. It creates a kind of peer pressure that coerces people into marking themselves “safe” causing a chain reaction that makes the whole thing spiral out of control.

You might also say, that is is a good and social thing to get in touch with your friends and loved ones. As I said, I use social media too. I am not Susan Greenfield, or any one of those people who think that staring into your phone or having social contact via the internet withers away our interhuman contact. Quite to the contrary in fact. I remember seeing this excellent cartoon about how smart phones are all about interhuman contact but sadly my google skills are too poor to find it. I most certainly disagree with this article – it is nonsense in so many ways.

But again, there is a difference: getting in touch with your loved ones is not the same as seeing a notification (or even requesting) that they “mark themselves safe”. It seems so cold, so removed from humanity. Of course, you worry about your loved ones. The clue why is in the word. You see on the news that some tragedy occurred and you want to know your friends and family are all right. Well then, pick up that smart phone of yours and send them a message or give them a call! The best way to find out if they are okay and letting them know you care about them is to speak to them. Several friends and family got in touch with me via phone or email or instant message asking if we are okay. And I certainly did the same. I have friends and family in Paris and in Berlin and I contacted them when the terror attacks there happened. On the day of the 7/7 bombings I contacted all of my London friends at the time. Even though I realise that the odds of any of them being caught up in these events are low, you also want them to know you think of them, find out how they feel, and give them some consolation and support. By all means, use social media for that purpose – it’s very good for that. But to me, reducing this down to one tap of your finger on the phone is sorely insufficient. I hardly says “I care” and in some ways it even seems to disrespect the victims and the plight of those people who actually grieve for their loved ones.

And then there is the practical side of this. The blunt nature of the algorithms behind this feature and the fact that people (quite rightly) don’t actually share all the details of their lives on Facebook causes some really stupid artifacts. Not only is Egham (home of Royal Holloway “University of London”) really, really safe, my department in actual London was also pretty safe from this terror attack (ironically enough, my department is right next to several of the sites of the 7/7 bombings, in particular the bus bombing at Tavistock Square). While I have walked across Westminster Bridge and past Parliament many times, believe me, it’s not where I spend most of my work days. And while of course it was possible that the terrorist didn’t act alone and other attacks might be happening (a common feature to IS and Al-Qaeda attacks), there were no reports of anything else happening at the time. But what if there had been other attacks? What if your friend marks themselves “safe” of the first one and then gets caught up in the second? Is there a way to “unmark” yourself again? And would that really be your first priority in that situation?

The even more bizarre artifacts of Facebook’s indiscriminate scatter approach are of course that it not only wants us to make sure people in Egham are okay but also those in galaxies far, far away. On the mark yourself safe page I saw several people who haven’t lived in London for years but are in the United States and other places thousands of miles away. Not everyone changes their personal details every time they move because that really isn’t always the most important thing in their lives. And of course, some people may have been in London at the time even though according to their “official Facebook records” they live somewhere else. They will fall through the cracks completely.

A much more severe side effect, however, is the distorted picture of reality this sort of thing produces. The tweet by Hanna Isotalus I already mentioned starts a thread elaborating on this problem. This whole business of marking yourself safe actually has the consequence of making everyone feel less safe than they are. While of course horrible and tragic for everyone who was involved, as I already said this attack was a pretty isolated event. By drawing this much attention to it by frantically requesting everyone who has anything to do with London mark themselves “safe” we actually vastly exaggerate its effects. The same can probably also be said about the intense news coverage of such events.

The casualties of terrorism in the western world have clearly declined considerably over the past decades. Admittedly, there are some spikes in recent years and most of those are related to jihadist terrorism. However, the actual reach of these attacks in Europe or the US is very small compared to the extent of fear-mongering and political agonising it causes. Also, not that it should matter but a very large proportion of Islamist terror happens in predominantly Muslim countries and most certainly a large proportion of the victims are Muslims.

This stands in stark contrast to the number of people injured and killed all the time by car accidents or – in the US anyways – by guns. It stands in contrast to the risks we are subjected to every day. Nobody seems to think to mark themselves safe every time they take a car or cross a road as if they’d unlocked some achievement in a computer game. I have yet to see a notification on Facebook from one of my many daredevil colleagues telling me “I rode my bike to work and managed to survive for yet another day”.

So as Hanna points out, you are safe. Marking yourself safe doesn’t make you safe. Take a step back (but omit the deep breaths – in London that is actually dangerous). Think about what this really achieves. By all means, contact your loved ones to let them know you care. While statistically they are not at risk, there is one distinct difference between accidents and terrorism. An accident happens by misfortune or neglect. Crime and terrorism are deliberate acts of evil. Talking to your friends and family who happen to be close to such things shows your support. And of course, please pay your respects to the victims, console the ones close to them, and honour the heroes who saved people’s lives and bring the perpetrators to justice.

But don’t buy into this callous scheme of “marking yourself safe”. You’re just playing into the terrorists’ hands. You just spread the fear they want to cause, the hatred and divisions they want to incite, and it contributes to the continued erosion of our liberties and way of life. It strengthens the forces who want to undermine our freedom and respect for one another. All those far-right politicians may not know it but they are bedfellows of these Islamist murderers. Sorry for the cliche but it’s true: If we buy into this crap, the terrorists win.

Chris Chambers is a space alien

Imagine you are a radio astronomer and you suddenly stumble across a signal from outer space that appears to be evidence of an extra-terrestrial intelligence. Let’s also assume you already confidently ruled out any trivial artifactual explanation to do with naturally occurring phenomena or defective measurements. How could you confirm that this signal isn’t simply a random fluke?

This is actually the premise of the novel Contact by Carl Sagan, which happens to be one of my favorite books (I never watched the movie but only caught the end which is nothing like the book so I wouldn’t recommend it…). The solution to this problem proposed in the book is that one should quantify how likely the observed putative extraterrestrial signal would be under the assumption that it is the product of random background radiation.

This is basically what a p-value in frequentist null hypothesis significance testing represents. Using frequentist inference requires that you have a pre-specified hypothesis and a pre-specified design. You should have an effect size in mind, determine how many measurements you need to achieve a particular statistical power, and then you must carry out this experiment precisely as planned. This is rarely how real science works and it is often put forth as one of the main arguments why we should preregister our experimental designs. Any analysis that wasn’t planned a priori is by definition exploratory. The most extreme form of this argument posits that any experiment that hasn’t been preregistered is exploratory. While I still find it hard to agree with this extremist position, it is certainly true that analytical flexibility distorts the inferences we can make about an observation.

This proposed frequentist solution is therefore inappropriate for confirming our extraterrestrial signal. Because the researcher stumbled across the signal, the analysis is by definition exploratory. Moreover, you must also beware of the base-rate fallacy: even an event extremely unlikely under the null hypothesis is not necessarily evidence against the null hypothesis. Even if p=0.00001, a true extraterrestrial signal may be even less likely, say, p=10-100. Even if extra-terrestrial signals are quite common, given the small amount of space, time, and EM bands we have studied thus far, how probable is it we would just stumble across a meaningful signal?

None of that means that exploratory results aren’t important. I think you’d agree that finding credible evidence of an extra-terrestrial intelligence capable of sending radio transmissions would be a major discovery. The other day I met up with Rob McIntosh, one of the editors for Registered Reports at Cortex, to discuss the distinction between exploratory and confirmatory research. A lot of the criticism of preregistration focuses on whether it puts too much emphasis on hypothesis-driven research and whether it in turn devalues or marginalizes exploratory studies. I have spent a lot of time thinking about this issue and (encouraged by discussions with many proponents of preregistration) I have come to the conclusion that the opposite is true: by emphasizing which parts of your research are confirmatory I believe exploration is actually valued more. The way scientific publishing works conventionally many studies are written up in a way that pretends to be hypothesis-driven when in truth they weren’t. Probably for a lot of published research the truth lies somewhere in the middle.

So preregistration just keeps you honest with yourself and if anything it allows you to be more honest about how you explored the data. Nobody is saying that you can’t explore, and in fact I would argue you should always include some exploration. Whether it is an initial exploratory experiment that you did that you then replicate or test further in a registered experiment, or whether it is a posthoc robustness test you do to ensure that your registered result isn’t just an unforeseen artifact, some exploration is almost always necessary. “If we knew what we were doing, it would not be called research, would it?” (a quote by Albert Einstein, apparently).

One idea I discussed with Rob is whether there should be a publication format that specifically caters to exploration (Chris Chambers has also mentioned this idea previously). Such Exploratory Reports would allow researchers to publish interesting and surprising findings without first registering a hypothesis. You may think this sounds a lot like what a lot of present day high impact papers are like already. The key difference is that these Exploratory Reports would contain no inferential statistics and critically they are explicit about the fact that the research is exploratory – something that is rarely the case in conventional studies. However, this idea poses a critical challenge: on the one hand you want to ensure that the results presented in such a format are trustworthy. But how do you ensure this without inferential statistics?

Proponents of the New Statistics (which aren’t actually “new” and it is also questionable whether you should call them “statistics”) will tell you that you could just report the means/medians and confidence intervals, or perhaps the whole distributions of data. But that isn’t really helping. Inspecting confidence intervals and how far they are from zero (or another value of no interest) is effectively the same thing as a significance test. Even merely showing the distribution of observations isn’t really helping. If a result is so blatantly obvious that it convinces you by visual inspection (the “inter-ocular trauma test”), then formal statistical testing would be unnecessary anyway. If the results are even just a little subtler, it can be very difficult to decide whether the finding is interesting. So the way I see it, we either need a way to estimate statistical evidence, or you need to follow up the finding with a registered, confirmatory experiment that specifically seeks to replicate and/or further test the original exploratory finding.

In the case of our extra-terrestrial signal you may plan a new measurement. You know the location in the sky where the signal came from, so part of your preregistered methods is to point your radio telescope at the same point. You also have an idea of the signal strength, which allows you to determine the number of measurements needed to have adequate statistical power. Then you carry out this experiment, sticking meticulously to your planned recipe. Finally, you report your result and the associated p-value.

Sounds good in theory. In practice, however, this is not how science typically works. Maybe the signal isn’t continuous. There could be all sorts of reasons why the signal may only be intermittent, be it some interstellar dust clouds blocking the line of transmission, the transmitter pointing away from Earth due to the rotation of the aliens’ home planet, or even simply the fact that the aliens are operating their transmitter on a random schedule. We know nothing about what an alien species, let alone their civilization, may be like. Who is to say that they don’t just fall into random sleeping periods in irregular intervals?

So some exploratory, flexible analysis is almost always necessary. If you are too rigid in your approach, you are very likely to miss important discoveries. At the same time, you must be careful not to fool yourself. If we are really going down the route of Exploratory Reports without any statistical inference we need to come up with a good way to ensure that such exploratory findings aren’t mostly garbage. I think in the long run the only way to do so is to replicate and test results in confirmatory studies. But this could already be done as part of a Registered Report in which your design is preregistered. Experiment 1 would be exploratory without any statistical inference but simply reporting the basic pattern of results. Experiment 2 would then be preregistered and replicate or test the finding further.

However, Registered Reports can take a long time to publish. This may in fact be one of the weak points about this format that may stop the scientific community from becoming more enthusiastic about them. As long as there is no real incentive to doing slow science, the idea that you may take two or three years to publish one study is not going to appeal to many people. It will stop early career researchers from getting jobs and research funding. It also puts small labs in poorer universities at a considerable disadvantage compared to researchers with big grants, big data, and legions of research assistants.

The whole point of Exploratory Reports would be to quickly push out interesting observations. In some ways, this is then exactly what brief communications in high impact journals are currently for. I don’t think it will serve us well to replace the notion of snappy (and likely untrue) high impact findings with inappropriate statistical inferences with snappy (and likely untrue) exploratory findings without statistical inference. If the purpose of Exploratory Reports is solely to provide an outlet for quick publication of interesting results, we still have the same kind of skewed incentive structure as now. Also, while removing statistical inference from our exploratory findings may be better statistical practice I am not convinced that it is better scientific practice unless we have other ways of ensuring that these exploratory results are kosher.

The way I see it, the only way around this dilemma is to finally stop treating publications as individual units. Science is by nature a lengthy, incremental process. Yes, we need exciting discoveries to drive science forward. At the same time, replicability and robustness of our discoveries is critical. In order to combine these two needs I believe research findings should not be seen as separate morsels but as a web of interconnected results. A single Exploratory Report (or even a bunch of them) could serve as the starting point. But unless they are followed up by Registered Reports replicating or scrutinizing these findings further, they are not all that meaningful. Only once replications and follow up experiments have been performed the whole body of a finding takes shape. A search on PubMed or Google Scholar would not merely spit out the original paper but a whole tree of linked experiments.

The perceived impact and value of a finding thus would be related to how much of a interconnected body of evidence it has generated rather than whether it was published in Nature or Science. Critically, this would allow people to quickly publish their exciting finding and thus avoid being deadlocked by endless review processes and disadvantaged compared to other people who can afford to do more open science. At the same time, they would be incentivized to conduct followed up studies. Because a whole body of related literature is linked, it would however also be an incentive for others to conduct replications or follow up experiments on your exploratory finding.

There are obviously logistic and technical challenges with this idea. The current publication infrastructure still does not really allow for this to work. This is not a major problem however. It seems entirely feasible to implement such a system. The bigger challenge is how to convince the broader community and publishers and funders to take this on board.

200px-arecibo_message-svg

Strolling through the Garden of Forking Paths

The other day I got into another Twitter argument – for which I owe Richard Morey another drink – about preregistration of experimental designs before data collection. Now, as you may know, I have in the past had long debates with proponents of preregistration. Not really because I was against it per se but because I am a natural skeptic. It is still far too early to tell if the evidence supports the claim that preregistration improves the replicability and validity of published research. I also have an innate tendency to view any revolutionary proposals with suspicion. However, these long discussions have eased my worries and led me to revise my views on this issue. As Russ Poldrack put it nicely, preregistration no longer makes me nervous. I believe the theoretical case for preregistration is compelling. While solid empirical evidence for the positive and negative consequences of preregistration will only emerge over the course of the coming decades, this is not actually all that important. I seriously doubt that preregistration actually hurts scientific progress. At worst it has not much of an effect at all – but I am fairly confident that it will prove to be a positive development.

Curiously, largely due to the heroic efforts by one Christopher Chambers, a Sith Lord at my alma mater Cardiff University, I am now strongly in favor of the more radical form of preregistration, registered reports (RRs), where the hypothesis and design is first subject to peer review, data collection only commences when the design has been accepted, and eventual publication is guaranteed if the registered plan was followed. In departmental discussions, a colleague of mine repeatedly voiced his doubts that RRs could ever become mainstream, because they are such a major effort. It is obvious that RRs are not ideal for all kinds of research and to my knowledge nobody claims otherwise. RRs are a lot of work that I wouldn’t invest in something like a short student project, in particular a psychophysics experiment. But I do think they should become the standard operating procedure for many larger, more expensive projects. We already have project presentations at our imaging facility where we discuss new projects and make suggestions on the proposed design. RRs are simply a way to take this concept into the 21st century and the age of transparent research. It can also improve the detail or quality of the feedback: most people at our project presentations will not be experts on the proposed research while peer reviewers at least are supposed to be. And, perhaps most important, RRs ensure that someone actually compares the proposed design to what was carried out eventually.

When RRs are infeasible or impractical, there is always the option of using light preregistration, in which you only state your hypothesis and experimental plans and upload this to OSF or a similar repository. I have done so twice now (although one is still in the draft stage and therefore not yet public). I would strongly encourage people to at least give that a try. If a detailed preregistration document is too much effort (it can be a lot of work although it should save you work when writing up your methods later on), there is even the option for very basic registration. The best format invariably depends on your particular research question. Such basic preregistrations can add transparency to the distinction between exploratory and confirmatory results because you have a public record of your prior predictions. Primarily, I think they are extremely useful to you, the researcher, as it allows you to check how directly you navigated the Garden of Forking Paths. Nobody stops you from taking a turn here or there. Maybe this is my OCD speaking, but I think you should always peek down some of the paths at least, simply as a robustness check. But the preregistration makes it less likely that you fool yourself. It is surprisingly easy to start believing that you took a straight path and forget about all the dead ends along the way.

This for me is really the main point of preregistration and RRs. I think a lot of the early discussion of this concept, and a lot of the opposition to it, stems from the implicit or even explicit accusation that nobody can be trusted. I can totally understand why this fails to win the hearts and minds of many people. However, it’s also clear that questionable research practices and deliberate p-hacking have been rampant. Moreover, unconscious p-hacking due to analytical flexibility almost certainly affects many findings. There are a lot of variables here and so I’d wager that most of the scientific literature is actually only mildly skewed by that. But that is not the point. Rather, I think as scientists, especially those who study cognitive and mental processes of all things, shouldn’t you want to minimize your own cognitive biases and human errors that could lead you astray? Instead of  the rather negative “data police” narrative that you often hear, this is exactly what preregistration is about. And so I think first and foremost a basic preregistration is only for yourself.

When I say such a basic preregistration is for yourself, this does not necessarily mean it cannot also be interesting to others. But I do believe their usefulness to other people is limited and should not be overstated. As with many of the changes brought on by open science, we must remain skeptical of any unproven claims of their benefits and keep in mind potential dangers. The way I see it, most (all?) public proponents of either form of preregistration are fully aware of this. I think the danger really concerns the wider community. I occasionally see anonymous or sock-puppet accounts popping up in online comment sections espousing a very radical view that only preregistered research can be trusted. Here is why this is disturbing me:

1. “I’ll just get some fresh air in the garden …”

Preregistered methods can only be as good as the detail they provide. A preregistration can be so vague that you cannot make heads or tails of it. The basic OSF-style registrations (e.g. the AsPredicted format) may be particularly prone to this problem but it could even be the case when you wrote a long design document. In essence, this is just saying you’ll take a stroll in the hedge maze without giving any indication whatsoever which paths you will take.

2. “I don’t care if the exit is right there!”

Preregistration doesn’t mean that your predictions make any sense or that there isn’t a better way to answer the research question. Often such things will only be revealed once the experiment is under way or completed and I’d actually hazard the guess that this is usually the case. Part of the beauty of preregistration is that it demonstrates to everyone (including yourself!) how many things you probably didn’t think of before starting the study. But it should never be used as an excuse not to try something unregistered when there are good scientific reasons to do so. This would be the equivalent of taking one predetermined path through the maze and then getting stuck in a dead end – in plain sight of the exit.

3. “Since I didn’t watch you, you must have chosen forking paths!”

Just because someone didn’t preregister their experiment does not mean their experiment was not confirmatory. Exploratory research is actually undervalued in the current system. A lot of research is written up as if it were confirmatory even if it wasn’t. Ironically, critics of preregistration often suggest that it devalues exploratory research but it actually places greater value on it because you are no longer incentivized to hide it. But nevertheless, confirmatory research does happen even without preregistration. It doesn’t become any less confirmatory because the authors didn’t tell you about it. I’m all in favor of constructive skepticism. If a result seems so surprising or implausible that you find it hard to swallow, by all means scrutinize it closely and/or carry out an (ideally preregistered) attempt to replicate it. But astoundingly, even people who don’t believe in open science sometimes do good science. When a tree falls in the garden and nobody is there to hear it, it still makes a sound.

Late September when the forks are in bloom

Obviously, RRs are not completely immune to these problems either. Present day peer review frequently fails to spot even glaring errors, so it is inevitable that it will also make mistakes in the RR situation. Moreover, there are additional problems with RRs, such as the fact that they require an observant and dedicated editor. This may not be so problematic while RR editors are strong proponents of RRs but if this concept becomes more widespread this will not always be the case. It remains to be seen how that works out. However, I think on the whole the RR concept is a reasonably good guarantee that hypotheses and designs are scrutinized, and that results are published, independent of the final outcome. The way I see it, both of these are fundamental improvements over the way we have been doing science so far.

But I’d definitely be very careful not to over-interpret the fact that a study is preregistered, especially when it isn’t a RR. Those badges they put on Psych Science articles may be a good incentive for people to embrace open science practices but I’m very skeptical of anyone who implies that just because a study was preregistered, or because it shares data and materials, that this makes it more trustworthy. Because it simply doesn’t. It lulls you into a false sense of security and I thought the intention here was not to fool ourselves so much any more. A recent case of data being manipulated after it was uploaded demonstrates how misleading an open data badge can be. In the same vein, just because an experiment is preregistered does not mean the authors didn’t lead us (and themselves) down the garden path. There have also been cases of preregistered studies that then did not actually report the outcomes of their intended analyses.

So, preregistration only means that you can read what the authors said they would do and then check for yourself how this compares to what they did do. That’s great because it’s transparent. But unless you actually do this check, you should treat the findings with the same skepticism (and the authors with the same courtesy and respect) as you would those of any other, non-registered study.

hedgemaze
Sometimes it is really not that hard to find your way through the garden…

The Day the Palm hit the Face

536px-triple-facepalm

 

Scientists are human beings. I get it. I really do because – all contrary reports and demonic possessions aside – I’m a human being, too. So I have all manner of sympathy for people’s hurt feelings. It can hurt when somebody criticizes you. It may also be true that the tone of criticism isn’t always as it should be to avoid hurt.

In this post, I want to discuss ways to answer scientific criticism. I haven’t always followed this advice myself because, as I said, I’m human. But I am at least trying. The post was sparked by an as-yet unpublished editorial by a certain ex-president of the APS. I don’t want to discuss specifically the rather inflammatory statements in that article as doing so will serve no good. Since it isn’t officially published, it may still change anyway. And last time I blogged about an unpublished editorial I received a cease and desist letter forcing me to embargo my post for two full hours

I believe most people would agree that science is an endeavor of truth seeking. It attempts to identify regularities in our chaotic observations of the world that can help us understand the underlying laws that govern it. So when multiple people are unable to replicate a previous claim, this casts doubt on the claim’s validity as a regularity of nature.

The currency of science should be evidence. Without any evidence, a claim is worthless. So if someone says “I don’t think this effect is real” but offers no evidence for that statement, be it a failed replication or a reanalysis of the same data showing the conclusions are erroneous, then you have every right to ignore them. But if they do offer evidence, this cannot be ignored. It is simply not good enough to talk about “hidden moderators” or complain about the replicators’ incompetence. Without evidence, these statements are hollow.

Whether you agree with it in principle or not, preregistration of experimental designs has become something of a standard in replication studies (and is becoming increasingly common in general). So when faced with a replication failure and the fact that people of that ilk are evidently worried about analytical flexibility and publication bias, surely it shouldn’t be very surprising that they won’t just be convinced by your rants about untested moderators or Google searches of ancient conceptual replications, let alone by your accusations of “shameless bullying” or “methodological terrorism”. Instead, what might possibly convince them is a preregistered and vetted replication attempt in which you do right all of the things that these incompetent buffoons did wrong. This proposal has already been outlined very well recently by Brent W Roberts. Speaking more generally, it is the ground-breaking, revolutionary concept that scientific debates should be fought using equivalent evidence instead of childish playground tactics and special pleading.

Granted, some might not be convinced even by that. And that’s fine, too. Skepticism is part of science. Also, some people are not convinced by any evidence you show them. It is actually not your job as a scientist to convince all your critics. It is your job to test theories and evaluate the evidence unimpassionately. If your evidence is solid, the scientific community will come around eventually. If your evidence is only shouting about hidden moderators and nightmare stories of people fearing tenure committees because someone failed to replicate your finding, then I doubt it will pass the test of time.

And maybe, just maybe, science is also about changing your mind when you realize that the evidence simply doesn’t support your previous thinking. I don’t think any single failed replication is enough to do that but a set of failed replications should certainly at least push you in that direction. As far as I can see, nobody who ever published a replication failure has even suggested that people should be refused tenure or lose their research program or whatever. I can’t speak for others, but if someone applied for a job with me and openly discussed the fact that a result of theirs failed to replicate and/or that they had to revise their theories, this would work strongly in their favor compared to the candidate with overbrimming confidence who only published Impact Factor > 30 papers, none of which have been challenged. And, in a story I think I told before, one of my scientific heroes was a man who admitted without me bringing it up that the results of his Science paper had been disproven.

Seriously, people, get a grip. I am sympathetic to the idea that criticism hurts, that we should perhaps be more mindful of just how snarky and frivolous we are with our criticism, and that there is a level of glee associated with how replication failures are publicized. But there is also a lot of glee with which high impact papers are being publicized and presented in TED talks. If you want the former to stop, perhaps we should also finally put an end to the bullshitting altogether. Anyway, I will conclude with a quote by another of my heroes and let my unbridled optimism flow, in spite of it all:

In science it often happens that scientists say, ‘You know that’s a really good argument; my position is mistaken,’ and then they would actually change their minds and you never hear that old view from them again. They really do it. It doesn’t happen as often as it should, because scientists are human and change is sometimes painful. But it happens every day. I cannot recall the last time something like that happened in politics or religion.
– Carl Sagan

Boosting power with better experiments

Probably one of the main reasons for the low replicability of scientific studies is that many previous studies have been underpowered – or rather that they only provided inconclusive evidence for or against the hypotheses they sought to test. Alex Etz had a great blog post on this with regard to replicability in psychology (and he published an extension of this analysis that takes publication bias into account as a paper). So it is certainly true that as a whole researchers in psychology and neuroscience can do a lot better when it comes to the sensitivity of their experiments.

A common mantra is that we need larger sample sizes to boost sensitivity. Statistical power is a function of the sample size and the expected effect size. There is a lot of talk out there about what effect size one should use for power calculations. For instance, when planning a replication study, it has been suggested that you should more than double the sample size of the original study. This is supposed to take into account the fact that published effect sizes are probably skewed upwards due to publication bias and analytical flexibility, or even simply because the true effect happens to be weaker than originally reported.

However, what all these recommendations neglect to consider is that standardized effect sizes, like Cohen’s d or a correlation coefficient, are also dependent on the precision of your observations. By reducing measurement error or other noise factors, you can literally increase the effect size. A higher effect size means greater statistical power – so with the same sample size you can boost power by improving your experiment in other ways.

Here is a practical example. Imagine I want to correlate the height of individuals measured in centimeters and inches. This is a trivial case – theoretically the correlation should be perfect, that is, ρ = 1. However, measurement error will spoil this potential correlation somewhat. I have a sample size of 100 people. I first ask my auntie Angie to guess the height of each subject in centimeters. To determine their heights in inches, I then take them all down the pub and ask this dude called Nigel to also take a guess. Both Angie and Nigel will misestimate heights to some degree. For simplicity, let’s just say that their errors are on average the same. This nonetheless means their guesses will not always agree very well. If I then calculate the correlation between their guesses, it will obviously have to be lower than 1, even though this is the true correlation. I simulated this scenario below. On the x-axis I plot the amount of measurement error in cm (the standard deviation of Gaussian noise added to the actual body heights). On the y-axis I plot the median observed correlation and the shaded area is the 95% confidence interval over 10,000 simulations. As you can see, as measurement error increases, the observed correlation goes down and the confidence interval becomes wider.

corr_vs_error

Greater error leads to poorer correlations. So far, so obvious. But while I call this the observed correlation, it really is the maximally observable correlation. This means that in order to boost power, the first thing you could do is to reduce measurement error. In contrast, increasing your sample size can be highly inefficient and border on the infeasible.

For a correlation of 0.35, hardly an unrealistically low effect in a biological or psychological scenario, you would need a sample size of 62 to achieve 80% power. Let’s assume this is the correlation found by a previous study and we want to replicate it. Following common recommendations you would plan to collect two-and-a-half the sample size, so n = 155. Doing so may prove quite a challenge. Assume that each data point involves hours of data collection per participant and/or that it costs 100s of dollars to acquire the data (neither are atypical in neuroimaging experiments). This may be a considerable additional expense few researchers are able to afford.

And it gets worse. It is quite possible that by collecting more data you further sacrifice data quality. When it comes to neuroimaging data, I have heard from more than one source that some of the large-scale imaging projects contain only mediocre data contaminated by motion and shimming artifacts. The often mentioned suggestion that sample sizes for expensive experiments could be increased by multi-site collaborations ignores that this quite likely introduces additional variability due to differences between sites. The data quality even from the same equipment may differ. The research staff at the two sites may not have the same level of skill or meticulous attention to detail. Behavioral measurements acquired online via a website may be more variable than under controlled lab conditions. So you may end up polluting your effect size even further by increasing sample size.

The alternative is to improve your measurements. In my example here, even going from a measurement error of 20 cm to 15 cm improves the observable effect size quite dramatically, moving from 0.35 to about 0.5. To achieve 80% power, you would only need a sample size of 29. If you kept the original sample size of 62, your power would be 99%. So the critical question is not really what the original effect size was that you want to replicate – rather it is how much you can improve your experiment by reducing noise. If your measurements are already pretty precise to begin with, then there is probably little room for improvement and you also don’t win all that much, as going from measurement error 5 cm to 1 cm in my example. But when the original measurement was noisy, improving the experiment can help a hell of a lot.

There are many ways to make your measurements more reliable. It can mean ensuring that your subjects in the MRI scanner are padded in really well, that they are not prone to large head movements, that you did all in your power to maintain a constant viewing distance for each participant, and that they don’t fall asleep halfway through your experiment. It could mean scanning 10 subjects twice, instead of scanning 20 subjects once. It may be that you measure the speed that participants walk down the hall to the lift with laser sensors instead of having a confederate sit there with a stopwatch. Perhaps you can change from a group comparison to a within-subject design? If your measure is an average across trials collected in each subject, you can enhance the effect size by increasing the number of trials. And it definitely means not giving a damn what Nigel from down the pub says and investing in a bloody tape measure instead.

I’m not saying that you shouldn’t collect larger samples. Obviously, if measurement reliability remains constant*, larger samples can improve sensitivity. But the first thought should always be how you can make your experiment a better test of your hypothesis. Sometimes the only thing you can do is to increase the sample but I bet usually it isn’t – and if you’re not careful, it can even make things worse. If your aim is to conclude something about the human brain/mind in general, a larger and broader sample would allow you to generalize better. However, for this purpose increasing your subject pool from 20 undergraduate students at your university to 100 isn’t really helping. And when it comes to the choice between an exact replication study with three times the sample size than the original experiment, and one with the same sample but objectively better methods, I know I’d always pick the latter.

 

(* In fact, it’s really a trade-off and in some cases a slight increase of measurement error may very well be outweighed by greater power due to a larger sample size. This probably happens for the kinds of experiments where slight difference in experimental parameters don’t matter much and you can collect 100s of people fast, for example online or at a public event).

A few thoughts on stats checking

You may have heard of StatCheck, an R package developed by Michèle B. Nuijten. It allows you to search a paper (or manuscript) for common frequentist statistical tests. The program then compares whether the p-value reported in the test matches up with the reported test statistic and the degrees of freedom. It flags up cases where the p-value is inconsistent and, additionally, when the recalculated p-value would change the conclusions of the test. Now, recently this program was used to trawl through 50,000ish papers in psychology journals (it currently only recognizes statistics in APA style). The results on each paper are then automatically posted as comments on the post-publication discussion platform PubPeer, for example here. At the time of writing this, I still don’t know if this project has finished. I assume not because the (presumably) only one of my papers that has been included in this search has yet to receive its comment. I left a comment of my own there, which is somewhat satirical because 1) I don’t take the world as seriously as my grumpier colleagues and 2) I’m really just an asshole…

While many have welcomed the arrival of our StatCheck Overlords, not everyone is happy. For instance, a commenter in this thread bemoans that this automatic stats checking is just “mindless application of stats unnecessarily causing grief, worry, and ostracism. Effectively, a witch hunt.” In a blog post, Dorothy Bishop discusses the case of her own StatCheck comments, one of which gives the paper a clean bill of health and the other discovered some potential errors that could change the significance and thus the conclusions of the study. My own immediate gut reaction to hearing about this was that this would cause a deluge of vacuous comments and that this diminishes the signal-to-noise ratio of PubPeer. Up until now discussions on there frequently focused on serious issues with published studies. If I see a comment on a paper I’ve been looking up (which is made very easy using the PubPeer plugin for Firefox), I would normally check it out. If in future most papers have a comment from StatCheck, I will certainly lose that instinct. Some are worried about the stigma that may be attached to papers when some errors are found although others have pointed out that to err is human and we shouldn’t be afraid of discovering errors.

Let me be crystal clear here. StatCheck is a fantastic tool and should prove immensely useful to researchers. Surely, we all want to reduce errors in our publications, which I am also sure all of us make some of the time. I have definitely noticed typos in my papers and also errors with statistics. That’s in spite of the fact that when I do the statistics myself I use Matlab code that outputs the statistics in the way they should look in the text so all I have to do is copy and paste them in. Some errors are introduced by the copy-editing stage after a manuscript is accepted. Anyway, using StatCheck on our own manuscripts can certainly help reduce such errors in future. It is also extremely useful for reviewing papers and marking student dissertations because I usually don’t have the time (or desire) to manually check every single test by hand. The real question is if there is really much of a point doing this posthoc for thousands of already published papers?

One argument for this is to enable people to meta-analyze previous results. Here it is important to know that a statistic is actually correct. However, I don’t entirely buy this argument because if you meta-analyze literature you really should spend more time on checking the results than looking what StatCheck auto-comment on PubPeer said. If anything, the countless comments saying that there are zero errors are probably more misleading than the ones that found minor problems. They may actually mislead you into thinking that there is probably nothing wrong with these statistics – and this is not necessarily true. In all fairness, StatCheck, both in its auto-comments and the original paper is very explicit about the fact that its results aren’t definite and should be verified manually. But if there is one thing I’ve learned about people it is that they tend to ignore the small print. When is the last time you actually read an EULA before agreeing to it?

Another issue with the meta-analysis argument is that presently the search is of limited scope. While 50,000 is a large number, it is a small proportion of scientific papers, even within the field of psychology and neuroscience. I work at a psychology department and am (by some people’s definition) a psychologist but – as I said – to my knowledge only one of my own papers should have even been included in the search so far. So if I do a literature search for a meta-analysis StatCheck’s autopubpeering wouldn’t be much help to me. I’m told there are plans to widen the scope of StatCheck’s robotic efforts beyond psychology journals in the future. When it is more common this may indeed be more useful although the problem remains that the validity of its results is simply unknown.

The original paper includes a validity check in the Appendix. This suggests that error rates are reasonably low when comparing StatCheck’s results to previous checks. This is doubtless important for confirming that StatCheck works. But in the long run this is not really the error rate we are interested in. What this does not tell us which proportion of papers contain actual errors with a study’s conclusions. Take Dorothy Bishop‘s paper as an example. For that StatCheck detected two F-tests for which the recalculated p-value would change the statistical conclusions. However, closer inspection reveals that the test was simply misreported in the paper. There is only one degree of freedom and I’m told StatCheck misinterpreted what test this was (but I’m also told this has been fixed in the new version). If you substitute in the correct degrees of freedom, the reported p-value matches.

Now, nobody is denying that there is something wrong with how these particular stats were reported. An F-test should have two degrees of freedom. So StatCheck did reveal errors and this is certainly useful. But the PubPeer comment flags this up as a potential gross inconsistency that could theoretically change the study’s conclusions. However, we know that it doesn’t actually mean that. The statistical inference and conclusions are fine. There is merely a typographic error. The StatCheck report is clearly a false positive.

This distinction seems important to me. The initial reports about this StatCheck mega-trawl was that “around half of psychology papers have at least one statistical error, and one in eight have mistakes that affect their statistical conclusions.” At least half of this sentence is blatantly untrue. I wouldn’t necessarily call a typo a “statistical error”. But as I already said, revealing these kinds of errors is certainly useful nonetheless. The second part of this statement is more troubling. I don’t think we can conclude 1 in 8 papers included in the search have mistakes that affect their conclusions. We simply do not know that. StatCheck is a clever program but it’s not a sentient AI. The only way to really determine if the statistical conclusions are correct is still to go and read each paper carefully and work out what’s going on. Note that the statement in the StatCheck paper is more circumspect and acknowledges that such firm conclusions cannot be drawn from its results. It’s a classical case of journalistic overreach where the RetractionWatch post simplifies what the researchers actually said. But these are still people who know what they’re doing. They aren’t writing flashy “science” article for the tabloid press.

This is a problem. I do think we need to be mindful of how the public perceives scientific research. In a world in which it is fine for politicians to win referenda because “people have had enough of experts” and in which a narcissistic, science-denying madman is dangerously close to becoming US President we simply cannot afford to keep telling the public that science is rubbish. Note that worries about the reputation of science are no excuse not to help improve it. Quite to the contrary, it is a reason to ensure that it does improve. I have said many times, science is self-correcting but only if there are people who challenge dearly held ideas, who try to replicate previous results, who improve the methods, and who reveal errors in published research. This must be encouraged. However, if this effort does not go hand in hand with informing people about how science actually works, rather than just “fucking loving” it for its cool tech and flashy images, then we are doomed. I think it is grossly irresponsible to tell people that an eighth of published articles contain incorrect statistical conclusions when the true number is probably considerably smaller.

In the same vein, an anonymous commenter on my own PubPeer thread also suggested that we should “not forget that Statcheck wasn’t written ‘just because.'” There is again an underhanded message in this. Again, I think StatCheck is a great tool and it can reveal questionable results such as rounding down your p=0.054 to p=0.05 or the even more unforgivable p<0.05. It can also reveal other serious errors. However, until I see any compelling evidence that the proportion of such evils in the literature is as high as suggested by these statements I remain skeptical. A mass-scale StatCheck of the whole literature in order to weed out serious mistakes seems a bit like carpet-bombing a city just to assassinate one terrorist leader. Even putting questions of morality aside, it isn’t really very efficient. Because if we assume that some 13% of papers have grossly inconsistent statistics we still need to go and manually check them all. And, what is worse, we quite likely miss a lot of serious errors that this test simple can’t detect.

So what do I think about all this? I’ve come to the conclusion that there is no major problem per se with StatCheck posting on PubPeer. I do think it is useful to see these results, especially if it becomes more general. Seeing all of these comments may help us understand how common such errors are. It allows people to double check the results when they come across them. I can adjust my instinct. If I see one or two comments on PubPeer I may now suspect it’s probably about StatCheck. If I see 30, it is still likely to be about something potentially more serious. So all of this is fine by me. And hopefully, as StatCheck becomes more widely used, it will help reduce these errors in future literature.

But – and this is crucial – we must consider how we talk about this. We cannot treat every statistical error as something deeply shocking. We need to develop a fair tolerance to these errors as they are discovered. This may seem obvious to some but I get the feeling not everybody realizes that correcting errors is the driving force behind science. We need to communicate this to the public instead of just telling them that psychologists can’t do statistics. We can’t just say that some issue with our data analysis invalidates 45,000 and 15 years worth of fMRI studies. In short, we should stop overselling our claims. If, like me, you believe it is damaging when people oversell their outlandish research claims about power poses and social priming, then it is also damaging if people oversell their doomsday stories about scientific errors. Yes, science makes errors – but the fact that we are actively trying to fix them is proof that it works.

800px-terminator_exhibition_t-800_-_menacing_looking_shoot
Your friendly stats checking robot says hello