Category Archives: wizardry

Black coins & swan flipping

My previous post sparked a number of responses from various corners, including some exchanges I had on Twitter as well as two blog posts, one by Simine Vazire and another one following on from that. In addition, there has also been another post which discussed (and, in my opinion, misrepresented) similar things I said recently at the UCL “Is Science Broken” panel discussion.

I don’t blame people for misunderstanding what I’m saying. The blame must lie largely with my own inability to communicate my thoughts clearly. Maybe I am just crazy. As they say, you can’t really judge your own sanity. However, I am a bit worried about our field. To me the issues I am trying to raise are self-evident and fundamental. The fact that they apparently aren’t to others makes me wonder if Science isn’t in fact broken after all…

Either way, I want to post a brief (even by Alexetz’ standards?) rejoinder to that. They will just be brief answers to frequently asked questions (or rather, the often constructed strawmen):

Why do you hate replications?

I don’t. I am saying replications are central to good science. This means all (or close to it) studies should contain replications as part of their design. It should be a daisy chain. Each experiment should contain some replications, some sanity checks and control conditions. This serves two purposes: it shows that your experiment was done properly and it helps to accumulate evidence on whether or not the previous findings are reliable. Thus we must stop distinguishing between “replicators” and “original authors”. All scientists should be replicators all the bloody time!

Why should replicators have to show why they failed to replicate?

They shouldn’t. But, as I said in the previous point, they should be expected to provide evidence that they did a proper experiment. And of course the original study should be held to the same standard. This could in fact be a sanity check: if you show that the method used couldn’t possibly reveal reliable data this speaks volumes about the original effect.

It’s not the replicator’s fault if the original study didn’t contain a sanity check!

That is true. It isn’t your fault if the previous study was badly designed. But it is your fault if you are aware of that defect and nonetheless don’t try to do better. And it’s not really that black and white. What was good design yesterday can be bad design today and indefensibly terrible tomorrow. We can always do better. That’s called progress.

But… but… fluke… *gasp* type 1 error… Randomness!?!

Almost every time I discuss this topic someone will righteously point out that I am ignoring the null hypothesis. I am not. Of course the original finding may be a total fluke but you simply can’t know for sure. Under certain provisions you can test predictions that the null hypothesis makes (with Bayesian inference anyway). But that isn’t the same. There are a billion reasons between heaven and earth why you may fail a replication. You don’t even need to do it poorly. It may just be bad luck. Brain-behaviour correlations observed in London will not necessarily be detectable in Amsterdam* because the heterogeneity, and thus the inter-individual variance, in the latter sample is likely to be smaller. This means that for the very same effect size resulting from the same underlying biological process you may need more statistical power. Of course, it could also be some methodological error. Or perhaps the original finding was just a false positive. You can never know.

Confirming the original result was a fluke is new information!

That view is problematic for two reasons. First of all, it is impossible to prove the null (yes, even for Bayesians). Science isn’t math, it doesn’t prove anything. You just collect data that may or may not be consistent with theoretical predictions. Secondly, you should never put too much confidence in any new glorious findings – even if it was high powered (because you don’t really know that) and pre-registered (because that doesn’t prevent people from making mistakes). So your prior that the result is a fluke should be strong anyway. You don’t learn very much new from that.

What then would tell me new information?

A new experiment that tests the same theory – or perhaps even a better theory. It can be a close replication but it can also be a largely conceptual one. I think this dichotomy is false. There are no true direct replications and even if there were they would be pointless. The directness of replication exists on a spectrum (I’ve said this already in a previous post). I admit that the definition of “conceptual” replications in the social priming literature is sometimes a fairly large stretch. You are free to disagree with them. The point is though that if a theory is so flaky that modest changes completely obliterate it then the onus is on the proponent of the theory to show why. In fact, this could be the new, better experiment you’re doing. This is how a replication effort can generate new hypotheses.

Please leave the poor replicators alone!

If you honestly think replicators are the ones getting a hard time I don’t know what world you live in. But that’s a story for another day, perhaps one that will be told by one of the Neuroscience Devils? The invitation to post there remains open…

Science clearly isn’t self-correcting or it wouldn’t be broken!

Apart from being a circular argument, this is also demonstrably wrong. Science isn’t a perpetual motion machine. Science is what scientists do. The fact that we are having these debates is conclusive proof that science self-corrects. I don’t see any tyrannical overlord dictating us to do any of this.

So what do you think should be done?

As I have said many times before, I think we need to train our students (and ourselves) in scientific skepticism and strong inference. We should stop being wedded to our pet theories. We need to make it easier to seek the truth rather than fame. For all I care, pre-registration can probably help with that but it won’t be enough. We have to stop the pervasive idea that an experiment “worked” when it confirmed your hypothesis and failed when it didn’t. We should read Feynman’s Cargo Cult Science. And after thinking about all this negativity, wash it all down by reading (or watching) Carl Sagan to remember how many mysteries yet wait to be solved in this amazing universe we inhabit.


*) I promise I will stop talking about this study now. I really don’t want to offend anyone…

How replication should work

Of Psychic Black Swans & Science Wizards

In recent months I have written a lot (and thought a lot more) about the replication crisis and the proliferation of direct replication attempts. I admit I haven’t bothered to quantify this but I have an impression that most of these attempts fail to reproduce the findings they try to replicate. I can understand why this is unsettling to many people. However, as I have argued before, I find the current replication movement somewhat misguided.

A big gaping hole where your theory should be

Over the past year I have also written a lot too much about Psi research. Most recently, I summarised my views on this in an uncharacteristically short post (by my standards) in reply to Jacob Jolij. But only very recently I realised my that my views on all of this actually converge on the same fundamental issue. On that note I would like to thank Malte Elson with whom I discussed some of these issues at that Open Science event at UCL recently. Our conversation played a significant role in clarifying my thoughts on this.

My main problem with Psi research is that it has no firm theoretical basis and that the use of labels like “Psi” or “anomalous” or whatnot reveals that this line of research is simply about stating the obvious. There will always be unexplained data but that doesn’t prove any theory. It has now dawned on me that my discomfort with the current replication movement stems from the same problem: failed direct replications do not explain anything. They don’t provide any theoretical advance to our knowledge about the world.

I am certainly not the first person to say this. Jason Mitchell’s treatise about failed replications covered many of the same points. In my opinion it is unfortunate that these issues have been largely ignored by commenters. Instead his post has been widely maligned and ridiculed. In my mind, this reaction was not only uncivil but really quite counter-productive to the whole debate.

Why most published research findings are probably not waterfowl

A major problem with his argument was pointed out by Neuroskeptic: Mitchell seems to hold replication attempts to a different standard than original research. While I often wonder if it is easier to incompetently fail to replicate a result than to incompetently p-hack it into existence, I agree that it is not really feasible to take that into account. I believe science should err on the side of open-minded skepticism. Thus even though it is very easy to fail to replicate a finding, the only truly balanced view is to use the same standards for original and replication evidence alike.

Mitchell describes the problems with direct replications with a famous analogy: if you want to prove the existence of black swans, all it takes is to show one example. No matter how many white swans you may produce afterwards, they can never refute the original reports. However, in my mind this analogy is flawed. Most of the effects we study in psychology or neuroscience research are not black swans. A significant social priming effect or a structural brain-behaviour correlation is not irrefutable evidence that it is real.

Imagine that there really were no black swans. It is conceivable that someone might parade around a black swan but maybe it’s all an elaborate hoax. Perhaps somebody just painted a white swan? Frauds of such a sensational nature are not unheard of in science, but most of us trust that they are nonetheless rare. More likely, it could be that the evidence is somehow faulty. Perhaps the swan was spotted in poor lighting conditions making it appear black. Considering how many people can disagree about whether a photo depicts a black or a white dress this possibility seems entirely conceivable. Thus simply showing a black swan is insufficient evidence.

On the other hand, Mitchell is entirely correct that parading a whole swarm of white swans is also insufficient evidence against the existence of black swans. The same principle applies here. The evidence could also be faulty. If we only looked at swans native to Europe we would have a severe sampling bias. In the worst case, people might be photographing black swans under conditions that make them appear white.

Definitely white and gold! (Fir0002/Flagstaffotos)

On the wizardry of cooking social psychologists

This brings us to another oft repeated argument about direct replications. Perhaps the “replicators” are just incompetent or lacking in skill. Mitchell also has an analogy for this (which I unintentionally also used in my previous post). Replicators may just be bad cooks who follow the recipes but nonetheless fail to produce meals that match the beautiful photographs in the cookbooks. In contrast, Neuroskeptic referred to this tongue-in-cheek as the Harry Potter Theory: only those blessed with magical powers are able to replicate. Inept “muggles” failing to replicate a social priming effect should just be ignored.

In my opinion both of these analogies are partly right. The cooking analogy correctly points out that simply following the recipe in a cookbook does not make you a master chef. However, it also ignores the fact that the beautiful photographs in a cookbook are frequently not entirely genuine. To my knowledge, many cookbook photos are actually of cold food to circumvent problems like steam on the camera etc. Most likely the photos will have been doctored in some way and they will almost certainly be the best pick out of several cooking attempts and numerous photos. So while it is true that the cook was an expert while you probably aren’t, the photo does not necessarily depict a representative meal.

The jocular wizardry argument implies that anyone with a modicum of expertise in a research area should be able to replicate a research finding. As students we are taught that the methods sections of our research publications should allow anyone to replicate our experiments. But this is certainly not feasible: some level of expertise and background knowledge should be expected for a successful replication. I don’t think I could replicate any findings in radio astronomy regardless how well established they may be.

One frustration many authors of results that have failed to replicate have expressed to me (and elsewhere) is the implicit assumption by many “replicators” that social psychology research is easy. I am not a social psychologist. I have no idea how easy these experiments are but I am willing to give people the benefit of the doubt here. It is possible that some replication attempts overlook critical aspects of the original experiments.

However, I think one of the key points of Neuroskeptic’s Harry Potter argument applies here: the validity of a “replicator’s” expertise, that is their ability to cast spells, cannot be contingent on their ability to produce these effects in the first place. This sort of reasoning seems circular and, appropriately enough, sounds like magical thinking.

Which one is Harry Potter again? (

How to fix our replicator malfunction

The way I see it both arguments carry some weight here. I believe that muggles replicators should have to demonstrate their ability to do this kind of research properly in order for us to have any confidence in their failed wizardry. When it comes to the recent failure to replicate nearly half a dozen studies reporting structural brain-behaviour correlations, Ryota Kanai suggested that the replicators should have analysed the age dependence of grey matter density to confirm that their methods were sensitive enough to detect such well-established effects. Similarly, all the large-scale replication attempts in social psychology should contain such sanity checks. On a positive note, the Many Labs 3 project included a replication of the Stroop effect and similar objective tests that fulfill such a role.

However, while such clear-cut baselines are great they are probably insufficient, in particular if the effect size of the “sanity check” is substantially greater than the effect of interest. Ideally, any replication attempt should contain a theoretical basis, an alternative hypothesis to be tested that could explain the original findings. As I said previously, it is the absence of such theoretical considerations that makes most failed replications so unsatisfying to me.

The problem is that for a lot of the replication attempts, whether they are of brain-behaviour correlations, social priming, or Bem’s precognition effects, the only underlying theory replicators put forth is that the original findings were spurious and potentially due to publication bias, p-hacking and/or questionable research practices. This seems mostly unfalsifiable. Perhaps these replication studies could incorporate control conditions/analyses to quantify the severity of p-hacking required to produce the original effects. But this is presumably unfeasible in practice because the parameter space of questionable research practices is so vast that it is impossible to derive a sufficiently accurate measure of them.┬áIn a sense, methods for detecting publication bias in meta-analysis are a way to estimate this but the evidence they provide is only probabilistic, not experimental.

Of course this doesn’t mean that we cannot have replication attempts in the absence of a good alternative hypothesis. My mentors instilled in me the view that any properly conducted experiment should be published. It shouldn’t matter whether the results are positive, negative, or inconclusive. Publication bias is perhaps the most pervasive problem scientific research faces and we should seek to reduce it, not amplify it by restricting what should and shouldn’t be published.

Rather I believe we must change the philosophy underlying our attempts to improve science. If you disbelieve the claims of many social priming studies (and honestly, I don’t blame you!) it would be far more convincing to test a hypothesis on why the entire theory is false than showing that some specific findings fail to replicate. It would also free up a lot of resources to actually advance scientific knowledge that are currently used on dismantling implausible ideas.

There is a reason why I haven’t tried to replicate “presentiment” experiments even though I have written about it. Well, to be honest the biggest reason is that my grant is actually quite specific as to what research I should be doing. However, if I were to replicate these findings I would want to test a reasonable hypothesis as to how they come about. I actually have some ideas how to do that but in all honesty I simply find these effects so implausible that I don’t really feel like investing a lot of my time into testing them. Still, if I were to try a replication it would have to be to test an alternative theory because a direct replication is simply insufficient. If my replication failed, it would confirm my prior beliefs but not explain anything. However, if it succeeded, I probably still wouldn’t believe the claims. In other words, we wouldn’t have learned very much either way.

Those pesky replicators always fail… (