Of Psychic Black Swans & Science Wizards

In recent months I have written a lot (and thought a lot more) about the replication crisis and the proliferation of direct replication attempts. I admit I haven’t bothered to quantify this but I have an impression that most of these attempts fail to reproduce the findings they try to replicate. I can understand why this is unsettling to many people. However, as I have argued before, I find the current replication movement somewhat misguided.

A big gaping hole where your theory should be

Over the past year I have also written a lot too much about Psi research. Most recently, I summarised my views on this in an uncharacteristically short post (by my standards) in reply to Jacob Jolij. But only very recently I realised my that my views on all of this actually converge on the same fundamental issue. On that note I would like to thank Malte Elson with whom I discussed some of these issues at that Open Science event at UCL recently. Our conversation played a significant role in clarifying my thoughts on this.

My main problem with Psi research is that it has no firm theoretical basis and that the use of labels like “Psi” or “anomalous” or whatnot reveals that this line of research is simply about stating the obvious. There will always be unexplained data but that doesn’t prove any theory. It has now dawned on me that my discomfort with the current replication movement stems from the same problem: failed direct replications do not explain anything. They don’t provide any theoretical advance to our knowledge about the world.

I am certainly not the first person to say this. Jason Mitchell’s treatise about failed replications covered many of the same points. In my opinion it is unfortunate that these issues have been largely ignored by commenters. Instead his post has been widely maligned and ridiculed. In my mind, this reaction was not only uncivil but really quite counter-productive to the whole debate.

Why most published research findings are probably not waterfowl

A major problem with his argument was pointed out by Neuroskeptic: Mitchell seems to hold replication attempts to a different standard than original research. While I often wonder if it is easier to incompetently fail to replicate a result than to incompetently p-hack it into existence, I agree that it is not really feasible to take that into account. I believe science should err on the side of open-minded skepticism. Thus even though it is very easy to fail to replicate a finding, the only truly balanced view is to use the same standards for original and replication evidence alike.

Mitchell describes the problems with direct replications with a famous analogy: if you want to prove the existence of black swans, all it takes is to show one example. No matter how many white swans you may produce afterwards, they can never refute the original reports. However, in my mind this analogy is flawed. Most of the effects we study in psychology or neuroscience research are not black swans. A significant social priming effect or a structural brain-behaviour correlation is not irrefutable evidence that it is real.

Imagine that there really were no black swans. It is conceivable that someone might parade around a black swan but maybe it’s all an elaborate hoax. Perhaps somebody just painted a white swan? Frauds of such a sensational nature are not unheard of in science, but most of us trust that they are nonetheless rare. More likely, it could be that the evidence is somehow faulty. Perhaps the swan was spotted in poor lighting conditions making it appear black. Considering how many people can disagree about whether a photo depicts a black or a white dress this possibility seems entirely conceivable. Thus simply showing a black swan is insufficient evidence.

On the other hand, Mitchell is entirely correct that parading a whole swarm of white swans is also insufficient evidence against the existence of black swans. The same principle applies here. The evidence could also be faulty. If we only looked at swans native to Europe we would have a severe sampling bias. In the worst case, people might be photographing black swans under conditions that make them appear white.

Definitely white and gold! (Fir0002/Flagstaffotos)

On the wizardry of cooking social psychologists

This brings us to another oft repeated argument about direct replications. Perhaps the “replicators” are just incompetent or lacking in skill. Mitchell also has an analogy for this (which I unintentionally also used in my previous post). Replicators may just be bad cooks who follow the recipes but nonetheless fail to produce meals that match the beautiful photographs in the cookbooks. In contrast, Neuroskeptic referred to this tongue-in-cheek as the Harry Potter Theory: only those blessed with magical powers are able to replicate. Inept “muggles” failing to replicate a social priming effect should just be ignored.

In my opinion both of these analogies are partly right. The cooking analogy correctly points out that simply following the recipe in a cookbook does not make you a master chef. However, it also ignores the fact that the beautiful photographs in a cookbook are frequently not entirely genuine. To my knowledge, many cookbook photos are actually of cold food to circumvent problems like steam on the camera etc. Most likely the photos will have been doctored in some way and they will almost certainly be the best pick out of several cooking attempts and numerous photos. So while it is true that the cook was an expert while you probably aren’t, the photo does not necessarily depict a representative meal.

The jocular wizardry argument implies that anyone with a modicum of expertise in a research area should be able to replicate a research finding. As students we are taught that the methods sections of our research publications should allow anyone to replicate our experiments. But this is certainly not feasible: some level of expertise and background knowledge should be expected for a successful replication. I don’t think I could replicate any findings in radio astronomy regardless how well established they may be.

One frustration many authors of results that have failed to replicate have expressed to me (and elsewhere) is the implicit assumption by many “replicators” that social psychology research is easy. I am not a social psychologist. I have no idea how easy these experiments are but I am willing to give people the benefit of the doubt here. It is possible that some replication attempts overlook critical aspects of the original experiments.

However, I think one of the key points of Neuroskeptic’s Harry Potter argument applies here: the validity of a “replicator’s” expertise, that is their ability to cast spells, cannot be contingent on their ability to produce these effects in the first place. This sort of reasoning seems circular and, appropriately enough, sounds like magical thinking.

Which one is Harry Potter again? (lotr.wikia.com/wiki/Wizards)

How to fix our replicator malfunction

The way I see it both arguments carry some weight here. I believe that muggles replicators should have to demonstrate their ability to do this kind of research properly in order for us to have any confidence in their failed wizardry. When it comes to the recent failure to replicate nearly half a dozen studies reporting structural brain-behaviour correlations, Ryota Kanai suggested that the replicators should have analysed the age dependence of grey matter density to confirm that their methods were sensitive enough to detect such well-established effects. Similarly, all the large-scale replication attempts in social psychology should contain such sanity checks. On a positive note, the Many Labs 3 project included a replication of the Stroop effect and similar objective tests that fulfill such a role.

However, while such clear-cut baselines are great they are probably insufficient, in particular if the effect size of the “sanity check” is substantially greater than the effect of interest. Ideally, any replication attempt should contain a theoretical basis, an alternative hypothesis to be tested that could explain the original findings. As I said previously, it is the absence of such theoretical considerations that makes most failed replications so unsatisfying to me.

The problem is that for a lot of the replication attempts, whether they are of brain-behaviour correlations, social priming, or Bem’s precognition effects, the only underlying theory replicators put forth is that the original findings were spurious and potentially due to publication bias, p-hacking and/or questionable research practices. This seems mostly unfalsifiable. Perhaps these replication studies could incorporate control conditions/analyses to quantify the severity of p-hacking required to produce the original effects. But this is presumably unfeasible in practice because the parameter space of questionable research practices is so vast that it is impossible to derive a sufficiently accurate measure of them. In a sense, methods for detecting publication bias in meta-analysis are a way to estimate this but the evidence they provide is only probabilistic, not experimental.

Of course this doesn’t mean that we cannot have replication attempts in the absence of a good alternative hypothesis. My mentors instilled in me the view that any properly conducted experiment should be published. It shouldn’t matter whether the results are positive, negative, or inconclusive. Publication bias is perhaps the most pervasive problem scientific research faces and we should seek to reduce it, not amplify it by restricting what should and shouldn’t be published.

Rather I believe we must change the philosophy underlying our attempts to improve science. If you disbelieve the claims of many social priming studies (and honestly, I don’t blame you!) it would be far more convincing to test a hypothesis on why the entire theory is false than showing that some specific findings fail to replicate. It would also free up a lot of resources to actually advance scientific knowledge that are currently used on dismantling implausible ideas.

There is a reason why I haven’t tried to replicate “presentiment” experiments even though I have written about it. Well, to be honest the biggest reason is that my grant is actually quite specific as to what research I should be doing. However, if I were to replicate these findings I would want to test a reasonable hypothesis as to how they come about. I actually have some ideas how to do that but in all honesty I simply find these effects so implausible that I don’t really feel like investing a lot of my time into testing them. Still, if I were to try a replication it would have to be to test an alternative theory because a direct replication is simply insufficient. If my replication failed, it would confirm my prior beliefs but not explain anything. However, if it succeeded, I probably still wouldn’t believe the claims. In other words, we wouldn’t have learned very much either way.

Those pesky replicators always fail… (en.memory-alpha.org/wiki/Replicator)

12 thoughts on “Of Psychic Black Swans & Science Wizards

  1. If a theory fails the test of replication, where the experimental conditions are at least broadly identical to the ones in the original setup, how can it possibly possess external validity and be applicable to everyday life, where conditions are always messy and volatile?


    1. I don’t disagree with that to be honest. If an effect repeatedly fails to replicate with reasonably matched conditions, this adds evidence suggesting it is false. If it is not robust to subtle changes in the parameters this is also evidence against it. The problem with this is however that it’s almost inevitable that people will debate whether these subtle differences aren’t in fact critical. In some situations such failed replications may reveal other issues. My previous post on those SBB correlations shows a good example of that. If the failed replication reveals a methodological flaw this is in my mind far more important than the question whether the original findings were true or not.


    2. Actually I should add that this all depends on a very subjective judgement as to what constitutes “broadly identical” conditions. Many real effects are very subtle and demand careful experimental control. This certainly seems to apply to particle physics or microbiology and from personal experience I can attest that it is true for neurophysiology as well. I don’t really see why people seem to think this argument is suddenly invalid when it comes to psychology. In fact, something as messy and noisy as human behaviour would seem even more precarious.

      As I said in my post, I don’t think this “muggle defense” is valid but we need to be careful here. Different people clearly disagree as to what parameters are critical these experiments. So in order to resolve these arguments we need to do experiments that test these parameters explicitly – which is the point I’m trying to make.


  2. quote: “methods for detecting publication bias in meta-analysis are a way to estimate this but the evidence they provide is only probabilistic, not experimental.”

    Experiments also use probabilistic information to make inferences. If the means in a sample show the predicted pattern, some statistics are used to claim that the observed covariance was not just a chance finding.

    Your quote may falsely imply that methods for detecting publication bias rely on correlational evidence. However, this is not the case for all methods. R-Index, TES, p-curve, and p-uniform are not based on covariations. They simply use the reported information in the original article and correct for biases that may exist in the set of studies.

    If the existing evidence does not survive a bias-correction, the existing studies provide no empirical evidence for a hypothesis.

    Moreover, these methods can also make accurate predictions about the chances of replicating reported findings. If reported findings are biased, they are unlikely to replicate. So far, bias in the original studies provides a simple and parsimonious for the outcome of replication studies. After all, many-labs 1 replicated many findings because the original studies had high power.


    1. Just to be clear, as I said in the post is that a probabilistic/statistical argument is not a bad way to make inferences, just that a robust experimental result and/or cogent theoretical argument to explain the original finding is still more convincing.

      One great example in my mind is the Doyen replication of Bargh’s elderly priming study. They actually tested an alternative explanation that seems to give a more parsimonious explanation of the original findings.


    2. I still do not understand what you mean by a robust experimental result. We would not have a replication crisis if experiments would produce robust results.

      The advantage of experiments is only that a robust result leads to a simple inference about cause and effect. The problem is that results are often not robust. Whether results are robust depends on many factors, being an experiment is not one of them. Correlational results can be robust, but may not allow for a simple explanation in terms of cause and effect.

      Your reference to Doyen’s replication study of Bargh is also not a solution to the problem. First, Doyen’s results have not been replicated. More important, you miss the main point that often a simple explanation for failed replications is that the original studies capitalized on chance/sampling error. There is no moderator to be found. It is just random noise and you cannot replicate random noise. There is simply no empirical, experimental solution to find out why a particular study at one particular moment in history produced a particular result.

      However, for a set of studies it is possible to examine whether the data are consistent with random sampling error. QRPs often violate simple rules of random sampling error, which makes it possible to make claims about original studies BASED ON THE RESULTS REPORTED in the original studies.


    3. Dr. R: I agree that the more labs fail at a direct replication, the more likely it seems that the finding really is a true null result. However, it could still be argued that there is a muggle factor you aren’t taking into account. The more people duplicate the same error, the more you will skew the final result. The same obviously also applies to confirmations of a finding. That’s why we need better experiments instead of lots of direct replications. Your publication bias tests cannot possibly reveal that problem.

      Contrary to your suggestion I’m fully aware of the fact that many (in some fields, most) results are false positives. However, to show that you must design experiments that go beyond merely trying to confirm the null because that is impossible. It is unfalsifiable. There are better ways to test ideas.

      I used the Doyen study as an example. It is completely irrelevant if it is replicated or not. Like any other result it should be scrutinised and replaced by more conclusive evidence, if necessary. My point is that unlike most replication attempts this one was well-designed. They replicated the original finding but tested an alternative explanation for it.

      Another kind of robust experiment is to improve the methodology. If you can enhance the signal-to-noise ratio of your effect estimate, or perhaps even make it more direct, this is much better evidence than repeating the original experiment a hundred times, especially if it there is any doubt that some of the replicators are doing it right.

      Now, as I said I think all research should be published (provided it isn’t completely erroneous and incompetent). I have absolutely nothing against the publication of failed replications. I am merely saying they should be in the context of an experiment that seeks to do more and which contains appropriate sanity checks to confirm that it had the necessary sensitivity. Unlike the posthoc probabilities that underlie a lot of the publication bias analyses, this is actually a valid test of power. If you can show that your experiment could detect effect X but nevertheless failed to replicate the theoretically equivalent effect Y, this provides fairly compelling evidence that effect Y doesn’t exist.


  3. “enhance the signal-to-noise ratio of your effect estimate”

    the effect is the signal, to increase the signal-to-noise-ratio means to decrease sampling error. This is what good replication studies do. They replicate an original study with small samples and large sampling error with large samples and small sampling error. In other words, they just ran a better study. If this study does not show a significant result, it undermines the validity of the evidence in the smaller study. If you want to believe that the replicators made a mistake, the original authors can redo the study and show that they can get the effect again, but they better do this with a bigger sample because another significant result with a small sample will be flagged as evidence that qrps were used.

    In short, studies with BIGGER samples are BETTER studies. So, if you are asking for better experiments, you have to explain why an exact replication study WITH A LARGER SAMPLE is not a better experiment than the original study.


    1. the effect is the signal, to increase the signal-to-noise-ratio means to decrease sampling error

      That’s one way but not the only way and very often it’s not the best way. You can also enhance SNR by enhancing the signal (very difficult in practice) or reducing the noise (such as measurement error). If my memory doesn’t fail me the Doyen study also did that as they included an automated measurement of walking speed using laser sensors rather than just relying on people with stopwatches. Regarding the SBB correlation replications it may have been possible to acquire quantitative structural images or use scans with higher resolution and better tissue contrast, although you would need to confirm that the SNR is in fact increased – which again means it has to contain a sanity check of some sort.

      you have to explain why an exact replication study WITH A LARGER SAMPLE is not a better experiment than the original study

      It’s not a better experiment. It’s the same experiment with a larger sample. Maybe you believe bigger is always better but I don’t. Critically such a direct replication doesn’t explain anything. If it fails it could be because you artifactually reduced the effect size by doing it badly. If the muggle factor reduces the effect size you can possibly measure by 75%, your larger sample size isn’t going to help you very much.

      As I discussed in my post, the muggle factor isn’t always credible. Also, as I said above in reply to Rolf, I can agree that if an effect is so flaky that very tiny parameter changes destroy it then it is probably not real. In fact, I once replicated an experiment for a former lab member and it failed gloriously. Since we changed some factors it could have been due to those but we felt that it was sufficiently damning suggesting the hypothesis was probably wrong. This is incidentally the only experiment I have in my file drawer (not counting a few as-yet-unpublished studies). I might blog about this at some point in the future.

      So what makes a better experiment for me? Could be a combination of these but it has to have at least one:

      1. It improves the method in some way that either reduces noise or boosts the signal.

      2. Seeks to test an alternative hypothesis – it’s fine if the replication itself fails but at least this shows that the authors weren’t blindly biased to disbelieve the original result.

      3. Includes a sanity check confirming that experiment is adequate to detect the effect (statistical power is not the same thing).

      4. If the original finding was correlational, the new experiment uses a causal manipulation. This can be part of a replication attempt.

      There may be others but any of these four would already be a good start.
      Finally, may I kindly request you refrain from caps-locking your comments? It’s too early in the morning that I want to feel like I’m being screamed at. You can use HTML tags or symbols if you wish to emphasise something. Thanks!


  4. Thank you for the honorable mention — like I said on Twitter, I doubt that my presence at UCL/Is Science Broken? played any role in the intellectual development of your sophisticated argument. That being said, I take this as an invitation to comment briefly; maybe after reading this it will become obvious to anyone why I cannot possibly be associated with your excellent analysis.

    I believe I said something similar at the event itself: I do not think the “replication crisis” is a crisis of methods, but a crisis of theory. The series of failed replications, even of psychology classics, is the logical consequence to the repeated use of bad methods to find support for bad theories.

    In the course of the cognitive revolution, psychology has started to indulge in “cute” theories that belong more in the philosophical rather than the empirical realm: that metaphors are represented physically in the brain and metaphorical brain circuitry can affect behavior, that motivation is an engine running on fuel, that the human mind is an associationistic network of implicit&explicit, internal&external, conscious&unconscious, and short&long-term compartments; or, even more basic, that we are actually able to observe internal processes often classified as “thinking” and “feeling” using non-invasive procedures, that supposedly are able inform us about theorized underlying processes nonetheless.

    The reason why these theories were able to survive for such a long time is that coincidentally, psychologists have found the “appropriate” methods, from (ad-hoc) questionnaires to lexical procedures allegedly tapping into implicit processes or associations, allowing them to claim finding scientific evidence for the theories describing human “internals”. Not physiological internals, mind you, but some kind of semantic structure hidden deep in a biological shell, the brain. The fetishizing of NHST with small samples certainly did not make things better, just be sure to mention this.

    Every contradicting finding is followed by a discussion of moderators that are as hidden as the variables experimenters were originally interested in — because if everything is internal, implicit, or unconscious, there is an unlimited number of potential equally insivible influences.

    We are presented with the results of this toxic combination today: Almost nothing replicates, psychological theories/models/mechanisms become fragmented the more are kept under scrutiny, and the bigger part of my text books has become a collection of historical documents.

    That being said, I hope that the replication crisis will lead to the testing of more reasonable (read: testable) hypotheses. It’s just possible that we have to go on deconstructing “contemporary textbook knowledge” for a couple of years.


    1. Our conversation really helped me focus my thoughts (which, as you can tell from my wordiness, is desperately needed). So the credit is well deserved.

      I am a neuroscientist (neurophysiologist by training) so I can’t really comment in depth on theories in social psychology. It looks to me like we may be at a watershed time for social psychology. It is normal for popular theories to get dismantled. It has happened before. Just look at phrenology, social darwinism, which as we might want to add were really awful. It would be great if we can somehow safeguard science from ever falling into traps like this again but I think it is impossible to prevent this completely. However, I also believe we are doing better than our colleagues in the late 1800s did and the efforts under way now will hopefully improve it even further.

      My main interest in this debate relates more to my own subfield, which is why I have talked so much about those brain-behaviour correlations. It is a different story than the failures to replicate social priming results (in fact, from a theoretical standpoint some might say they are polar opposites…) But these things get lumped together in the context of QRPs, p-hacking etc. There is a point to that but I think it detracts from the real problem which is the lack of a solid theoretical basis.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s