Category Archives: replication

Tale of the Coin Flippers

Recently I have spent a lot of time writing about replication and why I feel current “direct” replication efforts are often missing the point. For some reason it is a lot harder than it should be to get my point across. It is being misconstrued at every step and various straw man arguments are debated instead. Whatever the reasons for this may be, I want to try again one last time before I’ll go on a break. Perhaps I can communicate my thoughts more clearly by means of a parable…

Coin
Mystical coin from the South Seas

The magical coin

Professor Fluke returns from a journey to the tropical islands of the South Pacific. On a beach there he found the coin depicted above. One side shows a Polynesian deity. The other side bears the likeness of an ancient queen. Prof Fluke flips the strange coin 10 times and it lands on tails, the side with the fierce Polynesian god, every time. He is surprised, so he does it again. This time it lands on tails 6 times. Seeing that this means that overall there were 80% flips on tails and this is clearly beyond the traditional significant threshold of p<0.05, Fluke publishes a brief communication in a high impact journal to report that the coin is biased. He admits he doesn’t have a good theory for what is happening. The discovery is widely reported on the news partly due to the somewhat overhyped press release written by Fluke’s university. “Scientists discover magical coin” the headlines read. A disturbingly successful tabloid writes that the coin will cure cancer.

An earnest replication

Dr Earnest, a vocal proponent of Bayesian inference and a prolific replicator, doesn’t believe Prof Fluke’s sensationalist claims. She decides to replicate Fluke’s results. Unfortunately, she lacks the funds to fly to the south seas so she decides to craft a replica of the coin closely based on the description by Prof Fluke. Despite the hard effort in preparing the experiment, she only flips the coin five times. It lands on tails three times. While above chance levels, under the Bayesian framework this result actually weakly favours the null hypothesis (BF10=0.53). Even though these results aren’t very conclusive, Dr Earnest publishes this as a failure to replicate Fluke’s magical coin. The finding spreads like wildfire all over social media. People say the “Controversial magical coin was debunked!” and that “We need more replications like this!” It doesn’t take long for numerous anonymous commenters – who know nothing about coins let alone about coin flipping – to declare on internet forums that Prof Fluke is just a “bad scientist”. Some even accuse him of cheating.

Are you flipping crazy?

Another group of 10 researchers is understandably skeptical of Fluke’s magical coin. They all decide to flip coins 20 times so that there will be many more trials than ever before and thus the replication has much greater statistical power. Even though they all formally agree to do the same experiment, they don’t: eight of this consortium craft replicas of the coin just like Dr Earnest did. One of them, Dr Cook, travels to the south seas and a native gives him a coin that looks just like Prof Fluke’s. Finally, one replicator, Dr Friendly, directly talks to Prof Fluke who agrees to an adversarial collaboration using the actual coin he found.

All 10 of them start tossing coins. Overall the data suggest nothing much is going on. Out of the 200 coin tosses, it lands on tails 99 times – almost perfectly at chance and the effect goes in the opposite direction. However, Dr Friendly, who actually used Fluke’s coin, observes 14 tails out of 20. While this isn’t very strong evidence, it is not inconsistent with Fluke’s earlier findings. The consortium publishes a meta-analysis of the whole 200 coin flips stating that the evidence clearly shows that such coins are fair.

Prof Fluke and Dr Friendly however also publish their own results separately. Like with most adversarial collaborations, in the discussion section they starkly disagree in their interpretation of the very same finding. Dr Friendly states that the coin is most likely fair. Fluke disagrees and also discloses a methodological detail that was missing from his earlier publication. He left it out because of the strict word limits imposed by the high impact journal and also because he didn’t think then that it should matter: his original 20 coin flips were all performed on the tropical beach right where he found the coin. All of the replications were done someplace else.

The coin tossing crisis

Nobody takes Fluke’s arguments seriously. All over social media and even in formal publications this is discussed as a clear failure to replicate and that his findings were probably p-hacked. “It’s obvious,” some commenters say, “Fluke just did a few hundred coin flips but only reported the significant ones.” Some scientists take another coin that depicts a salmon and flip it twice. It lands on fish-heads both times. They present a humorous poster at a conference to illustrate the problems with underpowered coin flipping experiments. Countless direct replication efforts are underway to test previous coin tossing results. To increase statistical power some researchers decide to run their experiments online where they can quickly reach larger sample sizes. Most people ignore the problem that tossing bitcoins might not be theoretically equivalent to doing it in the flesh.

To make matters worse, a few high profile cases of fraudulent coin flippers are revealed. Popular science news outlets write damning editorials about the “reproducibility crisis.” A group of statisticians lead by Professor Eager reanalyses all the coin flips reported in the coin flipping literature to reveal that probably most studies did not report all their non-significant findings. Advocates of Bayesian methods counter those claims by saying that you can’t make claims about probabilities after the fact. Unfortunately, nobody really understands what they’re saying so the findings by Eager et al. are still cited widely.

The mainstream news media now continuously report on this “crisis.” Someone hacks into the email server of Prof Fluke’s university and digs out a statement that, when taken wildly out of context, sounds like all researchers are part of a global coin tossing conspiracy. The disturbingly successful tabloid publishes an article saying that the magical coin causes cancer. Public faith in science is undermined. In parts of the US, it is added to the school curriculum that children must learn that “Coin tossing is just a theory.” People stop vaccinating their children and regulations/treaties put in place to counteract climate change are dismantled. Soon thousands die from preventable diseases while millions get sick from polluted air and water…

From a documentary on Wikipedia about coin tossing experiments

The next generation

A few years later Prof Fluke dies of the flu. The epidemic caused by anti-vaxxers is only partly to blame. His immune system was simply weakened by all the stress caused by the replication debate. His name has become eponymous with false positives. People chuckle and joke about him whenever topics like p-hacking and questionable research practices are discussed. After the coin tossing debacle he could no longer get research grants and he failed to get tenure – impoverished and shunned by the scientific community, he couldn’t purchase any medicine.

Despite his mother’s warnings, Prof Fluke’s son decides to become a scientist. For obvious reasons, he decides to take his husband’s name when he gets married, so his name is now Dr Curious. Partly driven by an interest in the truth but also by a desire to exonerate his father’s name, Dr Curious takes the coin and travels to the South Pacific. He goes to the very beach where his father found the fateful coin and flips it. Ten out of ten tails! He does it again and observes the same result.

However, despite the possibility that this could prove his father right, he thinks it’s all too good to be true. He knows he will need extraordinary evidence to support extraordinary claims. He tries it a third time and this time he flips it 30 times. This time there is a gust of wind so he only gets 20 tails out of 30 coin tosses. It tends to be windy on Pacific beaches. This makes the temperature pleasant but it is not conducive to running good coin flipping studies.

A well-controlled experiment

To reduce measurement error in future experiments, if there is a gust of wind during any coin toss, this trial will be excluded. Dr Curious also vaguely remembers something an insane blogger wrote many years ago and includes some control conditions in his experiment. He brought along Dr Earnest’s replica coin. He also got an identical looking coin from one of the locals on the island and, last but not least, he brought a different coin from home. Dr Curious decides to do 100 coin flips per coin. Finally, because he fears people might not believe him otherwise, he preregisters this experimental protocol by means of a carrier albatross (internet connections on the island are too slow and too expensive).

The results of his coin flipping experiment are clear. After removing any trials with wind, the “magical” coin falls on tails almost all the time (52 tails out of 55 flips). During the three times it lands on heads, it could have been that he didn’t flip it well (this can really happen…). However, strikingly he observes very similar results for the other local coin and the results are even more extreme (60 tails out of 61 flips). Neither the replica coin nor the standard coin from home perform this way but they both show results that are very consistent with random chance.

Dr Curious is very pleased with his findings. He decides to return home and runs one more control experiment: it is an exact replication of his experiment but now he will do it in his lab. He again preregisters the protocol (this time via the internet). All four coins produce results that are not significantly different from chance levels. He publishes his findings arguing that both the place and the right type of coin are necessary to replicate his father’s findings.

The Fluke Effect

Our heroic scientist is however naturally curious (indeed that is his full name) so he is not satisfied with that outcome. He hands over the coins to his collaborator who will subject the coins to a full metallurgic analysis. In the meantime, Dr Curious flies back to the tropical island. He quickly confirms that he still gets similar results on the beach when using a local coin but not with one of the replicas.

Another thought crosses his mind. He goes into the jungle on the island, far from the beach, and repeats his coin tosses. The finding does not replicate. All coins flips are consistent with chance expectations. Mystified he returns to the beach. He takes a bucket full of sand from the beach into the jungle and tries again. Now the local coin falls on tails every time. “Eureka!” shouts Dr Curious, like no other scientist before him. “It’s all about the sand!”

He takes some of the sand home with him. His colleague has since discovered that the local coins are subtly magnetic. Now they also establish that the sand is somewhat magnetic. Whenever the coin is flipped over the sand it tends to fall on tails. The coin clearly wasn’t magical, in fact it wasn’t even special. It was just like all the other coins on the island. Dr Curious and his colleague have yet to figure out why the individual grains of sand don’t stick to the coin when you pick it up but they are confident that science will find the answer eventually. It is clearly a hitherto unknown form of magnetism. In honour of his father the effect is called Fluke’s Attraction.

Years later, Dr Curious watches a documentary about this on holographic television presented by Nellie deGrasse Tyson who inherited both the down-to-earth charm and the natural good looks of her great-grandfather. She explains while Prof Fluke’s interpretation of his original findings were wrong because he lacked some necessary control conditions, he nonetheless deserves credit for the discovery of a new physical phenomenon that brought about many scientific advances, like holographic television and hover-cars. The story of Fluke’s Attraction is but one example of why persistence and inquisitiveness are essential to scientific progress. It shows that many experiments can be flawed yet nonetheless lead to breakthroughs eventually. Happily, Dr Curious falls asleep on the couch…

An alternate ending?

He dreams he is back on the tropical beach. His experiment with the four different coins fails to replicate his father’s finding. All the coins perform around chance even when there is no wind. He tries it over and over but the results are the same. He is forced to conclude that the original findings were completely spurious. There is no Fluke’s Attraction. The islanders’ coins behave just like any other coins…

Drenched in sweat and with a pounding heart Curious awakes from his nightmare. It takes him a moment to realise it was just a dream. Fluke’s Attraction is real. His father’s name has been exonerated and appears in all science textbooks.

But after taking a few deep breaths Curious realises that in the big picture it doesn’t matter. Just then Nellie says on the holo-vision:

“Flukes happen all the time. The most important lesson is not that the effect turned out to be real but that Curious went back to the island and ran a well-controlled experiment to test a new hypothesis. Of course he could have failed to replicate his father’s findings. But nonetheless he would have learned something new about the world: that it doesn’t matter which coins you use or whether you flip them on the beach.

“An infinite number of replications with replica coins – or even with the real coin – could not have done that. Yet all it took to reveal another piece of the truth was one inquisitive researcher who asked ‘What if…?‘”

Beach
I hope that next time I’m here I won’t be thinking about failed replications…

For my own sanity’s sake I hope this will be my last post on replication. In the meantime, you may also enjoy this very short post by Lenny Teytelman about how the replication crisis isn’t a real crisis.

Black coins & swan flipping

My previous post sparked a number of responses from various corners, including some exchanges I had on Twitter as well as two blog posts, one by Simine Vazire and another one following on from that. In addition, there has also been another post which discussed (and, in my opinion, misrepresented) similar things I said recently at the UCL “Is Science Broken” panel discussion.

I don’t blame people for misunderstanding what I’m saying. The blame must lie largely with my own inability to communicate my thoughts clearly. Maybe I am just crazy. As they say, you can’t really judge your own sanity. However, I am a bit worried about our field. To me the issues I am trying to raise are self-evident and fundamental. The fact that they apparently aren’t to others makes me wonder if Science isn’t in fact broken after all…

Either way, I want to post a brief (even by Alexetz’ standards?) rejoinder to that. They will just be brief answers to frequently asked questions (or rather, the often constructed strawmen):

Why do you hate replications?

I don’t. I am saying replications are central to good science. This means all (or close to it) studies should contain replications as part of their design. It should be a daisy chain. Each experiment should contain some replications, some sanity checks and control conditions. This serves two purposes: it shows that your experiment was done properly and it helps to accumulate evidence on whether or not the previous findings are reliable. Thus we must stop distinguishing between “replicators” and “original authors”. All scientists should be replicators all the bloody time!

Why should replicators have to show why they failed to replicate?

They shouldn’t. But, as I said in the previous point, they should be expected to provide evidence that they did a proper experiment. And of course the original study should be held to the same standard. This could in fact be a sanity check: if you show that the method used couldn’t possibly reveal reliable data this speaks volumes about the original effect.

It’s not the replicator’s fault if the original study didn’t contain a sanity check!

That is true. It isn’t your fault if the previous study was badly designed. But it is your fault if you are aware of that defect and nonetheless don’t try to do better. And it’s not really that black and white. What was good design yesterday can be bad design today and indefensibly terrible tomorrow. We can always do better. That’s called progress.

But… but… fluke… *gasp* type 1 error… Randomness!?!

Almost every time I discuss this topic someone will righteously point out that I am ignoring the null hypothesis. I am not. Of course the original finding may be a total fluke but you simply can’t know for sure. Under certain provisions you can test predictions that the null hypothesis makes (with Bayesian inference anyway). But that isn’t the same. There are a billion reasons between heaven and earth why you may fail a replication. You don’t even need to do it poorly. It may just be bad luck. Brain-behaviour correlations observed in London will not necessarily be detectable in Amsterdam* because the heterogeneity, and thus the inter-individual variance, in the latter sample is likely to be smaller. This means that for the very same effect size resulting from the same underlying biological process you may need more statistical power. Of course, it could also be some methodological error. Or perhaps the original finding was just a false positive. You can never know.

Confirming the original result was a fluke is new information!

That view is problematic for two reasons. First of all, it is impossible to prove the null (yes, even for Bayesians). Science isn’t math, it doesn’t prove anything. You just collect data that may or may not be consistent with theoretical predictions. Secondly, you should never put too much confidence in any new glorious findings – even if it was high powered (because you don’t really know that) and pre-registered (because that doesn’t prevent people from making mistakes). So your prior that the result is a fluke should be strong anyway. You don’t learn very much new from that.

What then would tell me new information?

A new experiment that tests the same theory – or perhaps even a better theory. It can be a close replication but it can also be a largely conceptual one. I think this dichotomy is false. There are no true direct replications and even if there were they would be pointless. The directness of replication exists on a spectrum (I’ve said this already in a previous post). I admit that the definition of “conceptual” replications in the social priming literature is sometimes a fairly large stretch. You are free to disagree with them. The point is though that if a theory is so flaky that modest changes completely obliterate it then the onus is on the proponent of the theory to show why. In fact, this could be the new, better experiment you’re doing. This is how a replication effort can generate new hypotheses.

Please leave the poor replicators alone!

If you honestly think replicators are the ones getting a hard time I don’t know what world you live in. But that’s a story for another day, perhaps one that will be told by one of the Neuroscience Devils? The invitation to post there remains open…

Science clearly isn’t self-correcting or it wouldn’t be broken!

Apart from being a circular argument, this is also demonstrably wrong. Science isn’t a perpetual motion machine. Science is what scientists do. The fact that we are having these debates is conclusive proof that science self-corrects. I don’t see any tyrannical overlord dictating us to do any of this.

So what do you think should be done?

As I have said many times before, I think we need to train our students (and ourselves) in scientific skepticism and strong inference. We should stop being wedded to our pet theories. We need to make it easier to seek the truth rather than fame. For all I care, pre-registration can probably help with that but it won’t be enough. We have to stop the pervasive idea that an experiment “worked” when it confirmed your hypothesis and failed when it didn’t. We should read Feynman’s Cargo Cult Science. And after thinking about all this negativity, wash it all down by reading (or watching) Carl Sagan to remember how many mysteries yet wait to be solved in this amazing universe we inhabit.

Footnotes

*) I promise I will stop talking about this study now. I really don’t want to offend anyone…

How replication should work

Of Psychic Black Swans & Science Wizards

In recent months I have written a lot (and thought a lot more) about the replication crisis and the proliferation of direct replication attempts. I admit I haven’t bothered to quantify this but I have an impression that most of these attempts fail to reproduce the findings they try to replicate. I can understand why this is unsettling to many people. However, as I have argued before, I find the current replication movement somewhat misguided.

A big gaping hole where your theory should be

Over the past year I have also written a lot too much about Psi research. Most recently, I summarised my views on this in an uncharacteristically short post (by my standards) in reply to Jacob Jolij. But only very recently I realised my that my views on all of this actually converge on the same fundamental issue. On that note I would like to thank Malte Elson with whom I discussed some of these issues at that Open Science event at UCL recently. Our conversation played a significant role in clarifying my thoughts on this.

My main problem with Psi research is that it has no firm theoretical basis and that the use of labels like “Psi” or “anomalous” or whatnot reveals that this line of research is simply about stating the obvious. There will always be unexplained data but that doesn’t prove any theory. It has now dawned on me that my discomfort with the current replication movement stems from the same problem: failed direct replications do not explain anything. They don’t provide any theoretical advance to our knowledge about the world.

I am certainly not the first person to say this. Jason Mitchell’s treatise about failed replications covered many of the same points. In my opinion it is unfortunate that these issues have been largely ignored by commenters. Instead his post has been widely maligned and ridiculed. In my mind, this reaction was not only uncivil but really quite counter-productive to the whole debate.

Why most published research findings are probably not waterfowl

A major problem with his argument was pointed out by Neuroskeptic: Mitchell seems to hold replication attempts to a different standard than original research. While I often wonder if it is easier to incompetently fail to replicate a result than to incompetently p-hack it into existence, I agree that it is not really feasible to take that into account. I believe science should err on the side of open-minded skepticism. Thus even though it is very easy to fail to replicate a finding, the only truly balanced view is to use the same standards for original and replication evidence alike.

Mitchell describes the problems with direct replications with a famous analogy: if you want to prove the existence of black swans, all it takes is to show one example. No matter how many white swans you may produce afterwards, they can never refute the original reports. However, in my mind this analogy is flawed. Most of the effects we study in psychology or neuroscience research are not black swans. A significant social priming effect or a structural brain-behaviour correlation is not irrefutable evidence that it is real.

Imagine that there really were no black swans. It is conceivable that someone might parade around a black swan but maybe it’s all an elaborate hoax. Perhaps somebody just painted a white swan? Frauds of such a sensational nature are not unheard of in science, but most of us trust that they are nonetheless rare. More likely, it could be that the evidence is somehow faulty. Perhaps the swan was spotted in poor lighting conditions making it appear black. Considering how many people can disagree about whether a photo depicts a black or a white dress this possibility seems entirely conceivable. Thus simply showing a black swan is insufficient evidence.

On the other hand, Mitchell is entirely correct that parading a whole swarm of white swans is also insufficient evidence against the existence of black swans. The same principle applies here. The evidence could also be faulty. If we only looked at swans native to Europe we would have a severe sampling bias. In the worst case, people might be photographing black swans under conditions that make them appear white.

Definitely white and gold! (Fir0002/Flagstaffotos)

On the wizardry of cooking social psychologists

This brings us to another oft repeated argument about direct replications. Perhaps the “replicators” are just incompetent or lacking in skill. Mitchell also has an analogy for this (which I unintentionally also used in my previous post). Replicators may just be bad cooks who follow the recipes but nonetheless fail to produce meals that match the beautiful photographs in the cookbooks. In contrast, Neuroskeptic referred to this tongue-in-cheek as the Harry Potter Theory: only those blessed with magical powers are able to replicate. Inept “muggles” failing to replicate a social priming effect should just be ignored.

In my opinion both of these analogies are partly right. The cooking analogy correctly points out that simply following the recipe in a cookbook does not make you a master chef. However, it also ignores the fact that the beautiful photographs in a cookbook are frequently not entirely genuine. To my knowledge, many cookbook photos are actually of cold food to circumvent problems like steam on the camera etc. Most likely the photos will have been doctored in some way and they will almost certainly be the best pick out of several cooking attempts and numerous photos. So while it is true that the cook was an expert while you probably aren’t, the photo does not necessarily depict a representative meal.

The jocular wizardry argument implies that anyone with a modicum of expertise in a research area should be able to replicate a research finding. As students we are taught that the methods sections of our research publications should allow anyone to replicate our experiments. But this is certainly not feasible: some level of expertise and background knowledge should be expected for a successful replication. I don’t think I could replicate any findings in radio astronomy regardless how well established they may be.

One frustration many authors of results that have failed to replicate have expressed to me (and elsewhere) is the implicit assumption by many “replicators” that social psychology research is easy. I am not a social psychologist. I have no idea how easy these experiments are but I am willing to give people the benefit of the doubt here. It is possible that some replication attempts overlook critical aspects of the original experiments.

However, I think one of the key points of Neuroskeptic’s Harry Potter argument applies here: the validity of a “replicator’s” expertise, that is their ability to cast spells, cannot be contingent on their ability to produce these effects in the first place. This sort of reasoning seems circular and, appropriately enough, sounds like magical thinking.

Which one is Harry Potter again? (lotr.wikia.com/wiki/Wizards)

How to fix our replicator malfunction

The way I see it both arguments carry some weight here. I believe that muggles replicators should have to demonstrate their ability to do this kind of research properly in order for us to have any confidence in their failed wizardry. When it comes to the recent failure to replicate nearly half a dozen studies reporting structural brain-behaviour correlations, Ryota Kanai suggested that the replicators should have analysed the age dependence of grey matter density to confirm that their methods were sensitive enough to detect such well-established effects. Similarly, all the large-scale replication attempts in social psychology should contain such sanity checks. On a positive note, the Many Labs 3 project included a replication of the Stroop effect and similar objective tests that fulfill such a role.

However, while such clear-cut baselines are great they are probably insufficient, in particular if the effect size of the “sanity check” is substantially greater than the effect of interest. Ideally, any replication attempt should contain a theoretical basis, an alternative hypothesis to be tested that could explain the original findings. As I said previously, it is the absence of such theoretical considerations that makes most failed replications so unsatisfying to me.

The problem is that for a lot of the replication attempts, whether they are of brain-behaviour correlations, social priming, or Bem’s precognition effects, the only underlying theory replicators put forth is that the original findings were spurious and potentially due to publication bias, p-hacking and/or questionable research practices. This seems mostly unfalsifiable. Perhaps these replication studies could incorporate control conditions/analyses to quantify the severity of p-hacking required to produce the original effects. But this is presumably unfeasible in practice because the parameter space of questionable research practices is so vast that it is impossible to derive a sufficiently accurate measure of them. In a sense, methods for detecting publication bias in meta-analysis are a way to estimate this but the evidence they provide is only probabilistic, not experimental.

Of course this doesn’t mean that we cannot have replication attempts in the absence of a good alternative hypothesis. My mentors instilled in me the view that any properly conducted experiment should be published. It shouldn’t matter whether the results are positive, negative, or inconclusive. Publication bias is perhaps the most pervasive problem scientific research faces and we should seek to reduce it, not amplify it by restricting what should and shouldn’t be published.

Rather I believe we must change the philosophy underlying our attempts to improve science. If you disbelieve the claims of many social priming studies (and honestly, I don’t blame you!) it would be far more convincing to test a hypothesis on why the entire theory is false than showing that some specific findings fail to replicate. It would also free up a lot of resources to actually advance scientific knowledge that are currently used on dismantling implausible ideas.

There is a reason why I haven’t tried to replicate “presentiment” experiments even though I have written about it. Well, to be honest the biggest reason is that my grant is actually quite specific as to what research I should be doing. However, if I were to replicate these findings I would want to test a reasonable hypothesis as to how they come about. I actually have some ideas how to do that but in all honesty I simply find these effects so implausible that I don’t really feel like investing a lot of my time into testing them. Still, if I were to try a replication it would have to be to test an alternative theory because a direct replication is simply insufficient. If my replication failed, it would confirm my prior beliefs but not explain anything. However, if it succeeded, I probably still wouldn’t believe the claims. In other words, we wouldn’t have learned very much either way.

Those pesky replicators always fail… (en.memory-alpha.org/wiki/Replicator)

Ten-ish simple rules for scientific disagreements

All too often debates on best scientific practices descend into a chaotic mire of accusations, hurt feelings, and offended egos. As I mentioned in my previous post, I therefore decided to write a list of guidelines that I believe could help improve scientific discourse. It applies as much to replication attempts as to other scientific disagreements, say, when reanalysing someone else’s data or spotting a mistake they made.

Note that these are general points and should not be interpreted as relating to any particular case. Furthermore I know I am no angel. I have broken many of these rules before and fallible as I am I may end up doing so again. I hope that –I- learn from them but I hope they will also be useful for others. If you can think of additional rules please leave a comment!

1. Talk to the original authors

Don’t just send them a brusque email requesting data or with some basic questions about their paradigm (if you want to replicate their research). I mean actually talk and discuss with them, ideally in person at a conference or during a visit perhaps. Obviously this won’t always be possible. Either way, be open and honest about what you’re doing, why you want the data, why you want to replicate and how. Don’t be afraid to say that you find the results implausible, especially if you can name objective reasons for it.

One of my best experiences at a conference was when a man (whose name I unfortunately can’t remember) waited for me at my poster as I arrived at the beginning of my session. He said “I’m very skeptical of your results”. We had a very productive, friendly discussion and I believe it greatly improved my follow-up research.

2. Involve the original authors in your efforts

Suggest to the original authors to start a collaboration. It doesn’t need to be “adversarial” in the formal sense although it could be if you clearly disagree. But either way everyone will be better off if you work together instead of against each other. Publications are a currency in our field and I don’t see that changing anytime soon. I think you may be surprised how much more amenable many researchers will be to your replication/reanalysis effort if you both get a publication out of it.

3. Always try to be nice

If someone makes an error, point it out in an objective but friendly manner. Don’t let that manuscript you submitted/published to a journal in which you ridicule their life’s work be the first time they hear about it. Avoid emotional language or snarky comments.

I know I have a sarcastic streak so I have been no saint when it comes to that. I may have learned from a true master

4. Don’t accuse people without hard evidence 

Give people the benefit of the doubt. Don’t just blankly insinuate the original authors used questionable research practices no matter how wide-spread they apparently are. In the end you don’t know they engaged in bad practices unless you saw it yourself or some trustworthy whistle-blower told you. Statistics may indicate it but they don’t prove it.

5. Apologise even if (you think) you did no wrong

You might know this one if you’re married… We all make mistakes and slip up sometimes. We say things that come across as more offensive than intended. Sometimes we can find it very hard to understand what made the other person angry. Usually if you try you can empathise. It also doesn’t matter if you’re right. You should still say sorry because that is the right thing to do.

I am very sorry if I offended Eric-Jan Wagenmakers or any of his co-authors with my last post. Perhaps I spoke too harshly. I have the utmost respect for EJ and his colleagues and I do appreciate that they initiated the discussion we are having!

6. Admit that you could be wrong

We are all wrong, frequently. There are few things in life that I am very confident about. I believe scientists should be skeptical, including of their own beliefs. I remain somewhat surprised just with how much fervent conviction many scientists argue. I’d expect this from religious icons or political leaders. To me the best thing about being a scientist is the knowledge that we just don’t really know all that much.

I find it very hard to fathom that people can foretell the future or that their voting behaviour changes months later just because they saw a tiny flag in the corner of the screen. I just don’t think that’s plausible. But I don’t know for sure. It is possible that I’m wrong about this. It has happened before.

7. Don’t hound individuals

Replicating findings in the general field is fair enough. It is even fair enough if your replication is motivated by “I just cannot believe this!” even though this somewhat biases your expectations, which I think is problematic. But if you replicate or reanalyse some findings try not to do it only to the same person all the time. This looks petty and like a vendetta. And take a look around you. Even if you haven’t tried to replicate any of Research X’s findings, there is a chance a lot of other people already have. Don’t pester that one person and force them into a multi-front war they can’t possibly win with their physical and mental health intact.

This is one of the reasons I removed my reanalysis of Bem 2011’s precognition data from my bootstrapping manuscript.

8. Think of the children!

In the world we currently live in, researcher’s livelihoods depend on their reputation, their publication record and citations, their grant income etc. Yes, I would love for grant and hiring committees to only value trust-seekers who do creative and rigorous research. But in many cases it’s not a reality (yet?) and so it shouldn’t be surprising when some people react with panic and anger when they are criticised. It’s an instinct. Try to understand that. Saying “everyone makes mistakes” or “replication is necessary” isn’t really enough. Giving them a way to keep their job and their honour might (see rule 2).

9. Science should seek to explain

In my opinion the purpose of scientific research is to explain how the universe works (or even just that tiny part of the universe between your ears). This is what should motivate all our research, including the research that scrutinises and/or corrects previous claims. That’s why I am so skeptical of Many Labs and most direct replication projects in general. They do not explain anything. They are designed to prove the null hypothesis, which is conceptually impossible. It is fine to disbelieve big claims, in fact that’s what scientists should do. But if you don’t believe some finding, think about why you don’t believe it and then seek to disprove it by showing how simpler explanations could have produced the same finding. Showing that you, following the same recipe as the original authors, failed to reproduce the same result is pretty weak evidence – it could just mean that you are a bad cook. In general, sometimes we can’t trust our own motivations. If you really disbelieve some finding, try to think what kind of evidence could possibly convince you that it is true. Then, go and test it.

Failed replication or flawed reasoning?

A few months ago a study from EJ Wagenmakers’ lab (Boekel et al. in Cortex) failed to replicate 17 structural brain-behaviour correlations reported in the published literature. The study was preregistered by uploading the study protocol to a blog and so was what Wagenmakers generally refers to as “purely confirmatory“. As Wagenmakers is also a vocal proponent of Bayesian inferential methods, they used a one-tailed Bayesian hypothesis tests to ask whether their replication evidence supported the original findings. A lot has already been written about the Boekel study and I was previously engaged in a discussion on it. Therefore, in the interest of brevity (and thus the time Alexander Etz’s needs to spend on reading it :P) I will not cover all the details again but cut right to the case (It is pretty long anyway despite by earlier promises…)

Ryota Kanai, author of several of the results Boekel et al. failed to replicate, has now published a response in which he reanalyses their replication data. He shows that at least one finding (a correlation between grey matter volume in the left SPL and a measure of distractibility as quantified by a questionnaire) replicates successfully if the same methods as his original study are used. In fact, while Kanai does not report these statistics, using the same Bayesian replication test for which Boekel reported “anecdotal” evidence* for the null hypothesis (r=0.22, BF10=0.8), Kanai’s reanalysis reveals “strong” evidence for the alternative hypothesis (r=0.48, BF10=28.1). This successful replication is further supported by a third study that replicated this finding in an independent sample (albeit with some of the same authors as the original study). Taken together this suggests that at least for this finding, the failure to replicate may be due to methodological differences rather than that the original result was spurious.

Now, disagreements between scientists are common and essential to scientific progress. Replication is essential for healthy science. However, I feel that these days, as a field, psychology and neuroscience researchers are going about it in the wrong way. To me this case is a perfect illustration of these problems. In my next post I will summarise this one in a positive light by presenting ten simple rules for a good replication effort (and – hand on my heart – that one will be short!)

1. No such thing as “direct replication”

Recent years have seen the rise of numerous replication attempts with a particular emphasis on “direct” replications, that is, the attempt to exactly reproduce the experimental conditions that generated the original results. This is in contrast to “conceptual” replications in which a new experiment follows the spirit of a previous one but the actual parameters may be very different. So for instance a finding that exposing people to a tiny picture of the US flag influences their voting behaviour months in the future could be interpreted as conceptually replicating the result that people walk more slowly when they were primed with words describing the elderly.

However, I believe this dichotomy is false. The “directness” of a replication attempt is not categorical but exists on a gradual spectrum. Sure, the examples of conceptual replications from the social priming literature are quite distinct from Boekel’s attempt to replicate the brain-behaviour correlations or all the other Many Labs projects currently being undertaken with the aim to test (or disprove?) the validity of social psychology research.

However, there is no such thing as a perfectly direct replication. The most direct replication would be an exact carbon copy of the original, with the same participants, tested at the same time in the same place under the exact same conditions. This is impossible and nobody actually wants that because it would be completely meaningless other than testing just how deterministic our universe really is. What people mean when they talk about direct replications is that they match the experimental conditions reasonably well but use an independent sample of participants and (ideally) independent experimenters. Just how “direct” the replication is depends on how closely matched the experimental parameters are. By that logic, I would call the replication attempt of Boekel et al. less direct than say Wagenmakers et al’s replication of Bem’s precognition experiments. Boekel’s experiments were not matched at least with those by Kanai on a number of methodological points. However, even for the precognition replication Bem challenged Wakenmakers** on the directness of their methods because his replication attempt did not use the same software and stimuli as the original experiment.

Controversies like this reveal several issues. While you can strive to match the conditions of an original experiment as closely as possible, there will always be discrepancies. Ideally the original authors and the “replicators”*** can reach a consensus as to whether or not the discrepancies should matter. However, even if they do, this does not mean that they are unimportant. Saying that “original authors agreed to the protocol” means that a priori they made the assumption that methodological differences are insignificant. This does not mean that this assumption is correct. In the end this is an empirical question.

Discussions about failed replications are often contaminated with talk about “hidden moderators”, that is, unknown factors and discrepancies between the original experiment and the replication effort. As I pointed out under the guise of my alter ego****, I have little patience for this argument. It is counter-productive because there are always unknown factors. Saying that unknown factors can explain failures to replicate is an unfalsifiable hypothesis and a truism. The only thing that should matter in this situation is empirical evidence for additional factors. If you cannot demonstrate that your result hinges on a particular factor, this argument is completely meaningless. In the case of Bem’s precognition experiments, this could have been done by conducting an explicit experiment that compares the use of his materials with those used by Wagenmakers, ideally in the same group of participants. However, in the case of these brain-behaviour correlations this is precisely what Kanai did in his reply: he reanalysed Boekel’s data using the methods he had used originally and he found a different result. Importantly, this does not necessarily prove that Kanai’s theory about these results is correct. However, it clearly demonstrates that the failure to replicate was due to another factor that Boekel et al. did not take into account.

2. Misleading dichotomy

I also think the dichotomy between direct and conceptual replication is misleading. When people conduct “conceptual” replications the aim is different but equally important: direct replications (in so far that they exist) can test whether specific effects are reproducible. Conceptual replications are designed to test theories. Taking again the elderly-walking-speed and voting-behaviour priming examples from above, whether or not you believe that such experiments constitute compelling evidence for this idea, they are both experiments aiming to test the idea that subtle (subconscious?) information can influence people’s behaviour.

There is also a gradual spectrum for conceptual replication but here it depends on how general the overarching theory is that the replication seeks to test. These social priming examples clearly test a pretty diffuse theory of subconscious processing. By the same logic one could say that all of the 17 results scrutinised by Boekel test the theory that brain structure shares some common variance with behaviour. This theory is not only vague but so generic that it is almost meaningless. If you honestly doubt that there are any structural links between brain and behaviour, may I recommend checking some textbooks on brain lesions or neurodegenerative illness in your local library?

A more meaningful conceptual replication would be to show that the same grey matter volume in the SPL not only correlates with a cognitive failure questionnaire but with other, independent measures of distractibility. You could even go a step further and show that this brain area is somehow causally related to distraction. In fact, this is precisely what Kanai’s original study did.

I agree that replicating actual effects (i.e. what is called “direct” replication) is important because it can validate the existence of previous findings and – as I described earlier – help us identify the factors that modulate it. You may however also consider ways to improve your methodology. A single replication with a demonstrably better method (say, better model fits, higher signal-to-noise ratios, or more precise parameter estimates) is worth a 100 direct replications from a Many Labs project. In any of these cases, the directness of your replication will vary.

In the long run, however, conceptual replication that tests a larger overarching theory is more important than showing that a specific effect exists. The distinction between these two is very blurred though. It is important to know what factors modulate specific findings to derive a meaningful theory. Still, if we focus too much on Many Labs direct replication efforts, science will slow down to a snail’s pace and waste an enormous amount of resources (and taxpayer money). I feel that these experiments are largely designed to deconstruct the social priming theory in general. And sure, if the majority of these findings fail to replicate in repeated independent attempts, perhaps we can draw the conclusion that current theory is simply wrong. This happens a lot in science – just look at the history of phrenology or plate tectonics or our model of the solar system.

However, wouldn’t it be better to replace subconscious processing theory with a better model that actually describes what is really going on than to invest years of research funds and effort to prove the null hypothesis? As far as I can tell, the current working theory about social priming by most replicators is that social priming is all about questionable research practices, p-hacking, and publication bias. I know King Ioannidis and his army of Spartans show that the situation is dire***** – but I am not sure it is realistically that dire.

3. A fallacious power fallacy

Another issue with the Boekel replications, which is also discussed in Kanai’s response, is that the sample size they used was very small. For the finding that Kanai reanalysed the sample size was only 36. Across the 17 results they failed to replicate, their sample size ranged between 31-36. This is in stark contrast with the majority of the original studies many of which used samples well above 100. Only for one of the replications, which was of one of their own findings, Boekel et al. used a sample that was actually larger (n=31) than that in the original study (n=9). It seems generally accepted that larger samples are better, especially for replication attempts. A recent article recommended a sample size for replications two and a half times larger than the original. This may be a mathematical rule of thumb but it is hardly realistic, especially for neuroimaging experiments.

Thus I can understand why Boekel et al. couldn’t possibly have done their experiment on hundreds of participants. However, at the very least you should think that a direct replication effort should at least try to match the sample of the original study not one that is four times smaller. In our online discussions Wagenmakers explained the small sample by saying that they “simply lacked the financial resources” to collect more data. I do not think this is a very compelling argument. Using the same logic I could build a lego version of the Large Hadron Collider in my living room but fail to find the Higgs boson – only to then claim that my inadequate methodology was due to the lack of several billion dollars on my bank account******.

I must admit I sympathise a little with Wagenmakers here because it isn’t like I never had to collect more data for an experiment than I had planned (usually this sort of optional stopping happens at the behest of anonymous peer reviewers). But surely you can’t just set out to replicate somebody’s research, using a preregistered protocol no less, with a wholly inadequate sample size? As a matter of fact,their preregistered protocol states the structural data for this project (which is the expensive part) had already been collected previously and that the maximum sample of 36 was pre-planned. While they left “open the possibility of testing additional participants” they opted not to do so even though the evidence for half of the 17 findings remained inconclusively low (more on this below). Presumably this was as they say because they ran “out of time, money, or patience.”

In the online discussion Wagenmakers further states that power is a pre-experimental concept and refers to another  publications by him and others in which they describe a “power fallacy.” I hope I am piecing together their argument accurately in my own head. Essentially statistical power tells you how probable it is that you can detect evidence for a given effect with your planned sample size. It thus quantifies the probabilities across all possible outcomes given these parameters. I ran a simulation to do this for the aforementioned correlation between left SPL grey matter and cognitive failure questionnaire scores. So I drew 10,000 samples of 36 participants each from a bivariate Gaussian distribution with a correlation of rho=0.38 (i.e. the observed correlation coefficient in Kanai’s study). I also repeated this for the null hypothesis so I drew similar samples from an uncorrelated Gaussian distribution. The histograms in the figure below show the distributions of the 10,000 Bayes factors calculated using the same replication test used by Boekel et al. for the alternative hypothesis in red and the null hypothesis in blue.

BayesianPower
Histograms of Bayes factors in favour of alternative hypothesis (BF10) over 10,000 simulated samples of n=36 with rho=0.38 (red curve) and rho=0 (blue curve).

Out of those 10,000 simulations in the red curve only about 62% would pass the criterion for “anecdotal” evidence of BF10=3. Thus even if the effect size originally reported by Kanai’s study had been a perfect estimate of the true population effect (which is highly improbable) only in somewhat less than two thirds of replicate experiments should you expect conclusive evidence supporting the alternative hypothesis. The peak of the red distribution is in fact very close to the anecdotal criterion. With the exception of the study by Xu et al. (which I am in no position to discuss) this result was in fact one of the most highly powered experiments in Boekel’s study: as I showed in the online discussion the peaks of expected Bayes factors of the other correlations were all below the anecdotal criterion. To me this suggests that the pre-planned power of these replication experiments was wholly insufficient to give the replication a fighting chance.

Now, Wagenmakers’ reasoning of the “power fallacy” however is that after the experiment is completed power is a meaningless concept. It doesn’t matter what potential effect sizes (and thus Bayesian evidence) one could have gotten if one repeated the experiment infinitely. What matters is the results and evidence they did find. It is certainly true that a low-powered experiment can produce conclusive evidence in favour of a hypothesis – for example the proportion of simulations at the far right end of the red curve would very compellingly support H1 while those simulations forming the peak of the blue curve would afford reasonable confidence that the null hypothesis is true. Conversely, a high-powered experiment can still fail to provide conclusive evidence. This essentially seems to be Wagenmakers’ argument of the power fallacy: just because an experiment had low power doesn’t necessarily mean that its results are uninterpretable.

However, in my opinion this argument serves to obfuscate the issue. I don’t believe that Wagenmakers is doing this on purpose but I think that he has himself fallen victim to a logical fallacy. It is a non-sequitur. While it is true that low-powered experiments can produce conclusive evidence, this does not make the evidence conclusive. In fact, it is the beauty of Bayesian inference that it allows quantification of the strength of evidence. The evidence Boekel et al. observed in was inconclusive (“anecdotal”) in 9 of the 17 replications. Only in 3 the evidence for the null hypothesis was anywhere close to “strong” (i.e. below 1/10 or very close to it).

Imagine you want to test if a coin is biased. You flip it once and it comes up heads. What can we conclude from this? Absolutely nothing. Even though the experiment has been completed it was obviously underpowered. The nice thing about Bayesian inference is that it reflects that fact.

4. Interpreting (replication) evidence

You can’t have it both ways. You either take Bayesian inference to the logical conclusion and interpret the evidence you get according to Bayesian theory or you shouldn’t use it. Bayes factor analysis has the potential to be a perfect tool for statistical inference. Had Boekel et al. observed a correlation coefficient near 0 in the replication of the distractibility correlation they would have been right to conclude (in the context of their test) that the evidence supports the null hypothesis with reasonable confidence.

Now a close reading of Boekel’s study shows that the authors were in fact very careful in how they worded the interpretation of their results. They say that they “were unable to successfully replicate any of these 17 correlations”. This is entirely correct in the context of their analysis. What they do not say, however, is that they were also unable to refute the previously reported effects even though this was the case for over half of their results.

Unfortunately, this sort of subtlety is entirely lost on most people. The reaction of many commenters on the aforementioned blog post, on social media, and in personal communications was to interpret this replication study as a demonstration that these structural brain-behaviour correlations have been conclusively disproved. This is in spite of the statement in the actual article that “a single replication cannot be conclusive in terms of confirmation or refutation of a finding.” On social media I heard people say that “this is precisely what we need more of.” And you can literally feel the unspoken, gleeful satisfaction of many commenters that yet more findings by some famous and successful researchers have been “debunked.”

Do we really need more low-powered replication attempts and more inconclusive evidence? As I described above, a solid replication attempt can actually inform us about the factors governing a particular effect, which in turn can help us formulate better theories. This is what we need more of. We need more studies that test assumptions but that also take all the available evidence into account. Many of these 17 brain-behaviour correlation results here originally came with internal replications in the original studies. As far as I can tell these were not incorporated in Boekel’s analysis (although they mentioned them). For some of the results independent replications – or at least related studies – had already been published and it seems odd that Boekel et al. didn’t discuss at least those that had already been published months earlier.

Also some results, like Kanai’s distractibility correlation, were accompanied in the original paper by additional tests of the causal link between the brain area and behaviour. In my mind, from a scientific perspective it is far more important to test those questions in detail rather than simply asking whether the original MRI results can be reproduced.

5. Communicating replication efforts

I think there is also a more general problem with how the results of replication efforts are communicated. Replication should be a natural component of scientific research. All too often failed replications result in mudslinging contests, heated debates, and sometimes in inappropriate comparisons of replication authors with video game characters. Some talk about how the reputation of the original authors is hurt by failed replication.

It shouldn’t have to be this way. Good scientists also produce non-replicable results and even geniuses can believe in erroneous theories. However, the way our publishing and funding system works as well as our general human emotions predispose us to having these unfortunate disagreements.

I don’t think you can solely place the blame for such arguments on the authors of the original studies. Because scientists are human beings the way you talk to them influences how they will respond. Personally I think that the reports of many high profile replication failures suffer from a lack of social awareness. In that sense the discussion surrounding the Boekel replications has actually been very amicable. There have been far worse cases where the whole research programs of some authors have been denigrated and ridiculed on social media, sometimes while the replication efforts were still on-going. I’m not going to delve into that. Perhaps one of the Neuroscience Devils wants to pick up that torch in the future.

However, even the Boekel study shows how this communication could have been done with more tact. The first sentences of the Boekel article read as follows:

“A recent ‘crisis of confidence’ has emerged in the empirical sciences. Several studies have suggested that questionable research practices (QRPs) such as optional stopping and selective publication may be relatively widespread. These QRPs can result in a high proportion of false-positive findings, decreasing the reliability and replicability of research output.”

I know what Boekel et al. are trying to say here. EJ Wagenmakers has a declared agenda to promote “purely confirmatory” research in which experimental protocols are preregistered. There is nothing wrong with this per se. However, surely the choice of language here is odd? Preregistration is not the most relevant part about the Boekel study. It could have been done without it. It is fine to argue for why it is necessary in the article, but to actually start the article with a discussion of the replication crisis in the context of questionable research practices is very easy to be (mis?-)interpreted as an accusation. Whatever the intentions may have been, starting the article in this manner immediately places a spark of doubt in the reader’s mind and primes them to consider the original studies as being of a dubious nature. In fact, in the online debate Wagenmakers went a step further to suggest (perhaps somewhat tongue-in-cheek) that:

“For this particular line of research (brain-behavior correlations) I’d like to suggest an exploration-safeguard principle (ESP): after collecting the imaging data, researchers are free to analyze these data however they see fit. Data cleaning, outlier rejection, noise reduction: this is all perfectly legitimate and even desirable. Crucially, however, the behavioral measures are not available until after completion of the neuroimaging data analysis. This can be ensured by collecting the behavioral data in a later session, or by collaborating with a second party that holds the behavioral data in reserve until the imaging analysis is complete. This kind of ESP is something I can believe in.”

This certainly sounds somewhat accusatory to me. Quite frankly this is a bit offensive. I am all in favour of scientific skepticism but this is not the same as baseless suspicion. Having been on the receiving end of a particularly bad case of reviewer 2 once who made similar unsubstantiated accusations (and in fact ignored evidence to the contrary) I can relate to people who would be angered by that. For one thing such procedures are common in many labs conducting experiments like this. Having worked with Ryota Kanai in the past I have a fairly good idea of the meticulousness of his research. I also have great respect for EJ Wagenmakers and I don’t think he actually meant to offend anyone. Still, I think it could easily happen with statements like this and I think it speaks for Kanai’s character that he didn’t take offense here.

There is a better way. This recently published failure to replicate a link between visually induced gamma oscillation frequency and resting occipital GABA concentration is a perfect example of a well-written replication failure. There is no paranoid language about replication crises and p-hacking but a simple, factual account of the research question and the results. In my opinion this exposition certainly facilitated the rather calm reaction to this publication.

6. Don’t hide behind preregistration

Of course, the question about optional stopping and outcome-dependent analysis (I think that term was coined by Tal Yarkoni) could be avoided by preregistering the experimental protocols (in fact at least some of these original experiments were almost certainly preregistered in departmental project reviews). As opposed to what some may think, I am not opposed to preregistration as such. In fact, I fully intend to try it.

However, there is a big problem with this, which Kanai also discusses in his response. As a peer reviewer, he actually recommended Boekel et al. to use the same analysis pipeline he employed now to test for the effects. The reason Boekel et al. did not do this is that these methods were not part of the preregistered protocol. However, this did not stop them from employing other non-registered methods, which they report as exploratory analyses. In fact, we are frequently told that pre-registration does not preclude exploration. So why not here?

Moreover, preregistration is in the first instance designed to help authors control the flexibility of their experimental procedure. It should not be used as a justification to refuse performing essential analyses when reviewers ask for them. In this case, a cynic might say that Boekel et al. in fact did these analyses and chose not to report them because the results were inconsistent with the message they wanted to argue. Now I do not believe this to be the case but it’s an example of how unfounded accusations can go both ways in these discussions.

If this is how preregistration is handled in the future, we are in danger of slowing down scientific progress substantially. If Boekel et al. had performed these additional analyses (which should have been part of the originally preregistered protocol in the first place), this would have saved Kanai the time to do them himself. Both he and Boekel et al. could have done something more productive with their time (and so could I, for that matter :P).

It doesn’t have to go this way but we must be careful. If we allow this line of reasoning with preregistration, we may be able to stop the Texas sharpshooter from bragging but we will also break his rifle. It will then take much longer and more ammunition to finally hit the bulls-eye than is necessary.

Simine Vazier-style footnotes:

*) I actually dislike categorical labels for Bayesian evidence. I don’t think we need them.

**) This is a pre-print manuscript. It keeps changing with on-going peer review so this statement may no longer be true when you read this.

***) Replicators is a very stupid word but I can’t think of a better, more concise one.

****) Actually this post was my big slip-up as Devil’s Neuroscientist. In that one a lot of Sam’s personality shown through, especially in the last paragraph.

*****) I should add that I am merely talking about the armies of people pointing out the proneness of false positives. I am not implying that all these researchers I linked to here agree with one another.

******) To be fair, I probably wouldn’t be able to find the Higgs boson even if I had the LHC.