Recently I have spent a lot of time writing about replication and why I feel current “direct” replication efforts are often missing the point. For some reason it is a lot harder than it should be to get my point across. It is being misconstrued at every step and various straw man arguments are debated instead. Whatever the reasons for this may be, I want to try again one last time before I’ll go on a break. Perhaps I can communicate my thoughts more clearly by means of a parable…
The magical coin
Professor Fluke returns from a journey to the tropical islands of the South Pacific. On a beach there he found the coin depicted above. One side shows a Polynesian deity. The other side bears the likeness of an ancient queen. Prof Fluke flips the strange coin 10 times and it lands on tails, the side with the fierce Polynesian god, every time. He is surprised, so he does it again. This time it lands on tails 6 times. Seeing that this means that overall there were 80% flips on tails and this is clearly beyond the traditional significant threshold of p<0.05, Fluke publishes a brief communication in a high impact journal to report that the coin is biased. He admits he doesn’t have a good theory for what is happening. The discovery is widely reported on the news partly due to the somewhat overhyped press release written by Fluke’s university. “Scientists discover magical coin” the headlines read. A disturbingly successful tabloid writes that the coin will cure cancer.
An earnest replication
Dr Earnest, a vocal proponent of Bayesian inference and a prolific replicator, doesn’t believe Prof Fluke’s sensationalist claims. She decides to replicate Fluke’s results. Unfortunately, she lacks the funds to fly to the south seas so she decides to craft a replica of the coin closely based on the description by Prof Fluke. Despite the hard effort in preparing the experiment, she only flips the coin five times. It lands on tails three times. While above chance levels, under the Bayesian framework this result actually weakly favours the null hypothesis (BF10=0.53). Even though these results aren’t very conclusive, Dr Earnest publishes this as a failure to replicate Fluke’s magical coin. The finding spreads like wildfire all over social media. People say the “Controversial magical coin was debunked!” and that “We need more replications like this!” It doesn’t take long for numerous anonymous commenters – who know nothing about coins let alone about coin flipping – to declare on internet forums that Prof Fluke is just a “bad scientist”. Some even accuse him of cheating.
Are you flipping crazy?
Another group of 10 researchers is understandably skeptical of Fluke’s magical coin. They all decide to flip coins 20 times so that there will be many more trials than ever before and thus the replication has much greater statistical power. Even though they all formally agree to do the same experiment, they don’t: eight of this consortium craft replicas of the coin just like Dr Earnest did. One of them, Dr Cook, travels to the south seas and a native gives him a coin that looks just like Prof Fluke’s. Finally, one replicator, Dr Friendly, directly talks to Prof Fluke who agrees to an adversarial collaboration using the actual coin he found.
All 10 of them start tossing coins. Overall the data suggest nothing much is going on. Out of the 200 coin tosses, it lands on tails 99 times – almost perfectly at chance and the effect goes in the opposite direction. However, Dr Friendly, who actually used Fluke’s coin, observes 14 tails out of 20. While this isn’t very strong evidence, it is not inconsistent with Fluke’s earlier findings. The consortium publishes a meta-analysis of the whole 200 coin flips stating that the evidence clearly shows that such coins are fair.
Prof Fluke and Dr Friendly however also publish their own results separately. Like with most adversarial collaborations, in the discussion section they starkly disagree in their interpretation of the very same finding. Dr Friendly states that the coin is most likely fair. Fluke disagrees and also discloses a methodological detail that was missing from his earlier publication. He left it out because of the strict word limits imposed by the high impact journal and also because he didn’t think then that it should matter: his original 20 coin flips were all performed on the tropical beach right where he found the coin. All of the replications were done someplace else.
The coin tossing crisis
Nobody takes Fluke’s arguments seriously. All over social media and even in formal publications this is discussed as a clear failure to replicate and that his findings were probably p-hacked. “It’s obvious,” some commenters say, “Fluke just did a few hundred coin flips but only reported the significant ones.” Some scientists take another coin that depicts a salmon and flip it twice. It lands on fish-heads both times. They present a humorous poster at a conference to illustrate the problems with underpowered coin flipping experiments. Countless direct replication efforts are underway to test previous coin tossing results. To increase statistical power some researchers decide to run their experiments online where they can quickly reach larger sample sizes. Most people ignore the problem that tossing bitcoins might not be theoretically equivalent to doing it in the flesh.
To make matters worse, a few high profile cases of fraudulent coin flippers are revealed. Popular science news outlets write damning editorials about the “reproducibility crisis.” A group of statisticians lead by Professor Eager reanalyses all the coin flips reported in the coin flipping literature to reveal that probably most studies did not report all their non-significant findings. Advocates of Bayesian methods counter those claims by saying that you can’t make claims about probabilities after the fact. Unfortunately, nobody really understands what they’re saying so the findings by Eager et al. are still cited widely.
The mainstream news media now continuously report on this “crisis.” Someone hacks into the email server of Prof Fluke’s university and digs out a statement that, when taken wildly out of context, sounds like all researchers are part of a global coin tossing conspiracy. The disturbingly successful tabloid publishes an article saying that the magical coin causes cancer. Public faith in science is undermined. In parts of the US, it is added to the school curriculum that children must learn that “Coin tossing is just a theory.” People stop vaccinating their children and regulations/treaties put in place to counteract climate change are dismantled. Soon thousands die from preventable diseases while millions get sick from polluted air and water…
The next generation
A few years later Prof Fluke dies of the flu. The epidemic caused by anti-vaxxers is only partly to blame. His immune system was simply weakened by all the stress caused by the replication debate. His name has become eponymous with false positives. People chuckle and joke about him whenever topics like p-hacking and questionable research practices are discussed. After the coin tossing debacle he could no longer get research grants and he failed to get tenure – impoverished and shunned by the scientific community, he couldn’t purchase any medicine.
Despite his mother’s warnings, Prof Fluke’s son decides to become a scientist. For obvious reasons, he decides to take his husband’s name when he gets married, so his name is now Dr Curious. Partly driven by an interest in the truth but also by a desire to exonerate his father’s name, Dr Curious takes the coin and travels to the South Pacific. He goes to the very beach where his father found the fateful coin and flips it. Ten out of ten tails! He does it again and observes the same result.
However, despite the possibility that this could prove his father right, he thinks it’s all too good to be true. He knows he will need extraordinary evidence to support extraordinary claims. He tries it a third time and this time he flips it 30 times. This time there is a gust of wind so he only gets 20 tails out of 30 coin tosses. It tends to be windy on Pacific beaches. This makes the temperature pleasant but it is not conducive to running good coin flipping studies.
A well-controlled experiment
To reduce measurement error in future experiments, if there is a gust of wind during any coin toss, this trial will be excluded. Dr Curious also vaguely remembers something an insane blogger wrote many years ago and includes some control conditions in his experiment. He brought along Dr Earnest’s replica coin. He also got an identical looking coin from one of the locals on the island and, last but not least, he brought a different coin from home. Dr Curious decides to do 100 coin flips per coin. Finally, because he fears people might not believe him otherwise, he preregisters this experimental protocol by means of a carrier albatross (internet connections on the island are too slow and too expensive).
The results of his coin flipping experiment are clear. After removing any trials with wind, the “magical” coin falls on tails almost all the time (52 tails out of 55 flips). During the three times it lands on heads, it could have been that he didn’t flip it well (this can really happen…). However, strikingly he observes very similar results for the other local coin and the results are even more extreme (60 tails out of 61 flips). Neither the replica coin nor the standard coin from home perform this way but they both show results that are very consistent with random chance.
Dr Curious is very pleased with his findings. He decides to return home and runs one more control experiment: it is an exact replication of his experiment but now he will do it in his lab. He again preregisters the protocol (this time via the internet). All four coins produce results that are not significantly different from chance levels. He publishes his findings arguing that both the place and the right type of coin are necessary to replicate his father’s findings.
The Fluke Effect
Our heroic scientist is however naturally curious (indeed that is his full name) so he is not satisfied with that outcome. He hands over the coins to his collaborator who will subject the coins to a full metallurgic analysis. In the meantime, Dr Curious flies back to the tropical island. He quickly confirms that he still gets similar results on the beach when using a local coin but not with one of the replicas.
Another thought crosses his mind. He goes into the jungle on the island, far from the beach, and repeats his coin tosses. The finding does not replicate. All coins flips are consistent with chance expectations. Mystified he returns to the beach. He takes a bucket full of sand from the beach into the jungle and tries again. Now the local coin falls on tails every time. “Eureka!” shouts Dr Curious, like no other scientist before him. “It’s all about the sand!”
He takes some of the sand home with him. His colleague has since discovered that the local coins are subtly magnetic. Now they also establish that the sand is somewhat magnetic. Whenever the coin is flipped over the sand it tends to fall on tails. The coin clearly wasn’t magical, in fact it wasn’t even special. It was just like all the other coins on the island. Dr Curious and his colleague have yet to figure out why the individual grains of sand don’t stick to the coin when you pick it up but they are confident that science will find the answer eventually. It is clearly a hitherto unknown form of magnetism. In honour of his father the effect is called Fluke’s Attraction.
Years later, Dr Curious watches a documentary about this on holographic television presented by Nellie deGrasse Tyson who inherited both the down-to-earth charm and the natural good looks of her great-grandfather. She explains while Prof Fluke’s interpretation of his original findings were wrong because he lacked some necessary control conditions, he nonetheless deserves credit for the discovery of a new physical phenomenon that brought about many scientific advances, like holographic television and hover-cars. The story of Fluke’s Attraction is but one example of why persistence and inquisitiveness are essential to scientific progress. It shows that many experiments can be flawed yet nonetheless lead to breakthroughs eventually. Happily, Dr Curious falls asleep on the couch…
An alternate ending?
He dreams he is back on the tropical beach. His experiment with the four different coins fails to replicate his father’s finding. All the coins perform around chance even when there is no wind. He tries it over and over but the results are the same. He is forced to conclude that the original findings were completely spurious. There is no Fluke’s Attraction. The islanders’ coins behave just like any other coins…
Drenched in sweat and with a pounding heart Curious awakes from his nightmare. It takes him a moment to realise it was just a dream. Fluke’s Attraction is real. His father’s name has been exonerated and appears in all science textbooks.
But after taking a few deep breaths Curious realises that in the big picture it doesn’t matter. Just then Nellie says on the holo-vision:
“Flukes happen all the time. The most important lesson is not that the effect turned out to be real but that Curious went back to the island and ran a well-controlled experiment to test a new hypothesis. Of course he could have failed to replicate his father’s findings. But nonetheless he would have learned something new about the world: that it doesn’t matter which coins you use or whether you flip them on the beach.
“An infinite number of replications with replica coins – or even with the real coin – could not have done that. Yet all it took to reveal another piece of the truth was one inquisitive researcher who asked ‘What if…?‘”
For my own sanity’s sake I hope this will be my last post on replication. In the meantime, you may also enjoy this very short post by Lenny Teytelman about how the replication crisis isn’t a real crisis.