As promised, here is a post about science stuff, finally back to a more cheerful and hopeful topic than the dreadful state the world outside science is in right now…
A Dutch research funding agency recently announced a new grant initiative that exclusively funds replication attempts. The idea is to support replication efforts of particularly momentous “cornerstone” research findings. It’s not entirely clear what this means but presumably such findings include highly cited findings, those with great media coverage and public policy impact etc. It isn’t clear who determines whether a finding falls under this.
You can read about this announcement here. In that article you can see some comments by me on how I think funders should encourage replications by requiring that new grant proposals should also contain some replication of previous work. Like most people I believe replication to be one of the pillars supporting science. Before we treat any discovery as important we must know that it is reliable and meaningful. We need to know in how far it generalizes or if it is fickle and subject to minor changes in experimental parameters. If you read anything I have written about replication, you will probably already know my view on this: most good research is built on previous findings. This is how science advances. You take some previously observed results and use it to generate new hypotheses to be tested in a new experiment. In order to do so, you should include a replication and/or sanity check condition in your new experiment. This is precisely the suggestion Richard Feynman made in his famous Cargo Cult Science lecture.
Imagine somebody published a finding that people perceive the world as darker when they listen to sad classical music (let’s ignore for the moment the inherent difficulty in actually demonstrating such an effect…). You now want to ask if they also perceive the world as darker when they listen to dark metal. If you simply run the same experiment but replace the music any result you find will be inconclusive. If you don’t find any perceptual effect, it could be that your participant sample simply isn’t affected by music. The only way to rule this out is to also include the sad classical music condition in your experiment to test whether this claim actually replicates. Importantly, even if you do find a perceptual effect of dark metal music, the same problem applies. While you could argue that this is a conceptual replication, if you don’t know that you could actually replicate the original effect of classical music, it is impossible to know that you really found the same phenomenon.
My idea is that when applying for funding we should be far more explicit about how the proposal builds on past research and, insofar this is feasible, build more replication attempts into the proposed experiments. Critically, if you fail to replicate those experiments, this would in itself be an important finding that should be added to the scientific record. The funding thus implicitly sets aside some resources for replication attempts to validate previous claims. However, this approach also supports the advance of science because every proposal is nevertheless designed to test novel hypotheses. This stands in clear contrast between pure replication efforts such as those this Dutch initiative advocates or the various large-scale replication efforts like the RPP and Many Labs project. While I think these efforts clearly have value, one major concern I have with them is that they seem to stagnate scientific progress. They highlighted a lack of replicability in the current literature and it is undoubtedly important to flag that up. But surely this cannot be the way we will continue to do science from now on. Should we have a new RPP every 10 years now? And who decides which findings should be replicated? I don’t think we should really care whether every single surprising claim is replicated. Only the ones that are in fact in need of validation because they have an impact on science and society probably need to be replicated. But determining what makes a cornerstone discovery is not really that trivial.
That is not to say that such pure replication attempts should no longer happen or that they should receive no funding at all. If anyone is happy to give you money to replicate some result, by all means do so. However, my suggestion differs from these large-scale efforts and the Dutch initiative in that it treats replication the way it should be treated, as an essential part of all research, rather than as a special effort that is somehow separate from the rest. Most research would only be funded if it is explicit about which previous findings it builds on. This inherently also answers the question which previous claims should be replicated: only those findings that are deemed important enough by other researchers to motivate new research are sufficiently important for replication attempts.
Perhaps most crucially, encouraging replication in this way will help to break down the perceived polarization between the replicators and original authors of high-impact research claims. While I doubt many scientists who published replications actually see themselves as a “replication police,” we continue to rehash these discussions. Many replication attempts are also being suspected to be motivated by mistrust in the original claim. Not that there is really anything wrong with that because surely healthy skepticism is important in science. However, whether justified or not, skepticism of previous claims can lead to the perception that the replicators were biased and the outcome of the replication was a self-fulfilling prophecy. My suggestion would mitigate this problem at least to a large degree because most grant proposals would at least seek to replicate results that have a fighting chance of being true.
In the Nature article about this Dutch initiative there are also comments from Dan Gilbert, a vocal critic of the large-scale replication efforts. He bemoans that such replication research is based on its “unoriginality” and suspects that we will learn more about the universe by spending money on “exploring important new ideas.” I think this betrays the same false dichotomy I described above. I certainly agree with Gilbert that the goal of science should be to advance our understanding of the world but originality is not really the only objective here. Scientific claims must also be valid and generalize beyond very specific experimental contexts and parameters. In my view, both are equally important for healthy science. As such, there is not a problem with the Dutch initiative but it seems rather gimmicky to me and I am unconvinced its effects will be lasting. Instead I believe the only way to encourage active and on-going replication efforts will be to overhaul the funding structure as I outlined here.
I mentioned the issue of data quality before but reading Richard Morey’s interesting post about standardised effect sizes the other day made me think about this again. Yesterday I gave a lecture discussing Bem’s infamous precognition study and the meta-analysis he recently published of the replication attempts. I hadn’t looked very closely at the meta-analysis data before but for my lecture I produced the following figure:
This shows the standardised effect size for each of the 90 results in that meta-analysis split into four categories. On the left in red we have the ten results by Bem himself (nine of which are his original study and one is a replication of one of them by himself). Next, in orange we have what they call ‘exact replications’ in the meta-analysis, that is, replications that used his program/materials. In blue we have ‘non-exact replications’ – those that sought to replicate the paradigms but didn’t use his materials. Finally, on the right in black we have what I called ‘different’ experiments. These are at best conceptual replications because they also test whether precognition exists but use different experiment protocols. The hexagrams denote the means across all the experiments in each category (these are non-weighted means but it’s not that important for this post).
While the means for all categories are evidently greater than zero, the most notable thing should be that Bem’s findings are dramatically different from the rest. While the mean effect size in the other categories are below or barely at 0.1 and there is considerable spread beyond zero in all of them, all ten of Bem’s results are above zero and, with one exception, above 0.1. This is certainly very unusual and there are all sorts of reasons we could discuss for why this might be…
But let’s not. Instead let’s assume for the sake of this post that there is indeed such a thing as precognition and that Daryl Bem simply knows how to get people to experience it. I doubt that this is a plausible explanation in this particular case – but I would argue that for many kinds of experiments such “experimenter effects” are probably notable. In an fMRI experiment different labs may differ considerably in how well they control participants’ head motion or even simply in terms of the image quality of the MRI scans. In psychophysical experiments different experimenters may differ in how well they explain the task to participants or how meticulous they are in ensuring that they really understood the instructions, etc. In fact, the quality of the methods surely must matter in all experiments, whether they are in astronomy, microbiology, or social priming. Now this argument has been made in many forms, most infamously perhaps in Jason Mitchell’s essay “On the emptiness of failed replications” that drew much ire from many corners. You may disagree with Mitchell on many things but not on the fact that good methods are crucial. What he gets wrong is laying the blame for failed replications solely at the feet of “replicators”. Who is to say that the original authors didn’t bungle something up?
However, it is true that all good science should seek to reduce noise from irrelevant factors to obtain as clean observations as possible of the effect of interest. Using again Bem’s precognition experiments as an example, we could hypothesise that he indeed had a way to relax participants to unlock their true precognitive potential that others seeking to replicate his findings did not. If that were true (I’m willing to bet a fair amount of money that it isn’t but that’s not the point), if true, this would indeed mean that most of the replications – failed or successful – in his meta-analysis are only of low scientific value. All of these experiments are more contaminated by noise confounds than his experiments; thus only he provides clean measurements. Standardised effect sizes like Cohen’s d divide the absolute raw effect by a measure of uncertainty or dispersion in the data. The dispersion is a direct consequence of the noise factors involved. So it should be unsurprising that the effect size is greater for experimenters that are better at eliminating unnecessary noise.
Statistical inference seeks to estimate the population effect size from a limited sample. Thus, a meta-analytic effect size is an estimate of the “true” effect size from a set of replications. But since this population effect includes the noise from all the different experimenters, it does not actually reflect the true effect? The true effect is people’s inherent precognitive ability. The meta-analytic effect size estimate is spoiling that with all the rubbish others pile on with their sloppy Psi experimentation skills. Surely we want to know the former not the latter? Again, for precognition most of us will probably agree that this is unlikely – it seems more trivially explained by some Bem-related artifact – but in many situations this is a very valid point: Imagine one researcher manages to produce a cure for some debilitating disease but others fail to replicate it. I’d bet that most people wouldn’t run around shouting “Failed replication!”, “Publication bias!”, “P-hacking!” but would want to know what makes the original experiment – the one with the working drug – different from the rest.
The way I see that, meta-analysis of large scale replications is not the right way to deal with this problem. Meta-analysis of one lab’s replications are worthwhile, especially as a way to summarise a set of conceptually related experiments – but then you need to take them with a grain of salt because they aren’t independent replications. But large-scale meta-analysis across different labs don’t really tell us all that much. They simply don’t estimate the effect size that really matters. The same applies to replication efforts (and I know I’ve said this before). This is the point on which I have always sympathised with Jason Mitchell: you cannot conclude a lot from a failed replication. A successful replication that nonetheless demonstrates that the original claim is false is another story but simply failing to replicate some effect only tells you that something is (probably) different between the original and the replication. It does not tell you what the difference is.
Sure, it’s hard to make that point when you have a large-scale project like Brian Nosek’s “Estimating the reproducibility of psychological science” (I believe this is a misnomer because they mean replicability not reproducibility – but that’s another debate). Our methods sections are supposed to allow independent replication. The fact that so few of their attempts produced significant replications is a great cause for concern. It seems doubtful that all of the original authors knew what they were doing and so few of the “replicators” did. But in my view, there are many situations where this is not the case.
I’m not necessarily saying that large-scale meta-analysis is entirely worthless but I am skeptical that we can draw many firm conclusions from it. In cases where there is reasonable doubt about differences in data quality or experimenter effects, you need to test these differences. I’ve repeatedly said that I have little patience for claims about “hidden moderators”. You can posit moderating effects all you want but they are not helpful unless you test them. The same principle applies here. Rather than publishing one big meta-analysis after another showing that some effect is probably untrue or, as Psi researchers are wont to do, in an effort to prove that precognition, presentiment, clairvoyance or whatever are real, I’d like to see more attempts to rule out these confounds.
In my opinion the only way to do this is through adversarial collaboration. If an honest skeptic can observe Bem conduct his experiments, inspect his materials, and analyse the data for themselves and yet he still manages to produce these findings, that would go a much longer way convincing me that these effects are real than any meta-analysis ever could.
TL;DR: Leave-one-out cross-validation is a bad way for testing the predictive power of linear correlation/regression.
Correlation or regression analysis are popular tools in neuroscience and psychology research for analysing individual differences. It fits a model (most typically a linear relationship between two measures) to infer whether the variability in some measure is related to the variability in another measure. Revealing such relationships can help understand the underlying mechanisms. We and others used it in previous studies to test specific mechanistic hypotheses linking brain structure/function and behaviour. It also forms the backbone of twin studies of heritability that in turn can implicate genetic and experiential factors in some trait. Most importantly, in my personal view individual differences are interesting because they acknowledge the fact that every human being is unique rather than simply treating variability as noise and averaging across large groups people.
But typically every report of a correlational finding will be followed by someone zealously pointing out that “Correlation does not imply causation”. And doubtless it is very important to keep that in mind. A statistical association between two variables may simply reflect the fact that they are both related to a third, unknown factor or a correlation may just be a fluke.
Another problem is that the titles of studies using correlation analysis sometimes use what I like to call “smooth narrative” style. Saying that some behaviour is “predicted by” or “depends on” some brain measure makes for far more accessible and interesting reading that dryly talking about statistical correlations. However, it doesn’t sit well with a lot of people, in part because such language may imply a causal link that the results don’t actually support. Jack Gallant seems to regularly point out on Twitter that the term “prediction” should only ever be used when a predictive model is built on some data set but the validity is tested on an independent data set.
Recently I came across an interesting PubPeer thread debating this question. In this one commenter pointed out that the title of the study under discussion, “V1 surface size predicts GABA concentration“, was unjustified because this relationship explains only about 7% of the variance when using a leave-one-out cross-validation procedure. In this procedure all data points except one are used to fit the regression and the final point is then used to evaluate the fit of the model. This procedure is then repeated n-fold using every data point as evaluation data once.
Taken at face value this approach sounds very appealing because it uses independent data for making predictions and for testing them. Replication is a cornerstone of science and in some respects cross-validation is an internal replication. So surely this is a great idea? Naive as I am I have long had a strong affinity for this idea.
Cross-validation underestimates predictive power
But not so fast. These notions fail to address two important issues (both of which some commenters on that thread already pointed out): first, it is unclear what amount of variance a model should explain to be important. 7% is not very much but it can nevertheless be of substantial theoretical value. The amount of variance that can realistically be explained by any model is limited by the noise in the data that arises from measurement error or other distortions. So in fact many studies using cross-validation to estimate the variance explained by some models (often in the context of model comparison) instead report the amount of explainable variance accounted for by the model. To derive this one must first estimate the noise ceiling, that is, the realistic maximum of variance that can possibly be explained. This depends on the univariate variability of the measures themselves.
Second, the cross-validation approach is based on the assumption that the observed sample, which is then subdivided into model-fitting and evaluation sets, is a good representation of the population parameters the analysis is attempting to infer. As such, the cross-validation estimate also comes with an error (this issue is also discussed by another blog post mentioned in that discussion thread). What we are usually interested in when we conduct scientific studies is to make an inference about the whole population, say a conclusion that can be broadly generalised to any human brain, not just the handful of undergraduate students included in our experiments. This does not really fit the logic of cross-validation because the evaluation is by definition only based on the same sample we collected.
Because I am a filthy, theory-challenged experimentalist, I decided to simulate this (and I apologise to all my Bayesian friends for yet again conditioning on the truth here…). For a range of sample sizes between n=3 and n=300 I drew a sample with from a population with a fixed correlation of rho=0.7 and performed leave-one-out cross-validation to quantify the variance explained by it (using the squared correlation between predicted and observed values). I also performed a standard regression analysis and quantified the variance explained by that. At each sample size I did this 1000 times and then calculated the mean variance explained for each approach. Here are the results:
What is immediately clear is that the results strongly depend on the sample size. Let’s begin with the blue line. This represents the variance explained by the standard regression analysis on the whole observed sample. The dotted, black, horizontal line denotes the true effect size, that is, the variance explained by the population correlation (so R^2=49%). The blue line starts off well above the true effect but then converges on it. This means that at small sample sizes, especially below n=10, the observed sample inflates how much variance is explained by the fitted model.
Next look at the red line. This denotes the variance explained by the leave-one-out cross-validation procedure. This also starts off above the true population effect and follows the decline of the observed correlation. But then it actually undershoots and goes well below the true effect size. Only then it gradually increases again and converges on the true effect. So at sample sizes that are most realistic in individual differences research, n=20-100ish, this cross-validation approach underestimates how much variance a regression model can explain and thus in fact undervalues the predictive power of the model.
The error bars in this plot denote +/- 1 standard deviation across the simulations at each sample size. So as one would expect, the variability across simulations is considerable when sample size is small, especially when n <= 10. These sample sizes are maybe unusually small but certainly not unrealistically small. I have seen publications calculating correlations on such small samples. The good news here is that even with such small samples on average the effect may not be inflated massively (let’s assume for the moment that publication bias or p-hacking etc are not an issue). However, cross-validation is not reliable under these conditions.
A correlation of rho=0.7 is unusually strong for most research. So I repeated this simulation analysis using a perhaps more realistic effect size of rho=0.3. Here is the plot:
Now we see a hint of something fascinating: the variance explained by the cross-validation approach actually subtly exceeds that of the observed sample correlation. They again converge on the true population level of 9% when the sample size reaches n=50. Actually there is again an undershoot but it is negligible. But at least for small samples with n <= 10 the cross-validation certainly doesn’t perform any better than the observed correlation. Both massively overestimate the effect size.
When the null hypothesis is true…
So if this is what is happening at a reasonably realistic rho=0.3, what about when the null hypothesis is true? This is what is shown in here (I apologise for the error bars extending into the impossible negative range but I’m too lazy to add that contingency to the code…):
The problem we saw hinted at above for rho=0.3 is exacerbated here. As before, the variance explained for the observed sample correlation is considerably inflated when sample size is small. However, for the cross-validated result this situation is much worse. Even at a sample size of n=300 the variance explained by the cross-validation is greater than 10%. If you read the PubPeer discussion I mentioned, you’ll see that I discussed this issue. This result occurs because when the null hypothesis is true – or the true effect is very weak – the cross-validation will produce significant correlations between the inadequately fitted model predictions and the actual observed values. These correlations can be positive or negative (that is, the predictions systematically go in the wrong direction) but because the variance explained is calculated by squaring the correlation coefficient they turn into numbers substantially greater than 0%.
As I discussed in that thread, there is another way to calculate the variance explained by the cross-validation. I won’t go into detail on this but unlike the simpler approach I employed here this does not limit the variance explained to fall between 0-100%. While the estimates are numerically different, the pattern of results is qualitatively the same. At smaller sample sizes the variance explained by cross-validation systematically underestimates the true variance explained.
When the interocular traumatic test is significant…
My last example is the opposite scenario. While we already looked at an unusually strong correlation, I decided to also simulate a case where the effect should be blatantly obvious. Here rho=0.9:
Unsurprisingly, the results are similar as those seen for rho=0.7 but now the observed correlation is already doing a pretty decent job at reaching the nominal level of 81% variance explained. Still, the cross-validation underperforms at small sample sizes. In this situation, this actually seems to be a problem. It is rare that one would observe a correlation of this magnitude in psychological or biological sciences but if so chances are good that the sample size is small in that case. Often the reason for this may be that correlation estimates are inflated at small sample sizes but that’s not the point here. The point is that leave-one-out cross-validation won’t tell you. It underestimates the association even if it is real.
Where does all this leave us?
It is not my intention to rule out cross-validation. It can be a good approach for testing models and is often used successfully in the context of model comparison or classification analysis. In fact, as the debate about circular inference in neuroscience a few years ago illustrated, there are situations where it is essential that independent data are used. Cross-validation is a great way to deal with overfitting. Just don’t let yourself be misled into believing it can tell you something it doesn’t. I know it is superficially appealing and I had played with it previously for just that reason – but this exercise has convinced me that it’s not as bullet-proof is one might think.
Obviously, validation of a model with independent data is a great idea. A good approach is to collect a whole independent replication sample but this is expensive and may not always be feasible. Also, if a direct replication is performed it seems better that this is acquired independently by different researchers. A collaborative project could do this in which each group uses the data acquired by the other group to test their predictive model. But that again is not something that is likely to become regular practice anytime soon.
In the meantime we can also remember that performing typical statistical inference is a good approach after all. Its whole point is to infer the properties of the whole population from a sample. When used properly it tends to do a good job at that. Obviously, we should take measures to improve its validity, such as increasing power by using larger samples and/or better measurements. I know I am baysed but Bayesian hypothesis tests seem superior at ensuring validity than traditional significance testing. Registered Reports can probably also help and certainly should reduce the skew by publication bias and flexible analyses.
So, does correlation imply prediction? I think so. Statistically this is precisely what it does. It uses one measure (or multiple measures) to make predictions of another measure. The key point is not whether calling it a prediction is valid but whether the prediction is sufficiently accurate to be important. The answer to this question actually depends considerably on what we are trying to do. A correlation explaining 10-20% of the variance in a small sample is not going to be a clear biomarker for anything. I sure as hell wouldn’t want any medical or judicial decisions to be based solely on such an association. But it may very well be very informative about mechanisms. It is a clearly detectable effect even with the naked eye.
In the context of these analysis, a better way than quantifying the variance explained is to calculate the root mean squared deviation (essentially the error bar) of the prediction. This provides an actually much more direct index of how accurately one variable predicts another. The next step – and I know I sound like a broken record – should be to confirm that these effects are actually scientifically plausible. This mantra is true for individual differences research as much as it is for Bem’s precognition and social priming experiments where I mentioned it before. Are the differences in neural transmission speed or neurotransmitter concentration implied by these correlation results realistic based on what we know about the brain? These are the kinds of predictions we should actually care about in these discussions.
(7th Aug 2015: I edited the section ‘Unsupported assumptions’ because I realised my earlier comments didn’t make sense)
Fate (or most likely coincidence) just had it that soon after my previous post, the first issue in my “Journal of Disbelief“, someone wrote a long post about social priming. This very long and detailed post by Michael Ramscar is very worth a read. In it he discusses the question of failed replications and going into particular depth on an alternative explanation for why Bargh’s elderly priming experiments failed to replicate. He also criticises the replication attempt. Since I discuss both these studies in detail (in my last post and several others) and because it pertains to my general skepticism of social priming, I decided to respond. I’d have done it on his blog but that doesn’t seem to be open for comments – so here I go instead:
With regard to replication, he argues that a lot of effort is essentially wasted on repeating these priming studies that could be put to better use to advance scientific knowledge. I certainly agree with this notion to some extent and I have argued similar points in the past. He then carries out a seemingly (I say this because I have neither the time nor the expertise to verify it) impressive linguistic analysis suggesting that the elderly priming study was unlikely to be replicated so many years after the original study because the use of language has undoubtedly evolved since the late 80s/early 90s when the original study was conducted (in fact, as he points out, many of the participants in the replication study were not even born when the original one was done). His argument is essentially that the words to prime the “schema of old age” in Bargh’s original study could no longer be effective as primes.
Ramscar further points out that the replication attempt was conducted on French-speaking participants and goes to some lengths to show that French words are unlikely to exert the same effect. This difference could very well be of great importance and in fact there may be very many reasons why two populations from which we sample our participants are not comparable (problems that may be particularly serious when trying to generalise results from Western populations to people with very different lifestyles and environments, like native tribes in Sub-Saharan Africa or the Amazon etc.). This however ignores that there have been other replication attempts of this experiment that also failed to show this elderly priming effect. I am aware of at least one that was done on English-speaking participants. Although as this was also done only a few years ago the linguistic point presumably still stands.
How about No?
The first thought I had when reading Ramscar’s hypothesis about how elderly priming works and why we shouldn’t expect it to replicate in modern samples was that this sounds like the complicated explanatory handwaving that people all too often engage in when their “experiment didn’t work” (meaning that their hypothesis wasn’t confirmed). I often encounter this when marking student project reports but it would be grossly unfair to make this out to be a problem specific to students. Rather you can often see this even from very successful, tenured researchers. While some of this behaviour is probably natural (and thus it makes sense why many students write things like this) I think the main blame for this lies with how we train students to approach their results, both in words and action. The problem is nicely summarised by the title of James Alcock’s paper “Give the Null Hypothesis a Chance“. While he wrote this as a critique of Psi research (which I may or may not cover in a future issue of the Journal of Disbelief – I kind of feel I’ve written enough about Psi…), I think it would serve us all well to remember that sometimes our beautifully crafted hypotheses may simply not be correct. To me this is also the issue with social priming.
Now, I get the impression Michael Ramscar does not necessarily believe that this linguistic account is the only explanation for the failure to replicate Bargh’s finding. He may very well accept that the result may have been a fluke. I am also being vastly unjust to liken his detailed post to “handwaving”. Considering to what lengths he goes to produce data about word frequencies his post is anything but handwavy. But whether or not it is entirely serious, it is out there and deserves some additional thoughts.
The way I see it, the linguistic explanation is based on a whole host of assumptions that probably do not hold. As I said in my previous post, we need more Occam’s Razor. Often mischaracterised as that “the simplest explanation is usually correct” what is really states (at least in my interpretation) is that the explanation that requires the smallest number of assumptions whilst producing the maximal explanatory power is probably closest to the truth. The null hypothesis that there is no such thing as social priming (or if it exists, it is much, much weaker than these underpowered experiments could possibly hope to detect) seems to me far more likely than the complex explanation posited by Ramscar’s post.
Why should we expect the words most frequently associated with old age (which he argues is – or rather was – the word ‘old’) to produce the strongest age priming effect? Couldn’t that just as well lead to a habituation? The most effective words for priming the old age schema may be more obscure ones that however strongly evoke thoughts about the elderly. I agree that even in the US ‘Florida’ and ‘bingo’ don’t necessarily cut it in that respect (and my guess is outside the US ‘Florida’ mainly evokes images of beaches and palm trees and possibly cheesy 80s cop dramas) . Others though, like ‘retired’ and ‘grey’ very well might though. And words like ‘rigid’ could very well evoke the concept of slowness. The fact that the word frequency produces such priming effects is a mostly unsupported assumption.
Another possibility is that the aggregate of the 30 primes is highly non-linear. By this I mean that the combined effect may be a lot more than the average (or even the sum) of individual priming effects. To me it actually seems quite likely that any activation of the concept of old age would only gradually build up over the course of the experiment. So essentially, the word ‘old’ may have no effect on its own but in combination with all the other words it might clearly evoke the schema. Of course, on the other hand I find it quite hard to fathom that one little word in each of thirty sentences – sentences that by themselves may be completely unrelated to old age – will produce a very noticeable effect on an irrelevant behaviour after the experiment is over.
The discussion of semantic priming effect, such as that reading/hearing the word ‘doctor’ might make you more likely to think of ‘nurse’ than of ‘commissar’, is a perfect example of the reasons I was describing in my last post why I think social priming hypotheses (at least the more fanciful ones based on abstract ideas and folk sayings) are highly implausible. How strong are semantic priming effects? How likely do you think it is that a social priming effect like that shown by Bargh could be even half that strong? Surely there must be numerous additional factors that exert their tugs and pulls on your unsuspecting mind. Many of which must be stronger than the effect of some words you form into a sentence. I realise that these noise factors must cancel out with a sufficiently large data set – but they form variability that will dilute the effect size we can possibly expect from such effects.
In my opinion the major problem with the whole theoretical idea behind social priming is that it just seems rather unfathomable to me that human beings and human society could function at all, if the effects were so strong, so long-lasting and so simple as claimed by much of this research. I don’t buy into the idea that these effects can only be produced reliably under laboratory conditions and only by skilled researchers. I know I can’t hope to replicate a particle physics experiment both for lack of lab equipment and lack of expertise. I can believe that some training in conducting social priming (or, while I’m at it again, Psi) experiments may require some experience with doing that. However, at the same time, if these effects are so real and so obvious as these researchers claim, they should be easier to replicate than something that requires years of practical training, thorough knowledge of maths and theoretical physics, and a million-dollar hadron collider. Psychology labs may reduce some noise in human behaviour compared to, say, doing your experiments on street corners but they are not dust-free rooms or telescopes outside of the Earth atmosphere. The subjects that come in to your lab remain heavily contaminated with all the baggage and noisiness that only the human mind is capable of. If effects as impressive as those in the social priming literature were real, human beings should bounce around the world as if they were inside a pinball machine.
So … what about replications again?
Finally, as I said, I kind of agree with Ramscar about replication attempts. I think a lot of direct replications are valid to establish the robustness of some effects. I am not sure that it really makes sense to repeat the elder-priming experiments though. Not that I don’t appreciate Doyen et al.’s and Hal Pashler’s attempts to replicate this experiment. However, as I have tried to argue, the concepts in many social priming studies are simply so vague and complex that one probably can’t learn all that much. I entirely accept Ramscar’s point that different times, different populations, and (most critically) different languages might make an enormous difference here. The possible familiarity of research subjects with the original experiment may further complicate matters. And unless the experimenters are suitably blinded to the experimental condition of each participant (which I don’t think is always the case), there may be further problems with demand effect etc.
A lot of the rebuttals by original authors of failed social priming replications seem to revolve around the point that while specific experiments don’t replicate this does not mean the whole theory is invalid. There have been numerous findings of social priming in the literature. However, even I, having written extensively about what I think is misguided about the current wave of replication attempts, would say that the sheer number of failed social priming replications should pose a serious problem for advocates of that theory.
But this is where I think social psychologists need to do better. I think rather than more direct replications of social priming we need more conceptual replication attempts that try to directly address this question:
Can social priming effects of this magnitude be real?
I don’t believe that they can but perhaps I am too skeptical. I can only tell you that I won’t be convinced of the existence of social priming (or Psi) by yet more underpowered, possibly p-hacked studies by researchers who just know how to get these effects. Especially not if the effects are so large that they seem vastly incompatible with the way it appears our behaviour works. Maybe I am relying too much on my gut here than my brain but when faced between the choice of a complex theory based on numerous (typically posthoc) assumptions and the notion that these effects just don’t exist, I know which I’d choose…
Unless you have been living on the Moon in recent years, you will have heard about the replication crisis in science. Some will want to make you believe that this problem is specific to psychology and neuroscience but similar discussions also plague other research areas from drug development and stem cell research to high energy physics. However, since psychology deals with flaky, hard-to-define things like the human mind and the myriad of behaviours it can produce, it is unsurprising that reproducibility is perhaps a greater challenge in this field. As opposed to, say, an optics experiment (I assume, I may be wrong about that), there are just too many factors than that you could conduct a controlled experiment producing clear results.
Science is based on the notion that the world, and the basic laws that govern it, are for the most part deterministic. If you have situation A under condition X, you should get a systematic change to situation B when you change the condition to Y. The difference may be small and there may be a lot of noise meaning that the distinction between situations A and B (or between conditions X and Y for that matter) isn’t always very clear-cut. Nevertheless, our underlying premise remains that there is a regularity that we can reveal provided we have sufficiently exact measurement tools and are able and willing to repeat the experiment a sufficiently large number of times. Without causality as we know it, that is the assumption that a manipulation will produce a certain effect, scientific inquiry just wouldn’t work.
There is absolutely nothing wrong with wanting to study complicated phenomena like human thought or behaviour. Some people seem to think that if you can’t study every minute detail of a complex system like the brain and don’t understand what every cell and ion channel is doing at any given time, you have no hope of revealing anything meaningful about the larger system. This is nonsense. Phenomena exist at multiple scales and different levels and a good understanding of all the details is not a prerequisite for understanding some of the broader aspects, which may in fact be more than the sum of their parts. So by all means we should be free to investigate lofty hypotheses about consciousness, about how cognition influences perception (and vice versa), or about whether seemingly complex attributes like political alignment or choice of clothing relate to simple behaviours and traits. But when we do we should always keep in mind the complexity of the system we study and whether the result is even plausible under our hypothesis.
Whatever you may feel about Bayesian statistics, the way I see it there should be a Bayesian approach to scientific reasoning. Start with an educated guess on what effects one might realistically expect under a range of different hypotheses. Then see under which hypothesis (or hypotheses) the observed results are most likely. In my view, a lot of scientific claims fall flat in that respect. Note that this doesn’t necessarily mean that these claims aren’t true – it’s just that the evidence for them so far has been far from convincing and I will try to explain why I feel that way. I also don’t claim to be immune to this delusion either. Some of my own hypotheses are probably also scientifically implausible. The whole point of research is to work this out and arrive at the most parsimonious explanation for any effect. We need more Occam’s Razor instead of the Sherlock Holmes Principle.
So my next few posts will be about scientific claims and ideas I just can’t believe. I was going to post them as a list but there turns out to be so much material that I think it’s better to expand this across several posts…
Part I: Social Priming
If you read any of my blogs before, you will know that I am somewhat critical of the current mainstream (or is it only a vocal minority?) of direct replication advocates. Note that this does not make me an opponent of replication attempts. Replication is a cornerstone of science and it should be encouraged. I am not sure if there are many people who actually disagree with that point and who truly regard the currently fashionable efforts as the work of a “replication police.” All I have been saying in the past is that the way we approach replication could be better. In my opinion a good replication attempt should come with a sanity check or some control condition/analysis that can provide evidence that the experiments were done properly and that the data are sound. Ideally, I want the distinction between replications and original research to be much blurrier than it currently is. The way I see it, both direct and indirect replications should be done regularly as part of most original research. I know that this is not always feasible and when you simply fail to replicate some previous finding and also don’t support your new hypothesis, you should obviously still be permitted to publish (certainly if you can demonstrate that the experiment worked and you didn’t just produce rubbish data).
Some reasonably credible claims
But this post isn’t about replication. Whether or not the replication attempts have been optimal, the sheer number of failed social priming (I may be using this term loosely) replications is really casting the shadow of serious doubt on that field. The reason for this, in my mind, is that it contains so many perfect examples of implausible results. This presumably does not apply to all such claims. I am quite happy to believe that some seemingly innocuous, social cues can influence human behaviour, even though I don’t know if they have actually been tested with scientific rigour. For example, there have been claims that putting little targets into urinals, such as a picture of a fly or a tiny football (or for my American readers “soccer”) goal with a swinging ball, reduces the amount of urine splattered all over the bathroom floor. I can see how this might work, if only from anecdotal self observation (not that I pee all over the bathroom floor without that, mind you). I can also believe that when an online sales website tells you “Only two left in stock!” it makes you more likely to buy then and there. Perhaps somewhat less plausible but still credible is the notion that playing classical music in London Underground stations reduces anti-social behaviour because some unsavoury characters don’t like that ambience.
An untrue (because fake) but potentially credible claim
While on the topic of train stations, the idea that people are more prone to racist stereotyping when they are in messy environments, does not seem entirely far-fetched to me. This is perhaps somewhat ironic because this is one of Diederik Stapel’s infamous fraudulent research claims. I don’t know if anyone has tried to carry out that research for real. It might very well not be a real effect at all but I could easily see how messy, graffiti-covered environments could trigger a cascade of all sorts of reactions and feelings that in turn influence your perception and behaviour to some degree. If it is real, the effect will probably be very small at the level of the general population because what one regards as an unpleasantly messy environment (and how prone one is to stereotyping) will probably differ substantially between people and between different contexts, such as the general location or the time. For instance, a train station in London is probably not perceived the same as one in the Netherlands (trust me on that one…), and a messy environment after carnival or another street party is probably not viewed in the same light as during other days. All these factors must contribute “noise” to the estimate of the population effect size.
The problem of asymmetric noise
However, this example already hints at the larger problem with determining whether or not the results from most social priming research are credible. It is possible that some effect size estimates will be stronger in the original test conditions than they will be in large-scale replication attempts. I would suspect this for most reported effects in the literature even under the best conditions and with pre-registered protocols. Usually your subject sample does not generalise perfectly to the world population or even the wider population of your country. It is perhaps impossible to completely abolish the self-selection bias induced by those people who choose to participate in research.
And sometimes the experimenter’s selection bias also makes sense as it helps to reduce noise and thus enhances the sensitivity of the experimental paradigm. For instance, in our own research using fMRI retinotopic mapping we are keen to scan people who we know are “good fMRI subjects”: people who can lie still inside the scanner and fixate perfectly for prolonged periods of time without falling asleep too much (actually even the best subjects suffer from that flaw…). If you scan someone who jitters and twitches all the time and who can’t keep their eyes open during the (admittedly dull) experiments, you can’t be surprised if your data turns out to be crap. This doesn’t tell you anything about the true effect size in the typical brain but only that it is much harder to obtain these measurements from the broader population. The same applies to psychophysics. A trained observer will produce beautifully smooth sigmoidal curves in whatever experiment you have them do. In contrast, randomers off the street will often give you a zigzag from which you can estimate thresholds only with the most generous imagination. It would be idiotic to regard the latter as more precise measurement of human behaviour. The only thing we can say is that testing a broader sample can give you greater confidence that the result truly generalises.
Turning the hands of time
Could it perhaps be possible that some of the more bizarre social priming effects are “real” in the same sense? They may just be much weaker than the original reports because in the wider population these effects are increasingly diluted by asymmetric noise factors. However, I find it hard to see how this criticism could apply when it comes to many social priming claims. What are the clean, controlled laboratory conditions that assure the most accurate measurement and strongest signal-to-noise ratio in such experiments? Take for instance the finding that people become more or less “open to new experiences” (but see this) depending on the direction they turn a crank/cylinder/turn-table. How likely is it that effects as those reported (Cohen’s d mostly between 0.3-0.7ish) will be observed even if they are real? It seems to me that there are countless factors that will affect a person’s preference for familiar or novel experiences. Presumably if the effect of clockwise or anticlockwise (or for Americans: counterclockwise) rotation exists, it should manifest with a lot of individual differences because not everyone will be equally familiar with analogue clocks. I cannot rule out that engaging or seeing clockwise rotation activates some representation of the future. This could influence people to think about novel things. It might just as equally make them anxious: as someone who is perpetually late for appointments the sight of a ticking clock certainly mainly primes in me the thought that I am running late again. I’d not be surprised if it increased my heart rate but I’d be pretty surprised if it made me desire unfamiliar items.
Walking slowly because you read the word “Florida”
The same goes for many other social priming claims, many of which have spectacularly failed to be replicated by other researchers. Take Bargh’s finding that priming people with words related to the elderly makes them walk more slowly. I can see how thinking about old age could make you behave more like your stereotype of old people although at the same time I don’t know why you should. It might as well have the opposite effect. More importantly, there should be countless other factors that probably have a much stronger influence on your walking speed, such as our general fitness level or the time of day and the activities you were doing before. Another factor influencing you will be the excitement about what to do next, for instance whether you are going to go to work or are about to head to a party. Or, most relevant to my life probably, whether or not you are running late for your next appointment.
Under statistical assumptions we regard all of these factors as noise, that is, random variation in the subject sample. If we test enough subjects the noise should presumably cancel out and the true effect of elderly priming, tiny as it may be, should crystallise. Fair enough. But that does not answer the question how strong an effect of priming the elderly we can realistically expect. I may very well be wrong, but it seems highly improbable to me that such a basic manipulation could produce a difference in average walking speed of one whole second (at least around an eighth of the time it took people on average to walk down the corridor). Even if the effect were really that large, it should be swamped by noise making it unlikely that it would be statistically significant with a sample size of 15 per group. Rather the explanation offered by one replication attempt (I’ve written about this several times before) seems more parsimonious: that there was an experimenter effect in that whoever was measuring the walking speed consistently misestimated the walking speed depending on what priming condition they believed the participant had been exposed to.
I should have professor-primed before all my exams
Even more incredible to me is the idea of “professor priming” in which exposing participants to things that remind them of professors makes them better at answering trivia questions than when they are primed with the concept of “soccer hooligans”, another finding that recently failed to be replicated. What mechanism could possibly explain such a cognitive and behavioural change? I can imagine how being primed to think about hooligans could generate all sorts of thoughts and feelings. They could provoke anxiety and stress responses. Perhaps that could make you perform worse on common knowledge questions. It’s the same mechanism that perhaps underlies stereotype threat effects (incidentally, how do those do with regard to replicability?). But this wouldn’t explain improvements in people’s performance when primed with professors.
I could see how being primed by hooligans or professors might produce some effects on your perception – perhaps judging somebody’s character by their faces etc. Perhaps you are more likely to think an average-looking person has above average intelligence when you’re already thinking about professors than when you think about hooligans (although there might just as well be contrast effects and I can’t really predict what should happen). But I find it very hard to fathom how thinking about professors should produce a measurable boost in trivia performance. Again, even if it were real, this effect should be swamped by all sorts of other factors all of which are likely to exert much greater influence on your ability to answer common knowledge questions. Presumably, common knowledge depends in the first instance on one’s common knowledge. Thinking of facts you do not have immediate access to may be helped by wakefulness and arousal. It may also help if you’re already “thinking outside the box” (I mean this figuratively – I have this vague suspicion that there is also a social priming study that claims being inside vs outside a physical box has some effect on creative thinking… (I was right – there is such a study)). You may be quicker/better at coming up with unconventional, rarely accessed information when you are already on a broad search than when you are monotonously carrying out a simple, repetitive task. But I don’t see how being primed by professors could activate such a process and produce anything but the tiniest of effects.
There was also a study that claimed that exposing subjects to a tiny American flag in the corner of the screen while they answered a survey affected their voting behaviour in a the presidential election many months later. After all that I have written already, it should strike you as highly unlikely that such an effect could be established reliably. There are multitudes of factors that may influence a person’s voting behaviour, especially within the months between the critical flag experiment and election day. Surely the thousands of stars and stripes that any person residing in the United States would be exposed to on a daily basis should have some effect? I can believe that there are a great many hidden variables that govern where you make your cross on the ballot (or however they may be voting there) but I don’t think participation in a psychology experiment can produce a long-term shift in that behaviour by over 10%. If that were true, I think we would be well-advised to bring back absolutist monarchy. At least then you know who’s in charge.
Of fishy smells and ticking time bombs
One thing that many of these social priming studies have in common is that they take common folk sayings and turn them into psychology experiments. Similar claims, that I just learned about this week (thanks for Alex Etz and Stuart Ritchie), that being exposed to the smell of fish oil makes you more suspicious (because “it smells fishy”) and that the sound of a ticking clock pressures women from poor backgrounds into getting married and having children. I don’t know about you but if all of these findings are true, I feel seriously sorry for my brain. It’s constantly bombarded by conflicting cues telling it to change its perceptions and decisions on all sorts of things. It is a miracle we get anything done. Maybe it is because I am not a native English speaker but when I smell fish I think I may develop an appetite for fish. I don’t think it makes me more skeptical. Reading even more titles of studies turning proverbs into psychology studies just might though…
So what next?
I could go on as I am sure there are many more such studies but that’s beside the point. My main problem is that few research studies seem to ask whether the results they obtained are realistic. Ideally, we should start with some form of prediction of what kind of result we can even expect. To be honest, when it comes to social priming I don’t know how to go about doing this. I think it’s fine to start badly as long as someone is actually trying at all. Some thorough characterisation of the evidence to produce norm data may be useful. For instance, it would be useful to have data on general walking speeds of people leaving the lab from a larger sample so that you have a better estimate of the variability in walking speeds you could expect. If that is substantially larger than 1 second you should probably look to test a pretty large sample. Or characterise the various factors that can impact “openness to new experiences” more strongly than innocuous actions like turning a lever and then make an estimate as to how probable it is that your small social priming manipulation could exert a measurable effect with a typical sample size. Simulations could help with this. Last but definitely not least, think of more parsimonious hypotheses and either test them as part of your study or make sure that they are controlled – such as replacing the experimenter using a stopwatch with laser sensors at both ends of the corridor.
Of course, the issues I discussed here don’t apply only to social priming and future posts will deal with those topics. However, social priming has a particularly big problem. It is simply mechanistically very underdetermined. Sure, the general idea is that activating some representation or some idea can have an influence on behaviour. This essentially treats a human mind like a clean test tube just waiting for you to pour in your chemicals so you can watch the reaction. The problem is that in truth a human mind is more like a really, really filthy test tube, contaminated with blood and bacteria and dirty paw prints of all the people who fingered them before…
So I know I said I won’t write about replications any more. I want to stay true to my word for once… But then a thought occurred to me last night. Sometimes it takes minimal sleep followed by a hectic crazy day to crystalise an idea in your mind. So before I go on a well-needed break from research, social media, and (most importantly) bureaucracy, here is a post through the loophole (actually the previous one was too – I decided to split them up). It’s not about replication as such but about what “replicability” can tell us. Also it’s unusually short!
While replicability now seems widely understood to mean that a finding replicates on repeated testing, I have come to prefer the term reliability. If a finding is so fluky that it often cannot be reproduced, it is likely it was spurious in the first place. Hence it is unreliable. Most of the direct replication attempts underway now are seeking to establish the reliability of previous findings. That is fair enough. However, any divergence from the original experimental conditions will make the replication less direct.
This brings us to another important aspect to a finding – its generalisability (I have actually written about this whole issue before although in a more specific context). A finding may be highly reliable but still fail to generalise, like the coin flipping example in my previous post. In my opinion science must seek to establish both, the reliability and generalisability of hypotheses. Just like Sinatra said, “you can’t have one without the other.”
This is where I think most (not all!) currently debated replication efforts fall short. They seek to only establish reliability which you can’t do. A reliable finding that is so specific that it only occurs under very precise conditions could still lead to important theoretical advances – just like Fluke’s magnetic sand led to the invention of holographic television and hover-cars. Or, if you prefer real examples, just ask any microbiologist or single-cell electrophysiologist. Some very real effects can be very precarious.
However, it is almost certainly true that a lot of findings (especially in our field right now) are just simply unreliable and thus probably unreal. My main issue with the “direct” replication effort is that by definition it cannot ever distinguish reliability from generalisability. One could argue (and some people clearly do) that the theories underlying things like social priming are just so implausible that we don’t need to ask if they generalise. I think that is wrong. It is perfectly fine to argue that some hypothesis is implausible – I have done so myself. However, I think we should always test reliability and generalisability at the same time. If you only seek to establish reliability, you may be looking in the wrong place. If you only ask if the hypothesis generalises, you may end up chasing a mirage. Either way, you invest a lot of effort and resources but you may not actually advance scientific knowledge very much.
And this, my dear friends, Romans, and country(wo)men will really be my final post on replication. When I’m back from my well-needed science break I will want to post about another topic inspired by a recent Neuroskeptic post.
I have been thinking about something I heard on Twitter yesterday (they can take credit for their statement if they wish but for me it’s not important to drag names into this):
…people should be rewarded for counterintuitive findings that replicate
I think that notion is misguided and potentially quite dangerous. Part of the reason why psychology is in such a mess right now is this focus on counterintuitive and/or surprising findings. It is natural that we get excited by novel discoveries and they are essential for scientific progress – so it will probably always be the case that novel or surprising findings can boost a scientist’s career. But that isn’t the same as rewarding them directly.
Rather I believe we should reward good science. By good I mean that experiments are well-designed with decent control conditions, appropriate randomisation of conditions etc, and meticulously executed (you can present evidence of that with sanity checks, analyses of residuals etc). However, there is another aspect which is that the experiments – not the findings – should be replicable. The dictionary definition of ‘replicable‘ is that it should be ‘capable of replication’. In the context of the debates raging in our field this is usually taken to mean that findings you replicate on repeated testing.
However, it can also be taken to mean (and originally I think this was the primary meaning) that it is possible to repeat the experiment. I think good science should come with methods sections that contain sufficient detail for someone with a reasonable background in the field to replicate them. That’s what we teach our students how they should be writing their methods sections. One of Jason Mitchell’s arguments was that all experiments contain tacit knowledge without which a replication is likely to fail. There will doubtless always be methods we don’t report. Mitchell uses the example that we don’t report that we instruct participants in neuroimaging experiments to keep still. Simine Vazire used the example that we don’t mention that experimenters usually wear clothes. However, things like this are pretty basic. Anything that isn’t just common sense should be reported in your methods – especially if you believe that it could make a difference. Of course it is possible you will only later realise a factor matters (like Prof Fluke did with the place where his coin flipping experiment was conducted). But we should seek to minimise these realisations by reporting methods in as much detail as possible.
While things have improved since the days when Science reported methods only as footnotes, the methods sections of many high-impact journals in our field still have very strict word limits. This makes it very difficult to report all the details of your methods. At Nature for instance the methods section should only be about 3000 words and “detailed descriptions of methods already published should be avoided.” While that may seem sensible it is actually quite painful to gather together methods details from previous studies, unless the procedures are well-established and wide-spread. Note that I am not too bothered by the fact that methods are often only available online. In this day and age that isn’t unreasonable. But at the very least the methods should be thorough, easily accessible, and all in one place.
(You may very well think that this post is about replication again – even though I said I wouldn’t write about this any more – but I couldn’t possibly comment…)
Recently I have spent a lot of time writing about replication and why I feel current “direct” replication efforts are often missing the point. For some reason it is a lot harder than it should be to get my point across. It is being misconstrued at every step and various straw man arguments are debated instead. Whatever the reasons for this may be, I want to try again one last time before I’ll go on a break. Perhaps I can communicate my thoughts more clearly by means of a parable…
The magical coin
Professor Fluke returns from a journey to the tropical islands of the South Pacific. On a beach there he found the coin depicted above. One side shows a Polynesian deity. The other side bears the likeness of an ancient queen. Prof Fluke flips the strange coin 10 times and it lands on tails, the side with the fierce Polynesian god, every time. He is surprised, so he does it again. This time it lands on tails 6 times. Seeing that this means that overall there were 80% flips on tails and this is clearly beyond the traditional significant threshold of p<0.05, Fluke publishes a brief communication in a high impact journal to report that the coin is biased. He admits he doesn’t have a good theory for what is happening. The discovery is widely reported on the news partly due to the somewhat overhyped press release written by Fluke’s university. “Scientists discover magical coin” the headlines read. A disturbingly successful tabloid writes that the coin will cure cancer.
An earnest replication
Dr Earnest, a vocal proponent of Bayesian inference and a prolific replicator, doesn’t believe Prof Fluke’s sensationalist claims. She decides to replicate Fluke’s results. Unfortunately, she lacks the funds to fly to the south seas so she decides to craft a replica of the coin closely based on the description by Prof Fluke. Despite the hard effort in preparing the experiment, she only flips the coin five times. It lands on tails three times. While above chance levels, under the Bayesian framework this result actually weakly favours the null hypothesis (BF10=0.53). Even though these results aren’t very conclusive, Dr Earnest publishes this as a failure to replicate Fluke’s magical coin. The finding spreads like wildfire all over social media. People say the “Controversial magical coin was debunked!” and that “We need more replications like this!” It doesn’t take long for numerous anonymous commenters – who know nothing about coins let alone about coin flipping – to declare on internet forums that Prof Fluke is just a “bad scientist”. Some even accuse him of cheating.
Are you flipping crazy?
Another group of 10 researchers is understandably skeptical of Fluke’s magical coin. They all decide to flip coins 20 times so that there will be many more trials than ever before and thus the replication has much greater statistical power. Even though they all formally agree to do the same experiment, they don’t: eight of this consortium craft replicas of the coin just like Dr Earnest did. One of them, Dr Cook, travels to the south seas and a native gives him a coin that looks just like Prof Fluke’s. Finally, one replicator, Dr Friendly, directly talks to Prof Fluke who agrees to an adversarial collaboration using the actual coin he found.
All 10 of them start tossing coins. Overall the data suggest nothing much is going on. Out of the 200 coin tosses, it lands on tails 99 times – almost perfectly at chance and the effect goes in the opposite direction. However, Dr Friendly, who actually used Fluke’s coin, observes 14 tails out of 20. While this isn’t very strong evidence, it is not inconsistent with Fluke’s earlier findings. The consortium publishes a meta-analysis of the whole 200 coin flips stating that the evidence clearly shows that such coins are fair.
Prof Fluke and Dr Friendly however also publish their own results separately. Like with most adversarial collaborations, in the discussion section they starkly disagree in their interpretation of the very same finding. Dr Friendly states that the coin is most likely fair. Fluke disagrees and also discloses a methodological detail that was missing from his earlier publication. He left it out because of the strict word limits imposed by the high impact journal and also because he didn’t think then that it should matter: his original 20 coin flips were all performed on the tropical beach right where he found the coin. All of the replications were done someplace else.
The coin tossing crisis
Nobody takes Fluke’s arguments seriously. All over social media and even in formal publications this is discussed as a clear failure to replicate and that his findings were probably p-hacked. “It’s obvious,” some commenters say, “Fluke just did a few hundred coin flips but only reported the significant ones.” Some scientists take another coin that depicts a salmon and flip it twice. It lands on fish-heads both times. They present a humorous poster at a conference to illustrate the problems with underpowered coin flipping experiments. Countless direct replication efforts are underway to test previous coin tossing results. To increase statistical power some researchers decide to run their experiments online where they can quickly reach larger sample sizes. Most people ignore the problem that tossing bitcoins might not be theoretically equivalent to doing it in the flesh.
To make matters worse, a few high profile cases of fraudulent coin flippers are revealed. Popular science news outlets write damning editorials about the “reproducibility crisis.” A group of statisticians lead by Professor Eager reanalyses all the coin flips reported in the coin flipping literature to reveal that probably most studies did not report all their non-significant findings. Advocates of Bayesian methods counter those claims by saying that you can’t make claims about probabilities after the fact. Unfortunately, nobody really understands what they’re saying so the findings by Eager et al. are still cited widely.
The mainstream news media now continuously report on this “crisis.” Someone hacks into the email server of Prof Fluke’s university and digs out a statement that, when taken wildly out of context, sounds like all researchers are part of a global coin tossing conspiracy. The disturbingly successful tabloid publishes an article saying that the magical coin causes cancer. Public faith in science is undermined. In parts of the US, it is added to the school curriculum that children must learn that “Coin tossing is just a theory.” People stop vaccinating their children and regulations/treaties put in place to counteract climate change are dismantled. Soon thousands die from preventable diseases while millions get sick from polluted air and water…
The next generation
A few years later Prof Fluke dies of the flu. The epidemic caused by anti-vaxxers is only partly to blame. His immune system was simply weakened by all the stress caused by the replication debate. His name has become eponymous with false positives. People chuckle and joke about him whenever topics like p-hacking and questionable research practices are discussed. After the coin tossing debacle he could no longer get research grants and he failed to get tenure – impoverished and shunned by the scientific community, he couldn’t purchase any medicine.
Despite his mother’s warnings, Prof Fluke’s son decides to become a scientist. For obvious reasons, he decides to take his husband’s name when he gets married, so his name is now Dr Curious. Partly driven by an interest in the truth but also by a desire to exonerate his father’s name, Dr Curious takes the coin and travels to the South Pacific. He goes to the very beach where his father found the fateful coin and flips it. Ten out of ten tails! He does it again and observes the same result.
However, despite the possibility that this could prove his father right, he thinks it’s all too good to be true. He knows he will need extraordinary evidence to support extraordinary claims. He tries it a third time and this time he flips it 30 times. This time there is a gust of wind so he only gets 20 tails out of 30 coin tosses. It tends to be windy on Pacific beaches. This makes the temperature pleasant but it is not conducive to running good coin flipping studies.
A well-controlled experiment
To reduce measurement error in future experiments, if there is a gust of wind during any coin toss, this trial will be excluded. Dr Curious also vaguely remembers something an insane blogger wrote many years ago and includes some control conditions in his experiment. He brought along Dr Earnest’s replica coin. He also got an identical looking coin from one of the locals on the island and, last but not least, he brought a different coin from home. Dr Curious decides to do 100 coin flips per coin. Finally, because he fears people might not believe him otherwise, he preregisters this experimental protocol by means of a carrier albatross (internet connections on the island are too slow and too expensive).
The results of his coin flipping experiment are clear. After removing any trials with wind, the “magical” coin falls on tails almost all the time (52 tails out of 55 flips). During the three times it lands on heads, it could have been that he didn’t flip it well (this can really happen…). However, strikingly he observes very similar results for the other local coin and the results are even more extreme (60 tails out of 61 flips). Neither the replica coin nor the standard coin from home perform this way but they both show results that are very consistent with random chance.
Dr Curious is very pleased with his findings. He decides to return home and runs one more control experiment: it is an exact replication of his experiment but now he will do it in his lab. He again preregisters the protocol (this time via the internet). All four coins produce results that are not significantly different from chance levels. He publishes his findings arguing that both the place and the right type of coin are necessary to replicate his father’s findings.
The Fluke Effect
Our heroic scientist is however naturally curious (indeed that is his full name) so he is not satisfied with that outcome. He hands over the coins to his collaborator who will subject the coins to a full metallurgic analysis. In the meantime, Dr Curious flies back to the tropical island. He quickly confirms that he still gets similar results on the beach when using a local coin but not with one of the replicas.
Another thought crosses his mind. He goes into the jungle on the island, far from the beach, and repeats his coin tosses. The finding does not replicate. All coins flips are consistent with chance expectations. Mystified he returns to the beach. He takes a bucket full of sand from the beach into the jungle and tries again. Now the local coin falls on tails every time. “Eureka!” shouts Dr Curious, like no other scientist before him. “It’s all about the sand!”
He takes some of the sand home with him. His colleague has since discovered that the local coins are subtly magnetic. Now they also establish that the sand is somewhat magnetic. Whenever the coin is flipped over the sand it tends to fall on tails. The coin clearly wasn’t magical, in fact it wasn’t even special. It was just like all the other coins on the island. Dr Curious and his colleague have yet to figure out why the individual grains of sand don’t stick to the coin when you pick it up but they are confident that science will find the answer eventually. It is clearly a hitherto unknown form of magnetism. In honour of his father the effect is called Fluke’s Attraction.
Years later, Dr Curious watches a documentary about this on holographic television presented by Nellie deGrasse Tyson who inherited both the down-to-earth charm and the natural good looks of her great-grandfather. She explains while Prof Fluke’s interpretation of his original findings were wrong because he lacked some necessary control conditions, he nonetheless deserves credit for the discovery of a new physical phenomenon that brought about many scientific advances, like holographic television and hover-cars. The story of Fluke’s Attraction is but one example of why persistence and inquisitiveness are essential to scientific progress. It shows that many experiments can be flawed yet nonetheless lead to breakthroughs eventually. Happily, Dr Curious falls asleep on the couch…
An alternate ending?
He dreams he is back on the tropical beach. His experiment with the four different coins fails to replicate his father’s finding. All the coins perform around chance even when there is no wind. He tries it over and over but the results are the same. He is forced to conclude that the original findings were completely spurious. There is no Fluke’s Attraction. The islanders’ coins behave just like any other coins…
Drenched in sweat and with a pounding heart Curious awakes from his nightmare. It takes him a moment to realise it was just a dream. Fluke’s Attraction is real. His father’s name has been exonerated and appears in all science textbooks.
But after taking a few deep breaths Curious realises that in the big picture it doesn’t matter. Just then Nellie says on the holo-vision:
“Flukes happen all the time. The most important lesson is not that the effect turned out to be real but that Curious went back to the island and ran a well-controlled experiment to test a new hypothesis. Of course he could have failed to replicate his father’s findings. But nonetheless he would have learned something new about the world: that it doesn’t matter which coins you use or whether you flip them on the beach.
“An infinite number of replications with replica coins – or even with the real coin – could not have done that. Yet all it took to reveal another piece of the truth was one inquisitive researcher who asked ‘What if…?‘”
For my own sanity’s sake I hope this will be my last post on replication. In the meantime, you may also enjoy this very short post by Lenny Teytelman about how the replication crisis isn’t a real crisis.
My previous post sparked a number of responses from various corners, including some exchanges I had on Twitter as well as two blog posts, one by Simine Vazire and another one following on from that. In addition, there has also been another post which discussed (and, in my opinion, misrepresented) similar things I said recently at the UCL “Is Science Broken” panel discussion.
I don’t blame people for misunderstanding what I’m saying. The blame must lie largely with my own inability to communicate my thoughts clearly. Maybe I am just crazy. As they say, you can’t really judge your own sanity. However, I am a bit worried about our field. To me the issues I am trying to raise are self-evident and fundamental. The fact that they apparently aren’t to others makes me wonder if Science isn’t in fact broken after all…
Either way, I want to post a brief (even by Alexetz’ standards?) rejoinder to that. They will just be brief answers to frequently asked questions (or rather, the often constructed strawmen):
Why do you hate replications?
I don’t. I am saying replications are central to good science. This means all (or close to it) studies should contain replications as part of their design. It should be a daisy chain. Each experiment should contain some replications, some sanity checks and control conditions. This serves two purposes: it shows that your experiment was done properly and it helps to accumulate evidence on whether or not the previous findings are reliable. Thus we must stop distinguishing between “replicators” and “original authors”. All scientists should be replicators all the bloody time!
Why should replicators have to show why they failed to replicate?
They shouldn’t. But, as I said in the previous point, they should be expected to provide evidence that they did a proper experiment. And of course the original study should be held to the same standard. This could in fact be a sanity check: if you show that the method used couldn’t possibly reveal reliable data this speaks volumes about the original effect.
It’s not the replicator’s fault if the original study didn’t contain a sanity check!
That is true. It isn’t your fault if the previous study was badly designed. But it is your fault if you are aware of that defect and nonetheless don’t try to do better. And it’s not really that black and white. What was good design yesterday can be bad design today and indefensibly terrible tomorrow. We can always do better. That’s called progress.
But… but… fluke… *gasp* type 1 error… Randomness!?!
Almost every time I discuss this topic someone will righteously point out that I am ignoring the null hypothesis. I am not. Of course the original finding may be a total fluke but you simply can’t know for sure. Under certain provisions you can test predictions that the null hypothesis makes (with Bayesian inference anyway). But that isn’t the same. There are a billion reasons between heaven and earth why you may fail a replication. You don’t even need to do it poorly. It may just be bad luck. Brain-behaviour correlations observed in London will not necessarily be detectable in Amsterdam* because the heterogeneity, and thus the inter-individual variance, in the latter sample is likely to be smaller. This means that for the very same effect size resulting from the same underlying biological process you may need more statistical power. Of course, it could also be some methodological error. Or perhaps the original finding was just a false positive. You can never know.
Confirming the original result was a fluke is new information!
That view is problematic for two reasons. First of all, it is impossible to prove the null (yes, even for Bayesians). Science isn’t math, it doesn’t prove anything. You just collect data that may or may not be consistent with theoretical predictions. Secondly, you should never put too much confidence in any new glorious findings – even if it was high powered (because you don’t really know that) and pre-registered (because that doesn’t prevent people from making mistakes). So your prior that the result is a fluke should be strong anyway. You don’t learn very much new from that.
What then would tell me new information?
A new experiment that tests the same theory – or perhaps even a better theory. It can be a close replication but it can also be a largely conceptual one. I think this dichotomy is false. There are no true direct replications and even if there were they would be pointless. The directness of replication exists on a spectrum (I’ve said this already in a previous post). I admit that the definition of “conceptual” replications in the social priming literature is sometimes a fairly large stretch. You are free to disagree with them. The point is though that if a theory is so flaky that modest changes completely obliterate it then the onus is on the proponent of the theory to show why. In fact, this could be the new, better experiment you’re doing. This is how a replication effort can generate new hypotheses.
Please leave the poor replicators alone!
If you honestly think replicators are the ones getting a hard time I don’t know what world you live in. But that’s a story for another day, perhaps one that will be told by one of the Neuroscience Devils? The invitation to post there remains open…
Science clearly isn’t self-correcting or it wouldn’t be broken!
Apart from being a circular argument, this is also demonstrably wrong. Science isn’t a perpetual motion machine. Science is what scientists do. The fact that we are having these debates is conclusive proof that science self-corrects. I don’t see any tyrannical overlord dictating us to do any of this.
So what do you think should be done?
As I have said many times before, I think we need to train our students (and ourselves) in scientific skepticism and strong inference. We should stop being wedded to our pet theories. We need to make it easier to seek the truth rather than fame. For all I care, pre-registration can probably help with that but it won’t be enough. We have to stop the pervasive idea that an experiment “worked” when it confirmed your hypothesis and failed when it didn’t. We should read Feynman’s Cargo Cult Science. And after thinking about all this negativity, wash it all down by reading (or watching) Carl Sagan to remember how many mysteries yet wait to be solved in this amazing universe we inhabit.
*) I promise I will stop talking about this study now. I really don’t want to offend anyone…
In recent months I have written a lot (and thought a lot more) about the replication crisis and the proliferation of direct replication attempts. I admit I haven’t bothered to quantify this but I have an impression that most of these attempts fail to reproduce the findings they try to replicate. I can understand why this is unsettling to many people. However, as I have argued before, I find the current replication movement somewhat misguided.
A big gaping hole where your theory should be
Over the past year I have also written a lot too much about Psi research. Most recently, I summarised my views on this in an uncharacteristically short post (by my standards) in reply to Jacob Jolij. But only very recently I realised my that my views on all of this actually converge on the same fundamental issue. On that note I would like to thank Malte Elson with whom I discussed some of these issues at that Open Science event at UCL recently. Our conversation played a significant role in clarifying my thoughts on this.
My main problem with Psi research is that it has no firm theoretical basis and that the use of labels like “Psi” or “anomalous” or whatnot reveals that this line of research is simply about stating the obvious. There will always be unexplained data but that doesn’t prove any theory. It has now dawned on me that my discomfort with the current replication movement stems from the same problem: failed direct replications do not explain anything. They don’t provide any theoretical advance to our knowledge about the world.
I am certainly not the first person to say this. Jason Mitchell’s treatise about failed replications covered many of the same points. In my opinion it is unfortunate that these issues have been largely ignored by commenters. Instead his post has been widely maligned and ridiculed. In my mind, this reaction was not only uncivil but really quite counter-productive to the whole debate.
Why most published research findings are probably not waterfowl
A major problem with his argument was pointed out by Neuroskeptic: Mitchell seems to hold replication attempts to a different standard than original research. While I often wonder if it is easier to incompetently fail to replicate a result than to incompetently p-hack it into existence, I agree that it is not really feasible to take that into account. I believe science should err on the side of open-minded skepticism. Thus even though it is very easy to fail to replicate a finding, the only truly balanced view is to use the same standards for original and replication evidence alike.
Mitchell describes the problems with direct replications with a famous analogy: if you want to prove the existence of black swans, all it takes is to show one example. No matter how many white swans you may produce afterwards, they can never refute the original reports. However, in my mind this analogy is flawed. Most of the effects we study in psychology or neuroscience research are not black swans. A significant social priming effect or a structural brain-behaviour correlation is not irrefutable evidence that it is real.
Imagine that there really were no black swans. It is conceivable that someone might parade around a black swan but maybe it’s all an elaborate hoax. Perhaps somebody just painted a white swan? Frauds of such a sensational nature are not unheard of in science, but most of us trust that they are nonetheless rare. More likely, it could be that the evidence is somehow faulty. Perhaps the swan was spotted in poor lighting conditions making it appear black. Considering how many people can disagree about whether a photo depicts a black or a white dress this possibility seems entirely conceivable. Thus simply showing a black swan is insufficient evidence.
On the other hand, Mitchell is entirely correct that parading a whole swarm of white swans is also insufficient evidence against the existence of black swans. The same principle applies here. The evidence could also be faulty. If we only looked at swans native to Europe we would have a severe sampling bias. In the worst case, people might be photographing black swans under conditions that make them appear white.
On the wizardry of cooking social psychologists
This brings us to another oft repeated argument about direct replications. Perhaps the “replicators” are just incompetent or lacking in skill. Mitchell also has an analogy for this (which I unintentionally also used in my previous post). Replicators may just be bad cooks who follow the recipes but nonetheless fail to produce meals that match the beautiful photographs in the cookbooks. In contrast, Neuroskeptic referred to this tongue-in-cheek as the Harry Potter Theory: only those blessed with magical powers are able to replicate. Inept “muggles” failing to replicate a social priming effect should just be ignored.
In my opinion both of these analogies are partly right. The cooking analogy correctly points out that simply following the recipe in a cookbook does not make you a master chef. However, it also ignores the fact that the beautiful photographs in a cookbook are frequently not entirely genuine. To my knowledge, many cookbook photos are actually of cold food to circumvent problems like steam on the camera etc. Most likely the photos will have been doctored in some way and they will almost certainly be the best pick out of several cooking attempts and numerous photos. So while it is true that the cook was an expert while you probably aren’t, the photo does not necessarily depict a representative meal.
The jocular wizardry argument implies that anyone with a modicum of expertise in a research area should be able to replicate a research finding. As students we are taught that the methods sections of our research publications should allow anyone to replicate our experiments. But this is certainly not feasible: some level of expertise and background knowledge should be expected for a successful replication. I don’t think I could replicate any findings in radio astronomy regardless how well established they may be.
One frustration many authors of results that have failed to replicate have expressed to me (and elsewhere) is the implicit assumption by many “replicators” that social psychology research is easy. I am not a social psychologist. I have no idea how easy these experiments are but I am willing to give people the benefit of the doubt here. It is possible that some replication attempts overlook critical aspects of the original experiments.
However, I think one of the key points of Neuroskeptic’s Harry Potter argument applies here: the validity of a “replicator’s” expertise, that is their ability to cast spells, cannot be contingent on their ability to produce these effects in the first place. This sort of reasoning seems circular and, appropriately enough, sounds like magical thinking.
How to fix our replicator malfunction
The way I see it both arguments carry some weight here. I believe that muggles replicators should have to demonstrate their ability to do this kind of research properly in order for us to have any confidence in their failed wizardry. When it comes to the recent failure to replicate nearly half a dozen studies reporting structural brain-behaviour correlations, Ryota Kanai suggested that the replicators should have analysed the age dependence of grey matter density to confirm that their methods were sensitive enough to detect such well-established effects. Similarly, all the large-scale replication attempts in social psychology should contain such sanity checks. On a positive note, the Many Labs 3 project included a replication of the Stroop effect and similar objective tests that fulfill such a role.
However, while such clear-cut baselines are great they are probably insufficient, in particular if the effect size of the “sanity check” is substantially greater than the effect of interest. Ideally, any replication attempt should contain a theoretical basis, an alternative hypothesis to be tested that could explain the original findings. As I said previously, it is the absence of such theoretical considerations that makes most failed replications so unsatisfying to me.
The problem is that for a lot of the replication attempts, whether they are of brain-behaviour correlations, social priming, or Bem’s precognition effects, the only underlying theory replicators put forth is that the original findings were spurious and potentially due to publication bias, p-hacking and/or questionable research practices. This seems mostly unfalsifiable. Perhaps these replication studies could incorporate control conditions/analyses to quantify the severity of p-hacking required to produce the original effects. But this is presumably unfeasible in practice because the parameter space of questionable research practices is so vast that it is impossible to derive a sufficiently accurate measure of them. In a sense, methods for detecting publication bias in meta-analysis are a way to estimate this but the evidence they provide is only probabilistic, not experimental.
Of course this doesn’t mean that we cannot have replication attempts in the absence of a good alternative hypothesis. My mentors instilled in me the view that any properly conducted experiment should be published. It shouldn’t matter whether the results are positive, negative, or inconclusive. Publication bias is perhaps the most pervasive problem scientific research faces and we should seek to reduce it, not amplify it by restricting what should and shouldn’t be published.
Rather I believe we must change the philosophy underlying our attempts to improve science. If you disbelieve the claims of many social priming studies (and honestly, I don’t blame you!) it would be far more convincing to test a hypothesis on why the entire theory is false than showing that some specific findings fail to replicate. It would also free up a lot of resources to actually advance scientific knowledge that are currently used on dismantling implausible ideas.
There is a reason why I haven’t tried to replicate “presentiment” experiments even though I have written about it. Well, to be honest the biggest reason is that my grant is actually quite specific as to what research I should be doing. However, if I were to replicate these findings I would want to test a reasonable hypothesis as to how they come about. I actually have some ideas how to do that but in all honesty I simply find these effects so implausible that I don’t really feel like investing a lot of my time into testing them. Still, if I were to try a replication it would have to be to test an alternative theory because a direct replication is simply insufficient. If my replication failed, it would confirm my prior beliefs but not explain anything. However, if it succeeded, I probably still wouldn’t believe the claims. In other words, we wouldn’t have learned very much either way.