Since last night the internet has been all atwitter about a commentary* by Dan Gilbert and colleagues about the recent and (in my view) misnamed Reproducibility Project: Psychology. In this commentary, Gilbert et al. criticise the RPP for a number of technical reasons asserting that the sampling was non-random and biased and that essentially the conclusions, in particular in the coverage by science media and blogosphere, of a replicability crisis in psychology is unfounded. Some of their points are rather questionable to say the least and some, like their interpretation of confidence intervals, are statistically simply wrong. But I won’t talk about this here.
One point they raise is the oft repeated argument that replications differed in some way from the original research. We’ve discussed this already ad nauseam in the past and there is little point going over this again. Exact replication of the methods and conditions of an original experiment can test the replicability of a finding. Indirect replications loosely testing similar hypotheses instead inform about generalisability of the idea, which in turn tells us about the robustness of the purported processes we posited. Everybody (hopefully) knows this. Both are important aspects to scientific progress.
The main problem is that most debates about replicability go down that same road with people arguing about whether the replication was of sufficient quality to yield interpretable results. One example by Gilbert and co is that one of the replications in the RPP used the same video stimuli used by the original study, even though the original study was conducted in the US while the replication was carried out in the Netherlands, and the dependent variable was related to something that had no relevance to the participants in the replication (race relations and affirmative action). Other examples like this were brought up in previous debates about replication studies. A similar argument has also been made about the differences in language context between the original Bargh social priming studies and the replications. In my view, some of these points have merit and the example raised by Gilbert et al. is certain worth a facepalm or two. It does seem mind-boggling how anyone could have thought that it is valid to replicate a result about a US-specific issue in a liberal European country whilst using the original stimuli in English.
But what this example illustrates is a much larger problem. In my mind that is actually the crux of the matter: Psychology, or at least most forms of more traditional psychology, do not lend themselves very well to replication. As I am wont to point out, I am not a psychologist but a neuroscientist. I do work in a psychology department, however, and my field obviously has considerable overlap with traditional psychology. I also think many subfields of experimental psychology work in much the same way as other so-called “harder” sciences. This is not to say that neuroscience, psychophysics, or other fields do not also have problems with replicability, publication bias, and other concerns that plague science as a whole. We know they do. But the social sciences, the more lofty sides of psychology dealing with vague concepts of the mind and psyche, in my view have an additional problem: They lack the lawful regularity of effects that scientific discovery requires.
For example, we are currently conducting an fMRI experiment in which we replicate a previous finding. We are using the approach I have long advocated that in order to try to replicate you should design experiments that do both, replicate a previous result but also seek to address a novel question. The details of the experiment are not very important. (If we ever complete this experiment and publish it you can read about it then…) What matters is that we very closely replicate the methods of a study from 2012 and this study closely replicated the methods of one from 2008. The results are pretty consistent across all three instances of the experiment. The 2012 study provided a somewhat alternative interpretation of the findings of the 2008 one. Our experiment now adds more spatially sensitive methods to yet again paint a somewhat different picture. Since we’re not finished with it I can’t tell you how interesting this difference is. It is however already blatantly obvious that the general finding is the same. Had we analysed our experiment in the same way as the 2008 study, we would have reached the same conclusions they did.
The whole idea of science is to find regularities in our complex observations of the world, to uncover lawfulness in the chaos. The entire empirical approach is based on the idea that I can perform an experiment with particular parameters and repeat it with the same results, blurred somewhat by random chance. Estimating the generalisability allows me to understand how tweaking the parameters can affect the results and thus allows me to determine what the laws are the govern the whole system.
And this right there is where much of psychology has a big problem. I agree with Gilbert et al. that repeating a social effect in US participants with identical methods in Dutch participants is not a direct replication. But what would be? They discuss how the same experiment was then repeated in the US and found results weakly consistent with the original findings. But this isn’t a direct replication either. It does not suffer from the same cultural and language differences as the replication in the Netherlands did but it has other contextual discrepancies. Even repeating exactly the same experiment in the original Stanford(?) population would not necessarily be equivalent because of the time that has passed and the way cultural factors have changed. A replication is simply not possible.
For all the failings that all fields of science have, this is a problem my research area does not suffer from (and to clarify: “my field” is not all of cognitive neuroscience, much of which is essentially straight-up psychology with the brain tagged on, and also while I don’t see myself as a psychologist, I certainly acknowledge that my research also involves psychology). Our experiment is done on people living in London. The 2012 study was presumably done mainly on Belgians in Belgium. As far as I know the 2008 study was run in the mid-western US. We are asking a question that deals with a fairly fundamental aspect of human brain function. This does not mean that there aren’t any population differences but our prior for such things affecting the results in a very substantial way are pretty small. Similarly, the methods can certainly modulate the results somewhat but I would expect the effects to be fairly robust to minor methodological changes. In fact, whenever we see that small changes in the method (say, the stimulus duration or the particular scanning sequence used) seem to obliterate a result completely, my first instinct is usually that such a finding is non-robust and thus unlikely to be meaningful.
From where I’m standing, social and other forms of traditional psychology can’t say the same. Small contextual or methodological differences can quite likely skew the results because the mind is a damn complex thing. For that reason alone, we should expect psychology to have low replicability and the effect sizes should be pretty small (i.e. smaller than what is common in the literature) because they will always be diluted by a multitude of independent factors. Perhaps more than any other field, psychology can benefit from preregistering experimental protocols to delineate the exploratory garden-path from hypothesis-driven confirmatory results.
I agree that a direct replication of a contextually dependent effect in a different country and at a different time makes little sense but that is no excuse. If you just say that the effects are so context-specific it is difficult to replicate them, you are bound to end up chasing lots of phantoms. And that isn’t science – not even a “soft” one.
* At first I thought the commentary was due to be published by Science on 4th March and embargoed until that date. However, it turns out to be more complicated than that because the commentary I am discussing here is not the Science article but Gilbert et al.’s reply to Nosek et al.’s reply to Gilbert et al.’s reply to the RPP (Confused yet?). It appeared on a website and then swiftly vanished again. I don’t know how I would feel posting it because the authors evidently didn’t want it to be public. I don’t think actually having that article is central to understanding my post so I feel it’s not important.
TL;DR: Leave-one-out cross-validation is a bad way for testing the predictive power of linear correlation/regression.
Correlation or regression analysis are popular tools in neuroscience and psychology research for analysing individual differences. It fits a model (most typically a linear relationship between two measures) to infer whether the variability in some measure is related to the variability in another measure. Revealing such relationships can help understand the underlying mechanisms. We and others used it in previous studies to test specific mechanistic hypotheses linking brain structure/function and behaviour. It also forms the backbone of twin studies of heritability that in turn can implicate genetic and experiential factors in some trait. Most importantly, in my personal view individual differences are interesting because they acknowledge the fact that every human being is unique rather than simply treating variability as noise and averaging across large groups people.
But typically every report of a correlational finding will be followed by someone zealously pointing out that “Correlation does not imply causation”. And doubtless it is very important to keep that in mind. A statistical association between two variables may simply reflect the fact that they are both related to a third, unknown factor or a correlation may just be a fluke.
Another problem is that the titles of studies using correlation analysis sometimes use what I like to call “smooth narrative” style. Saying that some behaviour is “predicted by” or “depends on” some brain measure makes for far more accessible and interesting reading that dryly talking about statistical correlations. However, it doesn’t sit well with a lot of people, in part because such language may imply a causal link that the results don’t actually support. Jack Gallant seems to regularly point out on Twitter that the term “prediction” should only ever be used when a predictive model is built on some data set but the validity is tested on an independent data set.
Recently I came across an interesting PubPeer thread debating this question. In this one commenter pointed out that the title of the study under discussion, “V1 surface size predicts GABA concentration“, was unjustified because this relationship explains only about 7% of the variance when using a leave-one-out cross-validation procedure. In this procedure all data points except one are used to fit the regression and the final point is then used to evaluate the fit of the model. This procedure is then repeated n-fold using every data point as evaluation data once.
Taken at face value this approach sounds very appealing because it uses independent data for making predictions and for testing them. Replication is a cornerstone of science and in some respects cross-validation is an internal replication. So surely this is a great idea? Naive as I am I have long had a strong affinity for this idea.
Cross-validation underestimates predictive power
But not so fast. These notions fail to address two important issues (both of which some commenters on that thread already pointed out): first, it is unclear what amount of variance a model should explain to be important. 7% is not very much but it can nevertheless be of substantial theoretical value. The amount of variance that can realistically be explained by any model is limited by the noise in the data that arises from measurement error or other distortions. So in fact many studies using cross-validation to estimate the variance explained by some models (often in the context of model comparison) instead report the amount of explainable variance accounted for by the model. To derive this one must first estimate the noise ceiling, that is, the realistic maximum of variance that can possibly be explained. This depends on the univariate variability of the measures themselves.
Second, the cross-validation approach is based on the assumption that the observed sample, which is then subdivided into model-fitting and evaluation sets, is a good representation of the population parameters the analysis is attempting to infer. As such, the cross-validation estimate also comes with an error (this issue is also discussed by another blog post mentioned in that discussion thread). What we are usually interested in when we conduct scientific studies is to make an inference about the whole population, say a conclusion that can be broadly generalised to any human brain, not just the handful of undergraduate students included in our experiments. This does not really fit the logic of cross-validation because the evaluation is by definition only based on the same sample we collected.
Because I am a filthy, theory-challenged experimentalist, I decided to simulate this (and I apologise to all my Bayesian friends for yet again conditioning on the truth here…). For a range of sample sizes between n=3 and n=300 I drew a sample with from a population with a fixed correlation of rho=0.7 and performed leave-one-out cross-validation to quantify the variance explained by it (using the squared correlation between predicted and observed values). I also performed a standard regression analysis and quantified the variance explained by that. At each sample size I did this 1000 times and then calculated the mean variance explained for each approach. Here are the results:
What is immediately clear is that the results strongly depend on the sample size. Let’s begin with the blue line. This represents the variance explained by the standard regression analysis on the whole observed sample. The dotted, black, horizontal line denotes the true effect size, that is, the variance explained by the population correlation (so R^2=49%). The blue line starts off well above the true effect but then converges on it. This means that at small sample sizes, especially below n=10, the observed sample inflates how much variance is explained by the fitted model.
Next look at the red line. This denotes the variance explained by the leave-one-out cross-validation procedure. This also starts off above the true population effect and follows the decline of the observed correlation. But then it actually undershoots and goes well below the true effect size. Only then it gradually increases again and converges on the true effect. So at sample sizes that are most realistic in individual differences research, n=20-100ish, this cross-validation approach underestimates how much variance a regression model can explain and thus in fact undervalues the predictive power of the model.
The error bars in this plot denote +/- 1 standard deviation across the simulations at each sample size. So as one would expect, the variability across simulations is considerable when sample size is small, especially when n <= 10. These sample sizes are maybe unusually small but certainly not unrealistically small. I have seen publications calculating correlations on such small samples. The good news here is that even with such small samples on average the effect may not be inflated massively (let’s assume for the moment that publication bias or p-hacking etc are not an issue). However, cross-validation is not reliable under these conditions.
A correlation of rho=0.7 is unusually strong for most research. So I repeated this simulation analysis using a perhaps more realistic effect size of rho=0.3. Here is the plot:
Now we see a hint of something fascinating: the variance explained by the cross-validation approach actually subtly exceeds that of the observed sample correlation. They again converge on the true population level of 9% when the sample size reaches n=50. Actually there is again an undershoot but it is negligible. But at least for small samples with n <= 10 the cross-validation certainly doesn’t perform any better than the observed correlation. Both massively overestimate the effect size.
When the null hypothesis is true…
So if this is what is happening at a reasonably realistic rho=0.3, what about when the null hypothesis is true? This is what is shown in here (I apologise for the error bars extending into the impossible negative range but I’m too lazy to add that contingency to the code…):
The problem we saw hinted at above for rho=0.3 is exacerbated here. As before, the variance explained for the observed sample correlation is considerably inflated when sample size is small. However, for the cross-validated result this situation is much worse. Even at a sample size of n=300 the variance explained by the cross-validation is greater than 10%. If you read the PubPeer discussion I mentioned, you’ll see that I discussed this issue. This result occurs because when the null hypothesis is true – or the true effect is very weak – the cross-validation will produce significant correlations between the inadequately fitted model predictions and the actual observed values. These correlations can be positive or negative (that is, the predictions systematically go in the wrong direction) but because the variance explained is calculated by squaring the correlation coefficient they turn into numbers substantially greater than 0%.
As I discussed in that thread, there is another way to calculate the variance explained by the cross-validation. I won’t go into detail on this but unlike the simpler approach I employed here this does not limit the variance explained to fall between 0-100%. While the estimates are numerically different, the pattern of results is qualitatively the same. At smaller sample sizes the variance explained by cross-validation systematically underestimates the true variance explained.
When the interocular traumatic test is significant…
My last example is the opposite scenario. While we already looked at an unusually strong correlation, I decided to also simulate a case where the effect should be blatantly obvious. Here rho=0.9:
Unsurprisingly, the results are similar as those seen for rho=0.7 but now the observed correlation is already doing a pretty decent job at reaching the nominal level of 81% variance explained. Still, the cross-validation underperforms at small sample sizes. In this situation, this actually seems to be a problem. It is rare that one would observe a correlation of this magnitude in psychological or biological sciences but if so chances are good that the sample size is small in that case. Often the reason for this may be that correlation estimates are inflated at small sample sizes but that’s not the point here. The point is that leave-one-out cross-validation won’t tell you. It underestimates the association even if it is real.
Where does all this leave us?
It is not my intention to rule out cross-validation. It can be a good approach for testing models and is often used successfully in the context of model comparison or classification analysis. In fact, as the debate about circular inference in neuroscience a few years ago illustrated, there are situations where it is essential that independent data are used. Cross-validation is a great way to deal with overfitting. Just don’t let yourself be misled into believing it can tell you something it doesn’t. I know it is superficially appealing and I had played with it previously for just that reason – but this exercise has convinced me that it’s not as bullet-proof is one might think.
Obviously, validation of a model with independent data is a great idea. A good approach is to collect a whole independent replication sample but this is expensive and may not always be feasible. Also, if a direct replication is performed it seems better that this is acquired independently by different researchers. A collaborative project could do this in which each group uses the data acquired by the other group to test their predictive model. But that again is not something that is likely to become regular practice anytime soon.
In the meantime we can also remember that performing typical statistical inference is a good approach after all. Its whole point is to infer the properties of the whole population from a sample. When used properly it tends to do a good job at that. Obviously, we should take measures to improve its validity, such as increasing power by using larger samples and/or better measurements. I know I am baysed but Bayesian hypothesis tests seem superior at ensuring validity than traditional significance testing. Registered Reports can probably also help and certainly should reduce the skew by publication bias and flexible analyses.
So, does correlation imply prediction? I think so. Statistically this is precisely what it does. It uses one measure (or multiple measures) to make predictions of another measure. The key point is not whether calling it a prediction is valid but whether the prediction is sufficiently accurate to be important. The answer to this question actually depends considerably on what we are trying to do. A correlation explaining 10-20% of the variance in a small sample is not going to be a clear biomarker for anything. I sure as hell wouldn’t want any medical or judicial decisions to be based solely on such an association. But it may very well be very informative about mechanisms. It is a clearly detectable effect even with the naked eye.
In the context of these analysis, a better way than quantifying the variance explained is to calculate the root mean squared deviation (essentially the error bar) of the prediction. This provides an actually much more direct index of how accurately one variable predicts another. The next step – and I know I sound like a broken record – should be to confirm that these effects are actually scientifically plausible. This mantra is true for individual differences research as much as it is for Bem’s precognition and social priming experiments where I mentioned it before. Are the differences in neural transmission speed or neurotransmitter concentration implied by these correlation results realistic based on what we know about the brain? These are the kinds of predictions we should actually care about in these discussions.
Unless you have been living on the Moon in recent years, you will have heard about the replication crisis in science. Some will want to make you believe that this problem is specific to psychology and neuroscience but similar discussions also plague other research areas from drug development and stem cell research to high energy physics. However, since psychology deals with flaky, hard-to-define things like the human mind and the myriad of behaviours it can produce, it is unsurprising that reproducibility is perhaps a greater challenge in this field. As opposed to, say, an optics experiment (I assume, I may be wrong about that), there are just too many factors than that you could conduct a controlled experiment producing clear results.
Science is based on the notion that the world, and the basic laws that govern it, are for the most part deterministic. If you have situation A under condition X, you should get a systematic change to situation B when you change the condition to Y. The difference may be small and there may be a lot of noise meaning that the distinction between situations A and B (or between conditions X and Y for that matter) isn’t always very clear-cut. Nevertheless, our underlying premise remains that there is a regularity that we can reveal provided we have sufficiently exact measurement tools and are able and willing to repeat the experiment a sufficiently large number of times. Without causality as we know it, that is the assumption that a manipulation will produce a certain effect, scientific inquiry just wouldn’t work.
There is absolutely nothing wrong with wanting to study complicated phenomena like human thought or behaviour. Some people seem to think that if you can’t study every minute detail of a complex system like the brain and don’t understand what every cell and ion channel is doing at any given time, you have no hope of revealing anything meaningful about the larger system. This is nonsense. Phenomena exist at multiple scales and different levels and a good understanding of all the details is not a prerequisite for understanding some of the broader aspects, which may in fact be more than the sum of their parts. So by all means we should be free to investigate lofty hypotheses about consciousness, about how cognition influences perception (and vice versa), or about whether seemingly complex attributes like political alignment or choice of clothing relate to simple behaviours and traits. But when we do we should always keep in mind the complexity of the system we study and whether the result is even plausible under our hypothesis.
Whatever you may feel about Bayesian statistics, the way I see it there should be a Bayesian approach to scientific reasoning. Start with an educated guess on what effects one might realistically expect under a range of different hypotheses. Then see under which hypothesis (or hypotheses) the observed results are most likely. In my view, a lot of scientific claims fall flat in that respect. Note that this doesn’t necessarily mean that these claims aren’t true – it’s just that the evidence for them so far has been far from convincing and I will try to explain why I feel that way. I also don’t claim to be immune to this delusion either. Some of my own hypotheses are probably also scientifically implausible. The whole point of research is to work this out and arrive at the most parsimonious explanation for any effect. We need more Occam’s Razor instead of the Sherlock Holmes Principle.
So my next few posts will be about scientific claims and ideas I just can’t believe. I was going to post them as a list but there turns out to be so much material that I think it’s better to expand this across several posts…
Part I: Social Priming
If you read any of my blogs before, you will know that I am somewhat critical of the current mainstream (or is it only a vocal minority?) of direct replication advocates. Note that this does not make me an opponent of replication attempts. Replication is a cornerstone of science and it should be encouraged. I am not sure if there are many people who actually disagree with that point and who truly regard the currently fashionable efforts as the work of a “replication police.” All I have been saying in the past is that the way we approach replication could be better. In my opinion a good replication attempt should come with a sanity check or some control condition/analysis that can provide evidence that the experiments were done properly and that the data are sound. Ideally, I want the distinction between replications and original research to be much blurrier than it currently is. The way I see it, both direct and indirect replications should be done regularly as part of most original research. I know that this is not always feasible and when you simply fail to replicate some previous finding and also don’t support your new hypothesis, you should obviously still be permitted to publish (certainly if you can demonstrate that the experiment worked and you didn’t just produce rubbish data).
Some reasonably credible claims
But this post isn’t about replication. Whether or not the replication attempts have been optimal, the sheer number of failed social priming (I may be using this term loosely) replications is really casting the shadow of serious doubt on that field. The reason for this, in my mind, is that it contains so many perfect examples of implausible results. This presumably does not apply to all such claims. I am quite happy to believe that some seemingly innocuous, social cues can influence human behaviour, even though I don’t know if they have actually been tested with scientific rigour. For example, there have been claims that putting little targets into urinals, such as a picture of a fly or a tiny football (or for my American readers “soccer”) goal with a swinging ball, reduces the amount of urine splattered all over the bathroom floor. I can see how this might work, if only from anecdotal self observation (not that I pee all over the bathroom floor without that, mind you). I can also believe that when an online sales website tells you “Only two left in stock!” it makes you more likely to buy then and there. Perhaps somewhat less plausible but still credible is the notion that playing classical music in London Underground stations reduces anti-social behaviour because some unsavoury characters don’t like that ambience.
An untrue (because fake) but potentially credible claim
While on the topic of train stations, the idea that people are more prone to racist stereotyping when they are in messy environments, does not seem entirely far-fetched to me. This is perhaps somewhat ironic because this is one of Diederik Stapel’s infamous fraudulent research claims. I don’t know if anyone has tried to carry out that research for real. It might very well not be a real effect at all but I could easily see how messy, graffiti-covered environments could trigger a cascade of all sorts of reactions and feelings that in turn influence your perception and behaviour to some degree. If it is real, the effect will probably be very small at the level of the general population because what one regards as an unpleasantly messy environment (and how prone one is to stereotyping) will probably differ substantially between people and between different contexts, such as the general location or the time. For instance, a train station in London is probably not perceived the same as one in the Netherlands (trust me on that one…), and a messy environment after carnival or another street party is probably not viewed in the same light as during other days. All these factors must contribute “noise” to the estimate of the population effect size.
The problem of asymmetric noise
However, this example already hints at the larger problem with determining whether or not the results from most social priming research are credible. It is possible that some effect size estimates will be stronger in the original test conditions than they will be in large-scale replication attempts. I would suspect this for most reported effects in the literature even under the best conditions and with pre-registered protocols. Usually your subject sample does not generalise perfectly to the world population or even the wider population of your country. It is perhaps impossible to completely abolish the self-selection bias induced by those people who choose to participate in research.
And sometimes the experimenter’s selection bias also makes sense as it helps to reduce noise and thus enhances the sensitivity of the experimental paradigm. For instance, in our own research using fMRI retinotopic mapping we are keen to scan people who we know are “good fMRI subjects”: people who can lie still inside the scanner and fixate perfectly for prolonged periods of time without falling asleep too much (actually even the best subjects suffer from that flaw…). If you scan someone who jitters and twitches all the time and who can’t keep their eyes open during the (admittedly dull) experiments, you can’t be surprised if your data turns out to be crap. This doesn’t tell you anything about the true effect size in the typical brain but only that it is much harder to obtain these measurements from the broader population. The same applies to psychophysics. A trained observer will produce beautifully smooth sigmoidal curves in whatever experiment you have them do. In contrast, randomers off the street will often give you a zigzag from which you can estimate thresholds only with the most generous imagination. It would be idiotic to regard the latter as more precise measurement of human behaviour. The only thing we can say is that testing a broader sample can give you greater confidence that the result truly generalises.
Turning the hands of time
Could it perhaps be possible that some of the more bizarre social priming effects are “real” in the same sense? They may just be much weaker than the original reports because in the wider population these effects are increasingly diluted by asymmetric noise factors. However, I find it hard to see how this criticism could apply when it comes to many social priming claims. What are the clean, controlled laboratory conditions that assure the most accurate measurement and strongest signal-to-noise ratio in such experiments? Take for instance the finding that people become more or less “open to new experiences” (but see this) depending on the direction they turn a crank/cylinder/turn-table. How likely is it that effects as those reported (Cohen’s d mostly between 0.3-0.7ish) will be observed even if they are real? It seems to me that there are countless factors that will affect a person’s preference for familiar or novel experiences. Presumably if the effect of clockwise or anticlockwise (or for Americans: counterclockwise) rotation exists, it should manifest with a lot of individual differences because not everyone will be equally familiar with analogue clocks. I cannot rule out that engaging or seeing clockwise rotation activates some representation of the future. This could influence people to think about novel things. It might just as equally make them anxious: as someone who is perpetually late for appointments the sight of a ticking clock certainly mainly primes in me the thought that I am running late again. I’d not be surprised if it increased my heart rate but I’d be pretty surprised if it made me desire unfamiliar items.
Walking slowly because you read the word “Florida”
The same goes for many other social priming claims, many of which have spectacularly failed to be replicated by other researchers. Take Bargh’s finding that priming people with words related to the elderly makes them walk more slowly. I can see how thinking about old age could make you behave more like your stereotype of old people although at the same time I don’t know why you should. It might as well have the opposite effect. More importantly, there should be countless other factors that probably have a much stronger influence on your walking speed, such as our general fitness level or the time of day and the activities you were doing before. Another factor influencing you will be the excitement about what to do next, for instance whether you are going to go to work or are about to head to a party. Or, most relevant to my life probably, whether or not you are running late for your next appointment.
Under statistical assumptions we regard all of these factors as noise, that is, random variation in the subject sample. If we test enough subjects the noise should presumably cancel out and the true effect of elderly priming, tiny as it may be, should crystallise. Fair enough. But that does not answer the question how strong an effect of priming the elderly we can realistically expect. I may very well be wrong, but it seems highly improbable to me that such a basic manipulation could produce a difference in average walking speed of one whole second (at least around an eighth of the time it took people on average to walk down the corridor). Even if the effect were really that large, it should be swamped by noise making it unlikely that it would be statistically significant with a sample size of 15 per group. Rather the explanation offered by one replication attempt (I’ve written about this several times before) seems more parsimonious: that there was an experimenter effect in that whoever was measuring the walking speed consistently misestimated the walking speed depending on what priming condition they believed the participant had been exposed to.
I should have professor-primed before all my exams
Even more incredible to me is the idea of “professor priming” in which exposing participants to things that remind them of professors makes them better at answering trivia questions than when they are primed with the concept of “soccer hooligans”, another finding that recently failed to be replicated. What mechanism could possibly explain such a cognitive and behavioural change? I can imagine how being primed to think about hooligans could generate all sorts of thoughts and feelings. They could provoke anxiety and stress responses. Perhaps that could make you perform worse on common knowledge questions. It’s the same mechanism that perhaps underlies stereotype threat effects (incidentally, how do those do with regard to replicability?). But this wouldn’t explain improvements in people’s performance when primed with professors.
I could see how being primed by hooligans or professors might produce some effects on your perception – perhaps judging somebody’s character by their faces etc. Perhaps you are more likely to think an average-looking person has above average intelligence when you’re already thinking about professors than when you think about hooligans (although there might just as well be contrast effects and I can’t really predict what should happen). But I find it very hard to fathom how thinking about professors should produce a measurable boost in trivia performance. Again, even if it were real, this effect should be swamped by all sorts of other factors all of which are likely to exert much greater influence on your ability to answer common knowledge questions. Presumably, common knowledge depends in the first instance on one’s common knowledge. Thinking of facts you do not have immediate access to may be helped by wakefulness and arousal. It may also help if you’re already “thinking outside the box” (I mean this figuratively – I have this vague suspicion that there is also a social priming study that claims being inside vs outside a physical box has some effect on creative thinking… (I was right – there is such a study)). You may be quicker/better at coming up with unconventional, rarely accessed information when you are already on a broad search than when you are monotonously carrying out a simple, repetitive task. But I don’t see how being primed by professors could activate such a process and produce anything but the tiniest of effects.
There was also a study that claimed that exposing subjects to a tiny American flag in the corner of the screen while they answered a survey affected their voting behaviour in a the presidential election many months later. After all that I have written already, it should strike you as highly unlikely that such an effect could be established reliably. There are multitudes of factors that may influence a person’s voting behaviour, especially within the months between the critical flag experiment and election day. Surely the thousands of stars and stripes that any person residing in the United States would be exposed to on a daily basis should have some effect? I can believe that there are a great many hidden variables that govern where you make your cross on the ballot (or however they may be voting there) but I don’t think participation in a psychology experiment can produce a long-term shift in that behaviour by over 10%. If that were true, I think we would be well-advised to bring back absolutist monarchy. At least then you know who’s in charge.
Of fishy smells and ticking time bombs
One thing that many of these social priming studies have in common is that they take common folk sayings and turn them into psychology experiments. Similar claims, that I just learned about this week (thanks for Alex Etz and Stuart Ritchie), that being exposed to the smell of fish oil makes you more suspicious (because “it smells fishy”) and that the sound of a ticking clock pressures women from poor backgrounds into getting married and having children. I don’t know about you but if all of these findings are true, I feel seriously sorry for my brain. It’s constantly bombarded by conflicting cues telling it to change its perceptions and decisions on all sorts of things. It is a miracle we get anything done. Maybe it is because I am not a native English speaker but when I smell fish I think I may develop an appetite for fish. I don’t think it makes me more skeptical. Reading even more titles of studies turning proverbs into psychology studies just might though…
So what next?
I could go on as I am sure there are many more such studies but that’s beside the point. My main problem is that few research studies seem to ask whether the results they obtained are realistic. Ideally, we should start with some form of prediction of what kind of result we can even expect. To be honest, when it comes to social priming I don’t know how to go about doing this. I think it’s fine to start badly as long as someone is actually trying at all. Some thorough characterisation of the evidence to produce norm data may be useful. For instance, it would be useful to have data on general walking speeds of people leaving the lab from a larger sample so that you have a better estimate of the variability in walking speeds you could expect. If that is substantially larger than 1 second you should probably look to test a pretty large sample. Or characterise the various factors that can impact “openness to new experiences” more strongly than innocuous actions like turning a lever and then make an estimate as to how probable it is that your small social priming manipulation could exert a measurable effect with a typical sample size. Simulations could help with this. Last but definitely not least, think of more parsimonious hypotheses and either test them as part of your study or make sure that they are controlled – such as replacing the experimenter using a stopwatch with laser sensors at both ends of the corridor.
Of course, the issues I discussed here don’t apply only to social priming and future posts will deal with those topics. However, social priming has a particularly big problem. It is simply mechanistically very underdetermined. Sure, the general idea is that activating some representation or some idea can have an influence on behaviour. This essentially treats a human mind like a clean test tube just waiting for you to pour in your chemicals so you can watch the reaction. The problem is that in truth a human mind is more like a really, really filthy test tube, contaminated with blood and bacteria and dirty paw prints of all the people who fingered them before…
So I know I said I won’t write about replications any more. I want to stay true to my word for once… But then a thought occurred to me last night. Sometimes it takes minimal sleep followed by a hectic crazy day to crystalise an idea in your mind. So before I go on a well-needed break from research, social media, and (most importantly) bureaucracy, here is a post through the loophole (actually the previous one was too – I decided to split them up). It’s not about replication as such but about what “replicability” can tell us. Also it’s unusually short!
While replicability now seems widely understood to mean that a finding replicates on repeated testing, I have come to prefer the term reliability. If a finding is so fluky that it often cannot be reproduced, it is likely it was spurious in the first place. Hence it is unreliable. Most of the direct replication attempts underway now are seeking to establish the reliability of previous findings. That is fair enough. However, any divergence from the original experimental conditions will make the replication less direct.
This brings us to another important aspect to a finding – its generalisability (I have actually written about this whole issue before although in a more specific context). A finding may be highly reliable but still fail to generalise, like the coin flipping example in my previous post. In my opinion science must seek to establish both, the reliability and generalisability of hypotheses. Just like Sinatra said, “you can’t have one without the other.”
This is where I think most (not all!) currently debated replication efforts fall short. They seek to only establish reliability which you can’t do. A reliable finding that is so specific that it only occurs under very precise conditions could still lead to important theoretical advances – just like Fluke’s magnetic sand led to the invention of holographic television and hover-cars. Or, if you prefer real examples, just ask any microbiologist or single-cell electrophysiologist. Some very real effects can be very precarious.
However, it is almost certainly true that a lot of findings (especially in our field right now) are just simply unreliable and thus probably unreal. My main issue with the “direct” replication effort is that by definition it cannot ever distinguish reliability from generalisability. One could argue (and some people clearly do) that the theories underlying things like social priming are just so implausible that we don’t need to ask if they generalise. I think that is wrong. It is perfectly fine to argue that some hypothesis is implausible – I have done so myself. However, I think we should always test reliability and generalisability at the same time. If you only seek to establish reliability, you may be looking in the wrong place. If you only ask if the hypothesis generalises, you may end up chasing a mirage. Either way, you invest a lot of effort and resources but you may not actually advance scientific knowledge very much.
And this, my dear friends, Romans, and country(wo)men will really be my final post on replication. When I’m back from my well-needed science break I will want to post about another topic inspired by a recent Neuroskeptic post.