Category Archives: statistics

Irish Times OpEds are just bloody awful at science (n=1)

TL-DR: No, men are not “better at science” than women.

Clickbaity enough for you? I cannot honestly say I have read a lot of OpEds in the Irish Times so the evidence for my titular claim is admittedly rather limited. But it is still more solidly grounded in actual data than this article published yesterday in the Irish Times. At least I have one data point.

The article in question, a prime example of Betteridge’s Law, is entitled “Are men just better at science than women?“. I don’t need to explain why such a title might be considered sensationalist and controversial. The article itself is an “Opinion” piece, thus allowing the publication to disavow any responsibility for its authorship whilst allowing it to rake in the views from this blatant clickbait. In it, the author discusses some new research reporting gender differences in systemising vs empathising behaviour and puts this in the context of some new government initiative to specifically hire female professors because apparently there is some irony here. He goes on a bit about something called “neurosexism” (is that a real word?) and talks about “hard-wired” brains*.

I cannot quite discern if the author thought he was being funny or if he is simply scientifically illiterate but that doesn’t really matter. I don’t usually spend much time commenting on stuff like that. I have no doubt that the Irish Times, and this author in particular, will be overloaded with outrage and complaints – or, to use the author’s own words, “beaten up” on Twitter. There are many egregious misrepresentations of scientific findings in the mainstream media (and often enough, scientists and/or the university press releases are the source of this). But this example of butchery is just so bad and infuriating in its abuse of scientific evidence that I cannot let it slip past.

The whole argument, if this is what the author attempted, is just riddled with logical fallacies and deliberate exaggerations. I have no time or desire to go through them all. Conveniently, the author already addresses a major point himself by admitting that the study in question does not actually speak to male brains being “hard-wired” for science, but that any gender differences could be arising due to cultural or environmental factors. Not only that, he also acknowledges that the study in question is about autism, not about who makes good professors. So I won’t dwell on these rather obvious points any further. There are much more fundamental problems with the illogical leaps and mental gymnastics in this OpEd:

What makes you “good at science”?

There is a long answer to this question. It most certainly depends somewhat on your field of research and the nature of your work. Some areas require more manual dexterity, whilst others may require programming skills, and others yet call for a talent for high-level maths. As far as we can generalise, in my view necessary traits of a good researcher are: intelligence, creativity, patience, meticulousness, and a dedication to seek the truth rather than confirming theories. That last one probably goes hand-in-hand with some scepticism, including a healthy dose of self-doubt.

There is also a short answer to this question. A good scientist is not measured by their Systemising Quotient (SQ), a self-report measure that quantifies “the drive to analyze or build a rule-based system”. Academia is obsessed with metrics like the h-index (see my previous post) but even pencil pushers and bean counters** in hiring or grant committees haven’t yet proposed to use SQ to evaluate candidates***.

I suspect it is true that many scientists score high on the SQ and also the related Autism-spectrum Quotient (AQ) which, among other things, quantifies a person’s self-reported attention to detail. Anecdotally, I can confirm that a lot of my colleagues score higher than the population average on AQ. More on this in the next section.

However, none of this implies that you need to have a high SQ or AQ to be “good at science”, whatever that means. That assertion is a logical fallacy called affirming the consequent. We may agree that “systemising” characterises a lot of the activities a typical scientist engages in, but there is no evidence that this is sufficient to being a good scientist. It could mean that systemising people are attracted to science and engineering jobs. It certainly does not mean that a non-systemising person cannot be a good scientist.

Small effect sizes

I know I rant a lot about relative effect sizes such as Cohen’s d, where the mean difference is normalised by the variability. I feel that in a lot of research contexts these are given undue weight because the variability itself isn’t sufficiently controlled. But for studies like this we can actually be fairly confident that they are meaningful. The scientific study had a pretty whopping sample size of 671,606 (although that includes all their groups) and also used validation data. The residual physiologist inside me retains his scepticism about self-report questionnaire type measures, but even I have come to admit that a lot of questionnaires can be pretty effective. I think it is safe to say that the Big 5 Personality Factors or the AQ tap into some meaningful real factors. Further, whatever latent variance there may be on these measures, that is probably outweighed by collecting such a massive sample. So the Cohen’s d this study reports is probably quite informative.

What does this say? Well, the difference in SQ between males and females was 0.31. In other words, the distributions of SQ between sexes overlap quite considerably but the distribution for males is somewhat shifted towards higher values. Thus, while the average man has a subtly higher SQ than the average woman, a rather considerable number of women will have higher SQs than the average man. The study helpfully plots these distributions in Figure 1****:

Sex diffs SQ huge N
Distributions of SQ in control females (cyan), control males (magenta), austistic females (red), and autistic males (green).

The relevant curves here are the controls in cyan and magenta. Sorry, colour vision deficient people, the authors clearly don’t care about you (perhaps they are retinasexists?). You’ll notice that the modes of the female and male distributions are really not all that far apart. More noticeable is the skew of all these distributions with a long tail to the right: Low SQs are most common in all groups (including autism) but values across the sample are spread across the full range. So by picking out a random man and a random woman from a crowd, you can be fairly confident that their SQs are both on the lower end but I wouldn’t make any strong guesses about whether the man has a higher SQ than the woman.

However, it gets even tastier because the authors of the study actually also conducted an analysis splitting their data from controls into people in Science, Technology, Engineering, or Maths (STEM) professions compared to controls who were not in STEM. The results (yes, I know the colour code is now weirdly inverted – not how I would have done it…) show that people in STEM, whether male or female, tend to have larger SQs than people outside of STEM. But again, the average difference here is actually small and most of it plays out in the rightward tail of the distributions. The difference between males and females in STEM is also much less distinct than for people outside STEM.

Sex & STEM diffs SQ
Distributions of SQ in STEM females (cyan), STEM males (magenta), control females (red), and control males (green).

So, as already discussed in the previous section, it seems to be the case that people in STEM professions tend to “systemise” a bit more. It also suggests that men systemise more then women but that difference probably decreases for people in STEM. None of this tells us anything about whether people’s brains are “hard-wired” for systemising, if it is about cultural and environmental differences between men and women, or indeed if  being trained in a STEM profession might make people more systemising. It definitely does not tell you who is “good at science”.

What if it were true?

So far so bad for those who might want to make that interpretive leap. But let’s give them the benefit of the doubt and ignore everything I said up until now. What if it were true that systemisers are in fact better scientists? Would that invalidate government or funders initiatives to hire more female scientists? Would that be bad for science?

No. Even if there were a vast difference in systemising between men and women, and between STEM and non-STEM professions, respectively, all such a hiring policy will achieve is to increase the number of good female scientists – exactly what this policy is intended to do. Let me try an analogy.

Basketball players in the NBA tend to be pretty damn tall. Presumably it is easier to dunk when you measure 2 meters than when you’re Tyrion Lannister. Even if all other necessarily skills here are equal there is a clear selection pressure for tall people to get into top basketball teams. Now let’s imagine a team decided they want to hire more shorter players. They declare they will hire 10 players who cannot be taller than 1.70m. The team will have try-outs and still seek to get the best players out of their pool of applicants. If they apply an objective criterion for what makes a good player, such as the ability to score consistently, they will only hire short players with excellent aim or who can jump really high. In fact, these shorties will be on average better at aiming and/or jumping than the giants they already have on their team. The team selects for the ability to score. Shorties and Tallies get there via different means but they both get there.

In this analogy, being a top scorer is being a systemiser, which in turn makes you a good scientist. Giants tend to score high because they find it easy to reach the basket. Shorties score high because they have other skills that compensate for their lack of height. Women can be good systemisers despite not being men.

The only scenario in which such a specific hiring policy could be counterproductive is if two conditions are met: 1) The difference between groups in the critical trait (i.e., systemising) is vast and 2) the policy mandates hiring from a particular group without any objective criteria. We have already established that the former condition isn’t fulfilled here – the difference in systemising between men and women is modest at best. The latter condition is really a moot point because this is simply not how hiring works in the real world. Hiring committees don’t usually just offer jobs to the relatively best person out of the pool but also consider the candidates’ objective abilities and achievements. This is even more pertinent here because all candidates in this case will already be eligible for a professorial position anyway. So all that will in fact happen is that we end up with more female professors who will also happen to be high in systemising.

Bad science reporting

Again, this previous section is based on the entirely imaginary and untenable assumption that systemisers are better scientists. I am not aware of any evidence of that – in part because we cannot actually quantify very well what makes a good scientist. The metrics academics actually (and sadly) use for hiring and funding decisions probably do not quantify that either but I am not even aware of any link between systemising and those metrics. Is there a correlation between h-indeces (relative to career age) and SQ? I doubt it.

What we have here is a case of awful science reporting. Bad science journalism and the abuse of scientific data for nefarious political purposes are hardly a new phenomenon – and this won’t just disappear. But the price of freedom (to practice science) is eternal vigilance. I believe as scientists we have a responsibility to debunk such blatant misapprehensions by journalists who I suspect have never even set foot in an actual lab or spoken to any actual scientists.

Some people assert that improving the transparency and reliability of research will hurt the public’s faith in science. Far from it, I believe those things can show people how science really works. The true damage to how the public perceives science is done by garbage articles in the mainstream media like this one – even if it is merely offered as an “opinion”.

1280px-tyson_chandler
By Keith Allison

*) Brains are not actually hard-wired to do anything. Leaving the old Hebbian analogy aside, brains aren’t wired at all, period. They are soft, squishy, wet sponges containing lots of neuronal and glial tissue plus blood vessels. Neurons connect via synapses between axons and dendrites and this connectivity is constantly regulated and new connections grown while others are pruned. This adaptability is one of the main reasons why we even have brains, and lies at the heart of the intelligence, ingenuity, and versatility of our species.

**) I suspect a lot of the pencil pushers and bean counters behind metrics like impact factors or the h-index might well be Systemisers.

***) I hope none of them read this post. We don’t want to give these people any further ideas…

****) Isn’t open access under Creative Commons license great?

P-values are uniformly distributed when the null hypothesis is true

TL-DR: If the title of this blog post is unsurprising to you, I suggest you go play outside.

Many discussions in my science social media bubble circle around p-values (what an exciting life I lead…). Just a few days ago, there was a big kerfuffle about p-curving and whether p-values just below 0.05 are a sign of whatever. One of the main concepts behind p-curves is that under the assumption that the null hypothesis (H0) of no effect/difference is true, p-values should be uniformly distributed (at least as long as the test assumptions are met reasonably). This once again supported my suspicions that most people don’t actually know what p-values mean. Reports of people defining p-values incorrectly abound, sometimes even in stats textbooks. It also seems to me that people find p-values rather unintuitive. And I get the impression a lot of people vastly overestimate how widely known things like p-curve actually are.

A few weeks ago I got embroiled in a Facebook discussion. A friend of mine was running a permutation analysis to test something about his experiment and found something very odd: the distribution of p-values was skewed severely to the left – there were very few low p-values but the proportion was steadily increasing with most p-values being just below 1. He expected this distribution to be uniform because under the random permutations H0 should be true. A lot of commenters on his post seemed rather surprised and/or confused by the whole idea that p-values should be distributed randomly when H0 is true. “Surely,” so the common intuition goes, “when there is actually no difference, most p-values should be high and close to 1?”

No, and the reason why not is the p-value itself. A p-value can be calculated/estimated in many different ways. Most people use parametric tests but essentially they all share one philosophy. If you have no underlying effect and randomly sample data ad infinitum you end up with a distribution of test statistics. In my example, I draw two variables each with n=100 from a normal distribution and calculate the Pearson correlation between them – and I repeat this 20,000 times. This produces a distribution of correlation coefficients like this:

Rs0

There is no correlation between two random variables (H0 is true) and so the distribution is centred on zero. The spread of the distribution depends on the sample size. Larger samples will produce narrower distributions. Critically, we can use this distribution to get a p-value. If we had observed a correlation of r=0.3 in our experiment, we could calculate the proportion of correlation coefficients in this distribution that are equal or greater than 0.3. This would give us a one-tailed p-value. If you ignore the sign of the correlation, you get a two-tailed p-value.

In the plot above, I coloured the 5% most extreme correlation coefficients in blue (2.5% to the left and to the right, respectively). These regions are abutted by vertical red lines at just below +/-0.2 in this case. This reflects the critical effect size needed to get p<0.05 – only 5% of the correlations coefficients in this distribution are +/-0.19ish or even more extreme.

Now compare this to the region coloured in red. This region also makes up 5% of the whole distribution. However, the red region surrounds zero, that is, those correlation coefficients that are really close to the true correlation value. Random chance makes the distribution spread out (and that becomes more severe when your sample size is low) but most of the correlations will nevertheless be close to the true value of zero. Therefore, the range of values in this red region is much narrower because the values are much denser here.

But of course these nigh-zero correlation coefficients will have the largest p-values. Consider again what a p-value reflects. If your observed correlation is 0.006 and you again ignore the sign of the effects, almost all correlations in this null distribution would be equal or greater than 0.006. So this proportion, the p-value, is almost 1. Put in other words, 5% of low p-values below 0.05 are from the long, thin tails of the null distribution, while 5% of really high p-values above 0.95 are from a really narrow slither of the null distribution near zero:

Ps0

Visualised the same way, you have the blue region with p<0.05 on the left. Here correlations are large (greater than 0.19ish). On the right, you have the red region with p>0.95. Here correlations are really close to zero.

In other words, you can directly read off the p-value from the x-axis of this distribution of p-values. This is a direct consequence of what p-values represent. They are the proportion of values in the null distribution where correlations are equal or more extreme than the observed correlation.

Of course, if the null hypothesis is false and there actually is a correlation between the two variables this distribution must become skewed. There should now be many more tests with low p-values than with large ones. This is exactly what happens and this is the pattern that analyses like p-curve seek to detect:

Ps1

Now, my friend’s p-distribution looked essentially like the mirror image of this. I still haven’t learned what could have possibly caused this. It would mean that more effect sizes were close to zero than there should be under H0. This could suggest some assumptions not being met but none of my own feeble simulations managed to reproduce the pattern he found. His analyses sounded quite complex so there is a possibility that there were some complex errors in it.

 

Is d>10 a plausible effect size?

TL;DR: You may get a very large relative effect size (like Cohen’s d), if the main source of the variability in your sample is the reliability of each observation and the measurement was made as exact as is feasible. Such a large d is not trivial, but in this case talking about d is missing the point.

In discussions of scientific findings you will often hear talk about relative effect sizes, like the ubiquitous Cohen’s d. Essentially, such effect sizes quantify the mean difference between groups/treatments/conditions relative to the variability across subjects/observations. The situation is actually a lot more complicated because even for a seemingly simple results like the difference between conditions you will find that there are several ways of calculating the effect size. You can read a nice summary by Jake Westfall here. There are also other effect sizes, such as correlation coefficients, and what I write here applies to that, too. I will however stick to the difference-type effect size because it is arguably the most common.

One thing that has irked me about those discussions for some years is that this ignores a very substantial issue: the between-subject variance of your sample depends on the within-subject variance. The more unreliable the measurement of each subject, the greater is the variability of your sample. Thus the reliability of individual measurements limits the relative effect size you can possibly achieve in your experiment given a particular experimental design. In most of science – especially biological and psychological sciences – the reliability of individual observations is strongly limited by the measurement error and/or the quality of your experiment.

There are some standard examples that are sometimes used to illustrate what a given effect size means. I stole a common one from this blog post about the average height difference between men and women, which apparently was d=1.482 in 1980 Spain. I have no idea if this is true exactly but that figure should be in the right ballpark. I assume most people will agree that men are on average taller than women but that there is nevertheless substantial overlap in the distributions – so that relatively frequently you will find a woman who is taller than many men. That is an effect size we might consider strong.

The height difference between men and women is a good reference for an effect size because it is largely limited by the between-subject variance, the variability in actual heights across the population. Obviously, the reliability of each observation also plays a role. There will definitely be a degree of measurement error. However, I suspect that this error is small, probably on the order of a few millimeters. Even if you’re quite bad at this measurement I doubt you will typically err by more than 1-2 cm and you can probably still replicate this effect in a fairly small sample. However, in psychology experiments your measurement rarely is that accurate.

Now, in some experiments you can increase the reliability of your individual measurement by increasing the number of trials (at this point I’d like to again refer to Richard Morey’s related post on this topic). In psychophysics, collecting hundreds or thousands of trials on one individual subject is not at all uncommon. Let’s take a very simple case. Contour integration refers to the ability of the visual system to detect “snake” contours better than “ladder” contours or those defined by other orientations (we like to call those “ropes”):

 

In the left image you should hopefully see a circle defined by 16 grating patches embedded in a background or randomly oriented gratings. This “snake” contour pops out from the background because the visual system readily groups orientations along a collinear (or cocircular) path into a coherent object. In contrast, when the contour is defined by patches of other orientations, for example the “rope” contour in the right image which is defined by patches at 45 degrees relative to the path, then it is much harder to detect the presence of this contour. This isn’t a vision science post so I won’t go into any debates on what this means. The take-home message here is that if healthy subjects with normal vision are asked to determine the presence or absence of a contour like this, especially with limited viewing time, they will perform very well for the “snake” contours but only barely above chance levels for the “rope” contours.

This is a very robust effect and I’d argue this is quite representative of many psychophysical findings. A psychophysicist probably wouldn’t simply measure the accuracy but conduct a broader study of how this depends on particular stimulus parameters – but that’s not really important here. It is still pretty representative.

What is the size of this effect? 

If I study this in a group of subjects, the relative effect size at the group level will depend on how accurately I measure the performance in each individual. If I have 50 subjects (which is between 10-25 larger than your typical psychophysics study…) and each performs just one trial, then the sample variance will be much larger compared to if each of them does 100 trials or if they each do 1000 trials. As a result, the Cohen’s d of the group will be considerably different. A d>10 should be entirely feasible if we collect enough trials per person.

People will sometimes say that large effects (d>>2 perhaps) are trivial. But there is nothing trivial about this. In this particular example you may see the difference quite easily for yourself (so you are a single-subject and single-trial replication). But we might want to know just how much better we are at detecting the snake than the rope contours. Or, as I already mentioned, a psychophysicist might measure the sensitivity of subjects to various stimulus parameters in this experiment (e.g., the distance between patches, the amount of noise in the orientations we can tolerate, etc) and this could tell us something about how vision works. The Cohen’s d would be pretty large for all of these. That does not make it trivial but in my view it makes it useless:

Depending on my design choices the estimated effect size may be a very poor reflection of the true effect size. As mentioned earlier, the relative effect size is directly dependent on the between-subject variance – but that in turn depends on the reliability of individual measurements. If each subject only does one trial, the effect of just one attentional lapse or accidental button press in the task is much more detrimental than when they perform 1000 trials, even if the overall rate of lapses/accidents is the same*.

Why does this matter?

In many experiments, the estimate of between-subject variance will be swamped by the within-subject variability. Returning to the example of gender height differences, this is essentially what would happen if you chose to eyeball each person’s height instead of using a tape measure. I’d suspect that is the case for many experiments in social or personality psychology where each measurement is essentially a single quantity (say, timing the speed with which someone walks out of the lab in a priming experiment) rather than being based on hundreds or thousands of trials as in psychophysics. Notoriously noisy measurements are also doubtless the major limiting factor in most neuroimaging experiments. On the other hand, I assume a lot of questionnaire-type results you might have in psychology (such as IQ or the Big Five personality factors) have actually pretty high test-retest reliability and so you probably do get mostly the between-subject variance.

The problem is that often it is very difficult to determine which scenario we are in. In psychophysics, we are often so extremely dominated by the measurement reliability that a knowledge of the “true” population effect size is actually completely irrelevant. This is a critical issue because you cannot use such an effect size for power analysis: If I take an experiment someone did and base my power analysis on the effect size they reported, I am not really powering my experiment to detect a similar effect but a similar design. (This is particularly useless if I then decide to use a different design…)

So next time you see an unusually large Cohen’s (d>10 or even d>3) ask yourself not simply whether this is a plausible effect but whether this experiment can plausibly estimate the true population effect. If this result is based on a single observation per subject with a highly variable measurement (say, how often Breton drivers stop for female hitchhikers wearing red clothing…), even a d=1 seems incredibly large.

But if it is for a measurement that could have been made more reliable by doubling the amount of data collected in each subject (say, a change in psychometric thresholds), then a very high Cohen’s d is entirely plausible – but it is also pretty meaningless. In this situation, what we should really care about is the absolute effect size (How much does the threshold change? How much does the accuracy drop? etc).

And I must say, I remain unsure whether absolute effect sizes aren’t more useful in general, including for experiments on complex human behaviour, neuroimaging, or drug effects.

* Actually the lapse rate probably increases with a greater number of trials due to subject fatigue, drop in motivation, or out of pure spite. But even that increase is unlikely to be as detrimental as having too few trials.

Of hacked peas and crooked teas

The other day, my twitter feed got embroiled in another discussion about whether or not p-hacking is deliberate and if it constitutes fraud. Fortunately, I then immediately left for a trip abroad and away from my computer, so there was no danger of me being drawn into this debate too deeply and running the risk of owing Richard Morey another drink. However, now that I am back I wanted to elaborate a bit more on why I think the way our field has often approached p-hacking is both wrong and harmful.

What the hell is p-hacking anyway? When I google it I get this Wikipedia article, which uses it as a synonym for “data dredging”. There we already have a term that seems to me more appropriate. P-hacking refers to when you massage your data and analysis methods until your result reaches a statistically significant p-value. I will put it to you that in practice most p-hacking is not necessarily about hacking p-s but about dredging your data until your results fit a particular pattern. That may be something you predicted but didn’t find or could even just be some chance finding that looked interesting and is amplified this way. However, the p-value is usually probably secondary to the act here. The end result may very well be the same in that you continue abusing the data until a finding becomes significant, but I would bet that in most cases what matters to people is not the p-value but the result. Moreover, while null-hypothesis significance testing with p-values is still by far the most widespread way to make inferences about results, it is not the only way. All this fussing about p-hacking glosses over the fact that the same analytic flexibility or data dredging can be applied to any inference, whether it is based on p-values, confidence intervals, Bayes factors, posterior probabilities, or simple summary statistics. By talking of p-hacking we create a caricature that this is somehow a problem specific to p-values. Whether or not NHST is the best approach for making statistical inferences is a (much bigger) debate for another day – but it has little to do with p-hacking.

What is more, not only is p-hacking not really about p’s but it is also not really about hacking. Here is the dictionary entry for the term ‘hacking‘. I think we can safely assume that when people say p-hacking they don’t mean that peas are physically being chopped or cut or damaged in any way. I’d also hazard a guess that it’s not meant in the sense of “to deal or cope with” p-values. In fact, the only meaning of the term that seems to come even remotely close is this:

“to modify a computer program or electronic device in a skillful or clever way”

Obviously, what is being modified in p-hacking is the significance or impressiveness of a result, rather than a computer program or electronic device, but we can let this slide. I’d also suggest that it isn’t always done in a skillful or clever way either, but perhaps we can also ignore this. However, the verb ‘hacking’ to me implies that this is done in a very deliberate way. It may even, as with computer hacking, carry the connotation of fraud, of criminal intent. I believe neither of these things are true about p-hacking.

That is not to say that p-hacking isn’t deliberate. I believe in many situations it likely is. People no doubt make conscious decisions when they dig through their data. But the overwhelming majority of p-hacking is not deliberately done to create spurious results that the researcher knows to be false. Anyone who does so would be committing actual fraud. Rather, most p-hacking is the result of confirmation bias combined with analytical flexibility. This leads people to sleep walk into creating false positives or – as Richard Feynman would have called it – fooling themselves. Simine Vazire already wrote an excellent post about this a few years ago (and you may see a former incarnation of yours truly in the comment section arguing against the point I’m making here… I’d like to claim that it’s cause I have grown as a person but in truth I only exorcised this personality :P). I’d also guess that a lot of p-hacking happens out of ignorance, although that excuse really shouldn’t fly as easily in 2017 as it may have done in 2007. Nevertheless, I am pretty sure people do not normally p-hack because they want to publish false results.

Some may say that it doesn’t matter whether or not p-hacking is fraud – the outcome is the same: many published results are false. But in my view it’s not so simple. First, the solution to these two problems surely isn’t the same. Preregistration and transparency may very well solve the problem of analytical flexibility and data dredging – but it is not going to stop deliberate fraud, nor is it meant to. Second, actively conflating fraud and data dredging implicitly accuses researchers of being deliberately misleading and thus automatically puts them on the defensive. This is hardly a way to have a productive discussion and convince people to do something about p-hacking. You don’t have to look very far for examples of that playing out. Several protracted discussions on a certain Psychology Methods Facebook group come to mind…

Methodological flexibility is a real problem. We definitely should do something about it and new moves towards preregistration and data transparency are at least theoretically effective solutions to improve things. The really pernicious thing about p-hacking is that people are usually entirely unaware of the fact that they are doing it. Until you have tried to do a preregistered study, you don’t appreciate just how many forks in the road you passed along the way (I may blog about my own experiences with that at some point). So implying, however unintentionally, that people are fraudsters is not helping matters.

Preregistration and data sharing have gathered a lot of momentum over the past few years. Perhaps the opinions of some old tenured folks opposed to such approaches no longer carry so much weight now, regardless how powerful they may be. But I’m not convinced that this is true. Just because there has been momentum now does not mean that these ideas will prevail. It is just as likely that they fizzle out due to lacking enthusiasm or because people begin to feel that the effort isn’t worth it. I seems to me that “open science” very much exists in a bubble and I have bemoaned that before. To change scientific practices we need to open the hearts and minds of sceptics to why p-hacking is so pervasive. I don’t believe we will achieve that by preaching to them. Everybody p-hacks if left to their own devices. Preregistration and open data can help protect yourself against your mind’s natural tendency to perceive patterns in noise. A scientist’s training is all about developing techniques to counteract this tendency, and so open practices are just another tool for achieving that purpose.

1920px-fish2c_chips_and_mushy_peas
There is something fishy about those pea values…

 

Chris Chambers is a space alien

Imagine you are a radio astronomer and you suddenly stumble across a signal from outer space that appears to be evidence of an extra-terrestrial intelligence. Let’s also assume you already confidently ruled out any trivial artifactual explanation to do with naturally occurring phenomena or defective measurements. How could you confirm that this signal isn’t simply a random fluke?

This is actually the premise of the novel Contact by Carl Sagan, which happens to be one of my favorite books (I never watched the movie but only caught the end which is nothing like the book so I wouldn’t recommend it…). The solution to this problem proposed in the book is that one should quantify how likely the observed putative extraterrestrial signal would be under the assumption that it is the product of random background radiation.

This is basically what a p-value in frequentist null hypothesis significance testing represents. Using frequentist inference requires that you have a pre-specified hypothesis and a pre-specified design. You should have an effect size in mind, determine how many measurements you need to achieve a particular statistical power, and then you must carry out this experiment precisely as planned. This is rarely how real science works and it is often put forth as one of the main arguments why we should preregister our experimental designs. Any analysis that wasn’t planned a priori is by definition exploratory. The most extreme form of this argument posits that any experiment that hasn’t been preregistered is exploratory. While I still find it hard to agree with this extremist position, it is certainly true that analytical flexibility distorts the inferences we can make about an observation.

This proposed frequentist solution is therefore inappropriate for confirming our extraterrestrial signal. Because the researcher stumbled across the signal, the analysis is by definition exploratory. Moreover, you must also beware of the base-rate fallacy: even an event extremely unlikely under the null hypothesis is not necessarily evidence against the null hypothesis. Even if p=0.00001, a true extraterrestrial signal may be even less likely, say, p=10-100. Even if extra-terrestrial signals are quite common, given the small amount of space, time, and EM bands we have studied thus far, how probable is it we would just stumble across a meaningful signal?

None of that means that exploratory results aren’t important. I think you’d agree that finding credible evidence of an extra-terrestrial intelligence capable of sending radio transmissions would be a major discovery. The other day I met up with Rob McIntosh, one of the editors for Registered Reports at Cortex, to discuss the distinction between exploratory and confirmatory research. A lot of the criticism of preregistration focuses on whether it puts too much emphasis on hypothesis-driven research and whether it in turn devalues or marginalizes exploratory studies. I have spent a lot of time thinking about this issue and (encouraged by discussions with many proponents of preregistration) I have come to the conclusion that the opposite is true: by emphasizing which parts of your research are confirmatory I believe exploration is actually valued more. The way scientific publishing works conventionally many studies are written up in a way that pretends to be hypothesis-driven when in truth they weren’t. Probably for a lot of published research the truth lies somewhere in the middle.

So preregistration just keeps you honest with yourself and if anything it allows you to be more honest about how you explored the data. Nobody is saying that you can’t explore, and in fact I would argue you should always include some exploration. Whether it is an initial exploratory experiment that you did that you then replicate or test further in a registered experiment, or whether it is a posthoc robustness test you do to ensure that your registered result isn’t just an unforeseen artifact, some exploration is almost always necessary. “If we knew what we were doing, it would not be called research, would it?” (a quote by Albert Einstein, apparently).

One idea I discussed with Rob is whether there should be a publication format that specifically caters to exploration (Chris Chambers has also mentioned this idea previously). Such Exploratory Reports would allow researchers to publish interesting and surprising findings without first registering a hypothesis. You may think this sounds a lot like what a lot of present day high impact papers are like already. The key difference is that these Exploratory Reports would contain no inferential statistics and critically they are explicit about the fact that the research is exploratory – something that is rarely the case in conventional studies. However, this idea poses a critical challenge: on the one hand you want to ensure that the results presented in such a format are trustworthy. But how do you ensure this without inferential statistics?

Proponents of the New Statistics (which aren’t actually “new” and it is also questionable whether you should call them “statistics”) will tell you that you could just report the means/medians and confidence intervals, or perhaps the whole distributions of data. But that isn’t really helping. Inspecting confidence intervals and how far they are from zero (or another value of no interest) is effectively the same thing as a significance test. Even merely showing the distribution of observations isn’t really helping. If a result is so blatantly obvious that it convinces you by visual inspection (the “inter-ocular trauma test”), then formal statistical testing would be unnecessary anyway. If the results are even just a little subtler, it can be very difficult to decide whether the finding is interesting. So the way I see it, we either need a way to estimate statistical evidence, or you need to follow up the finding with a registered, confirmatory experiment that specifically seeks to replicate and/or further test the original exploratory finding.

In the case of our extra-terrestrial signal you may plan a new measurement. You know the location in the sky where the signal came from, so part of your preregistered methods is to point your radio telescope at the same point. You also have an idea of the signal strength, which allows you to determine the number of measurements needed to have adequate statistical power. Then you carry out this experiment, sticking meticulously to your planned recipe. Finally, you report your result and the associated p-value.

Sounds good in theory. In practice, however, this is not how science typically works. Maybe the signal isn’t continuous. There could be all sorts of reasons why the signal may only be intermittent, be it some interstellar dust clouds blocking the line of transmission, the transmitter pointing away from Earth due to the rotation of the aliens’ home planet, or even simply the fact that the aliens are operating their transmitter on a random schedule. We know nothing about what an alien species, let alone their civilization, may be like. Who is to say that they don’t just fall into random sleeping periods in irregular intervals?

So some exploratory, flexible analysis is almost always necessary. If you are too rigid in your approach, you are very likely to miss important discoveries. At the same time, you must be careful not to fool yourself. If we are really going down the route of Exploratory Reports without any statistical inference we need to come up with a good way to ensure that such exploratory findings aren’t mostly garbage. I think in the long run the only way to do so is to replicate and test results in confirmatory studies. But this could already be done as part of a Registered Report in which your design is preregistered. Experiment 1 would be exploratory without any statistical inference but simply reporting the basic pattern of results. Experiment 2 would then be preregistered and replicate or test the finding further.

However, Registered Reports can take a long time to publish. This may in fact be one of the weak points about this format that may stop the scientific community from becoming more enthusiastic about them. As long as there is no real incentive to doing slow science, the idea that you may take two or three years to publish one study is not going to appeal to many people. It will stop early career researchers from getting jobs and research funding. It also puts small labs in poorer universities at a considerable disadvantage compared to researchers with big grants, big data, and legions of research assistants.

The whole point of Exploratory Reports would be to quickly push out interesting observations. In some ways, this is then exactly what brief communications in high impact journals are currently for. I don’t think it will serve us well to replace the notion of snappy (and likely untrue) high impact findings with inappropriate statistical inferences with snappy (and likely untrue) exploratory findings without statistical inference. If the purpose of Exploratory Reports is solely to provide an outlet for quick publication of interesting results, we still have the same kind of skewed incentive structure as now. Also, while removing statistical inference from our exploratory findings may be better statistical practice I am not convinced that it is better scientific practice unless we have other ways of ensuring that these exploratory results are kosher.

The way I see it, the only way around this dilemma is to finally stop treating publications as individual units. Science is by nature a lengthy, incremental process. Yes, we need exciting discoveries to drive science forward. At the same time, replicability and robustness of our discoveries is critical. In order to combine these two needs I believe research findings should not be seen as separate morsels but as a web of interconnected results. A single Exploratory Report (or even a bunch of them) could serve as the starting point. But unless they are followed up by Registered Reports replicating or scrutinizing these findings further, they are not all that meaningful. Only once replications and follow up experiments have been performed the whole body of a finding takes shape. A search on PubMed or Google Scholar would not merely spit out the original paper but a whole tree of linked experiments.

The perceived impact and value of a finding thus would be related to how much of a interconnected body of evidence it has generated rather than whether it was published in Nature or Science. Critically, this would allow people to quickly publish their exciting finding and thus avoid being deadlocked by endless review processes and disadvantaged compared to other people who can afford to do more open science. At the same time, they would be incentivized to conduct follow-up studies. Because a whole body of related literature is linked, it would however also be an incentive for others to conduct replications or follow up experiments on your exploratory finding.

There are obviously logistic and technical challenges with this idea. The current publication infrastructure still does not really allow for this to work. This is not a major problem however. It seems entirely feasible to implement such a system. The bigger challenge is how to convince the broader community and publishers and funders to take this on board.

200px-arecibo_message-svg

Boosting power with better experiments

Probably one of the main reasons for the low replicability of scientific studies is that many previous studies have been underpowered – or rather that they only provided inconclusive evidence for or against the hypotheses they sought to test. Alex Etz had a great blog post on this with regard to replicability in psychology (and he published an extension of this analysis that takes publication bias into account as a paper). So it is certainly true that as a whole researchers in psychology and neuroscience can do a lot better when it comes to the sensitivity of their experiments.

A common mantra is that we need larger sample sizes to boost sensitivity. Statistical power is a function of the sample size and the expected effect size. There is a lot of talk out there about what effect size one should use for power calculations. For instance, when planning a replication study, it has been suggested that you should more than double the sample size of the original study. This is supposed to take into account the fact that published effect sizes are probably skewed upwards due to publication bias and analytical flexibility, or even simply because the true effect happens to be weaker than originally reported.

However, what all these recommendations neglect to consider is that standardized effect sizes, like Cohen’s d or a correlation coefficient, are also dependent on the precision of your observations. By reducing measurement error or other noise factors, you can literally increase the effect size. A higher effect size means greater statistical power – so with the same sample size you can boost power by improving your experiment in other ways.

Here is a practical example. Imagine I want to correlate the height of individuals measured in centimeters and inches. This is a trivial case – theoretically the correlation should be perfect, that is, ρ = 1. However, measurement error will spoil this potential correlation somewhat. I have a sample size of 100 people. I first ask my auntie Angie to guess the height of each subject in centimeters. To determine their heights in inches, I then take them all down the pub and ask this dude called Nigel to also take a guess. Both Angie and Nigel will misestimate heights to some degree. For simplicity, let’s just say that their errors are on average the same. This nonetheless means their guesses will not always agree very well. If I then calculate the correlation between their guesses, it will obviously have to be lower than 1, even though this is the true correlation. I simulated this scenario below. On the x-axis I plot the amount of measurement error in cm (the standard deviation of Gaussian noise added to the actual body heights). On the y-axis I plot the median observed correlation and the shaded area is the 95% confidence interval over 10,000 simulations. As you can see, as measurement error increases, the observed correlation goes down and the confidence interval becomes wider.

corr_vs_error

Greater error leads to poorer correlations. So far, so obvious. But while I call this the observed correlation, it really is the maximally observable correlation. This means that in order to boost power, the first thing you could do is to reduce measurement error. In contrast, increasing your sample size can be highly inefficient and border on the infeasible.

For a correlation of 0.35, hardly an unrealistically low effect in a biological or psychological scenario, you would need a sample size of 62 to achieve 80% power. Let’s assume this is the correlation found by a previous study and we want to replicate it. Following common recommendations you would plan to collect two-and-a-half the sample size, so n = 155. Doing so may prove quite a challenge. Assume that each data point involves hours of data collection per participant and/or that it costs 100s of dollars to acquire the data (neither are atypical in neuroimaging experiments). This may be a considerable additional expense few researchers are able to afford.

And it gets worse. It is quite possible that by collecting more data you further sacrifice data quality. When it comes to neuroimaging data, I have heard from more than one source that some of the large-scale imaging projects contain only mediocre data contaminated by motion and shimming artifacts. The often mentioned suggestion that sample sizes for expensive experiments could be increased by multi-site collaborations ignores that this quite likely introduces additional variability due to differences between sites. The data quality even from the same equipment may differ. The research staff at the two sites may not have the same level of skill or meticulous attention to detail. Behavioral measurements acquired online via a website may be more variable than under controlled lab conditions. So you may end up polluting your effect size even further by increasing sample size.

The alternative is to improve your measurements. In my example here, even going from a measurement error of 20 cm to 15 cm improves the observable effect size quite dramatically, moving from 0.35 to about 0.5. To achieve 80% power, you would only need a sample size of 29. If you kept the original sample size of 62, your power would be 99%. So the critical question is not really what the original effect size was that you want to replicate – rather it is how much you can improve your experiment by reducing noise. If your measurements are already pretty precise to begin with, then there is probably little room for improvement and you also don’t win all that much, as going from measurement error 5 cm to 1 cm in my example. But when the original measurement was noisy, improving the experiment can help a hell of a lot.

There are many ways to make your measurements more reliable. It can mean ensuring that your subjects in the MRI scanner are padded in really well, that they are not prone to large head movements, that you did all in your power to maintain a constant viewing distance for each participant, and that they don’t fall asleep halfway through your experiment. It could mean scanning 10 subjects twice, instead of scanning 20 subjects once. It may be that you measure the speed that participants walk down the hall to the lift with laser sensors instead of having a confederate sit there with a stopwatch. Perhaps you can change from a group comparison to a within-subject design? If your measure is an average across trials collected in each subject, you can enhance the effect size by increasing the number of trials. And it definitely means not giving a damn what Nigel from down the pub says and investing in a bloody tape measure instead.

I’m not saying that you shouldn’t collect larger samples. Obviously, if measurement reliability remains constant*, larger samples can improve sensitivity. But the first thought should always be how you can make your experiment a better test of your hypothesis. Sometimes the only thing you can do is to increase the sample but I bet usually it isn’t – and if you’re not careful, it can even make things worse. If your aim is to conclude something about the human brain/mind in general, a larger and broader sample would allow you to generalize better. However, for this purpose increasing your subject pool from 20 undergraduate students at your university to 100 isn’t really helping. And when it comes to the choice between an exact replication study with three times the sample size than the original experiment, and one with the same sample but objectively better methods, I know I’d always pick the latter.

 

(* In fact, it’s really a trade-off and in some cases a slight increase of measurement error may very well be outweighed by greater power due to a larger sample size. This probably happens for the kinds of experiments where slight difference in experimental parameters don’t matter much and you can collect 100s of people fast, for example online or at a public event).

A few thoughts on stats checking

You may have heard of StatCheck, an R package developed by Michèle B. Nuijten. It allows you to search a paper (or manuscript) for common frequentist statistical tests. The program then compares whether the p-value reported in the test matches up with the reported test statistic and the degrees of freedom. It flags up cases where the p-value is inconsistent and, additionally, when the recalculated p-value would change the conclusions of the test. Now, recently this program was used to trawl through 50,000ish papers in psychology journals (it currently only recognizes statistics in APA style). The results on each paper are then automatically posted as comments on the post-publication discussion platform PubPeer, for example here. At the time of writing this, I still don’t know if this project has finished. I assume not because the (presumably) only one of my papers that has been included in this search has yet to receive its comment. I left a comment of my own there, which is somewhat satirical because 1) I don’t take the world as seriously as my grumpier colleagues and 2) I’m really just an asshole…

While many have welcomed the arrival of our StatCheck Overlords, not everyone is happy. For instance, a commenter in this thread bemoans that this automatic stats checking is just “mindless application of stats unnecessarily causing grief, worry, and ostracism. Effectively, a witch hunt.” In a blog post, Dorothy Bishop discusses the case of her own StatCheck comments, one of which gives the paper a clean bill of health and the other discovered some potential errors that could change the significance and thus the conclusions of the study. My own immediate gut reaction to hearing about this was that this would cause a deluge of vacuous comments and that this diminishes the signal-to-noise ratio of PubPeer. Up until now discussions on there frequently focused on serious issues with published studies. If I see a comment on a paper I’ve been looking up (which is made very easy using the PubPeer plugin for Firefox), I would normally check it out. If in future most papers have a comment from StatCheck, I will certainly lose that instinct. Some are worried about the stigma that may be attached to papers when some errors are found although others have pointed out that to err is human and we shouldn’t be afraid of discovering errors.

Let me be crystal clear here. StatCheck is a fantastic tool and should prove immensely useful to researchers. Surely, we all want to reduce errors in our publications, which I am also sure all of us make some of the time. I have definitely noticed typos in my papers and also errors with statistics. That’s in spite of the fact that when I do the statistics myself I use Matlab code that outputs the statistics in the way they should look in the text so all I have to do is copy and paste them in. Some errors are introduced by the copy-editing stage after a manuscript is accepted. Anyway, using StatCheck on our own manuscripts can certainly help reduce such errors in future. It is also extremely useful for reviewing papers and marking student dissertations because I usually don’t have the time (or desire) to manually check every single test by hand. The real question is if there is really much of a point doing this posthoc for thousands of already published papers?

One argument for this is to enable people to meta-analyze previous results. Here it is important to know that a statistic is actually correct. However, I don’t entirely buy this argument because if you meta-analyze literature you really should spend more time on checking the results than looking what StatCheck auto-comment on PubPeer said. If anything, the countless comments saying that there are zero errors are probably more misleading than the ones that found minor problems. They may actually mislead you into thinking that there is probably nothing wrong with these statistics – and this is not necessarily true. In all fairness, StatCheck, both in its auto-comments and the original paper is very explicit about the fact that its results aren’t definite and should be verified manually. But if there is one thing I’ve learned about people it is that they tend to ignore the small print. When is the last time you actually read an EULA before agreeing to it?

Another issue with the meta-analysis argument is that presently the search is of limited scope. While 50,000 is a large number, it is a small proportion of scientific papers, even within the field of psychology and neuroscience. I work at a psychology department and am (by some people’s definition) a psychologist but – as I said – to my knowledge only one of my own papers should have even been included in the search so far. So if I do a literature search for a meta-analysis StatCheck’s autopubpeering wouldn’t be much help to me. I’m told there are plans to widen the scope of StatCheck’s robotic efforts beyond psychology journals in the future. When it is more common this may indeed be more useful although the problem remains that the validity of its results is simply unknown.

The original paper includes a validity check in the Appendix. This suggests that error rates are reasonably low when comparing StatCheck’s results to previous checks. This is doubtless important for confirming that StatCheck works. But in the long run this is not really the error rate we are interested in. What this does not tell us which proportion of papers contain actual errors with a study’s conclusions. Take Dorothy Bishop‘s paper as an example. For that StatCheck detected two F-tests for which the recalculated p-value would change the statistical conclusions. However, closer inspection reveals that the test was simply misreported in the paper. There is only one degree of freedom and I’m told StatCheck misinterpreted what test this was (but I’m also told this has been fixed in the new version). If you substitute in the correct degrees of freedom, the reported p-value matches.

Now, nobody is denying that there is something wrong with how these particular stats were reported. An F-test should have two degrees of freedom. So StatCheck did reveal errors and this is certainly useful. But the PubPeer comment flags this up as a potential gross inconsistency that could theoretically change the study’s conclusions. However, we know that it doesn’t actually mean that. The statistical inference and conclusions are fine. There is merely a typographic error. The StatCheck report is clearly a false positive.

This distinction seems important to me. The initial reports about this StatCheck mega-trawl was that “around half of psychology papers have at least one statistical error, and one in eight have mistakes that affect their statistical conclusions.” At least half of this sentence is blatantly untrue. I wouldn’t necessarily call a typo a “statistical error”. But as I already said, revealing these kinds of errors is certainly useful nonetheless. The second part of this statement is more troubling. I don’t think we can conclude 1 in 8 papers included in the search have mistakes that affect their conclusions. We simply do not know that. StatCheck is a clever program but it’s not a sentient AI. The only way to really determine if the statistical conclusions are correct is still to go and read each paper carefully and work out what’s going on. Note that the statement in the StatCheck paper is more circumspect and acknowledges that such firm conclusions cannot be drawn from its results. It’s a classical case of journalistic overreach where the RetractionWatch post simplifies what the researchers actually said. But these are still people who know what they’re doing. They aren’t writing flashy “science” article for the tabloid press.

This is a problem. I do think we need to be mindful of how the public perceives scientific research. In a world in which it is fine for politicians to win referenda because “people have had enough of experts” and in which a narcissistic, science-denying madman is dangerously close to becoming US President we simply cannot afford to keep telling the public that science is rubbish. Note that worries about the reputation of science are no excuse not to help improve it. Quite to the contrary, it is a reason to ensure that it does improve. I have said many times, science is self-correcting but only if there are people who challenge dearly held ideas, who try to replicate previous results, who improve the methods, and who reveal errors in published research. This must be encouraged. However, if this effort does not go hand in hand with informing people about how science actually works, rather than just “fucking loving” it for its cool tech and flashy images, then we are doomed. I think it is grossly irresponsible to tell people that an eighth of published articles contain incorrect statistical conclusions when the true number is probably considerably smaller.

In the same vein, an anonymous commenter on my own PubPeer thread also suggested that we should “not forget that Statcheck wasn’t written ‘just because.'” There is again an underhanded message in this. Again, I think StatCheck is a great tool and it can reveal questionable results such as rounding down your p=0.054 to p=0.05 or the even more unforgivable p<0.05. It can also reveal other serious errors. However, until I see any compelling evidence that the proportion of such evils in the literature is as high as suggested by these statements I remain skeptical. A mass-scale StatCheck of the whole literature in order to weed out serious mistakes seems a bit like carpet-bombing a city just to assassinate one terrorist leader. Even putting questions of morality aside, it isn’t really very efficient. Because if we assume that some 13% of papers have grossly inconsistent statistics we still need to go and manually check them all. And, what is worse, we quite likely miss a lot of serious errors that this test simple can’t detect.

So what do I think about all this? I’ve come to the conclusion that there is no major problem per se with StatCheck posting on PubPeer. I do think it is useful to see these results, especially if it becomes more general. Seeing all of these comments may help us understand how common such errors are. It allows people to double check the results when they come across them. I can adjust my instinct. If I see one or two comments on PubPeer I may now suspect it’s probably about StatCheck. If I see 30, it is still likely to be about something potentially more serious. So all of this is fine by me. And hopefully, as StatCheck becomes more widely used, it will help reduce these errors in future literature.

But – and this is crucial – we must consider how we talk about this. We cannot treat every statistical error as something deeply shocking. We need to develop a fair tolerance to these errors as they are discovered. This may seem obvious to some but I get the feeling not everybody realizes that correcting errors is the driving force behind science. We need to communicate this to the public instead of just telling them that psychologists can’t do statistics. We can’t just say that some issue with our data analysis invalidates 45,000 and 15 years worth of fMRI studies. In short, we should stop overselling our claims. If, like me, you believe it is damaging when people oversell their outlandish research claims about power poses and social priming, then it is also damaging if people oversell their doomsday stories about scientific errors. Yes, science makes errors – but the fact that we are actively trying to fix them is proof that it works.

800px-terminator_exhibition_t-800_-_menacing_looking_shoot
Your friendly stats checking robot says hello

On the magic of independent piloting

TL,DR: Never simply decide to run a full experiment based on whether one of the small pilots in which you tweaked your paradigm supported the hypothesis. Use small pilots only to ensure the experiment produces high quality data, judged by criteria that are unrelated to your hypothesis.

Sorry for the bombardment with posts on data peeking and piloting. I felt this would have cluttered up the previous post so I wrote a separate one. After this one I will go back to doing actual work though, I promise! That grant proposal I should be writing has been neglected for too long…

In my previous post, I simulated what happens when you conduct inappropriate pilot experiments by running a small experiment and then continuing data collection if the pilot produces significant results. This is really data peeking and it shouldn’t come as much of a surprise that this inflates false positives and massively skews effect size estimates. I hope most people realize that this is a terrible thing to do because it makes your results entirely dependent on the outcome. Quite possibly, some people would have learned about this in their undergrad stats classes. As one of my colleagues put it, “if it ends up in the final analysis it is not a pilot.” Sadly, I don’t think this as widely known as it should be. I was not kidding when I said that I have seen it happen before or overheard people discussing having done this type of inappropriate piloting.

But anyway, what is an appropriate pilot then? In my previous post, I suggested you should redo the same experiment but restart data collection. You now stick to the methods that gave you a significant pilot result. Now the data set used to test your hypothesis is completely independent, so it won’t be skewed by the pre-selected pilot data. Put another way, your exploratory pilot allows you to estimate a prior, and your full experiment seeks to confirm it. Surely there is nothing wrong with that, right?

I’m afraid there is and it is actually obvious why: your small pilot experiment is underpowered to detect real effects, especially small ones. So if you use inferential statistics to determine if a pilot experiment “worked,” this small pilot is biased towards detecting larger effect sizes. Importantly, this does not mean you bias your experiment towards larger effect sizes. If you only continue the experiment when the pilot was significant, you are ignoring all of the pilots that would have shown true effects but which – due to the large uncertainty (low power) of the pilot – failed to do so purely by chance. Naturally, the proportion of these false negatives becomes smaller the larger you make your pilot sample – but since pilots are by definition small, the error rate is pretty high in any case. For example, for a true effect size of δ = 0.3, the false negatives at a pilot sample of 2 is 95%. With a pilot sample of 15, it is still as high as 88%. Just for illustration I show below the false negative rates (1-power) for three different true effect sizes. Even for quite decent effect sizes the sensitivity of a small pilot is abysmal:

False Negatives

Thus, if you only pick pilot experiments with significant results to do real experiments you are deluding yourself into thinking that the methods you piloted are somehow better (or “precisely calibrated”). Remember this is based on a theoretical scenario that the effect is real and of fixed strength. Every single pilot experiment you ran investigated the same underlying phenomenon and any difference in outcome is purely due to chance – the tweaking of your methods had no effect whatsoever. You waste all manner of resources piloting some methods you then want to test.

So frequentist inferential statistics on pilot experiments are generally nonsense. Pilots are by nature exploratory. You should only determine significance for confirmatory results. But what are these pilots good for? Perhaps we just want to have an idea of what effect size they can produce and then do our confirmatory experiments for those methods that produce a reasonably strong effect?

I’m afraid that won’t do either. I simulated this scenario in a similar manner as in my previous post. 100,000 times I generated two groups (with a full sample size of n = 80, although the full sample size isn’t critical for this). Both groups are drawn from a population with standard deviation 1 but one group has a mean of zero while the other’s mean is shifted by 0.3 – so we have a true effect size here (the actual magnitude of this true effect size is irrelevant for the conclusions). In each of the 100,000 simulations, the researcher runs a number of pilot subjects per group (plotted on x-axis). Only if the effect size estimate for this pilot exceeds a certain criterion level, the researcher runs an independent, full experiment. The criterion is either 50%, 100%, or 200% of the true effect size. Obviously, the researcher cannot know this however. I simply use these criteria as something that the researcher might be doing in a real world situation. (For the true effect size I used here, these criteria would be d = 0.15, d = 0.3, or d = 0.6, respectively).

The results are below. The graph on the left once again plots the false negative rates against the pilot sample size. A false negative here is not based on significance but on effect size, so any simulation for which d was below the criterion. When the criterion is equal to the true effect size, the false negative rate is constant at 50%. The reason for this is obvious: each simulation is drawn from a population centered on the true effect of 0.3, so half of these simulations will exceed that value. However, when the criterion is not equal to the true effect the false negative rates depend on the pilot sample size. If the criterion is lower than the true effect, false negatives decrease. If the criterion is strict, false negatives increase. Either way, the false negative rates are substantially greater than the 20% mark you would have with an adequately powered experiment. So you will still delude yourself a considerable number of times if you only conduct the full experiment when your pilot has a particular effect size. Even if your criterion is lax (and d = 0.15 for a pilot sounds pretty lax to me), you are missing a lot of true results. Again, remember that all of the pilot experiments here investigated a real effect of fixed size. Tweaking the method makes no difference. The difference between simulations is simply due to chance.

Finally, the graph on the right shows the mean effect sizes  estimated by your completed experiments (but not the absolute this time!). The criterion you used in the pilot makes no difference here (all colors are at the same level), which is reassuring. However, all is not necessarily rosy. The open circles plot the effect size you get under publication bias, that is, if you only publish the significant experiments with p < 0.05. This effect is clearly inflated compared to the true effect size of 0.3. The asterisks plot the effect size estimate if you take all of the experiments. This is the situation you would have (Chris Chambers will like this) if you did a Registered Report for your full experiment and publication of the results is guaranteed irrespective of whether or not they are significant. On average, this effect size is an accurate estimate of the true effect.

Simulation Results

Again, these are only the experiments that were lucky enough to go beyond the piloting stage. You already wasted a lot of time, effort, and money to get here. While the final outcome is solid if publication bias is minimized, you have thrown a considerable number of good experiments into the trash. You’ve also misled yourself into believing that you conducted a valid pilot experiment that honed the sensitivity of your methods when in truth all your pilot experiments were equally mediocre.

I have had a few comments from people saying that they are only interested in large effect sizes and surely that means they are fine? I’m afraid not. As I said earlier already, the principle here is not dependent on the true effect size. It is solely a factor of the low sensitivity of the pilot experiment. Even with a large true effect, your outcome-dependent pilot is a blind chicken that errs around in the dark until it is lucky enough to hit a true effect more or less by chance. For this to happen you must use a very low criterion to turn your pilot into a real experiment. This however also means that if the null hypothesis is true an unacceptable proportion of your pilots produce false positives. Again, remember that your piloting is completely meaningless – you’re simply chasing noise here. It means that your decision whether to go from pilot to full experiment is (almost) completely arbitrary, even when the true effect is large.

So for instance, when the true effect is a whopping δ = 1, and you are using d > 0.15 as a criterion in your pilot of 10 subjects (which is already large for pilots I typically hear about), your false negative rate is nice and low at ~3%. But critically, if the null hypothesis of δ = 0 is true, your false positive rate is ~37%. How often you will fool yourself by turning a pilot into a full experiment depends on the base rate. If you give this hypothesis at 50:50 chance of being true, almost one in three of your pilot experiments will lead you to chase a false positive. If these odds are lower (which they very well may be), the situation becomes increasingly worse.

What should we do then? In my view, there are two options: either run a well-powered confirmatory experiment that tests your hypothesis based on an effect size you consider meaningful. This is the option I would chose if resources are a critical factor. Alternatively, if you can afford the investment of time, money, and effort, you could run an exploratory experiment with a reasonably large sample size (that is, more than a pilot). If you must, tweak the analysis at the end to figure out what hides in the data. Then, run a well-powered replication experiment to confirm the result. The power for this should be high enough to detect effects that are considerably weaker than the exploratory effect size. This exploratory experiment may sound like a pilot but it isn’t because it has decent sensitivity and the only resource you might be wasting is your time* during the exploratory analysis stage.

The take-home message here is: don’t make your experiments dependent on whether your pilot supported your hypothesis, even if you use independent data. It may seem like a good idea but it’s tantamount to magical thinking. Chances are that you did not refine your method at all. Again (and I apologize for the repetition but it deserves repeating): this does not mean all small piloting is bad. If your pilot is about assuring that the task isn’t too difficult for subjects, that your analysis pipeline works, that the stimuli appear as you intended, that the subjects aren’t using a different strategy to perform the task, or quite simply to reduce the measurement noise, then it is perfectly valid to run a few people first and it can even be justified to include them in your final data set (although that last point depends on what you’re studying). The critical difference is that the criteria for green-lighting a pilot experiment are completely unrelated to the hypothesis you are testing.

(* Well, your time and the carbon footprint produced by your various analysis attempts. But if you cared about that, you probably wouldn’t waste resources on meaningless pilots in the first place, so this post is not for you…)

MatLab code for this simulation.

On the worthlessness of inappropriate piloting

So this post is just a brief follow up to my previous post on data peeking. I hope it will be easy to see why this is very related:

Today I read this long article about the RRR of the pen-in-mouth experiments – another in a growing list of failures to replicate classical psychology findings. I was quite taken aback by one comment in this: the assertion that these classical psychology experiments (in particular the social priming ones) had been “precisely calibrated to elicit tiny changes in behavior.” It is an often repeated argument to explain why findings fail to replicate – the “replicators” simply do not have the expertise and/or skill to redo these delicate experiments. And yes, I am entirely willing to believe that I’d be unable to replicate a lot of experiments outside my area, say, finding subatomic particles or even (to take an example from my general field) difficult studies on clinical populations.

But what does this statement really mean? How were these psychology experiments “calibrated” before they were run? What did the authors do to nail down the methods before they conducted the studies? It implies that extensive pilot experiments were conducted first. I am in no position to say that this is what the authors of these psychology studies did during their piloting stage but one possibility is that several small pilot experiments were run and the experimental parameters were tweaked until a significant result supporting the hypothesis was observed. Only then they continued the experiment and collected a full data set that included the pilot data. I have seen and heard of people who did precisely this sort of piloting until the “experiment worked.”

So, what actually happens when you “pilot” experiments to “precisely calibrate” them? I decided to simulate this and the results are in the graph below (each data point is based on 100,000 simulations). In this simulation, an intrepid researcher first runs a small number of pilot subjects per group (plotted on x-axis). If the pilot fails to produce significant results at p < 0.05, the experiment is abandoned and the results are thrown in the bin never to again see the light of day. However, if the results are significant, the eager researcher collects more data until the full sample in each group is n = 20, 40, or 80. On the y-axis I plotted the proportion of these continued experiments that produced significant results. Note that all simulated groups were drawn from a normal distribution with mean 0 and standard deviation 1. Therefore, any experiments that “worked” (that is, they were significant) are false positives. In a world where publication bias is still commonplace, these are the findings that make it into journals – the rest vanish in the file-drawer.

 

False Positives

As you can see, such a scheme of piloting until the experiment “works,” can produce an enormous number of false positives in the completed experiments. Perhaps this is not really all that surprising – after all this is just another form of data peeking. Critically, I don’t think this is unrealistic. I’d wager this sort of thing is not at all uncommon. And doesn’t it seem harmless? After all we are only peeking once! If a pilot experiment “worked,” we continue sampling until the sample is complete.

Well, even under these seemingly benign conditions false positives can be inflated dramatically. The black curve is for the case when the final sample size, of the completed studies, is only 20. This is the worst case and it is perhaps unrealistic. If the pilot experiment consists of 10 subjects (that is, half the full sample) about a third of results will be flukes. But even in the other cases, when only a handful of pilot subjects are collected compared to the much larger full samples, false positives are well above 5%. In other words, whenever you pilot an experiment and decide that it’s “working” because it seems to support your hypothesis, you are already skewing the final outcome.

Of course, the true false positive rate, taken across the whole set of 100,000 pilots that were run, would be much lower (0.05 times the rates I plotted above to be precise, because we picked from the 5% of significant “pilots” in the first place). However, since we cannot know how much of this inappropriate piloting went on behind the scenes, knowing this isn’t particularly helpful.

More importantly, we aren’t only interested in the false positive rate. A lot of researchers will care about the effect size estimates of their experiments. Crucially, this form of piloting will substantially inflate these effect size estimates as well and this may have even worse consequences for the interpretation of these experiments. In the graph below, I plot the effect sizes (the mean absolute Cohen’s d) for the same simulations for which I showed you the false positive rates above. I use the absolute effect size because the sign is irrelevant – the whole point of this simulation exercise is to mimic a full-blown fishing expedition via inappropriate “piloting.” So our researcher will interpret a significant result as meaningful regardless of whether d is positive or negative.

Forgive the somewhat cluttered plot but it’s not that difficult to digest really. The color code is the same as for the previous figure. The open circles and solid lines show you the effect size of the experiments that “worked,” that is, the ones for which we completed data collection and which came out significant. The asterisks and dashed lines show the effect sizes for all of global false positives, that is, all the simulations with p < 0.05 after the pilot but using the full the data set, as if you had completed these experiments. Finally, the crosses and dotted lines show the effect sizes you get for all simulations (ignoring inferential statistics). This is just given as a reference.

Effect Sizes

Two things are notable about all this. First, effect size estimates increase with “pilot” sample size for the set of global false positives (asterisks) but not the other curves. This is because the “pilot” sample size determines how strongly the fluke pilot effect will contribute to the final effect size. More importantly, the effect size estimates for those experiments with significant pilots and which also “worked” after completion are massively exaggerated (open circles). The degree of exaggeration is a factor of the baseline effect (crosses). The absolute effect size estimate depends on the full sample size. At the smallest full sample size (n=20, black curve) the effect sizes are as high as d = 0.8. Critically, the degree of exaggeration does not depend on how large your pilot sample was. Whether your “pilot” had only 2 or 15 subjects, the average effect size estimate is around 0.8.

The reason for this is that the smaller the pilot experiment, the more underpowered it is. Since it is a condition for continuing the experiment that the pilot must be significant, the pilot effect size must be considerably larger for small pilots than larger pilots. Because the true effect size is always zero, this cancels out in the end so the final effect size estimate is constant regardless of the pilot sample size. But in any case, the effect size estimate you got from your precisely calibrated inappropriately piloted experiments are enormously overrated. It shouldn’t be much of a surprise if these don’t replicate and that posthoc power calculations based on these effect sizes suggest low power (of course, you should never use posthoc power in that way but that’s another story…) .

So what should we do? Ideally you should just throw away the pilot data, preregister the design, and restart the experiment anew with the methods you piloted. In this case the results are independent and only the methods are shared. Importantly, there is nothing wrong with piloting in general. After all, I had a previous post praising pilot experiments. But piloting should be about ensuring that the methods are effective in producing clean data. There are many situations in which an experiment seems clever and elegant in theory but once you actually start it in practice you realize that it just can’t work. Perhaps the participants don’t use the task strategy you envisioned. Or they simply don’t perceive the stimuli the way they were intended. In fact, this happened to us recently and we may have stumbled over an interesting finding in its own right (but this must also be confirmed by a proper experiment!). In all these situations, however, the decision on the pilot results is unrelated to the hypothesis you are testing. If they are related, you must account for that.

MatLab code for these simulations is available. As always, let me know if you find errors. (To err is human, to have other people check your code divine?)

Realistic data peeking isn’t as bad as you* thought – it’s worse

Unless you’ve been living under a rock, you have probably heard of data peeking – also known as “optional stopping”. It’s one of those nasty questionable research practices that could produce a body of scientific literature contaminated by widespread spurious findings and thus lead to poor replicability.

Data peeking is when you run a Frequentist statistical test every time you collect a new subject/observation (or after every few observations) and stop collecting data when the test comes out significant (say, at p < 0.05). Doing this clearly does not accord with good statistical practice because under the Frequentist framework you should plan your final sample size a priori based on power analysis, collect data until you have that sample size, and never look back (but see my comment below for more discussion of this…). What is worse, under the aforementioned data peeking scheme you can be theoretically certain to reject the null hypothesis eventually. Even if the null hypothesis is true, sooner or later you will hit a p-value smaller than the significance threshold.

Until recently, many researchers, at least in psychological and biological sciences, appeared to be unaware of this problem and it isn’t difficult to see that this could contribute to a prevalence of false positives in the literature. Even now, after numerous papers and blog posts have been written about this topic, this problem still persists. It is perhaps less common but I still occasionally overhear people (sometimes even in their own public seminar presentations) saying things like “This effect isn’t quite significant yet so we’ll see what happens after we tested a few more subjects.” So far so bad.

Ever since I heard about this issue (and I must admit that I was also unaware of the severity of this problem back in my younger, carefree days), I have felt somehow dissatisfied with how this issue has been described. While it is a nice illustration of a problem, the models of data peeking seem extremely simplistic to me. There are two primary aspects of this notion that in my opinion just aren’t realistic. First, the notion of indefinite data collection is obviously impossible, as this would imply having an infinite subject pool and other bottomless resources. However, even if you allow for a relatively manageable maximal sample size at which a researcher may finally stop data collection even when the test is not significant, the false positive rate is still massively inflated.

The second issue is therefore a bigger problem: the simple data peeking procedure described above seems grossly fraudulent to me. I would have thought that even if the researcher in question were unaware of the problems with data peeking, they probably would nonetheless feel that something isn’t quite right with checking for significant results after every few subjects and continuing until they get them. As always, I may be wrong about this but I sincerely doubt this is what most “normal people do. Rather, I believe people would be more likely to peek at the data to look if the results are significant, and only if the p-value “looks promising” (say 0.05 < p < 0.1) they continue testing. This sampling plan sounds a lot more like what may actually happen. So I wanted to find out how this sort of sampling scheme would affect results. I have no idea if anyone already did something like this. If so, I’d be grateful if you could point me to that analysis.

So what I did is the following: I used Pearson’s correlation as the statistical test. In each iteration of the simulation I generated a data set of 150 subjects, each with two uncorrelated Gaussian variables, let’s just pretend it’s the height of some bump on the subjects’ foreheads and a behavioral score of how belligerent they are. 150 is thus the maximal sample size, assuming that our simulated phrenologist – let’s call him Dr Peek – would not want to test more than 150 subjects. However, Dr Peek actually starts with only 3 subjects and then runs the correlation test. In the simplistic version of data peeking, Dr Peek will stop collecting data if p < 0.05; otherwise he will collect another subject until p < 0.05 or 150 subjects are eventually reached. In addition, I simulated three other sampling schemes that feel more realistic to me. In these cases, Dr Peek will also stop data collection when p < 0.05 but he will also stop when p is either greater than 0.1, greater than 0.3, or greater than 0.5. I repeated each of these simulations 1000 times.

The results are in the graph below. The four sampling schemes are denoted by the different colors. On the y-axis I plotted the proportion of the 1000 simulations in which the final outcome (that is, whenever data collection was stopped) yielded p < 0.05. The scenario I described above is the leftmost set of data points in which the true effect size, the correlation between forehead bump height and belligerence, is zero. Confirming previous reports on data peeking, the simplistic case (blue curve) has an enormously inflated false positive rate of around 0.42. Nominally, the false positive rate should be 0.05. However, under the more “realistic” sampling schemes the false positive rates are far lower. In fact, for the case where data collection only continues while p-values are marginal (0.05 < p < 0.1), the false positive rate is 0.068, only barely above the nominal rate. For the other two schemes, the situation is slightly worse but not by that much. So does this mean that data peeking isn’t really as bad as we have been led to believe?

Rates

Hold on, not so fast. Let us now look what happens in the rest of the plot. I redid the same kind of simulation for a range of true effect sizes up to rho = 0.9. The x-axis shows the true correlation between forehead bump height and belligerence. Unlike for the above case when the true correlation is zero, now the y-axis shows statistical power, the proportion of simulations in which Dr Peek concluded correctly that there actual is a correlation. All four curves rise steadily as one might expect with stronger true effects. The blue curve showing the simplistic data peeking scheme rises very steeply and reaches maximal power at a true correlation of around 0.4. The slopes of the other curves are much more shallow and while the power at strong true correlations is reasonable at least for two of them, they don’t reach the lofty heights of the simplistic scheme.

This feels somehow counter-intuitive at first but it makes sense: when the true correlation is strong, the probability of high p-values is low. However, at the very small sample sizes we start out with even a strong correlation is not always detectable – the confidence interval of the estimated correlation is very wide. Thus there will be a relatively large proportion of p-values that pass that high cut-off and terminate data collection prematurely without rejecting the null hypothesis.

Critically, these two things, inflated false positive rates and reduced statistical power to detect true effects, dramatically reduce the sensitivity of any analysis that is performed under these realistic data peeking schemes. In the graph below, I plot the sensitivity (quantified as d’) of the analysis. Larger d’ means there is a more favorable ratio between the number of simulations in which Dr Peek correctly detected a true effect and how often he falsely concluded there was a correlation when there wasn’t one. Sensitivity for the simplistic sample scheme (blue curve) rises steeply until power is maximal. However, sensitivity for the other sampling schemes starts off close to zero (no sensitivity) and only rises fairly slowly.

Sensitivity

For reference compare this to the situation under desired conditions, that is, without questionable research practices, with adequate statistical power of 0.8, and the nominal false positive rate of 0.05: in this case the sensitivity would be d’ = 2.49, so higher than any of the realistic sampling schemes ever get. Again, this is not really surprising because data collections will typically be terminated at sample sizes that give far less than 0.8 power. But in any case, this is bad news. Even though the more realistic forms of data peeking don’t inflate false positives as massively as the most pessimistic predictions, they impede the sensitivity of experiments dramatically and are thus very likely to only produce rubbish. It should come as no surprise that many findings fail to replicate.

Obviously, what I call here more realistic data peeking is not necessarily a perfect simulation of how data peeking may work in practice. For one thing, I don’t think Dr Peek would have a fixed cut-off of p > 0.1 or p > 0.5. Rather, such a cut-off might be determined on a case-by-case basis, dependent on the prior expectation Dr Peek has that the experiment should yield significant results. (Dr Peek may not use Bayesian statistics, but like all of us he clearly has Bayesian priors.) In some cases, he may be very confident that there should be an effect and he will continue testing for a while but then finally give up when the p-value is very high. For other hypotheses that he considered to be risky to begin with, he may not be very convinced even by marginal p-values and thus will terminate data collection when p > 0.1.

Moreover, it is probably also unrealistic that Dr Peek would start with a sample size of 3. Rather, it seems more likely that he would have a larger minimal sample size in mind, for example 20 and collect that first. While he may have been peeking at the data before he completed testing 20 subjects, there is nothing wrong with that provided he doesn’t stop early if the result becomes significant. Under these conditions the situation becomes somewhat better but the realistic data peeking schemes still have reduced sensitivity, at least for lower true effect sizes, which are presumably far more prevalent in real world situations. The only reason that sensitivity goes up fairly quickly to reasonable levels is that with the starting sample size of 20 subjects, the power to detect those stronger correlations is already fairly high – so in many cases data collection will be terminated as soon as the minimum sample is completed.

Sensitivity

Finally, while I don’t think this plot is entirely necessary, I also show you the false positives / power rates for this latter case. The curves are such beautiful sigmoids that I just cannot help myself but to include them in this post…

Rates

So to sum up, leaving aside the fact that you shouldn’t really peek at your data and stop data collection prematurely in any case, if you do this you can shoot yourself seriously in the foot. While the inflation of false positives through data peeking may have contributed a considerable number of spurious, unreplicable findings to the literature, what is worse it may very well also have contributed a great number of false negatives to the proverbial file drawer: experiments that were run but failed to produce significant results after peeking a few times and which were then abandoned, never to be heard of again. When it comes to spurious findings in the literature, I suspect the biggest problem is not actually data peeking but other questionable practices from the Garden of Forking Paths, such as tweaking the parameters of an experiment or the analysis.

* Actually it may just be me…

Matlab code for these simulations. Please let me know if you discover the inevitable bugs in this analysis.