Irish Times OpEds are just bloody awful at science (n=1)

TL-DR: No, men are not “better at science” than women.

Clickbaity enough for you? I cannot honestly say I have read a lot of OpEds in the Irish Times so the evidence for my titular claim is admittedly rather limited. But it is still more solidly grounded in actual data than this article published yesterday in the Irish Times. At least I have one data point.

The article in question, a prime example of Betteridge’s Law, is entitled “Are men just better at science than women?“. I don’t need to explain why such a title might be considered sensationalist and controversial. The article itself is an “Opinion” piece, thus allowing the publication to disavow any responsibility for its authorship whilst allowing it to rake in the views from this blatant clickbait. In it, the author discusses some new research reporting gender differences in systemising vs empathising behaviour and puts this in the context of some new government initiative to specifically hire female professors because apparently there is some irony here. He goes on a bit about something called “neurosexism” (is that a real word?) and talks about “hard-wired” brains*.

I cannot quite discern if the author thought he was being funny or if he is simply scientifically illiterate but that doesn’t really matter. I don’t usually spend much time commenting on stuff like that. I have no doubt that the Irish Times, and this author in particular, will be overloaded with outrage and complaints – or, to use the author’s own words, “beaten up” on Twitter. There are many egregious misrepresentations of scientific findings in the mainstream media (and often enough, scientists and/or the university press releases are the source of this). But this example of butchery is just so bad and infuriating in its abuse of scientific evidence that I cannot let it slip past.

The whole argument, if this is what the author attempted, is just riddled with logical fallacies and deliberate exaggerations. I have no time or desire to go through them all. Conveniently, the author already addresses a major point himself by admitting that the study in question does not actually speak to male brains being “hard-wired” for science, but that any gender differences could be arising due to cultural or environmental factors. Not only that, he also acknowledges that the study in question is about autism, not about who makes good professors. So I won’t dwell on these rather obvious points any further. There are much more fundamental problems with the illogical leaps and mental gymnastics in this OpEd:

What makes you “good at science”?

There is a long answer to this question. It most certainly depends somewhat on your field of research and the nature of your work. Some areas require more manual dexterity, whilst others may require programming skills, and others yet call for a talent for high-level maths. As far as we can generalise, in my view necessary traits of a good researcher are: intelligence, creativity, patience, meticulousness, and a dedication to seek the truth rather than confirming theories. That last one probably goes hand-in-hand with some scepticism, including a healthy dose of self-doubt.

There is also a short answer to this question. A good scientist is not measured by their Systemising Quotient (SQ), a self-report measure that quantifies “the drive to analyze or build a rule-based system”. Academia is obsessed with metrics like the h-index (see my previous post) but even pencil pushers and bean counters** in hiring or grant committees haven’t yet proposed to use SQ to evaluate candidates***.

I suspect it is true that many scientists score high on the SQ and also the related Autism-spectrum Quotient (AQ) which, among other things, quantifies a person’s self-reported attention to detail. Anecdotally, I can confirm that a lot of my colleagues score higher than the population average on AQ. More on this in the next section.

However, none of this implies that you need to have a high SQ or AQ to be “good at science”, whatever that means. That assertion is a logical fallacy called affirming the consequent. We may agree that “systemising” characterises a lot of the activities a typical scientist engages in, but there is no evidence that this is sufficient to being a good scientist. It could mean that systemising people are attracted to science and engineering jobs. It certainly does not mean that a non-systemising person cannot be a good scientist.

Small effect sizes

I know I rant a lot about relative effect sizes such as Cohen’s d, where the mean difference is normalised by the variability. I feel that in a lot of research contexts these are given undue weight because the variability itself isn’t sufficiently controlled. But for studies like this we can actually be fairly confident that they are meaningful. The scientific study had a pretty whopping sample size of 671,606 (although that includes all their groups) and also used validation data. The residual physiologist inside me retains his scepticism about self-report questionnaire type measures, but even I have come to admit that a lot of questionnaires can be pretty effective. I think it is safe to say that the Big 5 Personality Factors or the AQ tap into some meaningful real factors. Further, whatever latent variance there may be on these measures, that is probably outweighed by collecting such a massive sample. So the Cohen’s d this study reports is probably quite informative.

What does this say? Well, the difference in SQ between males and females was 0.31. In other words, the distributions of SQ between sexes overlap quite considerably but the distribution for males is somewhat shifted towards higher values. Thus, while the average man has a subtly higher SQ than the average woman, a rather considerable number of women will have higher SQs than the average man. The study helpfully plots these distributions in Figure 1****:

Sex diffs SQ huge N
Distributions of SQ in control females (cyan), control males (magenta), austistic females (red), and autistic males (green).

The relevant curves here are the controls in cyan and magenta. Sorry, colour vision deficient people, the authors clearly don’t care about you (perhaps they are retinasexists?). You’ll notice that the modes of the female and male distributions are really not all that far apart. More noticeable is the skew of all these distributions with a long tail to the right: Low SQs are most common in all groups (including autism) but values across the sample are spread across the full range. So by picking out a random man and a random woman from a crowd, you can be fairly confident that their SQs are both on the lower end but I wouldn’t make any strong guesses about whether the man has a higher SQ than the woman.

However, it gets even tastier because the authors of the study actually also conducted an analysis splitting their data from controls into people in Science, Technology, Engineering, or Maths (STEM) professions compared to controls who were not in STEM. The results (yes, I know the colour code is now weirdly inverted – not how I would have done it…) show that people in STEM, whether male or female, tend to have larger SQs than people outside of STEM. But again, the average difference here is actually small and most of it plays out in the rightward tail of the distributions. The difference between males and females in STEM is also much less distinct than for people outside STEM.

Sex & STEM diffs SQ
Distributions of SQ in STEM females (cyan), STEM males (magenta), control females (red), and control males (green).

So, as already discussed in the previous section, it seems to be the case that people in STEM professions tend to “systemise” a bit more. It also suggests that men systemise more then women but that difference probably decreases for people in STEM. None of this tells us anything about whether people’s brains are “hard-wired” for systemising, if it is about cultural and environmental differences between men and women, or indeed if  being trained in a STEM profession might make people more systemising. It definitely does not tell you who is “good at science”.

What if it were true?

So far so bad for those who might want to make that interpretive leap. But let’s give them the benefit of the doubt and ignore everything I said up until now. What if it were true that systemisers are in fact better scientists? Would that invalidate government or funders initiatives to hire more female scientists? Would that be bad for science?

No. Even if there were a vast difference in systemising between men and women, and between STEM and non-STEM professions, respectively, all such a hiring policy will achieve is to increase the number of good female scientists – exactly what this policy is intended to do. Let me try an analogy.

Basketball players in the NBA tend to be pretty damn tall. Presumably it is easier to dunk when you measure 2 meters than when you’re Tyrion Lannister. Even if all other necessarily skills here are equal there is a clear selection pressure for tall people to get into top basketball teams. Now let’s imagine a team decided they want to hire more shorter players. They declare they will hire 10 players who cannot be taller than 1.70m. The team will have try-outs and still seek to get the best players out of their pool of applicants. If they apply an objective criterion for what makes a good player, such as the ability to score consistently, they will only hire short players with excellent aim or who can jump really high. In fact, these shorties will be on average better at aiming and/or jumping than the giants they already have on their team. The team selects for the ability to score. Shorties and Tallies get there via different means but they both get there.

In this analogy, being a top scorer is being a systemiser, which in turn makes you a good scientist. Giants tend to score high because they find it easy to reach the basket. Shorties score high because they have other skills that compensate for their lack of height. Women can be good systemisers despite not being men.

The only scenario in which such a specific hiring policy could be counterproductive is if two conditions are met: 1) The difference between groups in the critical trait (i.e., systemising) is vast and 2) the policy mandates hiring from a particular group without any objective criteria. We have already established that the former condition isn’t fulfilled here – the difference in systemising between men and women is modest at best. The latter condition is really a moot point because this is simply not how hiring works in the real world. Hiring committees don’t usually just offer jobs to the relatively best person out of the pool but also consider the candidates’ objective abilities and achievements. This is even more pertinent here because all candidates in this case will already be eligible for a professorial position anyway. So all that will in fact happen is that we end up with more female professors who will also happen to be high in systemising.

Bad science reporting

Again, this previous section is based on the entirely imaginary and untenable assumption that systemisers are better scientists. I am not aware of any evidence of that – in part because we cannot actually quantify very well what makes a good scientist. The metrics academics actually (and sadly) use for hiring and funding decisions probably do not quantify that either but I am not even aware of any link between systemising and those metrics. Is there a correlation between h-indeces (relative to career age) and SQ? I doubt it.

What we have here is a case of awful science reporting. Bad science journalism and the abuse of scientific data for nefarious political purposes are hardly a new phenomenon – and this won’t just disappear. But the price of freedom (to practice science) is eternal vigilance. I believe as scientists we have a responsibility to debunk such blatant misapprehensions by journalists who I suspect have never even set foot in an actual lab or spoken to any actual scientists.

Some people assert that improving the transparency and reliability of research will hurt the public’s faith in science. Far from it, I believe those things can show people how science really works. The true damage to how the public perceives science is done by garbage articles in the mainstream media like this one – even if it is merely offered as an “opinion”.

By Keith Allison

*) Brains are not actually hard-wired to do anything. Leaving the old Hebbian analogy aside, brains aren’t wired at all, period. They are soft, squishy, wet sponges containing lots of neuronal and glial tissue plus blood vessels. Neurons connect via synapses between axons and dendrites and this connectivity is constantly regulated and new connections grown while others are pruned. This adaptability is one of the main reasons why we even have brains, and lies at the heart of the intelligence, ingenuity, and versatility of our species.

**) I suspect a lot of the pencil pushers and bean counters behind metrics like impact factors or the h-index might well be Systemisers.

***) I hope none of them read this post. We don’t want to give these people any further ideas…

****) Isn’t open access under Creative Commons license great?

Enough with the stupid heuristics already!

Today’s post is inspired by another nonsensical proposal that made the rounds and that reminded me why I invented the Devil’s Neuroscientist back in the day (Don’t worry, that old demon won’t make a comeback…). So apparently RetractionWatch created a database allowing you to search for an author’s name to list any retractions or corrections of their publications*. Something called the Ochsner Journal then declared they would use this to scan “every submitting author’s name to ensure that no author published in the Journal has had a paper retracted.” I don’t want to dwell on this abject nonsense – you can read about this in this Twitter thread. Instead I want to talk about the wider mentality that I believe underlies such ideas.

In my view, using retractions as a stigma to effectively excommunicate any researcher from “science” forever is just another manifestation of a rather pervasive and counter-productive tendency of trying to reduce everything in academia to simple metrics and heuristics. Good science should be trustworthy, robust, careful, transparent, and objective. You cannot measure these things with a number.

Perhaps it is unsurprising that quantitative scientists want to find ways to quantify such things. After all, science is the endeavour to reveal regularities in our observations to explain the variance of the natural world and thus reduce the complexity in our understanding of it. There is nothing wrong with meta-science and trying to derive models of how science – and scientists – work. But please don’t pretend that these models are anywhere near good enough to actually govern all of academia.

Few people you meet still believe that the Impact Factor of a journal tells you much about the quality of a given publication in it. Neither does an h-index or citation count tell us anything about the importance or “impact” of somebody’s research, certainly not without looking at this relative to the specific field of science they operate in. The rate with which someone’s findings replicate doesn’t tell you anything about how great a scientist they are. And you certainly won’t learn anything about the integrity and ability of a researcher – and their right to publish in your journal – when all you have to go on is that they were an author on one retracted study.

Reducing people’s careers and scientific knowledge to a few stats is lazy at best. But it is also downright dangerous. As long as such metrics are used to make consequential real-life decisions, people are incentivised to game them. Nowhere can this be seen better than with the dubious tricks some journals use to inflate their Impact Factor or the occasional dodgy self-citation scandals. Yes, in the most severe cases these are questionable, possibly even fraudulent, practices – but there is a much greater grey area here. What do you think would happen, if we adopted the policy that only researchers with high replicability ratings get put up for promotion? Do you honestly think this would encourage scientists to do better science rather than merely safer, boring science?

This argument is sometimes used as a defence of the status quo and a reason why we shouldn’t change the way science is done. Don’t be fooled by that. We should reward good and careful science. We totally should give credit to people who preregister their experiments, self-replicate their findings, test the robustness of their methods, and go the extra mile to ensure their studies are solid. We should appreciate hypotheses based on clever, creative, and/or unique insights. We should also give credit to people for admitting when they are wrong – otherwise why should anyone seek the truth?

The point is, you cannot do any of that with a simple number in your CV. Neither can you do that by looking at retractions or failures to replicate as a plague mark on someone’s career. I’m sorry to break it to you, but the only way to assess the quality of some piece of research, or to understand anything about the scientist behind it, is to read their work closely and interpret it in the appropriate context. That takes time and effort. Often it also necessitates talking to them because no matter how clever you think you are, you will not understand everything they wrote, just as not everybody will comprehend the gibberish you write. If you believe a method is inadequate, by all means criticise it. Look at the raw data and the analysis code. Challenge interpretations you disagree with. Take nobody’s word for granted and all that…

But you can shove your metrics where the sun don’t shine.

P-values are uniformly distributed when the null hypothesis is true

TL-DR: If the title of this blog post is unsurprising to you, I suggest you go play outside.

Many discussions in my science social media bubble circle around p-values (what an exciting life I lead…). Just a few days ago, there was a big kerfuffle about p-curving and whether p-values just below 0.05 are a sign of whatever. One of the main concepts behind p-curves is that under the assumption that the null hypothesis (H0) of no effect/difference is true, p-values should be uniformly distributed (at least as long as the test assumptions are met reasonably). This once again supported my suspicions that most people don’t actually know what p-values mean. Reports of people defining p-values incorrectly abound, sometimes even in stats textbooks. It also seems to me that people find p-values rather unintuitive. And I get the impression a lot of people vastly overestimate how widely known things like p-curve actually are.

A few weeks ago I got embroiled in a Facebook discussion. A friend of mine was running a permutation analysis to test something about his experiment and found something very odd: the distribution of p-values was skewed severely to the left – there were very few low p-values but the proportion was steadily increasing with most p-values being just below 1. He expected this distribution to be uniform because under the random permutations H0 should be true. A lot of commenters on his post seemed rather surprised and/or confused by the whole idea that p-values should be distributed randomly when H0 is true. “Surely,” so the common intuition goes, “when there is actually no difference, most p-values should be high and close to 1?”

No, and the reason why not is the p-value itself. A p-value can be calculated/estimated in many different ways. Most people use parametric tests but essentially they all share one philosophy. If you have no underlying effect and randomly sample data ad infinitum you end up with a distribution of test statistics. In my example, I draw two variables each with n=100 from a normal distribution and calculate the Pearson correlation between them – and I repeat this 20,000 times. This produces a distribution of correlation coefficients like this:


There is no correlation between two random variables (H0 is true) and so the distribution is centred on zero. The spread of the distribution depends on the sample size. Larger samples will produce narrower distributions. Critically, we can use this distribution to get a p-value. If we had observed a correlation of r=0.3 in our experiment, we could calculate the proportion of correlation coefficients in this distribution that are equal or greater than 0.3. This would give us a one-tailed p-value. If you ignore the sign of the correlation, you get a two-tailed p-value.

In the plot above, I coloured the 5% most extreme correlation coefficients in blue (2.5% to the left and to the right, respectively). These regions are abutted by vertical red lines at just below +/-0.2 in this case. This reflects the critical effect size needed to get p<0.05 – only 5% of the correlations coefficients in this distribution are +/-0.19ish or even more extreme.

Now compare this to the region coloured in red. This region also makes up 5% of the whole distribution. However, the red region surrounds zero, that is, those correlation coefficients that are really close to the true correlation value. Random chance makes the distribution spread out (and that becomes more severe when your sample size is low) but most of the correlations will nevertheless be close to the true value of zero. Therefore, the range of values in this red region is much narrower because the values are much denser here.

But of course these nigh-zero correlation coefficients will have the largest p-values. Consider again what a p-value reflects. If your observed correlation is 0.006 and you again ignore the sign of the effects, almost all correlations in this null distribution would be equal or greater than 0.006. So this proportion, the p-value, is almost 1. Put in other words, 5% of low p-values below 0.05 are from the long, thin tails of the null distribution, while 5% of really high p-values above 0.95 are from a really narrow slither of the null distribution near zero:


Visualised the same way, you have the blue region with p<0.05 on the left. Here correlations are large (greater than 0.19ish). On the right, you have the red region with p>0.95. Here correlations are really close to zero.

In other words, you can directly read off the p-value from the x-axis of this distribution of p-values. This is a direct consequence of what p-values represent. They are the proportion of values in the null distribution where correlations are equal or more extreme than the observed correlation.

Of course, if the null hypothesis is false and there actually is a correlation between the two variables this distribution must become skewed. There should now be many more tests with low p-values than with large ones. This is exactly what happens and this is the pattern that analyses like p-curve seek to detect:


Now, my friend’s p-distribution looked essentially like the mirror image of this. I still haven’t learned what could have possibly caused this. It would mean that more effect sizes were close to zero than there should be under H0. This could suggest some assumptions not being met but none of my own feeble simulations managed to reproduce the pattern he found. His analyses sounded quite complex so there is a possibility that there were some complex errors in it.


Is d>10 a plausible effect size?

TL;DR: You may get a very large relative effect size (like Cohen’s d), if the main source of the variability in your sample is the reliability of each observation and the measurement was made as exact as is feasible. Such a large d is not trivial, but in this case talking about d is missing the point.

In discussions of scientific findings you will often hear talk about relative effect sizes, like the ubiquitous Cohen’s d. Essentially, such effect sizes quantify the mean difference between groups/treatments/conditions relative to the variability across subjects/observations. The situation is actually a lot more complicated because even for a seemingly simple results like the difference between conditions you will find that there are several ways of calculating the effect size. You can read a nice summary by Jake Westfall here. There are also other effect sizes, such as correlation coefficients, and what I write here applies to that, too. I will however stick to the difference-type effect size because it is arguably the most common.

One thing that has irked me about those discussions for some years is that this ignores a very substantial issue: the between-subject variance of your sample depends on the within-subject variance. The more unreliable the measurement of each subject, the greater is the variability of your sample. Thus the reliability of individual measurements limits the relative effect size you can possibly achieve in your experiment given a particular experimental design. In most of science – especially biological and psychological sciences – the reliability of individual observations is strongly limited by the measurement error and/or the quality of your experiment.

There are some standard examples that are sometimes used to illustrate what a given effect size means. I stole a common one from this blog post about the average height difference between men and women, which apparently was d=1.482 in 1980 Spain. I have no idea if this is true exactly but that figure should be in the right ballpark. I assume most people will agree that men are on average taller than women but that there is nevertheless substantial overlap in the distributions – so that relatively frequently you will find a woman who is taller than many men. That is an effect size we might consider strong.

The height difference between men and women is a good reference for an effect size because it is largely limited by the between-subject variance, the variability in actual heights across the population. Obviously, the reliability of each observation also plays a role. There will definitely be a degree of measurement error. However, I suspect that this error is small, probably on the order of a few millimeters. Even if you’re quite bad at this measurement I doubt you will typically err by more than 1-2 cm and you can probably still replicate this effect in a fairly small sample. However, in psychology experiments your measurement rarely is that accurate.

Now, in some experiments you can increase the reliability of your individual measurement by increasing the number of trials (at this point I’d like to again refer to Richard Morey’s related post on this topic). In psychophysics, collecting hundreds or thousands of trials on one individual subject is not at all uncommon. Let’s take a very simple case. Contour integration refers to the ability of the visual system to detect “snake” contours better than “ladder” contours or those defined by other orientations (we like to call those “ropes”):

In the left image you should hopefully see a circle defined by 16 grating patches embedded in a background or randomly oriented gratings. This “snake” contour pops out from the background because the visual system readily groups orientations along a collinear (or cocircular) path into a coherent object. In contrast, when the contour is defined by patches of other orientations, for example the “rope” contour in the right image which is defined by patches at 45 degrees relative to the path, then it is much harder to detect the presence of this contour. This isn’t a vision science post so I won’t go into any debates on what this means. The take-home message here is that if healthy subjects with normal vision are asked to determine the presence or absence of a contour like this, especially with limited viewing time, they will perform very well for the “snake” contours but only barely above chance levels for the “rope” contours.

This is a very robust effect and I’d argue this is quite representative of many psychophysical findings. A psychophysicist probably wouldn’t simply measure the accuracy but conduct a broader study of how this depends on particular stimulus parameters – but that’s not really important here. It is still pretty representative.

What is the size of this effect? 

If I study this in a group of subjects, the relative effect size at the group level will depend on how accurately I measure the performance in each individual. If I have 50 subjects (which is between 10-25 larger than your typical psychophysics study…) and each performs just one trial, then the sample variance will be much larger compared to if each of them does 100 trials or if they each do 1000 trials. As a result, the Cohen’s d of the group will be considerably different. A d>10 should be entirely feasible if we collect enough trials per person.

People will sometimes say that large effects (d>>2 perhaps) are trivial. But there is nothing trivial about this. In this particular example you may see the difference quite easily for yourself (so you are a single-subject and single-trial replication). But we might want to know just how much better we are at detecting the snake than the rope contours. Or, as I already mentioned, a psychophysicist might measure the sensitivity of subjects to various stimulus parameters in this experiment (e.g., the distance between patches, the amount of noise in the orientations we can tolerate, etc) and this could tell us something about how vision works. The Cohen’s d would be pretty large for all of these. That does not make it trivial but in my view it makes it useless:

Depending on my design choices the estimated effect size may be a very poor reflection of the true effect size. As mentioned earlier, the relative effect size is directly dependent on the between-subject variance – but that in turn depends on the reliability of individual measurements. If each subject only does one trial, the effect of just one attentional lapse or accidental button press in the task is much more detrimental than when they perform 1000 trials, even if the overall rate of lapses/accidents is the same*.

Why does this matter?

In many experiments, the estimate of between-subject variance will be swamped by the within-subject variability. Returning to the example of gender height differences, this is essentially what would happen if you chose to eyeball each person’s height instead of using a tape measure. I’d suspect that is the case for many experiments in social or personality psychology where each measurement is essentially a single quantity (say, timing the speed with which someone walks out of the lab in a priming experiment) rather than being based on hundreds or thousands of trials as in psychophysics. Notoriously noisy measurements are also doubtless the major limiting factor in most neuroimaging experiments. On the other hand, I assume a lot of questionnaire-type results you might have in psychology (such as IQ or the Big Five personality factors) have actually pretty high test-retest reliability and so you probably do get mostly the between-subject variance.

The problem is that often it is very difficult to determine which scenario we are in. In psychophysics, we are often so extremely dominated by the measurement reliability that a knowledge of the “true” population effect size is actually completely irrelevant. This is a critical issue because you cannot use such an effect size for power analysis: If I take an experiment someone did and base my power analysis on the effect size they reported, I am not really powering my experiment to detect a similar effect but a similar design. (This is particularly useless if I then decide to use a different design…)

So next time you see an unusually large Cohen’s (d>10 or even d>3) ask yourself not simply whether this is a plausible effect but whether this experiment can plausibly estimate the true population effect. If this result is based on a single observation per subject with a highly variable measurement (say, how often Breton drivers stop for female hitchhikers wearing red clothing…), even a d=1 seems incredibly large.

But if it is for a measurement that could have been made more reliable by doubling the amount of data collected in each subject (say, a change in psychometric thresholds), then a very high Cohen’s d is entirely plausible – but it is also pretty meaningless. In this situation, what we should really care about is the absolute effect size (How much does the threshold change? How much does the accuracy drop? etc).

And I must say, I remain unsure whether absolute effect sizes aren’t more useful in general, including for experiments on complex human behaviour, neuroimaging, or drug effects.

* Actually the lapse rate probably increases with a greater number of trials due to subject fatigue, drop in motivation, or out of pure spite. But even that increase is unlikely to be as detrimental as having too few trials.

Of hacked peas and crooked teas

The other day, my twitter feed got embroiled in another discussion about whether or not p-hacking is deliberate and if it constitutes fraud. Fortunately, I then immediately left for a trip abroad and away from my computer, so there was no danger of me being drawn into this debate too deeply and running the risk of owing Richard Morey another drink. However, now that I am back I wanted to elaborate a bit more on why I think the way our field has often approached p-hacking is both wrong and harmful.

What the hell is p-hacking anyway? When I google it I get this Wikipedia article, which uses it as a synonym for “data dredging”. There we already have a term that seems to me more appropriate. P-hacking refers to when you massage your data and analysis methods until your result reaches a statistically significant p-value. I will put it to you that in practice most p-hacking is not necessarily about hacking p-s but about dredging your data until your results fit a particular pattern. That may be something you predicted but didn’t find or could even just be some chance finding that looked interesting and is amplified this way. However, the p-value is usually probably secondary to the act here. The end result may very well be the same in that you continue abusing the data until a finding becomes significant, but I would bet that in most cases what matters to people is not the p-value but the result. Moreover, while null-hypothesis significance testing with p-values is still by far the most widespread way to make inferences about results, it is not the only way. All this fussing about p-hacking glosses over the fact that the same analytic flexibility or data dredging can be applied to any inference, whether it is based on p-values, confidence intervals, Bayes factors, posterior probabilities, or simple summary statistics. By talking of p-hacking we create a caricature that this is somehow a problem specific to p-values. Whether or not NHST is the best approach for making statistical inferences is a (much bigger) debate for another day – but it has little to do with p-hacking.

What is more, not only is p-hacking not really about p’s but it is also not really about hacking. Here is the dictionary entry for the term ‘hacking‘. I think we can safely assume that when people say p-hacking they don’t mean that peas are physically being chopped or cut or damaged in any way. I’d also hazard a guess that it’s not meant in the sense of “to deal or cope with” p-values. In fact, the only meaning of the term that seems to come even remotely close is this:

“to modify a computer program or electronic device in a skillful or clever way”

Obviously, what is being modified in p-hacking is the significance or impressiveness of a result, rather than a computer program or electronic device, but we can let this slide. I’d also suggest that it isn’t always done in a skillful or clever way either, but perhaps we can also ignore this. However, the verb ‘hacking’ to me implies that this is done in a very deliberate way. It may even, as with computer hacking, carry the connotation of fraud, of criminal intent. I believe neither of these things are true about p-hacking.

That is not to say that p-hacking isn’t deliberate. I believe in many situations it likely is. People no doubt make conscious decisions when they dig through their data. But the overwhelming majority of p-hacking is not deliberately done to create spurious results that the researcher knows to be false. Anyone who does so would be committing actual fraud. Rather, most p-hacking is the result of confirmation bias combined with analytical flexibility. This leads people to sleep walk into creating false positives or – as Richard Feynman would have called it – fooling themselves. Simine Vazire already wrote an excellent post about this a few years ago (and you may see a former incarnation of yours truly in the comment section arguing against the point I’m making here… I’d like to claim that it’s cause I have grown as a person but in truth I only exorcised this personality :P). I’d also guess that a lot of p-hacking happens out of ignorance, although that excuse really shouldn’t fly as easily in 2017 as it may have done in 2007. Nevertheless, I am pretty sure people do not normally p-hack because they want to publish false results.

Some may say that it doesn’t matter whether or not p-hacking is fraud – the outcome is the same: many published results are false. But in my view it’s not so simple. First, the solution to these two problems surely isn’t the same. Preregistration and transparency may very well solve the problem of analytical flexibility and data dredging – but it is not going to stop deliberate fraud, nor is it meant to. Second, actively conflating fraud and data dredging implicitly accuses researchers of being deliberately misleading and thus automatically puts them on the defensive. This is hardly a way to have a productive discussion and convince people to do something about p-hacking. You don’t have to look very far for examples of that playing out. Several protracted discussions on a certain Psychology Methods Facebook group come to mind…

Methodological flexibility is a real problem. We definitely should do something about it and new moves towards preregistration and data transparency are at least theoretically effective solutions to improve things. The really pernicious thing about p-hacking is that people are usually entirely unaware of the fact that they are doing it. Until you have tried to do a preregistered study, you don’t appreciate just how many forks in the road you passed along the way (I may blog about my own experiences with that at some point). So implying, however unintentionally, that people are fraudsters is not helping matters.

Preregistration and data sharing have gathered a lot of momentum over the past few years. Perhaps the opinions of some old tenured folks opposed to such approaches no longer carry so much weight now, regardless how powerful they may be. But I’m not convinced that this is true. Just because there has been momentum now does not mean that these ideas will prevail. It is just as likely that they fizzle out due to lacking enthusiasm or because people begin to feel that the effort isn’t worth it. I seems to me that “open science” very much exists in a bubble and I have bemoaned that before. To change scientific practices we need to open the hearts and minds of sceptics to why p-hacking is so pervasive. I don’t believe we will achieve that by preaching to them. Everybody p-hacks if left to their own devices. Preregistration and open data can help protect yourself against your mind’s natural tendency to perceive patterns in noise. A scientist’s training is all about developing techniques to counteract this tendency, and so open practices are just another tool for achieving that purpose.

There is something fishy about those pea values…


A few pints in Brussels

Good day to you, my name is David but my frenemies call me Tory Dave. Last night, I went to the pub with some mates of mine for some pints. I say “mates” but to be honest I don’t really like most of these people very much. They are all foreigners and – while I know it’s not okay to say this out loud – I must admit I don’t really like foreigners, except perhaps when they are white and come from one of the former colonies.

The Italian guy is just so lazy, always on siesta, and the Spaniard is from some filthy place like Venice or some such and always complains when I put chorizo in my risotto. That German lass, Angie, has no sense of humour and they always beat us at the football. But the worst is of course the French guy, Michel. I’ve never liked the French. He always smells of garlic and looks so sour. This is why I always pre-drink before I even met those guys in the pub and why I look so cheerful in the photos.

So anyway, the others bought a few rounds. But when it came to my turn I just thought “Nah, you know what, I’ll just go home.” For some reason all the others got really pissed off that I wasn’t going to buy a round for them. At first I wanted to tell them to go whistle but then I took a deep breath lest they kick me out on the kerb without my umbrella and bowler hat.  I told them I would carefully check every item on the bill and only pay for what I drank, but that they should keep in mind that I already paid loads many years ago even though nobody can really remember. They claim that the German, French, and Italian all put in more than me but of course that’s just because of that discount I’ve had in the pub for decades because I used to be a lot less well-off than I am now. I felt it was best not to bring that up though because that’s always been a sore spot with them. Instead I just shrugged and stumbled home. They are still asking me to pay them back now because apparently “we had all agreed on that” but I have no recollection of that at all. Perhaps I shouldn’t have had all that gin before joining them.

Anyway, I still haven’t paid and won’t give them a penny. All the same, I’m sure they will all be happy to see me again soon because they want to sell me their cars and prosecco. After all, they’ll know I’ve saved up for that because I didn’t waste my hard earned cash on buying them rounds in the pub. Besides I have such excellent jams made from strawberries in my garden. Jam production costs me next to nothing because I just invite the neighbours around to pick them for a pittance. Shame is those bastards always overstay their welcome so I have now told them they can come for two hours but then they have to go home. I don’t mean to sound ungrateful, some of the Poles actually came to fix my radiator once and the Dutch put a plaster on my scraped knee when I fell over drunk after the pub. And they’ve always put in a couple of coins in the piggy bank (I’m saving for a cool model train for my kids). So I totally think these guys are valued members of our community here. I just don’t want them in my house.

Obviously, the ones I already let come in can stay, they just need to carry a card at all times so we know who they are and that they can get a cup of tea when I put on the kettle while the newcomers don’t. This is only fair. By the way, the card is totally not an “ID card” – I hate those obviously. It’s simply a card you use to identify yourself with. Surely, the other people coming in later won’t mind. They’re all happy they can come to pick my strawberries and then piss off again when I’m tired of hearing them speak that foreign gibberish. They steadfastly refuse to spend the fortune they make working for me on proper English lessons. Seriously, sometimes I don’t hear a word of English before I make it upstairs to the bedroom – and even then it’s mostly because I’m mumbling to myself about sovereignty and all the Islamic extremists from devout Muslim countries like Austria and Poland.

Naturally, the ID ca… sorry not-an-ID-card will only be required for people who don’t already live at this address, by which I mean who were born here or who lived over six years at this address, took an exam on the history of my house that I couldn’t pass myself in a million years, and who spent between £1200-2000 for the honour to pledge allegiance to the Queen. Because you can totally tell the difference between these people and the ones that didn’t. They just immediately become part of the family and show it by wearing tweed jackets, going on illegal fox hunts, and losing the ability to speak any other language but English.

Obviously, if anyone living at this address wants to marry someone who doesn’t they can just get the hell out. Why should I let some random slit-eyed or brown person live in my house just because my son or daughter wants to be with their spouse? Wait, did I say that out loud? I meant some person from one of those wonderful countries that I would like to sell lots of stuff to. You know, things like my strawberry jam.

Anyway, I disgress. If those bullies in the pub keep insisting that I pay for a round then I’ll just walk away. No round is better than a bad round, by which I mean a round that isn’t free. If they don’t want to spend time with me, then I am sure I’ll find someone else to have drinks with. Like Theresa, the vicar’s daughter from down the road, or Moggy, who looks and talks more and more like a vicar every day himself. Or perhaps Boris the Blond although he’s really a bit of a clown. And of course I can always go over to the golf course for some well-done steak with ketchup and chlorine chicken with Donny. Provided he doesn’t go to jail first.


Angels in our midst?

A little more on “tone” – but also some science

This post is somewhat related to the last one and will be my last words on the tone debate*. I am sorry if calling it the “tone debate” makes some people feel excluded from participating in scientific discourse. I thought my last post was crystal clear that science should be maximally inclusive, that everyone has the right to complain about things they believe to be wrong, and that unacceptable behaviour should be called out. And certainly, I believe that those with the most influence have a moral obligation to defend those who are in a weaker position (with great power comes great responsibility, etc…). It is how I have always tried to act. In fact, not so long ago I called out a particularly bullish but powerful individual because he repeatedly acts in my (and, for that matter, many other people’s) estimation grossly inappropriately in post-publication peer review. In response, I and others have taken a fair bit of abuse from said person. Speaking more generally, I also feel that as a PI I have a responsibility to support those junior to me. I think my students and postdocs can all stand up for themselves, and I would support them in doing so, but in any direct confrontation I’ll be their first line of defense. I don’t think many who have criticised the “tone debate” would disagree with this.

The problem with arguments about tone is that they are often very subjective. The case I mentioned above is a pretty clear cut case. Many other situations are much greyer. More importantly, all too often “tone” is put forth as a means to silence criticism. Quite to the contrary of the argument that this “excludes” underrepresented groups from participating in the debate, it is used to categorically dismiss any dissenting views. In my experience, the people making these arguments are almost always people in positions of power.

A recent example of the tone debate

One of the many events that recently brought the question of tone to my mind was this tweet by Tom Wallis. On PubPeer** a Lydia Maniatis has been posting comments on what seems to be just about every paper published on psychophysical vision science.

I find a lot of things to be wrong with Dr Maniatis’ comments. First and foremost, it remains a mystery to me what the actual point is she is trying to make. I confess I must first read some of the literature she cites to comprehend the fundamental problem with vision science she clearly believes to have identified. Who knows, she might have an important theoretical point but it eludes me. This may very well be due to my own deficiency but it would help if she spelled it out more clearly for unenlightened readers.

The second problem with her comments is that they are in many places clearly uninformed with regard to the subject matter. It is difficult to argue with someone about the choices and underlying assumptions for a particular model of the data when they seemly misapprehend what these parameters are. This is not an insurmountable problem and it may also partly originate in the lack of clarity with which they are described in publications. Try as you might***, to some degree your method sections will always make tacit assumptions about the methodological knowledge of the reader. A related issue is that she picks seemingly random statements from papers and counters them with quotes from other papers that often do not really support her point.

The third problem is that there is just so much of Maniatis’ comments! I probably can’t talk as I am known to write verbose blogs myself – but conciseness is a virtue in communication. In my scientific writing in manuscripts or reviews I certainly aim for it. Yet, in her comments of this paper by my colleague John Greenwood are a perfect example: by my count she expends 5262 words on this before giving John a chance to respond! Now perhaps the problems with that paper are so gigantic that this is justified but somehow I doubt it. Maniatis’ concern seems to be with the general theoretical background of the field. It seems to me that a paper or even a continuous blog would be a far better way to communicate her concerns than targeting one particular paper with this deluge. Even if the paper were a perfect example of the fundamental problem, it is hard to see the forest for the trees here. Furthermore, it also drowns out the signal-to-noise ratio of the PubPeer thread considerably. If someone had an actual specific concern, say because they identified a major statistical flaw, it would be very hard to see it in this sea of Maniatis. Fortunately most of her other comments on PubPeer aren’t as extensive but they are still long and the same issue applies.

Why am I talking about this? Well, a fourth problem that people have raised is that her “tone” is unacceptable (see for example here). I disagree. If there is one thing I don’t take issue with it is her tone. Don’t get me wrong: I do not like her tone. I also think that her criticisms are aggressive, hostile, and unnecessarily inflammatory. Does this mean we can just brush aside her comments and ignore her immediately? It most certainly doesn’t. Even if her comments were the kind of crude bullying some other unpleasant characters in the post-publication peer review sphere are guilty of (like that bullish person I mentioned above), we should at least try to extract the meaning. If someone continues to be nasty after being called out on it, I think it is best to ignore them. In particularly bad cases they should be banned from participating in the debate. No fruitful discussion will happen with someone who just showers you in ad hominems. However, none of that categorically invalidates the arguments they make underneath all that rubbish.

Maniatis’ comments are aggressive and uncalled for. I do however not think they are nasty. I would prefer it if she “toned it down” as they say but I can live with how she says what she says (but of course YMMV). The point is, the other three issues I described above are what concerns me, not her tone. To address them I see these solutions: first of all, I need to read some of the literature her criticisms are based on to try to understand where she is coming from. Secondly, people in the field need to explain to her points of apparent misunderstanding. If she refuses to engage or acknowledge that, then it is best to ignore her. Third, the signal-to-noise ratio of PubPeer comments could be improved by better filtering, so by muting a commenter like you can on Twitter. If PubPeer doesn’t implement that, then perhaps it can be achieved with a browser plug-in.

You promised there would be some science!

Yes I did. I am sorry it took so long to get here but I will briefly discuss a quote from Maniatis’ latest comment on John’s paper:

Let’s suppose that the movement of heavenly bodies is due to pushing by angels, and that some of these angels are lazier than others. We may then measure the relative motions of these bodies, fit them to functions, infer the energy with which each angel is pushing his or her planet, and report our “angel energy” findings. We may ignore logical arguments against the angel hypothesis. When, in future measurements, changes in motion are observed that makes the fit to our functions less good, we can add assumptions, such as that angels sometimes take a break, causing a lapse in their performance. And we can report these inferences as well. If discrepancies can’t be managed with quantitative fixes, we can just “hush them up.”

I may disagree (and fail to understand) most of her criticisms, but I really like this analogy. It actually reminds me of an example I used when commenting on Psi research and which I also use in my teaching about the scientific method. I used the difference between the heliocentric and geocentric models of planetary movements to illustrate Occam’s Razor, explanatory power, and the trade-off with model complexity. Maniatis’ angels are a perfect example for how we can update our models to account for new observations by increasing their complexity and overfitting the noise. The best possible model however should maximise explanatory power while minimising our assumptions. If we can account for planetary motion without assuming the existence of angels, we may be on the right track (as disappointing as that is).

It won’t surprise you when I say I don’t believe Maniatis’ criticism applies to vision science. Our angels are supported by a long list of converging scientific observations and I think that if we remove them from the model the explanatory power of the models goes down and the complexity increases. Or at least Maniatis hasn’t made it clear why that isn’t the case. However, leaving this specific case aside, I do like the analogy a lot. There you go, I actually discussed science for a change.

* I expect someone to hold me to this!
** She also commented on PubMed Central but apparently her account there has been blocked.
*** But this is no reason not to try harder.