Category Archives: science journalism

Everything* you ever wanted to know about perceived income but were afraid to ask

This is a follow-up to my previous post. For context, you may wish to read this first. In that post I discussed how a plot from a Guardian piece (based on a policy paper) made the claim that German earners tend to misjudge themselves as being closer to the mean or, in the authors’ own words, “everyone thinks they’re middle class“. Now last week, I simply looked at this in the simplest way possible. What I think this plot shows is simply the effect of transforming a normal-ish data distribution into a quantile scale. For reference, here is the original figure again:

The column on the left aren’t data. They simply label the deciles, 10% brackets of the income distribution. My point previously was that if you calculate the means of the actual data for each decile you get exactly this line-squeeze plot that is shown here. Obviously this depends on the range of the scale you use. I simply transformed (normalised) the income data into a 1-10 scale where the maximum earner gets a score of 10 and everyone else is below. The point really is that in this scenario this has absolutely nothing to do with perceiving your income at all. It is simply plotting the normalised income data and you produce a plot that is highly reminiscent of the real thing.

Does the question matter?

Obviously my example wasn’t really mimicking what happens with perceived income. By design, it wasn’t supposed to. However, this seems to have led to some confusion about what my “simulation” (if you can even call it that) was showing. A blog post by Dino Carpentras argues that what matters here is how perceived income was measured. Here I want to show why I believe this isn’t the case.

First of all, Dino suggested that if people indeed reported their decile then the plot should have perfectly horizontal lines. Dino’s post already includes some very nice illustrations of that so I won’t rehash that here and instead encourage you to read that post. A similar point was made to me on Twitter by Rob McCutcheon. Now, obviously if people actually reported their true deciles then this would indeed be the case. In this case we are simply plotting the decile against the decile – no surprises there. In fact, almost the same would happen if they estimated the exact quantile they fall in and we then average that (that’s what I think Rob’s tweet is showing but I admit my R is too rusty to get into this right now).

My previous post implicitly assumed that people are not actually doing that. When you ask people to rate themselves on a 1-10 scale in terms of where their income lies, I doubt people will think about deciles. But keep in mind that the actual survey asked the participants to rate exactly that. Yet even in this case, I doubt that people are naturally inclined to think about themselves in terms of quantiles. Humans are terrible at judging distributions and probability and this is no exception. However, this is an empirical question – there may well be a lot of research on this already that I’m unaware of and I’d be curious to know about it.

But I maintain that my previous point still stands. To illustrate, I first show what the data would look like in these different scenarios if people could indeed judge their income perfectly on either scale. The plot below is showing what I used in my example previously. This is a distribution of (simulated) actual incomes. The x-axis shows the income in fictitious dollars. All my previous simulation did was to normalise so the numbers/ticks on the x-axis are changed to be between 1-10 but all the relationships remain the same.

But now let us assume that people can judge their income quantile. This comes with a big assumption that all survey respondents even know what that means, which I’d doubt strongly. But let’s take that granted that any individual is able to report accurately what percentage of the population earns less than them. Below I plot that on the y-axis against the actual income on the x-axis. It gives you the characteristic sigmoid shape – it’s a function most psychophysicists will be very familiar with: the cumulative Gaussian.

If we averaged the y-values for each x-decile and plotted this the way the original graph did, we would get close to horizontal lines. That’s the example I believe Rob showed in his tweet above. However, Dino’s post goes further and assumes people can actually report their deciles (that is, answer the question the survey asked perfectly). That is effectively rounding the quantile reports into 10% brackets. Here is the plot of that. It still follows the vague sigmoid shape but becomes sharply edged.

If you now plotted the line squeeze diagram used in the original graph, you would get perfectly horizontal lines. As I said, I won’t replot this; there really is no need for it. But obviously this is not a realistic scenario. We are talking about self-ratings here. In my last post I already elaborated on a few psychological factors why self-rating measures will be noisy. This is by no means exhaustive. There will be error on any measure, starting from simple mistakes in self-report or whatever. While we should always seek to reduce the noise in our measurements, noisy measurements are at the heart of science.

So let’s simulate that. Sources of error will affect the “perceived income” construct at several levels. The simplest we can do to simulate it is an error on how much the individual thinks their actual income is – we take each person’s income and add a Gaussian error. I used a Gaussian with SD=$30,000. That may be excessive but we don’t really know that. There is likely error in how high people think their income is relative to their peers and general spending power. Even more likely, there must be error on how they rate themselves on the 1-10 decile scale. I suspect that when transformed back into actual income this will be disproportionally larger than the error on judging their own income in dollars. It doesn’t really matter in principle.

Each point here is a simulated person’s self-reported income quantile plotted against their actual income. As you can see while the data still follow the vague sigmoid shape, there is a lot of scatter in people’s “reported” quantiles compared to what it actually should be. For clarity, I added a colour code here which denotes the actual income decile each person belongs to. The darkest blue are the 10% lowest earners and the yellow bracket is the top earners.

Next I round people’s income to simulate their self-reported deciles. The point of this is to effectively transform the self-reports into the discrete 1-10 scale that we believe the actual survey respondents used (I still don’t know the methods and if people were allowed to score themselves a 5.5 for instance – but based on my reading of the paper the scale was discrete). I replot these self-reported deciles using the same format:

Obviously, the y-axis will now again cluster in these 10 discrete levels. But as you can see from the colour code, the “self-reported” decile is a poor reflection of the actual income bracket. While a relative majority (or plurality) of respondents scoring themselves 1 are indeed in the lowest decile, in this particular example some of them are actual top earners. The same applies to the other brackets. Respondents thinking of themselves as perfectly middle class in decile 5 actually come more or less equally from across the spectrum. Now, again this may be a bit excessive but bear with me for just a while longer…

What happens when we replot this with our now infamous line plots? Voilà, doesn’t this look hauntingly familiar?

The reason for this is that perceived income is a psychological measure. Or even just a real world measure. It is noisy. The take-home message here is: It does not matter what question you ask the participants. People aren’t computers. The corollary of that is that when data are noisy the line plot must necessarily produce this squeezing effect the original study reported.

Now you may rightly say, Sam, this noise simulation is excessive. That may well be. I’ll be the first to admit that there are probably not many billionaires who will rate themselves as belonging to the lowest decile. However, I suspect that people indeed have quite a few delusions about their actual income. This may be more likely to affect the people in the actual middle range perhaps. So I don’t think the example here is as extreme as it may appear at first glance. There are also many further complications, such as that these measures are probably heteroscedastic. The error by which individuals misjudge their actual income level in dollars is almost certainly greater for high earners. My example here is very simplistic in assuming the same amount of error across the whole population. This heteroscedasticity is likely to introduce further distortions – such as the stronger “underestimation” by top earners compared to the “overestimation” by low earners, i.e. what the original graph purports to show.

In any case, the amount of error you choose for the simulation doesn’t affect the qualitative pattern. If people are more accurate at judging their income decile, the amount of “squeezing” we see in these line plots will be less extreme. But it must be there. So any of these plots will necessarily contain a degree of this artifact and thus make it very difficult to ascertain if this misestimation claimed by the policy paper and the corresponding Guardian piece actually exists.

Finally, I want to reiterate this because it is important: What this shows is that people are bad at judging their income. There is error on this judgement, but crucially this is Gaussian (or semi-Gaussian) error. It is symmetric. Top earner Jeff may underestimate his own income because he has no real concept of how the other half** live. In contrast, billionaire Donny may overestimate his own wealth because of his fragile ego and he forgot how much money he wastes on fake tanning oil. The point is, every individual*** in our simulated population is equally likely to over- or under-estimate their income – however, even with such symmetric noise the final outcome of this binned line plot is that the bin averages trend towards the population mean.

*) Well, perhaps almost everything?

**) Or to be precise, how the other 99.999% live.

***) Actually because my simulation prevents negative incomes for the very lowest earners, the error must skew their perceived income upwards.

Matlab code for this simulation is available here.

It’s #$%&ing everywhere!

I can hear you singing in the distance
I can see you when I close my eyes
Once you were somewhere and now you’re everywhere

Superblood Wolfmoon – Pearl Jam

If you read my previous blog post you’ll know I have a particular relationship these days with regression to the mean – and binning artifacts in general. Our recent retraction of a study reminded me of this issue. Of course, I was generally aware of the concept, as I am sure are most quantitative scientists. But often the underlying issues are somewhat obscure, which is why I certainly didn’t immediately clock on to them in our past work. It took a collaborate group effort with serendipitous suggestions, much thinking and simulating and digging, and not least of all the tireless efforts of my PhD student Susanne Stoll to uncover the full extent of this issue in our published research. We also still maintain that this rabbit hole goes a lot deeper because there are numerous other studies that used similar analyses. They must by necessity contain the same error – hopefully the magnitude of the problem is less severe in most other studies so that their conclusions aren’t all completely spurious. However, we simply cannot know that until somebody investigates this empirically. There are several candidates out there where I think the problem is almost certainly big enough to invalidate the conclusions. I am not the data police and I am not going to run around arguing people’s conclusions are invalid without A) having concrete evidence and B) having talked to the authors personally first.

What I can do, however, is explain how to spot likely candidates of this problem. And you really don’t have far too look. We believe that this issue is ubiquitous in almost all pRF studies; specifically, it affects all pRF studies that use any kind of binning. There are cases where this is probably of no consequence – but people must at least be aware of the issue before it leads to false assumptions and thus erroneous conclusions. We hope to publish another article in the future that lays out this issue in some depth.

But it goes well beyond that. This isn’t a specific problem with pRF studies. Many years before that I had discussions with David Shanks about this subject when he was writing an article (also long since published) of how this artifact confounds many studies in the field of unconscious processing, something that certainly overlaps with my own research. Only last year there was an article arguing that the same artifact explains the Dunning-Kruger effect. And I am starting to see this issue literally everywhere1 now… Just the other day I saw this figure on one of my social media feeds:

This data visualisation makes a striking claim with very clear political implications: High income earners (and presumably very rich people in general) underestimate their wealth relative to society as a whole, while low income earners overestimate theirs. A great number of narratives can be spun about this depending on your own political inclinations. It doesn’t take much imagination to conjure up the ways this could be used to further a political agenda, be it a fierce progressive tax policy or a rabid pulling-yourself-up-by-your-own-bootstraps type of conservatism. I have no interest in getting into this discussion here. What interests me here is whether the claim is actually supported by the evidence.

There are a number of open questions here. I don’t know how “perceived income” is measured exactly2. It could theoretically be possible that some adjustments were made here to control for artifacts. However, taken at face value this looks almost like a textbook example of regression to the mean. Effectively, you have an independent variable, the individuals’ actual income levels. We can presumably regard this as a ground truth – an individual’s income is what it is. We then take a dependent variable, perceived income. It is probably safe to assume that this will correlate with actual income. However, this is not a perfect correlation because perfect correlations are generally meaningless (say correlating body height in inches and centimeters). Obviously, perceived income is a psychological measure that must depend on a whole number of extraneous factors. For one thing, people’s social networks aren’t completely random but we all live embedded in a social context. You will doubtless judge your wealth relative to the people you mostly interact with. Another source of misestimation could be how this perception is measured. I don’t know how that was done here in detail but people were apparently asked to self-rate their assumed income decile. We can expect psychological factors at play that make people unlikely to put themselves in the lowest or highest scores on such a scale. There are many other factors at play but that’s not really important. The point is that we can safely assume that people are relatively bad at judging their true income relative to the whole of society.

But to hell with it, let’s just disregard all that. Instead, let us assume that people are actually perfectly accurate at judging their own income relative to society. Let’s simulate this scenario3. First we draw 10,000 people a Gaussian distribution of actual incomes. This distribution has a mean of $60,000 and a standard deviation of $20,000 – all in fictitious dollars which we assume our fictitious country uses. We assume these are based on people’s paychecks so there is no error4 on this independent variable at all. I use the absolute values to ensure that there is no negative income. The figure below shows the actual objective income for each (simulated) person on the x-axis. The y-axis is just random scatter for visualisation – it has no other significance. The colour code denotes the income bracket (decile) each person belongs to.

Next I simulate perceived income deciles for these fictitious people. To do this we need to do some rescaling to get everyone on the scale 1-10, with 10 being highest top earner. However – and this is important – as per our (certainly false) assumption above, perceived income is perfectly correlated with actual income. It is a simple transformation to rescale it. Now, what happens when you average the perceived income in each of these decile brackets like that graph above did? I do that below, using the same formatting as the original graph:

I will leave it to you, gentle reader, to determine how this compares to the original figure. Why is this happening? It’s simple really when you think about it: Take the highest income bracket. This ranges widely from high-but-reasonable to filthy-more-money-than-you-could-ever-spend-in-a-lifetime rich. This is not a symmetric distribution. The summary statistics of these binned data will be heavily skewed. Its mean/median will be biased downward for the top income brackets and upwards for the low income brackets. Only the income decile near the centre will be approximately symmetric and thus produce an unbiased estimate. Or to put it in simpler terms: the left column simply labels the deciles brackets. The only data here is in the right column and all this plot really shows is that the incomes have a Gaussian-like distribution. This has nothing to do with perceptions of income whatsoever.

In discussions I’ve had this all still confuses some people. So I added another illustration. In the graph below I plot a normal distribution. The coloured bands denote the approximated deciles. The white dots on the X-axis show the mean for each decile. The distance between these dots is obviously not equal. They all trend to be closer to the population mean (zero) than to the middle of their respective bands. This bias is present for all deciles except perhaps the most central ones. However, it is most extreme for the outermost deciles because these have the most asymmetric distributions. This is exactly what the income plots above are showing. It doesn’t matter whether we are looking at actual or perceived income. It doesn’t matter at all if there is error on those measures or not. All that matters is the distribution of the data.

Now, as I already said, I haven’t seen the detailed methodology of that original survey. If the analysis made any attempt to mathematically correct for this problem then I’ll stand corrected5. However, even in that case, the general statistical issue is extremely wide-spread and this serves as a perfect example of how binning can result in widely erroneous conclusions. It also illustrates the importance of this issue. The same problem relates to pRF tuning widths and stimulus preferences and whatnot – but that is frankly of limited importance. But things like these income statistics could have considerable social implications. What this shows to me is two-fold: First, please be careful when you do data analysis. Whenever possible, feed some simulated data to your analysis to see if it behaves as you think it should. Second, binning sucks. I see it effing everywhere now and I feel like I haven’t slept in months6

Superbloodmoon eclipse
Photo by Dave O’Brien, May 2021
  1. A very similar thing happened when I first learned about heteroscedasticity. I kept seeing it in all plots then as well – and I still do…
  2. Many thanks to Susanne Stoll for digging up the source for these data. I didn’t see much in terms of actual methods details here but I also didn’t really look too hard. Via Twitter I also discovered the corresponding Guardian piece which contains the original graph.
  3. Matlab code for this example is available here. I still don’t really do R. Can’t teach an old dog new tricks or whatever…
  4. There may be some error with a self-report measure of people’s actual income although this error is perhaps low – either way we do not need to assume any error here at all.
  5. Somehow I doubt it but I’d be very happy to be wrong.
  6. There could however be other reasons for that…

If this post confused you, there is now a follow-up post to confuse you even more… 🙂

Massaging data to fit a theory is antithetical to science

I have stayed out of the Wansink saga for the most part. If you don’t know what this is about, I suggest reading about this case on Retraction Watch. I had a few private conversations about this with Nick Brown, who has been one of the people instrumental in bringing about a whole series of retractions of Wansink’s publications. I have a marginal interest in some of Wansink’s famous research, specifically whether the size of plates can influence how much a person eats, because I have a broader interest in the interplay between perception and behaviour.

But none of that is particularly important. The short story is that considerable irregularities have been discovered in a string of Wansink’s publications, many of which has since been retracted. The whole affair first kicked off with a fundamental own-goal of a blog post (now removed, so posting Gelman’s coverage instead) he wrote in which he essentially seemed to promote p-hacking. Since then the problems that came to light ranged from irregularities (or impossibility) of some of the data he reported, evidence of questionable research practices in terms of cherry-picking or excluding data, to widespread self-plagiarism. Arguably, not all of these issues are equally damning and for some the evidence is more tenuous than for others – but the sheer quantity of problems is egregious. The resulting retractions seem entirely justified.

Today I read an article on Times Higher Education entitled “Massaging data to fit a theory is not the worst research sin” by Martin Cohen, which discusses Wansink’s research sins in a broader context of the philosophy of science. The argument is pretty muddled to me so I am not entirely sure what the author’s point is – but the effective gist seems to shrug off concerns about questionable research practices and that Wansink’s research is still a meaningful contribution to science.  In my mind, Cohen’s article reflects a fundamental misunderstanding of how science works and in places sounds positively post-Truthian. In the following, I will discuss some of the more curious claims made by this article.

“Massaging data to fit a theory is not the worst research sin”

I don’t know about the “worst” sin. I don’t even know if science can have “sins” although this view has been popularised by Chris Chamber’s book and Neuroskeptic’s Circles of Scientific Hell. Note that “inventing data”, a.k.a. going Full-Stapel, is considered the worst affront to the scientific method in the latter worldview. “Massaging data” is perhaps not the same as outright making it up, but on the spectrum of data fabrication it is certainly trending in that direction.

Science is about seeking the truth. In Cohen’s words, “science should above all be about explanation”. It is about finding regularities, relationships, links, and eventually – if we’re lucky – laws of nature that help us make sense of a chaotic, complex world. Altering, cherry-picking, or “shoe-horning” data to fit your favourite interpretation is the exact opposite of that.

Now, the truth is that p-hacking,  the garden of forking paths, flexible outcome-contingent analyses fall under this category. Such QRPs are extremely widespread and to some degree pervade most of the scientific literature. But just because it is common, doesn’t mean that this isn’t bad. Massaging data inevitably produces a scientific literature of skewed results. The only robust way to minimise these biases is through preregistration of experimental designs and confirmatory replications. We are working towards that becoming more commonplace – but in the absence of that it is still possible to do good and honest science.

In contrast, prolifically engaging in such dubious practices, as Wansink appears to have done, fundamentally undermines the validity of scientific research. It is not a minor misdemeanour.

“We forget too easily that the history of science is rich with errors”

I sympathise with the notion that science has always made errors. One of my favourite quotes about the scientific method is that it is about “finding better ways of being wrong.” But we need to be careful not to conflate some very different things here.

First of all, a better way of being wrong is an acknowledgement that science is never a done deal. We don’t just figure out the truth but constantly seek to home in on it. Our hypotheses and theories are constantly refined, hopefully by gradually becoming more correct, but there will also be occasional missteps down a blind alley.

But these “errors” are not at all the same thing as the practices Wansink appears to have engaged in. These were not mere mistakes. While the problems with many QRPs (like optional stopping) have long been underappreciated by many, a lot of the problems in Wansink’s retracted articles are quite deliberate distortions of scientific facts. For most, he could have and should have known better. This isn’t the same as simply getting things wrong.

The examples Cohen offers for the “rich errors” in past research are also not applicable. Miscalculating the age of the earth or presenting an incorrect equation are genuine mistakes. They might be based on incomplete or distorted knowledge. Publishing an incorrect hypothesis (e.g., that DNA is a triple helix) is not the same as mining data to confirm a hypothesis. It is perfectly valid to derive new hypotheses, even if they turn out to be completely false. For example, I might posit that gremlins cause the outdoor socket on my deck to fail. Sooner or later, a thorough empirical investigation will disprove this hypothesis and the evidence will support an alternative, such as that the wiring is faulty. The gremlin hypothesis may be false – and it is also highly implausible – but nothing stops me from formulating it. Wansink’s problem wasn’t with his hypotheses (some of which may indeed turn out to be true) but with the irregularities in the data he used to support them.

“Underlying it all is a suspicion that he was in the habit of forming hypotheses and then searching for data to support them”

Ahm, no. Forming hypotheses before collecting data is how it’s supposed to work. Using Cohen’s “generous perspective”, this is indeed how hypothetico-deductive research works. In how far this relates to Wansink’s “research sin” depends on what exactly is meant here by “searching for data to support” your hypotheses. If this implies you are deliberately looking for data that confirms your prior belief while ignoring or rejecting observations that contradict it, then that is not merely a questionable research practice, but antithetical to the whole scientific endeavour itself. It is also a perfect definition of confirmation bias, something that afflicts all human beings to some extent, scientists included. Scientists must find protections from fooling themselves in this way and that entails constant vigilance and scepticism of our own pet theories. In stark contrast, engaging in this behaviour actively and deliberately is not science but pure story-telling.

The critics are not merely “indulging themselves in a myth of neutral observers uncovering ‘facts'”. Quite to the contrary, I think Wansink’s critics are well aware of the human fallibility of scientists. People are rarely perfectly neutral when it pertains to hypotheses. Even when you are not emotionally invested in which one of multiple explanations for a phenomenon might be correct, they are frequently not equal in terms of how exciting it might be to confirm them. Finding gremlins under my deck would certainly be more interesting (and scary?) than evidence of faulty wiring.

But in the end, facts are facts. There are no “alternative facts”. Results are results. We can differ on how to interpret them but that doesn’t change the underlying data. Of course, some data are plainly wrong because they come from incorrect measurements, artifacts, or statistical flukes. These results are wrong. They aren’t facts even if we think of them as facts at the moment. Sooner or later, they will be refuted. That’s normal. But this is a long shot from deliberately misreporting or distorting facts.

“…studies like Wansink’s can be of value if they offer new clarity in looking at phenomena…”

This seems to be the crux of Cohen’s argument. Somehow, despite all the dubious and possibly fraudulent nature of his research, Wansink still makes a useful contribution to science. How exactly? What “new clarity” do we gain from cherry-picked results?

I can see though that Wansink may “stimulate ideas for future investigations”. There is no denying that he is a charismatic presenter and that some of his ideas were ingenuous. I like the concept of self-filling soup bowls. I do think we must ask some critical questions about this experimental design, such as whether people can be truly unaware that the soup level doesn’t go down as they spoon it up. But the idea is neat and there is certainly scope for future research.

But don’t present this as some kind of virtue. By all means, give credit to him for developing a particular idea or a new experimental method. But please, let’s not pretend that this excuses the dubious and deliberate distortion of the scientific record. It does not justify the amount of money that has quite possibly been wasted on changing how people eat, the advice given to schools based on false research. Deliberately telling untruths is not an error, it is called a lie.



Irish Times OpEds are just bloody awful at science (n=1)

TL-DR: No, men are not “better at science” than women.

Clickbaity enough for you? I cannot honestly say I have read a lot of OpEds in the Irish Times so the evidence for my titular claim is admittedly rather limited. But it is still more solidly grounded in actual data than this article published yesterday in the Irish Times. At least I have one data point.

The article in question, a prime example of Betteridge’s Law, is entitled “Are men just better at science than women?“. I don’t need to explain why such a title might be considered sensationalist and controversial. The article itself is an “Opinion” piece, thus allowing the publication to disavow any responsibility for its authorship whilst allowing it to rake in the views from this blatant clickbait. In it, the author discusses some new research reporting gender differences in systemising vs empathising behaviour and puts this in the context of some new government initiative to specifically hire female professors because apparently there is some irony here. He goes on a bit about something called “neurosexism” (is that a real word?) and talks about “hard-wired” brains*.

I cannot quite discern if the author thought he was being funny or if he is simply scientifically illiterate but that doesn’t really matter. I don’t usually spend much time commenting on stuff like that. I have no doubt that the Irish Times, and this author in particular, will be overloaded with outrage and complaints – or, to use the author’s own words, “beaten up” on Twitter. There are many egregious misrepresentations of scientific findings in the mainstream media (and often enough, scientists and/or the university press releases are the source of this). But this example of butchery is just so bad and infuriating in its abuse of scientific evidence that I cannot let it slip past.

The whole argument, if this is what the author attempted, is just riddled with logical fallacies and deliberate exaggerations. I have no time or desire to go through them all. Conveniently, the author already addresses a major point himself by admitting that the study in question does not actually speak to male brains being “hard-wired” for science, but that any gender differences could be arising due to cultural or environmental factors. Not only that, he also acknowledges that the study in question is about autism, not about who makes good professors. So I won’t dwell on these rather obvious points any further. There are much more fundamental problems with the illogical leaps and mental gymnastics in this OpEd:

What makes you “good at science”?

There is a long answer to this question. It most certainly depends somewhat on your field of research and the nature of your work. Some areas require more manual dexterity, whilst others may require programming skills, and others yet call for a talent for high-level maths. As far as we can generalise, in my view necessary traits of a good researcher are: intelligence, creativity, patience, meticulousness, and a dedication to seek the truth rather than confirming theories. That last one probably goes hand-in-hand with some scepticism, including a healthy dose of self-doubt.

There is also a short answer to this question. A good scientist is not measured by their Systemising Quotient (SQ), a self-report measure that quantifies “the drive to analyze or build a rule-based system”. Academia is obsessed with metrics like the h-index (see my previous post) but even pencil pushers and bean counters** in hiring or grant committees haven’t yet proposed to use SQ to evaluate candidates***.

I suspect it is true that many scientists score high on the SQ and also the related Autism-spectrum Quotient (AQ) which, among other things, quantifies a person’s self-reported attention to detail. Anecdotally, I can confirm that a lot of my colleagues score higher than the population average on AQ. More on this in the next section.

However, none of this implies that you need to have a high SQ or AQ to be “good at science”, whatever that means. That assertion is a logical fallacy called affirming the consequent. We may agree that “systemising” characterises a lot of the activities a typical scientist engages in, but there is no evidence that this is sufficient to being a good scientist. It could mean that systemising people are attracted to science and engineering jobs. It certainly does not mean that a non-systemising person cannot be a good scientist.

Small effect sizes

I know I rant a lot about relative effect sizes such as Cohen’s d, where the mean difference is normalised by the variability. I feel that in a lot of research contexts these are given undue weight because the variability itself isn’t sufficiently controlled. But for studies like this we can actually be fairly confident that they are meaningful. The scientific study had a pretty whopping sample size of 671,606 (although that includes all their groups) and also used validation data. The residual physiologist inside me retains his scepticism about self-report questionnaire type measures, but even I have come to admit that a lot of questionnaires can be pretty effective. I think it is safe to say that the Big 5 Personality Factors or the AQ tap into some meaningful real factors. Further, whatever latent variance there may be on these measures, that is probably outweighed by collecting such a massive sample. So the Cohen’s d this study reports is probably quite informative.

What does this say? Well, the difference in SQ between males and females was 0.31. In other words, the distributions of SQ between sexes overlap quite considerably but the distribution for males is somewhat shifted towards higher values. Thus, while the average man has a subtly higher SQ than the average woman, a rather considerable number of women will have higher SQs than the average man. The study helpfully plots these distributions in Figure 1****:

Sex diffs SQ huge N
Distributions of SQ in control females (cyan), control males (magenta), austistic females (red), and autistic males (green).

The relevant curves here are the controls in cyan and magenta. Sorry, colour vision deficient people, the authors clearly don’t care about you (perhaps they are retinasexists?). You’ll notice that the modes of the female and male distributions are really not all that far apart. More noticeable is the skew of all these distributions with a long tail to the right: Low SQs are most common in all groups (including autism) but values across the sample are spread across the full range. So by picking out a random man and a random woman from a crowd, you can be fairly confident that their SQs are both on the lower end but I wouldn’t make any strong guesses about whether the man has a higher SQ than the woman.

However, it gets even tastier because the authors of the study actually also conducted an analysis splitting their data from controls into people in Science, Technology, Engineering, or Maths (STEM) professions compared to controls who were not in STEM. The results (yes, I know the colour code is now weirdly inverted – not how I would have done it…) show that people in STEM, whether male or female, tend to have larger SQs than people outside of STEM. But again, the average difference here is actually small and most of it plays out in the rightward tail of the distributions. The difference between males and females in STEM is also much less distinct than for people outside STEM.

Sex & STEM diffs SQ
Distributions of SQ in STEM females (cyan), STEM males (magenta), control females (red), and control males (green).

So, as already discussed in the previous section, it seems to be the case that people in STEM professions tend to “systemise” a bit more. It also suggests that men systemise more then women but that difference probably decreases for people in STEM. None of this tells us anything about whether people’s brains are “hard-wired” for systemising, if it is about cultural and environmental differences between men and women, or indeed if  being trained in a STEM profession might make people more systemising. It definitely does not tell you who is “good at science”.

What if it were true?

So far so bad for those who might want to make that interpretive leap. But let’s give them the benefit of the doubt and ignore everything I said up until now. What if it were true that systemisers are in fact better scientists? Would that invalidate government or funders initiatives to hire more female scientists? Would that be bad for science?

No. Even if there were a vast difference in systemising between men and women, and between STEM and non-STEM professions, respectively, all such a hiring policy will achieve is to increase the number of good female scientists – exactly what this policy is intended to do. Let me try an analogy.

Basketball players in the NBA tend to be pretty damn tall. Presumably it is easier to dunk when you measure 2 meters than when you’re Tyrion Lannister. Even if all other necessary skills here are equal there is a clear selection pressure for tall people to get into top basketball teams. Now let’s imagine a team decided they want to hire more shorter players. They declare they will hire 10 players who cannot be taller than 1.70m. The team will have try-outs and still seek to get the best players out of their pool of applicants. If they apply an objective criterion for what makes a good player, such as the ability to score consistently, they will only hire short players with excellent aim or who can jump really high. In fact, these shorties will be on average better at aiming and/or jumping than the giants they already have on their team. The team selects for the ability to score. Shorties and Tallies get there via different means but they both get there.

In this analogy, being a top scorer is being a systemiser, which in turn makes you a good scientist. Giants tend to score high because they find it easy to reach the basket. Shorties score high because they have other skills that compensate for their lack of height. Women can be good systemisers despite not being men.

The only scenario in which such a specific hiring policy could be counterproductive is if two conditions are met: 1) The difference between groups in the critical trait (i.e., systemising) is vast and 2) the policy mandates hiring from a particular group without any objective criteria. We have already established that the former condition isn’t fulfilled here – the difference in systemising between men and women is modest at best. The latter condition is really a moot point because this is simply not how hiring works in the real world. Hiring committees don’t usually just offer jobs to the relatively best person out of the pool but also consider the candidates’ objective abilities and achievements. This is even more pertinent here because all candidates in this case will already be eligible for a professorial position anyway. So all that will in fact happen is that we end up with more female professors who will also happen to be high in systemising.

Bad science reporting

Again, this previous section is based on the entirely imaginary and untenable assumption that systemisers are better scientists. I am not aware of any evidence of that – in part because we cannot actually quantify very well what makes a good scientist. The metrics academics actually (and sadly) use for hiring and funding decisions probably do not quantify that either but I am not even aware of any link between systemising and those metrics. Is there a correlation between h-indeces (relative to career age) and SQ? I doubt it.

What we have here is a case of awful science reporting. Bad science journalism and the abuse of scientific data for nefarious political purposes are hardly a new phenomenon – and this won’t just disappear. But the price of freedom (to practice science) is eternal vigilance. I believe as scientists we have a responsibility to debunk such blatant misapprehensions by journalists who I suspect have never even set foot in an actual lab or spoken to any actual scientists.

Some people assert that improving the transparency and reliability of research will hurt the public’s faith in science. Far from it, I believe those things can show people how science really works. The true damage to how the public perceives science is done by garbage articles in the mainstream media like this one – even if it is merely offered as an “opinion”.

By Keith Allison

*) Brains are not actually hard-wired to do anything. Leaving the old Hebbian analogy aside, brains aren’t wired at all, period. They are soft, squishy, wet sponges containing lots of neuronal and glial tissue plus blood vessels. Neurons connect via synapses between axons and dendrites and this connectivity is constantly regulated and new connections grown while others are pruned. This adaptability is one of the main reasons why we even have brains, and lies at the heart of the intelligence, ingenuity, and versatility of our species.

**) I suspect a lot of the pencil pushers and bean counters behind metrics like impact factors or the h-index might well be Systemisers.

***) I hope none of them read this post. We don’t want to give these people any further ideas…

****) Isn’t open access under Creative Commons license great?