In June 2016, the United Kingdom carried out a little study to test the hypothesis that it is the “will of the people” that the country should leave the European Union. The result favoured the Leave hypothesis, albeit with a really small effect size (1.89%). This finding came as a surprise to many but as so often it is the most surprising results that have the most impact.
Accusations of p-hacking soon emerged. Not only was there a clear sampling bias but data thugs suggested that the results might have even been obtained by fraud. Nevertheless, the original publication was never retracted. What’s wrong with inflating the results a bit? Massaging data to fit a theory is not the worst sin! The history of science is rich with errors. Such studies can be of value if they offer new clarity in looking at phenomena.
In fact, the 2016 study did offer a lot of new ways to look at the situation. There was a fair amount of HARKing about what the result of the 2016 study actually meant. Prior to conducting the study, at conferences and in seminars the proponents of the Leave hypotheses kept talking about the UK having a relationship with the EU like Norway and Switzerland. Yet somehow in the eventual publication of the 2016 findings, the authors had changed their tune. Now they professed that their hypothesis was obviously always that the UK should leave the EU without any deal whatsoever.
Sceptics of the Leave hypothesis pointed out various problems with this idea. For one thing, leaving the EU without a deal wasn’t a very plausible hypothesis. There were thousands of little factors to be considered and it seemed unlikely that this was really the will of the people. And of course, the nitpickers also said that “barely more than half” could never be considered the “will of the people”.
Almost immediately, there were calls for a replication to confirm that the “will of the people” really was what believers in the Leave-without-a-deal hypothesis claimed. At first, these voices came only from a ragtag band of second stringers – but as time went on and more and more people realised just how implausible the Leave hypothesis really was, their numbers grew.
Leavers however firmly disagreed. To them, a direct replication was meaningless. That was odd for some of them had openly admitted they wanted to p-hack the hell out of this thing until they got the result they wanted. But now they claimed that there had by now been several conceptual replications of the 2016 results, first in the United States and then later also Brazil, and some might argue even in Italy, Hungary, and Poland. Also in several other European countries similar results were found, albeit not statistically significant. Based on all this evidence, a meta-analysis surely supported the general hypothesis?
But the replicators weren’t dissuaded. The more radical among these methodological terrorists posited that any study in which the experimental design isn’t clearly defined and preregistered prior to data collection is inherently exploratory, and cannot be used to test any hypotheses. They instead called for a preregistered replication, ideally a Registered Report where the methods are peer-reviewed and the manuscript is in principle accepted for publication before data collection even commences. The fact that the 2016 study didn’t do this was just one of its many problems. But people still cite it simply because of its novelty. The replicators also pointed to other research fields, like Switzerland and Ireland, where this approach has long been used very successfully.
As an added twist, it turns out that nobody actually read the background literature. The 2016 study was already a replication attempt of previous findings from 1975. Sure, some people had vaguely heard about this earlier study. Everybody who has ever been to a conference knows that there is always one white-haired emeritus professor in the audience who will shout out “But I already did this four decades ago!”. But nobody really bothered to read this original study until now. It found an enormous result in the opposite direction, 17.23% in favour of remaining in Europe. As some commentators suggested, the population at large may have changed over the past four decades or that there may have been subtle but important differences in the methodology. What if leaving Europe then meant something different to what it means now? But if that were the case, couldn’t leaving Europe in 2016 also have meant something different than in 2019?
But the Leave proponents wouldn’t have any of that. They had already invested too much money and effort and spent all this time giving TED talks about their shiny little theory to give up now. They were in fact desperately afraid of a direct replication because they knew that as with most replications it would probably end in a null result and their beautiful theoretical construct would collapse like a house of cards. Deep inside, most of these people already knew they were chasing a phantom but they couldn’t ever admit it. People like Professor BoJo, Dr Moggy, and Micky “The Class Clown” Gove had built their whole careers on this Leave idea and so they defended the “will of the people” with religious zeal. The last straw they clutched to was to warn that all these failures to replicate would cause irreparable damage to the public’s faith in science.
Only Nigel Farage, unaffiliated garden gnome and self-styled “irreverent citizen scientist”, relented somewhat. Naturally, he claimed he would be doing all that just for science and the pursuit of the truth and that the result of this replication would be even clearer than the 2016 finding. But in truth, he smelled the danger on the wind. He knew that should the Leave hypothesis be finally accepted by consensus, he would be reduced to a complete irrelevance. What was more, he would not get that hefty paycheck.
As of today, the situation remains unresolved. The preregistered replication attempt is still stuck in editorial triage and hasn’t even been sent out for peer review yet. But meanwhile, people in the corridors of power in Westminster and Brussels and Tokyo and wherever else are already basing their decisions on the really weak and poor and quite possibly fraudulent data from the flawed 2016 study. But then, it’s all about the flair, isn’t it?
I was originally thinking of writing a long blog post discussing this but it is hard to type verbose treatises like that from inside my gently swaying hammock. So y’all will be much relieved to hear that I spare you that post. Instead I’ll just post the results of the recent Twitter poll I ran, which is obviously enormously representative of the 336 people who voted. Whatever we’re going to make of this, I think it is obvious that there remains great scepticism about treating preprints the same as publications. I am also puzzled by what the hell is wrong with the 8% who voted for the third option*. Do these people not put In press articles on their CVs?
*) Weirdly, I could have sworn that when the poll originally closed this percentage was 9%. Somehow it was corrected downwards afterward. Should this be possible?
Disclaimer: This is a follow-up to my previous post about the discussion between Niko Kriegeskorte and Brad Love. Here are my scientific views on the preprint by Bobadilla-Suarez, Ahlheim, Mehrotra, Panos, & Love and some of the issues raised by Kriegeskorte in his review/blog post. This is not a review and therefore not as complete as a review would be, and it contains some additional explanations and non-scientific points. Given my affiliation with Bobadilla-Suarez’s department, a formal review for a journal would constitute a conflict of interest anyway.
What’s the point of all this?
I was first attracted to Niko’s post because just the other day my PhD student and I discussed the possibility of running a new study using Representational Similarity Analysis (RSA). Given the title of his post, I jokingly asked him what was the TL;DR answer to the question “What’s the best measure of representational dissimilarity?”. At the time, I had no idea that this big controversy was brewing… I have used multivoxel pattern analyses in the past and am reasonably familiar with RSA but I have never used it in published work (although I am currently preparing a manuscript that contains one such analysis). The answer to this question is therefore pretty relevant to me.
RSA is a way to quantify the similarity of patterns of brain responses (usually measured as voxel response patterns with fMRI or the firing rates of a set of neurons etc) to a range of different stimuli. This produces a (dis)similarity matrix where each pairwise comparison is a cell that denotes how similar/confusable the response patterns to those stimuli are. In turn, the pattern of these similarities (the “representational similarity”) then allows researchers to draw inferences about how particular stimuli (or stimulus dimensions) are encoded in the brain. Here is an illustration:
The person called Warshort believes journal reviews, preprint comments, and blog posts to be more or less the same thing, public commentaries on published research. The logic of RSA is that somewhere in their brain the pattern of neural activity evoked by these three concepts is similar. Contrast this to person Liebe who regards reviews and preprint comments to be similar (but not as similar as Warshort would) but who considers personal blog posts to be diametrically opposed to reviews.
What is the research question?
According to their introduction, Bobadilla-Suarez et al. set themselves the following goals:
“The first goal was to ascertain whether the similarity measures used by the brain differ across regions. The second goal was to investigate whether the preferred measures differ across tasks and stimulus conditions. Our broader aim was to elucidate the nature of neural similarity.”
In some sense, it is one of the overarching goals of cognitive neuroscience to answer that final question, so they certainly have their work cut out for them. But looking at this more specifically, the question of the best measure of comparing brain states across conditions and how this depends on where and what is being compared is an important one for the field.
Unfortunately, to me this question seems ill-posed in the context of this study. If the goal is to understand what similarity measures are “used by the brain” we immediately need to ask ourselves whether the techniques used to address this question are appropriate to answer it. This is largely a conceptual point, and the study’s first caveat for me. We could instead reinterpret this into a technical comparison of different methods, but therein lies another caveat and this seems to be the main concern Kriegeskorte raised in his review. I’ll elaborate on both these points in turn:
The conceptual issue
I am sure the authors are fully aware of the limitations of making inferences about neural representations from brain imaging data. Any such inferences can only be as good as the method for measuring brain responses. Most studies using RSA are based on fMRI data which measures a metabolic proxy of neuronal activity. While fMRI experiments have doubtless made important discoveries about how the brain is organised and functions, this is a caveat we need to take seriously: there may very well be information in brain activity that is not directly reflected in fMRI measures. It is almost certainly not the case that brain regions communicate with one another directly via reading out their respective metabolic activity patterns.
This issue is further complicated by the fact that RSA studies using fMRI are based on voxel activity patterns. Voxels are individual elements in a brain image, the equivalent to pixels in a digital image. How a brain scan is subdivided into voxels is completely arbitrary and depends on a lot of methodological choices and parameters. The logic of using voxel patterns for RSA is that individual voxels will usually exhibit biased responses depending on the stimulus – however, the nature of these response biases remains highly controversial and also quite likely depends on what brain states (visual stimuli, complex tasks, memories, etc) are being compared. Critically, voxel patterns cannot possibly be directly relevant to neural encoding. At best, they are indirectly correlated with the underlying patterns, and naturally, the voxel resolution may also matter. In theory, two stimuli could be encoded by completely non-overlapping and unconnected neuronal populations which are nevertheless mixed into the same voxels. Even if voxel responses were a direct measure of neuronal activity, they might not show any biased responses at all, and the voxel response pattern would therefore carry no information about the stimuli whatsoever.
But there is an even more fundamental issue here. This is also unaffected by what actual brain measure is used, be it voxel patterns or the firing rates of actual neurons. The authors’ stated goal is to reveal what measure the brain itself uses to establish the similarity of brain states. The measures they compare are statistical methods, e.g. the Pearson correlation coefficient or the Mahalanobis distance between two response patterns. But the brain is no statistician. At most, a statistical quantity like a Pearson’s r might be a good description for what some read-out neurons somewhere in the processing hierarchy do to categorise the response patterns in up-stream regions. This may sound like an unnecessarily pedantic semantic distinction, but I’d disagree: by only testing predefined statistical models of how pattern similarity could be quantified, we may impose an artificially biased set of models. The actual way this is implemented in neuronal circuits may very well be a hybrid or a completely different process altogether. Neural similarity might linearly correlate with Pearson’s r over some range, say between r=0.5-1, but then be more consistent with a magnitude code at the lower end of similarities. It might also come with built in thresholding or rectifying mechanisms in which patterns below a certain criterion are automatically encoded as dissimilar. Of course, you have to start somewhere and the models the authors used are reasonable choices. However, this description should be more circumspect in my view because in the best case we could really say that the results suggest a mechanism that is well described by a given statistical model.
Finally, the authors seem to make an implicit assumption that does not necessarily hold: there is actually no reason to accept up-front that the brain quantifies pattern similarity at all. I assume that it does, and it is certainly an important assumption to be tested empirically. But in theory it seems entirely possible that spatial patterns of neural activity in a particular brain region are an epi-phenomenon of how neurons in that region are organised. This does however not mean that downstream neurons necessarily use this pattern information. I’d wager this almost certainly also depends on the stimulus/task. For instance, a higher-level neuron whose job it is to determine whether a stimulus appeared on the left or the right presumably uses the spatial pattern of retinotopically-organised responses in the earlier visual regions. For other, more complex stimulus dimensions, this may not be the case.
The technical issue
This brings me to the other caveat I see with Bobadilla-Suarez et al.’s approach here. As I said, this is largely the same point made by Kriegeskorte in his review and since this takes up most of his post I’ll keep it brief. If we brush aside the conceptual points I made above and instead assume that the brain indeed determines the similarity of response patterns in up-stream areas, what is the best way to test how it does this? The authors used a machine learning classifier to use pair-wise decoding of different stimuli and construct a confusability matrix. Conceptually, this is pretty much the same as the similarity matrix derived from the other measures they are testing (e.g. Pearson’s r) but it instead uses a classifier algorithm the determine the discriminability of the response patterns. The authors then compare these decoding matrices with those based on the similarity measures they tested.
As Kriegeskorte suggests, these decoding methods are just another method of determining neural similarity. Different kinds of decoders are also closely related to the various methods Bobadilla-Suarez et al. compared: the Mahalanobis distance isn’t conceptually very far from a linear discriminant decoder, and you can actually build a classifier using Pearson’s r (in fact, this is the classifier I mostly used in my own studies).
The premise of Bobadilla-Suarez et al.’s study therefore seems circular. They treat decodability of neural activity patterns as the ground truth of neural similarity, and that assumption seems untenable to me. They discuss the confound that the choice of decoding algorithm would affect the results and therefore advocate using the best available algorithm, yet this doesn’t really address the underlying issue. The decoder establishes the statistical similarity between neural response patterns. It does not quantify the actual neural similarity code – as a matter of fact, it cannot possibly do so.
It is therefore also unsurprising if the similarity measure that best matches classifier performance is the method that is closest to what the given classifier algorithm is based on. I may have missed this, but I cannot discern from the manuscript which classifier was actually used for the final analyses, only that the best of three was chosen. The best classifier was determined separately for the two datasets the authors used, which could be one explanation for why their outcome results differ between them.
Bobadilla-Suarez et al. ask an interesting and important question but I don’t think the study as it is can actually address it. There is a conceptual issue in that the brain may not necessarily use any of the available statistical models to quantify neural similarity, and in fact it may not do so at all. Of course, it is perfectly valid to compare different models of how it achieves this feat and any answer to this question need not be final. It does however seem to me that this is more of a methodological comparison rather than an attempt to establish what the brain is actually doing.
To my understanding, the approach the authors used to establish which similarity measure is best cannot answer this question. In this I appear to concur with Kriegeskorte’s review. Perhaps I am wrong of course, as the authors have previously suggested that Kriegeskorte “missed the point”, in which case I would welcome further explanation of the authors’ rationale here. However, from where I’m currently standing, I would recommend that the authors revise their manuscript as a methodological comparison and to be more circumspect with regard to claims about neural representations.
The results shown here are certainly not without merit. By comparing commonly used similarity measures to the best available decoding algorithm they may not establish which measure is closest to what the brain is doing, but they certainly do show how these measures compare to complex classification algorithms. This in itself is informative for practical reasons because decoding is computationally expensive. Any squabbling aside, the authors show that the most commonly used measure, Pearson’s correlation, clearly does not perform in the same way as a lot of other possible techniques. This finding should also be of interest to anyone conducting an RSA experiment.
Some final words
I hope the authors find this comment useful. Just because I agree with Kriegeskorte’s main point, I hope that doesn’t make me his “acolyte” (I have neither been trained by him nor would I say that we stem from the same theoretical camp). I may have “missed the point” too, in which case I would appreciate further insight.
I find it very unfortunate that instead of a decent discussion on science, this debate descended into something not far above a poo-slinging contest. I have deliberately avoided taking sides in that argument because of my relationship to either side. While I vehemently object to the manner with which Brad responded to Niko’s post, I think it should be obvious that not everybody is on the same wavelength when it comes to open reviewing. It is depressing and deeply unsettling how many people on either side of this divide appear to be unwilling to even try to understand the other point of view.
I have decided to turn off the comment functionality on this blog. I used to believe strongly that this would be the best place for any discussion to take place but this is clearly utopian. Most discussion about blog posts inevitably occurs on social media like Twitter and Facebook. At my advanced age I find it increasingly harder to keep track off all these multiple parallel streams and I predict soon I’ll find it even harder. Most of the comments here were often rehashing discussions I already had elsewhere as well while some of them were completely pointless. There was also the occasional joker who just took a dump on my lawn but didn’t bother to stick around for a chat. So now I am consolidating my resources. If you have a comment ping me a reply on Twitter (I always tweet out the link to a new post), respond via another blog post, or if you prefer a private conversation you can always email me.
An interesting little spat played out on science social media today. It began with a blog post by Niko Kriegeskorte, in which he posted a peer review he had conducted on a manuscript by Brad Love. The manuscript in question is publicly available as a preprint. I don’t want to go into too much detail here (you can read that all up for yourself) but Brad took issue with the fact and the manner in which Niko posted the review of their manuscript after it was rejected by a journal. A lot of related discussion also took place via Twitter (see links in Brad’s post) and on Facebook.
I must say, in the days of hyper-polarisation in everyday political and social discourse, I find this debate actually really refreshing. It is actually pretty easy to feel outrage and disagreement with a president putting children in cages or holding a whole bloody country hostage over a temper tantrum – although the fact that there are apparently still far too many people who do not feel outraged about these things is certainly a pretty damning indictment of the moral bankruptcy of the human race… Anyway, things are actually far more philosophically challenging when there is a genuine and somewhat acrimonious disagreement between two sides you respect equally. For the record, Brad is a former colleague of mine from my London days, whose work I have the utmost respect for. Niko has for years been a key player in multivariate and representational analysis and we have collaborated in the past.
Whatever my personal relationship to these people, I can certainly see both points of view in this argument. Brad seems to object mostly to the fact that Niko posted the reviews on this blog and without their consent. He regards this as a “self-serving” act. In contrast, Niko regards this as a substantial part of open review. His justification for posting this review publicly is that the manuscript is already public anyway, and that this invites public commentary. I don’t think that Brad particularly objects to public commentary, but he sees a conflict of interest in using a personal blog as a venue for this, especially since these were the peer reviews Niko wrote for a journal, not on the preprint server. Moreover, since these were reviews that led to the paper being rejected by the journal, he and his coauthors had no opportunity to reply to Niko’s reviews.
This is a tough nut to crack. But this is precisely the kind of discussion we need to have for making scientific publishing and peer review more transparent. For several years now I have argued that peer reviews should be public (even if the reviewers’ names are redacted). I believe reviewers’ comments and editorial decisions should be transparent. I’ve heard “How did this get accepted for publication?” in journal clubs just too many times. Show the world why! Not only is it generally more open but it will also make it fairer when there are challenges to the validity of an editorial decision, including dodgy decisions to retract studies.
That said, Brad certainly has a point that ethically this openness requires up-front consent from both parties. The way he sees it, he and his coauthors did not consent to publishing these journal reviews (which, in the present system, are still behind closed doors). Niko’s view is clearly that because the preprint is public, consent is implicit and this is fair game. Brad’s counterargument to this is that any comments on the preprint should be made directly on the preprint. This is separate from any journal review process and would allow the authors to consider the comments and decide if and how to respond to them. So, who is right here?
What this really comes down to is a philosophical worldview as to how openness should work and how open it should be. In a liberal society, the right to free expression certainly permits a person to post their opinions online, within certain constraints to protect people from libel, defamation, or threats to their safety. Some journals make reviewers sign a confidentiality agreement about reviews. If this was the case here, a post like this would constitute a violation of that agreement, although I am unaware of any case where this has ever been enforced. Besides, even if reviewers couldn’t publicly post their reviews and discuss the peer review of a manuscript, this would certainly not stop them from making similar comments at conferences, seminars – or on public preprints. In that regard, in my judgement Niko hasn’t done anything wrong here.
At the same time, I fully understand Brad’s frustration. I personally disagree with the somewhat vitriolic and accusatory tone of his response to Niko. This seems both unnecessary and unhelpful. But I agree with him that a personal blog is the wrong venue for posting peer reviews, regardless of whether they are from behind the closed doors of a journal review process or from the outside lawn of post-publication discussion. Obviously, nobody can stop anyone from blogging their opinion on a public piece of science (and a preprint is a public piece of science). Both science bloggers and mainstream journalists constantly write about published research, including preprints that haven’t been peer reviewed. Twitter is frequently ablaze with heated discussion about published research. And I must say that when I first skimmed Niko’s post, I didn’t actually realise that this was a peer review, let along one he had submitted to a journal, but simply thought it was his musings about the preprint.
The way I see it, social media aren’t peer review but mere opinion chatter. Peer review requires some established process. Probably this should have some editorial moderation – but even without that, at the very least there should be a constant platform for the actual review. Had Niko posted his review as a comment on the preprint server, this would have been entirely acceptable. In an ideal world, he would have done that after writing it instead of waiting for the journal to formally reject the manuscript*. This isn’t to say that opinion chatter is wrong. We do it all the time and talking about a preprint on Twitter is not so different from discussing a presentation you saw at a conference or seminar. But if we treat any channel as equivalent for public peer review, we end up with a mess. I don’t want to constantly track down opinions, some of which are vastly ill-informed, all over the wild west of the internet.
In the end, this whole debacle just confirms my already firmly held belief (Did you expect anything else? 😉 ) that the peer review process should be independent from journals altogether. What we call preprints today should really be the platform where peer review happens. There should be an editor/moderator to ensure a decent and fair process and facilitate a final decision (because the concept of eternally updating studies is unrealistic and infeasible). However, all of this should happen in public. Importantly, journals only come into play at the end, to promote research they consider interesting and perhaps some nice editing and formatting.
The way I see it, this is the only way. Science should happen out in the open – including the review process. But what we have here is a clash between promoting openness in a world still partly dominated by the traditional way things have always been done. I think Niko’s heart was in the right place here but by posting his journal reviews on his personal blog he effectively went rogue, or took the law into his own hands, if you will. Perhaps this is the way the world changes but I don’t think this is a good approach. How about we all get together and remake the laws. They are for us scientists after all, to determine how science should work. It’s about time we start governing ourselves.
I want to add links to two further posts opining on this issue, both of which make important points. First, Sebastian Bobadilla-Suarez, the first author of the manuscript in question, wrote a blog post about his own experiences, especially from the perspective of an early career researcher. Not only are his views far more important, but I actually find his take far more professional and measured than Brad’s post.
Secondly, I want to mention another excellent blog post on this whole debacle by Edwin Dalmaijer which very eloquently summarises this situation. From what I can tell, we pretty much agree in general but Edwin makes a number of more concrete points compared to my utopian dreams of how I would hope things should work.
I have stayed out of the Wansink saga for the most part. If you don’t know what this is about, I suggest reading about this case on Retraction Watch. I had a few private conversations about this with Nick Brown, who has been one of the people instrumental in bringing about a whole series of retractions of Wansink’s publications. I have a marginal interest in some of Wansink’s famous research, specifically whether the size of plates can influence how much a person eats, because I have a broader interest in the interplay between perception and behaviour.
But none of that is particularly important. The short story is that considerable irregularities have been discovered in a string of Wansink’s publications, many of which has since been retracted. The whole affair first kicked off with a fundamental own-goal of a blog post (now removed, so posting Gelman’s coverage instead) he wrote in which he essentially seemed to promote p-hacking. Since then the problems that came to light ranged from irregularities (or impossibility) of some of the data he reported, evidence of questionable research practices in terms of cherry-picking or excluding data, to widespread self-plagiarism. Arguably, not all of these issues are equally damning and for some the evidence is more tenuous than for others – but the sheer quantity of problems is egregious. The resulting retractions seem entirely justified.
Today I read an article on Times Higher Education entitled “Massaging data to fit a theory is not the worst research sin” by Martin Cohen, which discusses Wansink’s research sins in a broader context of the philosophy of science. The argument is pretty muddled to me so I am not entirely sure what the author’s point is – but the effective gist seems to shrug off concerns about questionable research practices and that Wansink’s research is still a meaningful contribution to science. In my mind, Cohen’s article reflects a fundamental misunderstanding of how science works and in places sounds positively post-Truthian. In the following, I will discuss some of the more curious claims made by this article.
“Massaging data to fit a theory is not the worst research sin”
I don’t know about the “worst” sin. I don’t even know if science can have “sins” although this view has been popularised by Chris Chamber’s book and Neuroskeptic’s Circles of Scientific Hell. Note that “inventing data”, a.k.a. going Full-Stapel, is considered the worst affront to the scientific method in the latter worldview. “Massaging data” is perhaps not the same as outright making it up, but on the spectrum of data fabrication it is certainly trending in that direction.
Science is about seeking the truth. In Cohen’s words, “science should above all be about explanation”. It is about finding regularities, relationships, links, and eventually – if we’re lucky – laws of nature that help us make sense of a chaotic, complex world. Altering, cherry-picking, or “shoe-horning” data to fit your favourite interpretation is the exact opposite of that.
Now, the truth is that p-hacking, the garden of forking paths, flexible outcome-contingent analyses fall under this category. Such QRPs are extremely widespread and to some degree pervade most of the scientific literature. But just because it is common, doesn’t mean that this isn’t bad. Massaging data inevitably produces a scientific literature of skewed results. The only robust way to minimise these biases is through preregistration of experimental designs and confirmatory replications. We are working towards that becoming more commonplace – but in the absence of that it is still possible to do good and honest science.
In contrast, prolifically engaging in such dubious practices, as Wansink appears to have done, fundamentally undermines the validity of scientific research. It is not a minor misdemeanour.
“We forget too easily that the history of science is rich with errors”
I sympathise with the notion that science has always made errors. One of my favourite quotes about the scientific method is that it is about “finding better ways of being wrong.” But we need to be careful not to conflate some very different things here.
First of all, a better way of being wrong is an acknowledgement that science is never a done deal. We don’t just figure out the truth but constantly seek to home in on it. Our hypotheses and theories are constantly refined, hopefully by gradually becoming more correct, but there will also be occasional missteps down a blind alley.
But these “errors” are not at all the same thing as the practices Wansink appears to have engaged in. These were not mere mistakes. While the problems with many QRPs (like optional stopping) have long been underappreciated by many, a lot of the problems in Wansink’s retracted articles are quite deliberate distortions of scientific facts. For most, he could have and should have known better. This isn’t the same as simply getting things wrong.
The examples Cohen offers for the “rich errors” in past research are also not applicable. Miscalculating the age of the earth or presenting an incorrect equation are genuine mistakes. They might be based on incomplete or distorted knowledge. Publishing an incorrect hypothesis (e.g., that DNA is a triple helix) is not the same as mining data to confirm a hypothesis. It is perfectly valid to derive new hypotheses, even if they turn out to be completely false. For example, I might posit that gremlins cause the outdoor socket on my deck to fail. Sooner or later, a thorough empirical investigation will disprove this hypothesis and the evidence will support an alternative, such as that the wiring is faulty. The gremlin hypothesis may be false – and it is also highly implausible – but nothing stops me from formulating it. Wansink’s problem wasn’t with his hypotheses (some of which may indeed turn out to be true) but with the irregularities in the data he used to support them.
“Underlying it all is a suspicion that he was in the habit of forming hypotheses and then searching for data to support them”
Ahm, no. Forming hypotheses before collecting data is how it’s supposed to work. Using Cohen’s “generous perspective”, this is indeed how hypothetico-deductive research works. In how far this relates to Wansink’s “research sin” depends on what exactly is meant here by “searching for data to support” your hypotheses. If this implies you are deliberately looking for data that confirms your prior belief while ignoring or rejecting observations that contradict it, then that is not merely a questionable research practice, but antithetical to the whole scientific endeavour itself. It is also a perfect definition of confirmation bias, something that afflicts all human beings to some extent, scientists included. Scientists must find protections from fooling themselves in this way and that entails constant vigilance and scepticism of our own pet theories. In stark contrast, engaging in this behaviour actively and deliberately is not science but pure story-telling.
The critics are not merely “indulging themselves in a myth of neutral observers uncovering ‘facts'”. Quite to the contrary, I think Wansink’s critics are well aware of the human fallibility of scientists. People are rarely perfectly neutral when it pertains to hypotheses. Even when you are not emotionally invested in which one of multiple explanations for a phenomenon might be correct, they are frequently not equal in terms of how exciting it might be to confirm them. Finding gremlins under my deck would certainly be more interesting (and scary?) than evidence of faulty wiring.
But in the end, facts are facts. There are no “alternative facts”. Results are results. We can differ on how to interpret them but that doesn’t change the underlying data. Of course, some data are plainly wrong because they come from incorrect measurements, artifacts, or statistical flukes. These results are wrong. They aren’t facts even if we think of them as facts at the moment. Sooner or later, they will be refuted. That’s normal. But this is a long shot from deliberately misreporting or distorting facts.
“…studies like Wansink’s can be of value if they offer new clarity in looking at phenomena…”
This seems to be the crux of Cohen’s argument. Somehow, despite all the dubious and possibly fraudulent nature of his research, Wansink still makes a useful contribution to science. How exactly? What “new clarity” do we gain from cherry-picked results?
I can see though that Wansink may “stimulate ideas for future investigations”. There is no denying that he is a charismatic presenter and that some of his ideas were ingenuous. I like the concept of self-filling soup bowls. I do think we must ask some critical questions about this experimental design, such as whether people can be truly unaware that the soup level doesn’t go down as they spoon it up. But the idea is neat and there is certainly scope for future research.
But don’t present this as some kind of virtue. By all means, give credit to him for developing a particular idea or a new experimental method. But please, let’s not pretend that this excuses the dubious and deliberate distortion of the scientific record. It does not justify the amount of money that has quite possibly been wasted on changing how people eat, the advice given to schools based on false research. Deliberately telling untruths is not an error, it is called a lie.
TL-DR: No, men are not “better at science” than women.
Clickbaity enough for you? I cannot honestly say I have read a lot of OpEds in the Irish Times so the evidence for my titular claim is admittedly rather limited. But it is still more solidly grounded in actual data than this article published yesterday in the Irish Times. At least I have one data point.
The article in question, a prime example of Betteridge’s Law, is entitled “Are men just better at science than women?“. I don’t need to explain why such a title might be considered sensationalist and controversial. The article itself is an “Opinion” piece, thus allowing the publication to disavow any responsibility for its authorship whilst allowing it to rake in the views from this blatant clickbait. In it, the author discusses some new research reporting gender differences in systemising vs empathising behaviour and puts this in the context of some new government initiative to specifically hire female professors because apparently there is some irony here. He goes on a bit about something called “neurosexism” (is that a real word?) and talks about “hard-wired” brains*.
I cannot quite discern if the author thought he was being funny or if he is simply scientifically illiterate but that doesn’t really matter. I don’t usually spend much time commenting on stuff like that. I have no doubt that the Irish Times, and this author in particular, will be overloaded with outrage and complaints – or, to use the author’s own words, “beaten up” on Twitter. There are many egregious misrepresentations of scientific findings in the mainstream media (and often enough, scientists and/or the university press releases are the source of this). But this example of butchery is just so bad and infuriating in its abuse of scientific evidence that I cannot let it slip past.
The whole argument, if this is what the author attempted, is just riddled with logical fallacies and deliberate exaggerations. I have no time or desire to go through them all. Conveniently, the author already addresses a major point himself by admitting that the study in question does not actually speak to male brains being “hard-wired” for science, but that any gender differences could be arising due to cultural or environmental factors. Not only that, he also acknowledges that the study in question is about autism, not about who makes good professors. So I won’t dwell on these rather obvious points any further. There are much more fundamental problems with the illogical leaps and mental gymnastics in this OpEd:
What makes you “good at science”?
There is a long answer to this question. It most certainly depends somewhat on your field of research and the nature of your work. Some areas require more manual dexterity, whilst others may require programming skills, and others yet call for a talent for high-level maths. As far as we can generalise, in my view necessary traits of a good researcher are: intelligence, creativity, patience, meticulousness, and a dedication to seek the truth rather than confirming theories. That last one probably goes hand-in-hand with some scepticism, including a healthy dose of self-doubt.
There is also a short answer to this question. A good scientist is not measured by their Systemising Quotient (SQ), a self-report measure that quantifies “the drive to analyze or build a rule-based system”. Academia is obsessed with metrics like the h-index (see my previous post) but even pencil pushers and bean counters** in hiring or grant committees haven’t yet proposed to use SQ to evaluate candidates***.
I suspect it is true that many scientists score high on the SQ and also the related Autism-spectrum Quotient (AQ) which, among other things, quantifies a person’s self-reported attention to detail. Anecdotally, I can confirm that a lot of my colleagues score higher than the population average on AQ. More on this in the next section.
However, none of this implies that you need to have a high SQ or AQ to be “good at science”, whatever that means. That assertion is a logical fallacy called affirming the consequent. We may agree that “systemising” characterises a lot of the activities a typical scientist engages in, but there is no evidence that this is sufficient to being a good scientist. It could mean that systemising people are attracted to science and engineering jobs. It certainly does not mean that a non-systemising person cannot be a good scientist.
Small effect sizes
I know I rant a lot about relative effect sizes such as Cohen’s d, where the mean difference is normalised by the variability. I feel that in a lot of research contexts these are given undue weight because the variability itself isn’t sufficiently controlled. But for studies like this we can actually be fairly confident that they are meaningful. The scientific study had a pretty whopping sample size of 671,606 (although that includes all their groups) and also used validation data. The residual physiologist inside me retains his scepticism about self-report questionnaire type measures, but even I have come to admit that a lot of questionnaires can be pretty effective. I think it is safe to say that the Big 5 Personality Factors or the AQ tap into some meaningful real factors. Further, whatever latent variance there may be on these measures, that is probably outweighed by collecting such a massive sample. So the Cohen’s d this study reports is probably quite informative.
What does this say? Well, the difference in SQ between males and females was 0.31. In other words, the distributions of SQ between sexes overlap quite considerably but the distribution for males is somewhat shifted towards higher values. Thus, while the average man has a subtly higher SQ than the average woman, a rather considerable number of women will have higher SQs than the average man. The study helpfully plots these distributions in Figure 1****:
The relevant curves here are the controls in cyan and magenta. Sorry, colour vision deficient people, the authors clearly don’t care about you (perhaps they are retinasexists?). You’ll notice that the modes of the female and male distributions are really not all that far apart. More noticeable is the skew of all these distributions with a long tail to the right: Low SQs are most common in all groups (including autism) but values across the sample are spread across the full range. So by picking out a random man and a random woman from a crowd, you can be fairly confident that their SQs are both on the lower end but I wouldn’t make any strong guesses about whether the man has a higher SQ than the woman.
However, it gets even tastier because the authors of the study actually also conducted an analysis splitting their data from controls into people in Science, Technology, Engineering, or Maths (STEM) professions compared to controls who were not in STEM. The results (yes, I know the colour code is now weirdly inverted – not how I would have done it…) show that people in STEM, whether male or female, tend to have larger SQs than people outside of STEM. But again, the average difference here is actually small and most of it plays out in the rightward tail of the distributions. The difference between males and females in STEM is also much less distinct than for people outside STEM.
So, as already discussed in the previous section, it seems to be the case that people in STEM professions tend to “systemise” a bit more. It also suggests that men systemise more then women but that difference probably decreases for people in STEM. None of this tells us anything about whether people’s brains are “hard-wired” for systemising, if it is about cultural and environmental differences between men and women, or indeed if being trained in a STEM profession might make people more systemising. It definitely does not tell you who is “good at science”.
What if it were true?
So far so bad for those who might want to make that interpretive leap. But let’s give them the benefit of the doubt and ignore everything I said up until now. What if it were true that systemisers are in fact better scientists? Would that invalidate government or funders initiatives to hire more female scientists? Would that be bad for science?
No. Even if there were a vast difference in systemising between men and women, and between STEM and non-STEM professions, respectively, all such a hiring policy will achieve is to increase the number of good female scientists – exactly what this policy is intended to do. Let me try an analogy.
Basketball players in the NBA tend to be pretty damn tall. Presumably it is easier to dunk when you measure 2 meters than when you’re Tyrion Lannister. Even if all other necessary skills here are equal there is a clear selection pressure for tall people to get into top basketball teams. Now let’s imagine a team decided they want to hire more shorter players. They declare they will hire 10 players who cannot be taller than 1.70m. The team will have try-outs and still seek to get the best players out of their pool of applicants. If they apply an objective criterion for what makes a good player, such as the ability to score consistently, they will only hire short players with excellent aim or who can jump really high. In fact, these shorties will be on average better at aiming and/or jumping than the giants they already have on their team. The team selects for the ability to score. Shorties and Tallies get there via different means but they both get there.
In this analogy, being a top scorer is being a systemiser, which in turn makes you a good scientist. Giants tend to score high because they find it easy to reach the basket. Shorties score high because they have other skills that compensate for their lack of height. Women can be good systemisers despite not being men.
The only scenario in which such a specific hiring policy could be counterproductive is if two conditions are met: 1) The difference between groups in the critical trait (i.e., systemising) is vast and 2) the policy mandates hiring from a particular group without any objective criteria. We have already established that the former condition isn’t fulfilled here – the difference in systemising between men and women is modest at best. The latter condition is really a moot point because this is simply not how hiring works in the real world. Hiring committees don’t usually just offer jobs to the relatively best person out of the pool but also consider the candidates’ objective abilities and achievements. This is even more pertinent here because all candidates in this case will already be eligible for a professorial position anyway. So all that will in fact happen is that we end up with more female professors who will also happen to be high in systemising.
Bad science reporting
Again, this previous section is based on the entirely imaginary and untenable assumption that systemisers are better scientists. I am not aware of any evidence of that – in part because we cannot actually quantify very well what makes a good scientist. The metrics academics actually (and sadly) use for hiring and funding decisions probably do not quantify that either but I am not even aware of any link between systemising and those metrics. Is there a correlation between h-indeces (relative to career age) and SQ? I doubt it.
What we have here is a case of awful science reporting. Bad science journalism and the abuse of scientific data for nefarious political purposes are hardly a new phenomenon – and this won’t just disappear. But the price of freedom (to practice science) is eternal vigilance. I believe as scientists we have a responsibility to debunk such blatant misapprehensions by journalists who I suspect have never even set foot in an actual lab or spoken to any actual scientists.
Some people assert that improving the transparency and reliability of research will hurt the public’s faith in science. Far from it, I believe those things can show people how science really works. The true damage to how the public perceives science is done by garbage articles in the mainstream media like this one – even if it is merely offered as an “opinion”.
*) Brains are not actually hard-wired to do anything. Leaving the old Hebbian analogy aside, brains aren’t wired at all, period. They are soft, squishy, wet sponges containing lots of neuronal and glial tissue plus blood vessels. Neurons connect via synapses between axons and dendrites and this connectivity is constantly regulated and new connections grown while others are pruned. This adaptability is one of the main reasons why we even have brains, and lies at the heart of the intelligence, ingenuity, and versatility of our species.
**) I suspect a lot of the pencil pushers and bean counters behind metrics like impact factors or the h-index might well be Systemisers.
***) I hope none of them read this post. We don’t want to give these people any further ideas…
****) Isn’t open access under Creative Commons license great?
Today’s post is inspired by another nonsensical proposal that made the rounds and that reminded me why I invented the Devil’s Neuroscientist back in the day (Don’t worry, that old demon won’t make a comeback…). So apparently RetractionWatch created a database allowing you to search for an author’s name to list any retractions or corrections of their publications*. Something called the Ochsner Journal then declared they would use this to scan “every submitting author’s name to ensure that no author published in the Journal has had a paper retracted.” I don’t want to dwell on this abject nonsense – you can read about this in this Twitter thread. Instead I want to talk about the wider mentality that I believe underlies such ideas.
In my view, using retractions as a stigma to effectively excommunicate any researcher from “science” forever is just another manifestation of a rather pervasive and counter-productive tendency of trying to reduce everything in academia to simple metrics and heuristics. Good science should be trustworthy, robust, careful, transparent, and objective. You cannot measure these things with a number.
Perhaps it is unsurprising that quantitative scientists want to find ways to quantify such things. After all, science is the endeavour to reveal regularities in our observations to explain the variance of the natural world and thus reduce the complexity in our understanding of it. There is nothing wrong with meta-science and trying to derive models of how science – and scientists – work. But please don’t pretend that these models are anywhere near good enough to actually govern all of academia.
Few people you meet still believe that the Impact Factor of a journal tells you much about the quality of a given publication in it. Neither does an h-index or citation count tell us anything about the importance or “impact” of somebody’s research, certainly not without looking at this relative to the specific field of science they operate in. The rate with which someone’s findings replicate doesn’t tell you anything about how great a scientist they are. And you certainly won’t learn anything about the integrity and ability of a researcher – and their right to publish in your journal – when all you have to go on is that they were an author on one retracted study.
Reducing people’s careers and scientific knowledge to a few stats is lazy at best. But it is also downright dangerous. As long as such metrics are used to make consequential real-life decisions, people are incentivised to game them. Nowhere can this be seen better than with the dubious tricks some journals use to inflate their Impact Factor or the occasional dodgy self-citation scandals. Yes, in the most severe cases these are questionable, possibly even fraudulent, practices – but there is a much greater grey area here. What do you think would happen, if we adopted the policy that only researchers with high replicability ratings get put up for promotion? Do you honestly think this would encourage scientists to do better science rather than merely safer, boring science?
This argument is sometimes used as a defence of the status quo and a reason why we shouldn’t change the way science is done. Don’t be fooled by that. We should reward good and careful science. We totally should give credit to people who preregister their experiments, self-replicate their findings, test the robustness of their methods, and go the extra mile to ensure their studies are solid. We should appreciate hypotheses based on clever, creative, and/or unique insights. We should also give credit to people for admitting when they are wrong – otherwise why should anyone seek the truth?
The point is, you cannot do any of that with a simple number in your CV. Neither can you do that by looking at retractions or failures to replicate as a plague mark on someone’s career. I’m sorry to break it to you, but the only way to assess the quality of some piece of research, or to understand anything about the scientist behind it, is to read their work closely and interpret it in the appropriate context. That takes time and effort. Often it also necessitates talking to them because no matter how clever you think you are, you will not understand everything they wrote, just as not everybody will comprehend the gibberish you write. If you believe a method is inadequate, by all means criticise it. Look at the raw data and the analysis code. Challenge interpretations you disagree with. Take nobody’s word for granted and all that…
But you can shove your metrics where the sun don’t shine.
TL-DR: If the title of this blog post is unsurprising to you, I suggest you go play outside.
Many discussions in my science social media bubble circle around p-values (what an exciting life I lead…). Just a few days ago, there was a big kerfuffle about p-curving and whether p-values just below 0.05 are a sign of whatever. One of the main concepts behind p-curves is that under the assumption that the null hypothesis (H0) of no effect/difference is true, p-values should be uniformly distributed (at least as long as the test assumptions are met reasonably). This once again supported my suspicions that most people don’t actually know what p-values mean. Reports of people defining p-values incorrectly abound, sometimes even in stats textbooks. It also seems to me that people find p-values rather unintuitive. And I get the impression a lot of people vastly overestimate how widely known things like p-curve actually are.
A few weeks ago I got embroiled in a Facebook discussion. A friend of mine was running a permutation analysis to test something about his experiment and found something very odd: the distribution of p-values was skewed severely to the left – there were very few low p-values but the proportion was steadily increasing with most p-values being just below 1. He expected this distribution to be uniform because under the random permutations H0 should be true. A lot of commenters on his post seemed rather surprised and/or confused by the whole idea that p-values should be distributed randomly when H0 is true. “Surely,” so the common intuition goes, “when there is actually no difference, most p-values should be high and close to 1?”
No, and the reason why not is the p-value itself. A p-value can be calculated/estimated in many different ways. Most people use parametric tests but essentially they all share one philosophy. If you have no underlying effect and randomly sample data ad infinitum you end up with a distribution of test statistics. In my example, I draw two variables each with n=100 from a normal distribution and calculate the Pearson correlation between them – and I repeat this 20,000 times. This produces a distribution of correlation coefficients like this:
There is no correlation between two random variables (H0 is true) and so the distribution is centred on zero. The spread of the distribution depends on the sample size. Larger samples will produce narrower distributions. Critically, we can use this distribution to get a p-value. If we had observed a correlation of r=0.3 in our experiment, we could calculate the proportion of correlation coefficients in this distribution that are equal or greater than 0.3. This would give us a one-tailed p-value. If you ignore the sign of the correlation, you get a two-tailed p-value.
In the plot above, I coloured the 5% most extreme correlation coefficients in blue (2.5% to the left and to the right, respectively). These regions are abutted by vertical red lines at just below +/-0.2 in this case. This reflects the critical effect size needed to get p<0.05 – only 5% of the correlations coefficients in this distribution are +/-0.19ish or even more extreme.
Now compare this to the region coloured in red. This region also makes up 5% of the whole distribution. However, the red region surrounds zero, that is, those correlation coefficients that are really close to the true correlation value. Random chance makes the distribution spread out (and that becomes more severe when your sample size is low) but most of the correlations will nevertheless be close to the true value of zero. Therefore, the range of values in this red region is much narrower because the values are much denser here.
But of course these nigh-zero correlation coefficients will have the largest p-values. Consider again what a p-value reflects. If your observed correlation is 0.006 and you again ignore the sign of the effects, almost all correlations in this null distribution would be equal or greater than 0.006. So this proportion, the p-value, is almost 1. Put in other words, 5% of low p-values below 0.05 are from the long, thin tails of the null distribution, while 5% of really high p-values above 0.95 are from a really narrow slither of the null distribution near zero:
Visualised the same way, you have the blue region with p<0.05 on the left. Here correlations are large (greater than 0.19ish). On the right, you have the red region with p>0.95. Here correlations are really close to zero.
In other words, you can directly read off the p-value from the x-axis of this distribution of p-values. This is a direct consequence of what p-values represent. They are the proportion of values in the null distribution where correlations are equal or more extreme than the observed correlation.
Of course, if the null hypothesis is false and there actually is a correlation between the two variables this distribution must become skewed. There should now be many more tests with low p-values than with large ones. This is exactly what happens and this is the pattern that analyses like p-curve seek to detect:
Now, my friend’s p-distribution looked essentially like the mirror image of this. I still haven’t learned what could have possibly caused this. It would mean that more effect sizes were close to zero than there should be under H0. This could suggest some assumptions not being met but none of my own feeble simulations managed to reproduce the pattern he found. His analyses sounded quite complex so there is a possibility that there were some complex errors in it.