Category Archives: improving science

Imaging

Inspired by my Twitter feed for the past few weeks, and in particular this tweet in which I suggested science should be about the science not about the scientist, the revelation that there is such a thing as “negotiated submissions“, the debate on whether or not people should sign their peer reviews, and last but not least my inability to spell simple words, I give you my most embarrassing blog post yet. To the tune of a famous John Lennon song:

Imaging there’s no tenure
It’s easy if you try
No self-promotion needed
No reviewer makes you cry

Imaging all researchers
Searching for the truth

Imaging there’s no Twitter
It isn’t hard to do
Nothing to kill or die for
And no arguments, too

Imaging all researchers
Living life in peace

You may say I’m a procrastinator
That I should apply for grants
I should be preparing lectures
And write no stupid rants

Imaging no more authors
Results published as they came
No need for impact factors
Nobody cares about fame

Imaging all researchers
Sharing all data

You may say that I’m a dreamer
But I’m not the only one
I hope someday you’ll join us
And science will again be fun

What is a “publication”?

I was originally thinking of writing a long blog post discussing this but it is hard to type verbose treatises like that from inside my gently swaying hammock. So y’all will be much relieved to hear that I spare you that post. Instead I’ll just post the results of the recent Twitter poll I ran, which is obviously enormously representative of the 336 people who voted. Whatever we’re going to make of this, I think it is obvious that there remains great scepticism about treating preprints the same as publications. I am also puzzled by what the hell is wrong with the 8% who voted for the third option*. Do these people not put In press articles on their CVs?

screenshot 2019-01-19 at 16.53.25

*) Weirdly, I could have sworn that when the poll originally closed this percentage was 9%. Somehow it was corrected downwards afterward. Should this be possible?

An open review of open reviewing

a.k.a. Love v Kriegeskorten

An interesting little spat played out on science social media today. It began with a blog post by Niko Kriegeskorte, in which he posted a peer review he had conducted on a manuscript by Brad Love. The manuscript in question is publicly available as a preprint. I don’t want to go into too much detail here (you can read that all up for yourself) but Brad took issue with the fact and the manner in which Niko posted the review of their manuscript after it was rejected by a journal. A lot of related discussion also took place via Twitter (see links in Brad’s post) and on Facebook.

I must say, in the days of hyper-polarisation in everyday political and social discourse, I find this debate actually really refreshing. It is actually pretty easy to feel outrage and disagreement with a president putting children in cages or holding a whole bloody country hostage over a temper tantrum – although the fact that there are apparently still far too many people who do not feel outraged about these things is certainly a pretty damning indictment of the moral bankruptcy of the human race… Anyway, things are actually far more philosophically challenging when there is a genuine and somewhat acrimonious disagreement between two sides you respect equally. For the record, Brad is a former colleague of mine from my London days, whose work I have the utmost respect for. Niko has for years been a key player in multivariate and representational analysis and we have collaborated in the past.

Whatever my personal relationship to these people, I can certainly see both points of view in this argument. Brad seems to object mostly to the fact that Niko posted the reviews on this blog and without their consent. He regards this as a “self-serving” act. In contrast, Niko regards this as a substantial part of open review. His justification for posting this review publicly is that the manuscript is already public anyway, and that this invites public commentary. I don’t think that Brad particularly objects to public commentary, but he sees a conflict of interest in using a personal blog as a venue for this, especially since these were the peer reviews Niko wrote for a journal, not on the preprint server. Moreover, since these were reviews that led to the paper being rejected by the journal, he and his coauthors had no opportunity to reply to Niko’s reviews.

This is a tough nut to crack. But this is precisely the kind of discussion we need to have for making scientific publishing and peer review more transparent. For several years now I have argued that peer reviews should be public (even if the reviewers’ names are redacted). I believe reviewers’ comments and editorial decisions should be transparent. I’ve heard “How did this get accepted for publication?” in journal clubs just too many times. Show the world why! Not only is it generally more open but it will also make it fairer when there are challenges to the validity of an editorial decision, including dodgy decisions to retract studies.

That said, Brad certainly has a point that ethically this openness requires up-front consent from both parties. The way he sees it, he and his coauthors did not consent to publishing these journal reviews (which, in the present system, are still behind closed doors). Niko’s view is clearly that because the preprint is public, consent is implicit and this is fair game. Brad’s counterargument to this is that any comments on the preprint should be made directly on the preprint. This is separate from any journal review process and would allow the authors to consider the comments and decide if and how to respond to them. So, who is right here?

What this really comes down to is a philosophical worldview as to how openness should work and how open it should be. In a liberal society, the right to free expression certainly permits a person to post their opinions online, within certain constraints to protect people from libel, defamation, or threats to their safety. Some journals make reviewers sign a confidentiality agreement about reviews. If this was the case here, a post like this would constitute a violation of that agreement, although I am unaware of any case where this has ever been enforced. Besides, even if reviewers couldn’t publicly post their reviews and discuss the peer review of a manuscript, this would certainly not stop them from making similar comments at conferences, seminars – or on public preprints. In that regard, in my judgement Niko hasn’t done anything wrong here.

At the same time, I fully understand Brad’s frustration. I personally disagree with the somewhat vitriolic and accusatory tone of his response to Niko. This seems both unnecessary and unhelpful. But I agree with him that a personal blog is the wrong venue for posting peer reviews, regardless of whether they are from behind the closed doors of a journal review process or from the outside lawn of post-publication discussion. Obviously, nobody can stop anyone from blogging their opinion on a public piece of science (and a preprint is a public piece of science). Both science bloggers and mainstream journalists constantly write about published research, including preprints that haven’t been peer reviewed. Twitter is frequently ablaze with heated discussion about published research. And I must say that when I first skimmed Niko’s post, I didn’t actually realise that this was a peer review, let along one he had submitted to a journal, but simply thought it was his musings about the preprint.

The way I see it, social media aren’t peer review but mere opinion chatter. Peer review requires some established process. Probably this should have some editorial moderation – but even without that, at the very least there should be a constant platform for the actual review. Had Niko posted his review as a comment on the preprint server, this would have been entirely acceptable. In an ideal world, he would have done that after writing it instead of waiting for the journal to formally reject the manuscript*. This isn’t to say that opinion chatter is wrong. We do it all the time and talking about a preprint on Twitter is not so different from discussing a presentation you saw at a conference or seminar. But if we treat any channel as equivalent for public peer review, we end up with a mess. I don’t want to constantly track down opinions, some of which are vastly ill-informed, all over the wild west of the internet.

In the end, this whole debacle just confirms my already firmly held belief (Did you expect anything else? 😉 ) that the peer review process should be independent from journals altogether. What we call preprints today should really be the platform where peer review happens. There should be an editor/moderator to ensure a decent and fair process and facilitate a final decision (because the concept of eternally updating studies is unrealistic and infeasible). However, all of this should happen in public. Importantly, journals only come into play at the end, to promote research they consider interesting and perhaps some nice editing and formatting.

The way I see it, this is the only way. Science should happen out in the open – including the review process. But what we have here is a clash between promoting openness in a world still partly dominated by the traditional way things have always been done. I think Niko’s heart was in the right place here but by posting his journal reviews on his personal blog he effectively went rogue, or took the law into his own hands, if you will. Perhaps this is the way the world changes but I don’t think this is a good approach. How about we all get together and remake the laws. They are for us scientists after all, to determine how science should work. It’s about time we start governing ourselves.

Addendum:
I want to add links to two further posts opining on this issue, both of which make important points. First, Sebastian Bobadilla-Suarez, the first author of the manuscript in question, wrote a blog post about his own experiences, especially from the perspective of an early career researcher. Not only are his views far more important, but I actually find his take far more professional and measured than Brad’s post.
Secondly, I want to mention another excellent blog post on this whole debacle by Edwin Dalmaijer which very eloquently summarises this situation. From what I can tell, we pretty much agree in general but Edwin makes a number of more concrete points compared to my utopian dreams of how I would hope things should work.

Massaging data to fit a theory is antithetical to science

I have stayed out of the Wansink saga for the most part. If you don’t know what this is about, I suggest reading about this case on Retraction Watch. I had a few private conversations about this with Nick Brown, who has been one of the people instrumental in bringing about a whole series of retractions of Wansink’s publications. I have a marginal interest in some of Wansink’s famous research, specifically whether the size of plates can influence how much a person eats, because I have a broader interest in the interplay between perception and behaviour.

But none of that is particularly important. The short story is that considerable irregularities have been discovered in a string of Wansink’s publications, many of which has since been retracted. The whole affair first kicked off with a fundamental own-goal of a blog post (now removed, so posting Gelman’s coverage instead) he wrote in which he essentially seemed to promote p-hacking. Since then the problems that came to light ranged from irregularities (or impossibility) of some of the data he reported, evidence of questionable research practices in terms of cherry-picking or excluding data, to widespread self-plagiarism. Arguably, not all of these issues are equally damning and for some the evidence is more tenuous than for others – but the sheer quantity of problems is egregious. The resulting retractions seem entirely justified.

Today I read an article on Times Higher Education entitled “Massaging data to fit a theory is not the worst research sin” by Martin Cohen, which discusses Wansink’s research sins in a broader context of the philosophy of science. The argument is pretty muddled to me so I am not entirely sure what the author’s point is – but the effective gist seems to shrug off concerns about questionable research practices and that Wansink’s research is still a meaningful contribution to science.  In my mind, Cohen’s article reflects a fundamental misunderstanding of how science works and in places sounds positively post-Truthian. In the following, I will discuss some of the more curious claims made by this article.

“Massaging data to fit a theory is not the worst research sin”

I don’t know about the “worst” sin. I don’t even know if science can have “sins” although this view has been popularised by Chris Chamber’s book and Neuroskeptic’s Circles of Scientific Hell. Note that “inventing data”, a.k.a. going Full-Stapel, is considered the worst affront to the scientific method in the latter worldview. “Massaging data” is perhaps not the same as outright making it up, but on the spectrum of data fabrication it is certainly trending in that direction.

Science is about seeking the truth. In Cohen’s words, “science should above all be about explanation”. It is about finding regularities, relationships, links, and eventually – if we’re lucky – laws of nature that help us make sense of a chaotic, complex world. Altering, cherry-picking, or “shoe-horning” data to fit your favourite interpretation is the exact opposite of that.

Now, the truth is that p-hacking,  the garden of forking paths, flexible outcome-contingent analyses fall under this category. Such QRPs are extremely widespread and to some degree pervade most of the scientific literature. But just because it is common, doesn’t mean that this isn’t bad. Massaging data inevitably produces a scientific literature of skewed results. The only robust way to minimise these biases is through preregistration of experimental designs and confirmatory replications. We are working towards that becoming more commonplace – but in the absence of that it is still possible to do good and honest science.

In contrast, prolifically engaging in such dubious practices, as Wansink appears to have done, fundamentally undermines the validity of scientific research. It is not a minor misdemeanour.

“We forget too easily that the history of science is rich with errors”

I sympathise with the notion that science has always made errors. One of my favourite quotes about the scientific method is that it is about “finding better ways of being wrong.” But we need to be careful not to conflate some very different things here.

First of all, a better way of being wrong is an acknowledgement that science is never a done deal. We don’t just figure out the truth but constantly seek to home in on it. Our hypotheses and theories are constantly refined, hopefully by gradually becoming more correct, but there will also be occasional missteps down a blind alley.

But these “errors” are not at all the same thing as the practices Wansink appears to have engaged in. These were not mere mistakes. While the problems with many QRPs (like optional stopping) have long been underappreciated by many, a lot of the problems in Wansink’s retracted articles are quite deliberate distortions of scientific facts. For most, he could have and should have known better. This isn’t the same as simply getting things wrong.

The examples Cohen offers for the “rich errors” in past research are also not applicable. Miscalculating the age of the earth or presenting an incorrect equation are genuine mistakes. They might be based on incomplete or distorted knowledge. Publishing an incorrect hypothesis (e.g., that DNA is a triple helix) is not the same as mining data to confirm a hypothesis. It is perfectly valid to derive new hypotheses, even if they turn out to be completely false. For example, I might posit that gremlins cause the outdoor socket on my deck to fail. Sooner or later, a thorough empirical investigation will disprove this hypothesis and the evidence will support an alternative, such as that the wiring is faulty. The gremlin hypothesis may be false – and it is also highly implausible – but nothing stops me from formulating it. Wansink’s problem wasn’t with his hypotheses (some of which may indeed turn out to be true) but with the irregularities in the data he used to support them.

“Underlying it all is a suspicion that he was in the habit of forming hypotheses and then searching for data to support them”

Ahm, no. Forming hypotheses before collecting data is how it’s supposed to work. Using Cohen’s “generous perspective”, this is indeed how hypothetico-deductive research works. In how far this relates to Wansink’s “research sin” depends on what exactly is meant here by “searching for data to support” your hypotheses. If this implies you are deliberately looking for data that confirms your prior belief while ignoring or rejecting observations that contradict it, then that is not merely a questionable research practice, but antithetical to the whole scientific endeavour itself. It is also a perfect definition of confirmation bias, something that afflicts all human beings to some extent, scientists included. Scientists must find protections from fooling themselves in this way and that entails constant vigilance and scepticism of our own pet theories. In stark contrast, engaging in this behaviour actively and deliberately is not science but pure story-telling.

The critics are not merely “indulging themselves in a myth of neutral observers uncovering ‘facts'”. Quite to the contrary, I think Wansink’s critics are well aware of the human fallibility of scientists. People are rarely perfectly neutral when it pertains to hypotheses. Even when you are not emotionally invested in which one of multiple explanations for a phenomenon might be correct, they are frequently not equal in terms of how exciting it might be to confirm them. Finding gremlins under my deck would certainly be more interesting (and scary?) than evidence of faulty wiring.

But in the end, facts are facts. There are no “alternative facts”. Results are results. We can differ on how to interpret them but that doesn’t change the underlying data. Of course, some data are plainly wrong because they come from incorrect measurements, artifacts, or statistical flukes. These results are wrong. They aren’t facts even if we think of them as facts at the moment. Sooner or later, they will be refuted. That’s normal. But this is a long shot from deliberately misreporting or distorting facts.

“…studies like Wansink’s can be of value if they offer new clarity in looking at phenomena…”

This seems to be the crux of Cohen’s argument. Somehow, despite all the dubious and possibly fraudulent nature of his research, Wansink still makes a useful contribution to science. How exactly? What “new clarity” do we gain from cherry-picked results?

I can see though that Wansink may “stimulate ideas for future investigations”. There is no denying that he is a charismatic presenter and that some of his ideas were ingenuous. I like the concept of self-filling soup bowls. I do think we must ask some critical questions about this experimental design, such as whether people can be truly unaware that the soup level doesn’t go down as they spoon it up. But the idea is neat and there is certainly scope for future research.

But don’t present this as some kind of virtue. By all means, give credit to him for developing a particular idea or a new experimental method. But please, let’s not pretend that this excuses the dubious and deliberate distortion of the scientific record. It does not justify the amount of money that has quite possibly been wasted on changing how people eat, the advice given to schools based on false research. Deliberately telling untruths is not an error, it is called a lie.

1024px-gremlins_think_it27s_fun_to_hurt_you-_use_care_always-_back_up_our_battleskies5e_-_nara_-_535381

 

Enough with the stupid heuristics already!

Today’s post is inspired by another nonsensical proposal that made the rounds and that reminded me why I invented the Devil’s Neuroscientist back in the day (Don’t worry, that old demon won’t make a comeback…). So apparently RetractionWatch created a database allowing you to search for an author’s name to list any retractions or corrections of their publications*. Something called the Ochsner Journal then declared they would use this to scan “every submitting author’s name to ensure that no author published in the Journal has had a paper retracted.” I don’t want to dwell on this abject nonsense – you can read about this in this Twitter thread. Instead I want to talk about the wider mentality that I believe underlies such ideas.

In my view, using retractions as a stigma to effectively excommunicate any researcher from “science” forever is just another manifestation of a rather pervasive and counter-productive tendency of trying to reduce everything in academia to simple metrics and heuristics. Good science should be trustworthy, robust, careful, transparent, and objective. You cannot measure these things with a number.

Perhaps it is unsurprising that quantitative scientists want to find ways to quantify such things. After all, science is the endeavour to reveal regularities in our observations to explain the variance of the natural world and thus reduce the complexity in our understanding of it. There is nothing wrong with meta-science and trying to derive models of how science – and scientists – work. But please don’t pretend that these models are anywhere near good enough to actually govern all of academia.

Few people you meet still believe that the Impact Factor of a journal tells you much about the quality of a given publication in it. Neither does an h-index or citation count tell us anything about the importance or “impact” of somebody’s research, certainly not without looking at this relative to the specific field of science they operate in. The rate with which someone’s findings replicate doesn’t tell you anything about how great a scientist they are. And you certainly won’t learn anything about the integrity and ability of a researcher – and their right to publish in your journal – when all you have to go on is that they were an author on one retracted study.

Reducing people’s careers and scientific knowledge to a few stats is lazy at best. But it is also downright dangerous. As long as such metrics are used to make consequential real-life decisions, people are incentivised to game them. Nowhere can this be seen better than with the dubious tricks some journals use to inflate their Impact Factor or the occasional dodgy self-citation scandals. Yes, in the most severe cases these are questionable, possibly even fraudulent, practices – but there is a much greater grey area here. What do you think would happen, if we adopted the policy that only researchers with high replicability ratings get put up for promotion? Do you honestly think this would encourage scientists to do better science rather than merely safer, boring science?

This argument is sometimes used as a defence of the status quo and a reason why we shouldn’t change the way science is done. Don’t be fooled by that. We should reward good and careful science. We totally should give credit to people who preregister their experiments, self-replicate their findings, test the robustness of their methods, and go the extra mile to ensure their studies are solid. We should appreciate hypotheses based on clever, creative, and/or unique insights. We should also give credit to people for admitting when they are wrong – otherwise why should anyone seek the truth?

The point is, you cannot do any of that with a simple number in your CV. Neither can you do that by looking at retractions or failures to replicate as a plague mark on someone’s career. I’m sorry to break it to you, but the only way to assess the quality of some piece of research, or to understand anything about the scientist behind it, is to read their work closely and interpret it in the appropriate context. That takes time and effort. Often it also necessitates talking to them because no matter how clever you think you are, you will not understand everything they wrote, just as not everybody will comprehend the gibberish you write. If you believe a method is inadequate, by all means criticise it. Look at the raw data and the analysis code. Challenge interpretations you disagree with. Take nobody’s word for granted and all that…

But you can shove your metrics where the sun don’t shine.

Is d>10 a plausible effect size?

TL;DR: You may get a very large relative effect size (like Cohen’s d), if the main source of the variability in your sample is the reliability of each observation and the measurement was made as exact as is feasible. Such a large d is not trivial, but in this case talking about d is missing the point.

In discussions of scientific findings you will often hear talk about relative effect sizes, like the ubiquitous Cohen’s d. Essentially, such effect sizes quantify the mean difference between groups/treatments/conditions relative to the variability across subjects/observations. The situation is actually a lot more complicated because even for a seemingly simple results like the difference between conditions you will find that there are several ways of calculating the effect size. You can read a nice summary by Jake Westfall here. There are also other effect sizes, such as correlation coefficients, and what I write here applies to that, too. I will however stick to the difference-type effect size because it is arguably the most common.

One thing that has irked me about those discussions for some years is that this ignores a very substantial issue: the between-subject variance of your sample depends on the within-subject variance. The more unreliable the measurement of each subject, the greater is the variability of your sample. Thus the reliability of individual measurements limits the relative effect size you can possibly achieve in your experiment given a particular experimental design. In most of science – especially biological and psychological sciences – the reliability of individual observations is strongly limited by the measurement error and/or the quality of your experiment.

There are some standard examples that are sometimes used to illustrate what a given effect size means. I stole a common one from this blog post about the average height difference between men and women, which apparently was d=1.482 in 1980 Spain. I have no idea if this is true exactly but that figure should be in the right ballpark. I assume most people will agree that men are on average taller than women but that there is nevertheless substantial overlap in the distributions – so that relatively frequently you will find a woman who is taller than many men. That is an effect size we might consider strong.

The height difference between men and women is a good reference for an effect size because it is largely limited by the between-subject variance, the variability in actual heights across the population. Obviously, the reliability of each observation also plays a role. There will definitely be a degree of measurement error. However, I suspect that this error is small, probably on the order of a few millimeters. Even if you’re quite bad at this measurement I doubt you will typically err by more than 1-2 cm and you can probably still replicate this effect in a fairly small sample. However, in psychology experiments your measurement rarely is that accurate.

Now, in some experiments you can increase the reliability of your individual measurement by increasing the number of trials (at this point I’d like to again refer to Richard Morey’s related post on this topic). In psychophysics, collecting hundreds or thousands of trials on one individual subject is not at all uncommon. Let’s take a very simple case. Contour integration refers to the ability of the visual system to detect “snake” contours better than “ladder” contours or those defined by other orientations (we like to call those “ropes”):

 

In the left image you should hopefully see a circle defined by 16 grating patches embedded in a background or randomly oriented gratings. This “snake” contour pops out from the background because the visual system readily groups orientations along a collinear (or cocircular) path into a coherent object. In contrast, when the contour is defined by patches of other orientations, for example the “rope” contour in the right image which is defined by patches at 45 degrees relative to the path, then it is much harder to detect the presence of this contour. This isn’t a vision science post so I won’t go into any debates on what this means. The take-home message here is that if healthy subjects with normal vision are asked to determine the presence or absence of a contour like this, especially with limited viewing time, they will perform very well for the “snake” contours but only barely above chance levels for the “rope” contours.

This is a very robust effect and I’d argue this is quite representative of many psychophysical findings. A psychophysicist probably wouldn’t simply measure the accuracy but conduct a broader study of how this depends on particular stimulus parameters – but that’s not really important here. It is still pretty representative.

What is the size of this effect? 

If I study this in a group of subjects, the relative effect size at the group level will depend on how accurately I measure the performance in each individual. If I have 50 subjects (which is between 10-25 larger than your typical psychophysics study…) and each performs just one trial, then the sample variance will be much larger compared to if each of them does 100 trials or if they each do 1000 trials. As a result, the Cohen’s d of the group will be considerably different. A d>10 should be entirely feasible if we collect enough trials per person.

People will sometimes say that large effects (d>>2 perhaps) are trivial. But there is nothing trivial about this. In this particular example you may see the difference quite easily for yourself (so you are a single-subject and single-trial replication). But we might want to know just how much better we are at detecting the snake than the rope contours. Or, as I already mentioned, a psychophysicist might measure the sensitivity of subjects to various stimulus parameters in this experiment (e.g., the distance between patches, the amount of noise in the orientations we can tolerate, etc) and this could tell us something about how vision works. The Cohen’s d would be pretty large for all of these. That does not make it trivial but in my view it makes it useless:

Depending on my design choices the estimated effect size may be a very poor reflection of the true effect size. As mentioned earlier, the relative effect size is directly dependent on the between-subject variance – but that in turn depends on the reliability of individual measurements. If each subject only does one trial, the effect of just one attentional lapse or accidental button press in the task is much more detrimental than when they perform 1000 trials, even if the overall rate of lapses/accidents is the same*.

Why does this matter?

In many experiments, the estimate of between-subject variance will be swamped by the within-subject variability. Returning to the example of gender height differences, this is essentially what would happen if you chose to eyeball each person’s height instead of using a tape measure. I’d suspect that is the case for many experiments in social or personality psychology where each measurement is essentially a single quantity (say, timing the speed with which someone walks out of the lab in a priming experiment) rather than being based on hundreds or thousands of trials as in psychophysics. Notoriously noisy measurements are also doubtless the major limiting factor in most neuroimaging experiments. On the other hand, I assume a lot of questionnaire-type results you might have in psychology (such as IQ or the Big Five personality factors) have actually pretty high test-retest reliability and so you probably do get mostly the between-subject variance.

The problem is that often it is very difficult to determine which scenario we are in. In psychophysics, we are often so extremely dominated by the measurement reliability that a knowledge of the “true” population effect size is actually completely irrelevant. This is a critical issue because you cannot use such an effect size for power analysis: If I take an experiment someone did and base my power analysis on the effect size they reported, I am not really powering my experiment to detect a similar effect but a similar design. (This is particularly useless if I then decide to use a different design…)

So next time you see an unusually large Cohen’s (d>10 or even d>3) ask yourself not simply whether this is a plausible effect but whether this experiment can plausibly estimate the true population effect. If this result is based on a single observation per subject with a highly variable measurement (say, how often Breton drivers stop for female hitchhikers wearing red clothing…), even a d=1 seems incredibly large.

But if it is for a measurement that could have been made more reliable by doubling the amount of data collected in each subject (say, a change in psychometric thresholds), then a very high Cohen’s d is entirely plausible – but it is also pretty meaningless. In this situation, what we should really care about is the absolute effect size (How much does the threshold change? How much does the accuracy drop? etc).

And I must say, I remain unsure whether absolute effect sizes aren’t more useful in general, including for experiments on complex human behaviour, neuroimaging, or drug effects.

* Actually the lapse rate probably increases with a greater number of trials due to subject fatigue, drop in motivation, or out of pure spite. But even that increase is unlikely to be as detrimental as having too few trials.

Of hacked peas and crooked teas

The other day, my twitter feed got embroiled in another discussion about whether or not p-hacking is deliberate and if it constitutes fraud. Fortunately, I then immediately left for a trip abroad and away from my computer, so there was no danger of me being drawn into this debate too deeply and running the risk of owing Richard Morey another drink. However, now that I am back I wanted to elaborate a bit more on why I think the way our field has often approached p-hacking is both wrong and harmful.

What the hell is p-hacking anyway? When I google it I get this Wikipedia article, which uses it as a synonym for “data dredging”. There we already have a term that seems to me more appropriate. P-hacking refers to when you massage your data and analysis methods until your result reaches a statistically significant p-value. I will put it to you that in practice most p-hacking is not necessarily about hacking p-s but about dredging your data until your results fit a particular pattern. That may be something you predicted but didn’t find or could even just be some chance finding that looked interesting and is amplified this way. However, the p-value is usually probably secondary to the act here. The end result may very well be the same in that you continue abusing the data until a finding becomes significant, but I would bet that in most cases what matters to people is not the p-value but the result. Moreover, while null-hypothesis significance testing with p-values is still by far the most widespread way to make inferences about results, it is not the only way. All this fussing about p-hacking glosses over the fact that the same analytic flexibility or data dredging can be applied to any inference, whether it is based on p-values, confidence intervals, Bayes factors, posterior probabilities, or simple summary statistics. By talking of p-hacking we create a caricature that this is somehow a problem specific to p-values. Whether or not NHST is the best approach for making statistical inferences is a (much bigger) debate for another day – but it has little to do with p-hacking.

What is more, not only is p-hacking not really about p’s but it is also not really about hacking. Here is the dictionary entry for the term ‘hacking‘. I think we can safely assume that when people say p-hacking they don’t mean that peas are physically being chopped or cut or damaged in any way. I’d also hazard a guess that it’s not meant in the sense of “to deal or cope with” p-values. In fact, the only meaning of the term that seems to come even remotely close is this:

“to modify a computer program or electronic device in a skillful or clever way”

Obviously, what is being modified in p-hacking is the significance or impressiveness of a result, rather than a computer program or electronic device, but we can let this slide. I’d also suggest that it isn’t always done in a skillful or clever way either, but perhaps we can also ignore this. However, the verb ‘hacking’ to me implies that this is done in a very deliberate way. It may even, as with computer hacking, carry the connotation of fraud, of criminal intent. I believe neither of these things are true about p-hacking.

That is not to say that p-hacking isn’t deliberate. I believe in many situations it likely is. People no doubt make conscious decisions when they dig through their data. But the overwhelming majority of p-hacking is not deliberately done to create spurious results that the researcher knows to be false. Anyone who does so would be committing actual fraud. Rather, most p-hacking is the result of confirmation bias combined with analytical flexibility. This leads people to sleep walk into creating false positives or – as Richard Feynman would have called it – fooling themselves. Simine Vazire already wrote an excellent post about this a few years ago (and you may see a former incarnation of yours truly in the comment section arguing against the point I’m making here… I’d like to claim that it’s cause I have grown as a person but in truth I only exorcised this personality :P). I’d also guess that a lot of p-hacking happens out of ignorance, although that excuse really shouldn’t fly as easily in 2017 as it may have done in 2007. Nevertheless, I am pretty sure people do not normally p-hack because they want to publish false results.

Some may say that it doesn’t matter whether or not p-hacking is fraud – the outcome is the same: many published results are false. But in my view it’s not so simple. First, the solution to these two problems surely isn’t the same. Preregistration and transparency may very well solve the problem of analytical flexibility and data dredging – but it is not going to stop deliberate fraud, nor is it meant to. Second, actively conflating fraud and data dredging implicitly accuses researchers of being deliberately misleading and thus automatically puts them on the defensive. This is hardly a way to have a productive discussion and convince people to do something about p-hacking. You don’t have to look very far for examples of that playing out. Several protracted discussions on a certain Psychology Methods Facebook group come to mind…

Methodological flexibility is a real problem. We definitely should do something about it and new moves towards preregistration and data transparency are at least theoretically effective solutions to improve things. The really pernicious thing about p-hacking is that people are usually entirely unaware of the fact that they are doing it. Until you have tried to do a preregistered study, you don’t appreciate just how many forks in the road you passed along the way (I may blog about my own experiences with that at some point). So implying, however unintentionally, that people are fraudsters is not helping matters.

Preregistration and data sharing have gathered a lot of momentum over the past few years. Perhaps the opinions of some old tenured folks opposed to such approaches no longer carry so much weight now, regardless how powerful they may be. But I’m not convinced that this is true. Just because there has been momentum now does not mean that these ideas will prevail. It is just as likely that they fizzle out due to lacking enthusiasm or because people begin to feel that the effort isn’t worth it. I seems to me that “open science” very much exists in a bubble and I have bemoaned that before. To change scientific practices we need to open the hearts and minds of sceptics to why p-hacking is so pervasive. I don’t believe we will achieve that by preaching to them. Everybody p-hacks if left to their own devices. Preregistration and open data can help protect yourself against your mind’s natural tendency to perceive patterns in noise. A scientist’s training is all about developing techniques to counteract this tendency, and so open practices are just another tool for achieving that purpose.

1920px-fish2c_chips_and_mushy_peas
There is something fishy about those pea values…