A few thoughts on stats checking

You may have heard of StatCheck, an R package developed by Michèle B. Nuijten. It allows you to search a paper (or manuscript) for common frequentist statistical tests. The program then compares whether the p-value reported in the test matches up with the reported test statistic and the degrees of freedom. It flags up cases where the p-value is inconsistent and, additionally, when the recalculated p-value would change the conclusions of the test. Now, recently this program was used to trawl through 50,000ish papers in psychology journals (it currently only recognizes statistics in APA style). The results on each paper are then automatically posted as comments on the post-publication discussion platform PubPeer, for example here. At the time of writing this, I still don’t know if this project has finished. I assume not because the (presumably) only one of my papers that has been included in this search has yet to receive its comment. I left a comment of my own there, which is somewhat satirical because 1) I don’t take the world as seriously as my grumpier colleagues and 2) I’m really just an asshole…

While many have welcomed the arrival of our StatCheck Overlords, not everyone is happy. For instance, a commenter in this thread bemoans that this automatic stats checking is just “mindless application of stats unnecessarily causing grief, worry, and ostracism. Effectively, a witch hunt.” In a blog post, Dorothy Bishop discusses the case of her own StatCheck comments, one of which gives the paper a clean bill of health and the other discovered some potential errors that could change the significance and thus the conclusions of the study. My own immediate gut reaction to hearing about this was that this would cause a deluge of vacuous comments and that this diminishes the signal-to-noise ratio of PubPeer. Up until now discussions on there frequently focused on serious issues with published studies. If I see a comment on a paper I’ve been looking up (which is made very easy using the PubPeer plugin for Firefox), I would normally check it out. If in future most papers have a comment from StatCheck, I will certainly lose that instinct. Some are worried about the stigma that may be attached to papers when some errors are found although others have pointed out that to err is human and we shouldn’t be afraid of discovering errors.

Let me be crystal clear here. StatCheck is a fantastic tool and should prove immensely useful to researchers. Surely, we all want to reduce errors in our publications, which I am also sure all of us make some of the time. I have definitely noticed typos in my papers and also errors with statistics. That’s in spite of the fact that when I do the statistics myself I use Matlab code that outputs the statistics in the way they should look in the text so all I have to do is copy and paste them in. Some errors are introduced by the copy-editing stage after a manuscript is accepted. Anyway, using StatCheck on our own manuscripts can certainly help reduce such errors in future. It is also extremely useful for reviewing papers and marking student dissertations because I usually don’t have the time (or desire) to manually check every single test by hand. The real question is if there is really much of a point doing this posthoc for thousands of already published papers?

One argument for this is to enable people to meta-analyze previous results. Here it is important to know that a statistic is actually correct. However, I don’t entirely buy this argument because if you meta-analyze literature you really should spend more time on checking the results than looking what StatCheck auto-comment on PubPeer said. If anything, the countless comments saying that there are zero errors are probably more misleading than the ones that found minor problems. They may actually mislead you into thinking that there is probably nothing wrong with these statistics – and this is not necessarily true. In all fairness, StatCheck, both in its auto-comments and the original paper is very explicit about the fact that its results aren’t definite and should be verified manually. But if there is one thing I’ve learned about people it is that they tend to ignore the small print. When is the last time you actually read an EULA before agreeing to it?

Another issue with the meta-analysis argument is that presently the search is of limited scope. While 50,000 is a large number, it is a small proportion of scientific papers, even within the field of psychology and neuroscience. I work at a psychology department and am (by some people’s definition) a psychologist but – as I said – to my knowledge only one of my own papers should have even been included in the search so far. So if I do a literature search for a meta-analysis StatCheck’s autopubpeering wouldn’t be much help to me. I’m told there are plans to widen the scope of StatCheck’s robotic efforts beyond psychology journals in the future. When it is more common this may indeed be more useful although the problem remains that the validity of its results is simply unknown.

The original paper includes a validity check in the Appendix. This suggests that error rates are reasonably low when comparing StatCheck’s results to previous checks. This is doubtless important for confirming that StatCheck works. But in the long run this is not really the error rate we are interested in. What this does not tell us which proportion of papers contain actual errors with a study’s conclusions. Take Dorothy Bishop‘s paper as an example. For that StatCheck detected two F-tests for which the recalculated p-value would change the statistical conclusions. However, closer inspection reveals that the test was simply misreported in the paper. There is only one degree of freedom and I’m told StatCheck misinterpreted what test this was (but I’m also told this has been fixed in the new version). If you substitute in the correct degrees of freedom, the reported p-value matches.

Now, nobody is denying that there is something wrong with how these particular stats were reported. An F-test should have two degrees of freedom. So StatCheck did reveal errors and this is certainly useful. But the PubPeer comment flags this up as a potential gross inconsistency that could theoretically change the study’s conclusions. However, we know that it doesn’t actually mean that. The statistical inference and conclusions are fine. There is merely a typographic error. The StatCheck report is clearly a false positive.

This distinction seems important to me. The initial reports about this StatCheck mega-trawl was that “around half of psychology papers have at least one statistical error, and one in eight have mistakes that affect their statistical conclusions.” At least half of this sentence is blatantly untrue. I wouldn’t necessarily call a typo a “statistical error”. But as I already said, revealing these kinds of errors is certainly useful nonetheless. The second part of this statement is more troubling. I don’t think we can conclude 1 in 8 papers included in the search have mistakes that affect their conclusions. We simply do not know that. StatCheck is a clever program but it’s not a sentient AI. The only way to really determine if the statistical conclusions are correct is still to go and read each paper carefully and work out what’s going on. Note that the statement in the StatCheck paper is more circumspect and acknowledges that such firm conclusions cannot be drawn from its results. It’s a classical case of journalistic overreach where the RetractionWatch post simplifies what the researchers actually said. But these are still people who know what they’re doing. They aren’t writing flashy “science” article for the tabloid press.

This is a problem. I do think we need to be mindful of how the public perceives scientific research. In a world in which it is fine for politicians to win referenda because “people have had enough of experts” and in which a narcissistic, science-denying madman is dangerously close to becoming US President we simply cannot afford to keep telling the public that science is rubbish. Note that worries about the reputation of science are no excuse not to help improve it. Quite to the contrary, it is a reason to ensure that it does improve. I have said many times, science is self-correcting but only if there are people who challenge dearly held ideas, who try to replicate previous results, who improve the methods, and who reveal errors in published research. This must be encouraged. However, if this effort does not go hand in hand with informing people about how science actually works, rather than just “fucking loving” it for its cool tech and flashy images, then we are doomed. I think it is grossly irresponsible to tell people that an eighth of published articles contain incorrect statistical conclusions when the true number is probably considerably smaller.

In the same vein, an anonymous commenter on my own PubPeer thread also suggested that we should “not forget that Statcheck wasn’t written ‘just because.'” There is again an underhanded message in this. Again, I think StatCheck is a great tool and it can reveal questionable results such as rounding down your p=0.054 to p=0.05 or the even more unforgivable p<0.05. It can also reveal other serious errors. However, until I see any compelling evidence that the proportion of such evils in the literature is as high as suggested by these statements I remain skeptical. A mass-scale StatCheck of the whole literature in order to weed out serious mistakes seems a bit like carpet-bombing a city just to assassinate one terrorist leader. Even putting questions of morality aside, it isn’t really very efficient. Because if we assume that some 13% of papers have grossly inconsistent statistics we still need to go and manually check them all. And, what is worse, we quite likely miss a lot of serious errors that this test simple can’t detect.

So what do I think about all this? I’ve come to the conclusion that there is no major problem per se with StatCheck posting on PubPeer. I do think it is useful to see these results, especially if it becomes more general. Seeing all of these comments may help us understand how common such errors are. It allows people to double check the results when they come across them. I can adjust my instinct. If I see one or two comments on PubPeer I may now suspect it’s probably about StatCheck. If I see 30, it is still likely to be about something potentially more serious. So all of this is fine by me. And hopefully, as StatCheck becomes more widely used, it will help reduce these errors in future literature.

But – and this is crucial – we must consider how we talk about this. We cannot treat every statistical error as something deeply shocking. We need to develop a fair tolerance to these errors as they are discovered. This may seem obvious to some but I get the feeling not everybody realizes that correcting errors is the driving force behind science. We need to communicate this to the public instead of just telling them that psychologists can’t do statistics. We can’t just say that some issue with our data analysis invalidates 45,000 and 15 years worth of fMRI studies. In short, we should stop overselling our claims. If, like me, you believe it is damaging when people oversell their outlandish research claims about power poses and social priming, then it is also damaging if people oversell their doomsday stories about scientific errors. Yes, science makes errors – but the fact that we are actively trying to fix them is proof that it works.

Your friendly stats checking robot says hello

22 thoughts on “A few thoughts on stats checking

  1. Good post. Personally, from what I’ve seen on PubPeer so far, StatCheck is a fantastic idea, but there seem to be a lot of false positives – and as for false negatives, it would be very hard to notice those. So I think it needs more work. Possibly it was “released” too early, but my feeling is that you have to release some time and then improve the product based on feedback. It will be interesting to see how it develops.


    1. Yes maybe it was a bit premature. Then again, as I say in the post, I don’t think it’s a bad thing. It does give you an idea about what is going on. I hope lots of authors (or others) respond to the error reports. It would be great to quantify what the cause of the errors are.


    1. Did you publish much in the journals they searched? I don’t think I have except that one Psych Science paper. Seems not to be a lot of journals common in my area really.


  2. Thanks for your post. I think on most points we agree.

    The one thing I try to repeat over and over is that statcheck is automated software that will never be as accurate as a manual check.

    As I see it, statcheck is most useful 1) to check your own manuscript before/during peer review to avoid publishing inconsistent results, and 2) to check large bodies of literature for a (rough) estimate of error rates. statcheck is noisy, but as far as we concluded not systematically biased, so in large samples the estimated error prevalence should not be far from the real prevalence.

    Using statcheck to check someone else’s paper really has to go hand in hand with a manual check before drawing strong conclusions. This is also stated in Chris’ PubPeer comments (hardly in fine print I’d say 😉 ), but I think this might be where a communication problem arose.

    I can imagine people can feel as if they are under attack when a program they may have never heard of is telling them they have mistakes in their paper. Especially when it turns out to be a false positive in statcheck! (Although in those cases, I so far have only received very friendly emails with requests for further explanation.) An added statement that these results are not definitive may not be enough to remove those feelings.

    At this point, I think my job remains 1) to keep improving statcheck, and 2) to keep repeating how its results are not definitive. Beyond that, I don’t think I can control how people interpret these results, but I hope that once it’s not so “new” anymore, people will treat it like they’d treat their spell checker: as a handy tool that sometimes says stupid things.


    1. Hi Michèle, thanks for your comment. Yes, I tried to be very clear that I don’t think you are deliberately spinning the results. Both the paper and the PubPeer comments as well as your repeated public statements make it clear that it is a bot and should be double checked. This is certainly how I’ll plan to use it in future, say, when I’m reviewing. If StatCheck finds errors in a manuscript I would look more closely at those tests and if it turns out to be an error I would mention that in the review. Other stats I may still check by hand if they look fishy but StatCheck should make it at least easier to have a quick pass through a manuscript.

      My only worry is really that many people may not interpret it in this way. The RetractionWatch post notwithstanding, I don’t think I’ve seen much of this happening yet for the StatCheck literature trawl but it’s early days. If the dead salmon, voodoo correlations, and fMRI cluster inference papers are any indication, there could be a major distortions of this in the media. You also see this with all the failed replications. It’s all fine and good that the scientific community is doing something to correct mistakes, it’s what it’s supposed to do. But we must communicate this better.

      Just to see what is at stake, the discoveries of faster-than-light particles and gravitational waves in physics were reported widely in the news only to then be refuted by failed replications. There is nothing wrong here. I think physicists have been doing it right for a long time. But the way this seems to be communicated to the media is nevertheless poor and can undermine public trust. If it is this bad for people doing it right, how bad is it going to be for a field that is only just waking up to the fact that it’s doing it wrong?

      Anyway, as Neuroskeptic said above, perhaps the PubPeer thing was too soon. I just received my own comment today. It found no errors. As far as I can see, one of these is a false negative? (Perhaps I’m missing something?). So I think there could’ve been more validity checking before rolling out this mass posting project.

      Then again, what is done is done. As I said, I hope lots of authors will comment and then it can perhaps be quantified of what proportion of gross inconsistencies flagged up by StatCheck are actual statistical errors rather than typos. Obviously this is based on the author’s response but without data sharing that’s the best we can do. So one needs to take a grain of salt with this but I still live in the hope that most people wouldn’t deliberately lie about stuff like this.


    2. Hi Sam, see my reply on PubPeer: your results are actually not inconsistent. statcheck takes correct rounding of your test statistic into account 🙂

      As for media communication: I also think that there’s a lot to win there. Media love to create headlines saying that “HALF OF PSYCHOLOGY IS BULLSHIT”, whereas I said “in half of the papers that statcheck could detect results in, it flagged a possible inconsistency in the reported statistics, but these mistakes are mainly in the third decimal and not influential”.

      Do you think there is something we/I can do to avoid these exaggerations?


    3. By the way, did you see that PubPeer comment (obviously anonymous) on one of the reports saying that “The Dutch are obviously trying to pay for past sins.” Somewhat offensive but it did make me chuckle 😛

      Unfortunately I can’t find it any more. Maybe PubPeer removed it for obvious reasons.

      Liked by 1 person

    4. Thanks I left a comment now as well. I don’t think the first test can be explained by rounding error?


    5. Oh and to answer your question: I would keep doing what you’re doing. Emphasize to journalists that these are not definite results and that most of these mistakes are likely to be minor typos and the like – but they’re mistakes nonetheless and technology allows us to do better than we did in the past when we had to typeset our papers in a Gutenberg press (at least this is what my professors and supervisors at uni tried to make me believe – I am still young enough that I did my PhD using a computer… 🙂

      I also don’t think the responsibility falls solely on you. This is something our whole community must do and we can do a lot better. It isn’t trivial or easy either. When I reiterate these points to journalists they often just ignore it in the final article. So just telling journalists isn’t really enough from what I can see.

      Liked by 1 person

  3. Update: the automatic report arrived on my paper now:

    Interestingly, this found no errors. The second one is obviously a rounding error and StatCheck takes this into account so that makes sense. The first one is more puzzling though. Maybe I’ve not had enough coffee but I don’t see how a rounding error could produce this mismatch. I think it’s more likely that I made a copy and paste error – but why did StatCheck not spot this?


    1. Hehe although with a bit more AI I think it should be able to do that one… 🙂

      Thanks for the explanation. I could’ve sworn I checked 0.495 but apparently I didn’t or didn’t see the p-value. Computers are definitely better at number crunching, I admit that!


  4. Thanks for your interesting piece. I don’t know anything about statistics so I’m not going to comment on that, but I do think your proposal for the best way of correcting errors in a field – by individual scientists working on replication and calling out mistakes – is problematic.

    The great thing about a piece of software is that it doesn’t have funding to lose or a senior to piss off. It has the potential to shorten the error correction cycle from career-length to just a few years.

    With the volume of research going on in the world right now, the sheer number of journals & the sheer number of papers, it seems to me unrealistic to do such a volume of checking in the first palce; but even if some selfless funding body suddenly decided to throw money at that problem, the sociology of science will get in the way of a really impartial process.

    So from my point of view as someone who gets her science second hand, this kind of approach (even if not, in the long run, this particular tool) is vital and useful.


    1. Thanks for your comment. I think you misunderstood my point. While I think all scientists (as a community, not just some individuals) must work to make correction easier and replication efforts more commonplace, I certainly see a use for programs like StatCheck. It is extremely useful for reviewers, and in fact I believe some journals are already going to use it on new submissions. I think it’s a great idea.

      Anyway, it is not a question of whether an automated program like StatCheck is better at spotting serious errors. It isn’t and it can’t be – not with present day AI at least. A program may flag up inconsistencies in the reported statistics and help us fix them faster. But it can’t read papers and understand whether the error was a typo, a genuine calculation error, or – god forbid – some nefarious fraud. I’m afraid you will need people for that until computers become good enough to do science themselves, which I’m not sure is ever going to happen.


  5. Statcheck shares some characteristics with GRIM (now in press! – preprint here https://peerj.com/preprints/2064/), so I’m following this discussion with interest, even if the nature of the analyses means that It’s unlikely that GRIM will be automated any time soon.

    Both Statcheck and GRIM will likely detect a high percentage of “minor” (in terms of their impact on the results) problems, but also a few very serious ones. One of the things we found when investigating (apparent) errors detected with GRIM was that the underlying problems ranged from fairly trivial to “article needs major correction, possibly even retraction”, and that — at least in those cases where the authors shared their data with us — there was no obvious clue to where any article would lie on that continuum until we looked at the data.

    While I have some small doubts about the blanket approach of flagging all the Statcheck-detected errors on PubPeer in one big chunk — I’d be inclined to go for a phased approach, for a number of reasons — I think it’s legitimate to call out the small problems as well as the large ones. First, cf. the previous paragraph, one can’t always predict which is which. Second, this is *science*, dammit. If you are reporting F statistics and nobody in the entire process of submitting, reviewing, and proofreading the article noticed that you forgot one of the DFs, what does that tell us about the general level of attention to detail that’s going in to the article. (In the case of GRIM, one author whose errors turned out to be fairly inconsequential was quite upset that s/he had made what s/he regarded as “sloppy” errors; I think that’s a great attitude. When we stop caring about reporting statistics 100.00% correctly, what else do we stop caring about?)


    1. Yes I think GRIM isn’t going to be applied automatically anytime for the same reason StatCheck isn’t going to distinguish between serious and trivial errors. You need a mind that can read and understand the paper for that to work out what the relevant statistics should be.

      I have a feeling that GRIM may reveal a lot grimmer (hah! :P) cases because it actually goes deeper into the data. This I actually discussed with a colleague yesterday. In my field the inferential stats are really the end point of the process. At least in neuroimaging experiments there is so much data processing before you even calculate your t-test where there is much potential for error (or worse) before you get to the inferential statistics.

      I totally agree with you about the sloppiness aspect. I don’t want to make those kinds of errors either and it does annoy me a lot when I discover them. My emotional reaction does depend a little on the severity of the error. Swapping a digit in my F-statistic feels like less of a reason to get mad at myself than missing out a degree of freedom. Putting down a genuinely incorrect p-value would be even worse. Something GRIM would find is utterly horrifying as that would suggest my code is wrong! 😉

      So as I said in the post, these tests are very useful and I do hope they can improve our publications. I think it’s delusional to expect we can ever get to 100% error free literature but we can and should bloody well try.


  6. “For that StatCheck detected two F-tests for which the recalculated p-value would change the statistical conclusions. However, closer inspection reveals that the test was simply misreported in the paper.”
    We’re encountering a similar problem, although the test was reported correctly, see discussion here: https://www.pubpeer.com/publications/316EC952EF5728EE5ADA609D028620#fb107935

    Statcheck assumed the reporting was in APA format and posted the “error” (potentially incorrect statistical results, of which 2 may change statistical significance) on PubPeer for quite a number of papers. However, it wasn’t in APA format in the first place. I’d have hoped that Statcheck would refrain from posting in that case, since it should only read APA format.


    1. Thanks for your comment. Yes this is clearly inconsistent and must produce a large number of false positives. It shouldn’t really recognize these tests as normal t-tests in the first place so this is evidently an error. It looks like this has been fixed in the newer version (as this doesn’t seem to happen on the web interface) but I’d still contact Michèle about it directly.


    1. I entirely agree that data sharing is far more important than checking all the p-values match their t-stats but that doesn’t mean ensuring the stats are kosher is unimportant.


Comments are closed.