Category Archives: improving science

Massaging data to fit a theory is antithetical to science

I have stayed out of the Wansink saga for the most part. If you don’t know what this is about, I suggest reading about this case on Retraction Watch. I had a few private conversations about this with Nick Brown, who has been one of the people instrumental in bringing about a whole series of retractions of Wansink’s publications. I have a marginal interest in some of Wansink’s famous research, specifically whether the size of plates can influence how much a person eats, because I have a broader interest in the interplay between perception and behaviour.

But none of that is particularly important. The short story is that considerable irregularities have been discovered in a string of Wansink’s publications, many of which has since been retracted. The whole affair first kicked off with a fundamental own-goal of a blog post (now removed, so posting Gelman’s coverage instead) he wrote in which he essentially seemed to promote p-hacking. Since then the problems that came to light ranged from irregularities (or impossibility) of some of the data he reported, evidence of questionable research practices in terms of cherry-picking or excluding data, to widespread self-plagiarism. Arguably, not all of these issues are equally damning and for some the evidence is more tenuous than for others – but the sheer quantity of problems is egregious. The resulting retractions seem entirely justified.

Today I read an article on Times Higher Education entitled “Massaging data to fit a theory is not the worst research sin” by Martin Cohen, which discusses Wansink’s research sins in a broader context of the philosophy of science. The argument is pretty muddled to me so I am not entirely sure what the author’s point is – but the effective gist seems to shrug off concerns about questionable research practices and that Wansink’s research is still a meaningful contribution to science.  In my mind, Cohen’s article reflects a fundamental misunderstanding of how science works and in places sounds positively post-Truthian. In the following, I will discuss some of the more curious claims made by this article.

“Massaging data to fit a theory is not the worst research sin”

I don’t know about the “worst” sin. I don’t even know if science can have “sins” although this view has been popularised by Chris Chamber’s book and Neuroskeptic’s Circles of Scientific Hell. Note that “inventing data”, a.k.a. going Full-Stapel, is considered the worst affront to the scientific method in the latter worldview. “Massaging data” is perhaps not the same as outright making it up, but on the spectrum of data fabrication it is certainly trending in that direction.

Science is about seeking the truth. In Cohen’s words, “science should above all be about explanation”. It is about finding regularities, relationships, links, and eventually – if we’re lucky – laws of nature that help us make sense of a chaotic, complex world. Altering, cherry-picking, or “shoe-horning” data to fit your favourite interpretation is the exact opposite of that.

Now, the truth is that p-hacking,  the garden of forking paths, flexible outcome-contingent analyses fall under this category. Such QRPs are extremely widespread and to some degree pervade most of the scientific literature. But just because it is common, doesn’t mean that this isn’t bad. Massaging data inevitably produces a scientific literature of skewed results. The only robust way to minimise these biases is through preregistration of experimental designs and confirmatory replications. We are working towards that becoming more commonplace – but in the absence of that it is still possible to do good and honest science.

In contrast, prolifically engaging in such dubious practices, as Wansink appears to have done, fundamentally undermines the validity of scientific research. It is not a minor misdemeanour.

“We forget too easily that the history of science is rich with errors”

I sympathise with the notion that science has always made errors. One of my favourite quotes about the scientific method is that it is about “finding better ways of being wrong.” But we need to be careful not to conflate some very different things here.

First of all, a better way of being wrong is an acknowledgement that science is never a done deal. We don’t just figure out the truth but constantly seek to home in on it. Our hypotheses and theories are constantly refined, hopefully by gradually becoming more correct, but there will also be occasional missteps down a blind alley.

But these “errors” are not at all the same thing as the practices Wansink appears to have engaged in. These were not mere mistakes. While the problems with many QRPs (like optional stopping) have long been underappreciated by many, a lot of the problems in Wansink’s retracted articles are quite deliberate distortions of scientific facts. For most, he could have and should have known better. This isn’t the same as simply getting things wrong.

The examples Cohen offers for the “rich errors” in past research are also not applicable. Miscalculating the age of the earth or presenting an incorrect equation are genuine mistakes. They might be based on incomplete or distorted knowledge. Publishing an incorrect hypothesis (e.g., that DNA is a triple helix) is not the same as mining data to confirm a hypothesis. It is perfectly valid to derive new hypotheses, even if they turn out to be completely false. For example, I might posit that gremlins cause the outdoor socket on my deck to fail. Sooner or later, a thorough empirical investigation will disprove this hypothesis and the evidence will support an alternative, such as that the wiring is faulty. The gremlin hypothesis may be false – and it is also highly implausible – but nothing stops me from formulating it. Wansink’s problem wasn’t with his hypotheses (some of which may indeed turn out to be true) but with the irregularities in the data he used to support them.

“Underlying it all is a suspicion that he was in the habit of forming hypotheses and then searching for data to support them”

Ahm, no. Forming hypotheses before collecting data is how it’s supposed to work. Using Cohen’s “generous perspective”, this is indeed how hypothetico-deductive research works. In how far this relates to Wansink’s “research sin” depends on what exactly is meant here by “searching for data to support” your hypotheses. If this implies you are deliberately looking for data that confirms your prior belief while ignoring or rejecting observations that contradict it, then that is not merely a questionable research practice, but antithetical to the whole scientific endeavour itself. It is also a perfect definition of confirmation bias, something that afflicts all human beings to some extent, scientists included. Scientists must find protections from fooling themselves in this way and that entails constant vigilance and scepticism of our own pet theories. In stark contrast, engaging in this behaviour actively and deliberately is not science but pure story-telling.

The critics are not merely “indulging themselves in a myth of neutral observers uncovering ‘facts'”. Quite to the contrary, I think Wansink’s critics are well aware of the human fallibility of scientists. People are rarely perfectly neutral when it pertains to hypotheses. Even when you are not emotionally invested in which one of multiple explanations for a phenomenon might be correct, they are frequently not equal in terms of how exciting it might be to confirm them. Finding gremlins under my deck would certainly be more interesting (and scary?) than evidence of faulty wiring.

But in the end, facts are facts. There are no “alternative facts”. Results are results. We can differ on how to interpret them but that doesn’t change the underlying data. Of course, some data are plainly wrong because they come from incorrect measurements, artifacts, or statistical flukes. These results are wrong. They aren’t facts even if we think of them as facts at the moment. Sooner or later, they will be refuted. That’s normal. But this is a long shot from deliberately misreporting or distorting facts.

“…studies like Wansink’s can be of value if they offer new clarity in looking at phenomena…”

This seems to be the crux of Cohen’s argument. Somehow, despite all the dubious and possibly fraudulent nature of his research, Wansink still makes a useful contribution to science. How exactly? What “new clarity” do we gain from cherry-picked results?

I can see though that Wansink may “stimulate ideas for future investigations”. There is no denying that he is a charismatic presenter and that some of his ideas were ingenuous. I like the concept of self-filling soup bowls. I do think we must ask some critical questions about this experimental design, such as whether people can be truly unaware that the soup level doesn’t go down as they spoon it up. But the idea is neat and there is certainly scope for future research.

But don’t present this as some kind of virtue. By all means, give credit to him for developing a particular idea or a new experimental method. But please, let’s not pretend that this excuses the dubious and deliberate distortion of the scientific record. It does not justify the amount of money that has quite possibly been wasted on changing how people eat, the advice given to schools based on false research. Deliberately telling untruths is not an error, it is called a lie.

1024px-gremlins_think_it27s_fun_to_hurt_you-_use_care_always-_back_up_our_battleskies5e_-_nara_-_535381

 

Enough with the stupid heuristics already!

Today’s post is inspired by another nonsensical proposal that made the rounds and that reminded me why I invented the Devil’s Neuroscientist back in the day (Don’t worry, that old demon won’t make a comeback…). So apparently RetractionWatch created a database allowing you to search for an author’s name to list any retractions or corrections of their publications*. Something called the Ochsner Journal then declared they would use this to scan “every submitting author’s name to ensure that no author published in the Journal has had a paper retracted.” I don’t want to dwell on this abject nonsense – you can read about this in this Twitter thread. Instead I want to talk about the wider mentality that I believe underlies such ideas.

In my view, using retractions as a stigma to effectively excommunicate any researcher from “science” forever is just another manifestation of a rather pervasive and counter-productive tendency of trying to reduce everything in academia to simple metrics and heuristics. Good science should be trustworthy, robust, careful, transparent, and objective. You cannot measure these things with a number.

Perhaps it is unsurprising that quantitative scientists want to find ways to quantify such things. After all, science is the endeavour to reveal regularities in our observations to explain the variance of the natural world and thus reduce the complexity in our understanding of it. There is nothing wrong with meta-science and trying to derive models of how science – and scientists – work. But please don’t pretend that these models are anywhere near good enough to actually govern all of academia.

Few people you meet still believe that the Impact Factor of a journal tells you much about the quality of a given publication in it. Neither does an h-index or citation count tell us anything about the importance or “impact” of somebody’s research, certainly not without looking at this relative to the specific field of science they operate in. The rate with which someone’s findings replicate doesn’t tell you anything about how great a scientist they are. And you certainly won’t learn anything about the integrity and ability of a researcher – and their right to publish in your journal – when all you have to go on is that they were an author on one retracted study.

Reducing people’s careers and scientific knowledge to a few stats is lazy at best. But it is also downright dangerous. As long as such metrics are used to make consequential real-life decisions, people are incentivised to game them. Nowhere can this be seen better than with the dubious tricks some journals use to inflate their Impact Factor or the occasional dodgy self-citation scandals. Yes, in the most severe cases these are questionable, possibly even fraudulent, practices – but there is a much greater grey area here. What do you think would happen, if we adopted the policy that only researchers with high replicability ratings get put up for promotion? Do you honestly think this would encourage scientists to do better science rather than merely safer, boring science?

This argument is sometimes used as a defence of the status quo and a reason why we shouldn’t change the way science is done. Don’t be fooled by that. We should reward good and careful science. We totally should give credit to people who preregister their experiments, self-replicate their findings, test the robustness of their methods, and go the extra mile to ensure their studies are solid. We should appreciate hypotheses based on clever, creative, and/or unique insights. We should also give credit to people for admitting when they are wrong – otherwise why should anyone seek the truth?

The point is, you cannot do any of that with a simple number in your CV. Neither can you do that by looking at retractions or failures to replicate as a plague mark on someone’s career. I’m sorry to break it to you, but the only way to assess the quality of some piece of research, or to understand anything about the scientist behind it, is to read their work closely and interpret it in the appropriate context. That takes time and effort. Often it also necessitates talking to them because no matter how clever you think you are, you will not understand everything they wrote, just as not everybody will comprehend the gibberish you write. If you believe a method is inadequate, by all means criticise it. Look at the raw data and the analysis code. Challenge interpretations you disagree with. Take nobody’s word for granted and all that…

But you can shove your metrics where the sun don’t shine.

Is d>10 a plausible effect size?

TL;DR: You may get a very large relative effect size (like Cohen’s d), if the main source of the variability in your sample is the reliability of each observation and the measurement was made as exact as is feasible. Such a large d is not trivial, but in this case talking about d is missing the point.

In discussions of scientific findings you will often hear talk about relative effect sizes, like the ubiquitous Cohen’s d. Essentially, such effect sizes quantify the mean difference between groups/treatments/conditions relative to the variability across subjects/observations. The situation is actually a lot more complicated because even for a seemingly simple results like the difference between conditions you will find that there are several ways of calculating the effect size. You can read a nice summary by Jake Westfall here. There are also other effect sizes, such as correlation coefficients, and what I write here applies to that, too. I will however stick to the difference-type effect size because it is arguably the most common.

One thing that has irked me about those discussions for some years is that this ignores a very substantial issue: the between-subject variance of your sample depends on the within-subject variance. The more unreliable the measurement of each subject, the greater is the variability of your sample. Thus the reliability of individual measurements limits the relative effect size you can possibly achieve in your experiment given a particular experimental design. In most of science – especially biological and psychological sciences – the reliability of individual observations is strongly limited by the measurement error and/or the quality of your experiment.

There are some standard examples that are sometimes used to illustrate what a given effect size means. I stole a common one from this blog post about the average height difference between men and women, which apparently was d=1.482 in 1980 Spain. I have no idea if this is true exactly but that figure should be in the right ballpark. I assume most people will agree that men are on average taller than women but that there is nevertheless substantial overlap in the distributions – so that relatively frequently you will find a woman who is taller than many men. That is an effect size we might consider strong.

The height difference between men and women is a good reference for an effect size because it is largely limited by the between-subject variance, the variability in actual heights across the population. Obviously, the reliability of each observation also plays a role. There will definitely be a degree of measurement error. However, I suspect that this error is small, probably on the order of a few millimeters. Even if you’re quite bad at this measurement I doubt you will typically err by more than 1-2 cm and you can probably still replicate this effect in a fairly small sample. However, in psychology experiments your measurement rarely is that accurate.

Now, in some experiments you can increase the reliability of your individual measurement by increasing the number of trials (at this point I’d like to again refer to Richard Morey’s related post on this topic). In psychophysics, collecting hundreds or thousands of trials on one individual subject is not at all uncommon. Let’s take a very simple case. Contour integration refers to the ability of the visual system to detect “snake” contours better than “ladder” contours or those defined by other orientations (we like to call those “ropes”):

In the left image you should hopefully see a circle defined by 16 grating patches embedded in a background or randomly oriented gratings. This “snake” contour pops out from the background because the visual system readily groups orientations along a collinear (or cocircular) path into a coherent object. In contrast, when the contour is defined by patches of other orientations, for example the “rope” contour in the right image which is defined by patches at 45 degrees relative to the path, then it is much harder to detect the presence of this contour. This isn’t a vision science post so I won’t go into any debates on what this means. The take-home message here is that if healthy subjects with normal vision are asked to determine the presence or absence of a contour like this, especially with limited viewing time, they will perform very well for the “snake” contours but only barely above chance levels for the “rope” contours.

This is a very robust effect and I’d argue this is quite representative of many psychophysical findings. A psychophysicist probably wouldn’t simply measure the accuracy but conduct a broader study of how this depends on particular stimulus parameters – but that’s not really important here. It is still pretty representative.

What is the size of this effect? 

If I study this in a group of subjects, the relative effect size at the group level will depend on how accurately I measure the performance in each individual. If I have 50 subjects (which is between 10-25 larger than your typical psychophysics study…) and each performs just one trial, then the sample variance will be much larger compared to if each of them does 100 trials or if they each do 1000 trials. As a result, the Cohen’s d of the group will be considerably different. A d>10 should be entirely feasible if we collect enough trials per person.

People will sometimes say that large effects (d>>2 perhaps) are trivial. But there is nothing trivial about this. In this particular example you may see the difference quite easily for yourself (so you are a single-subject and single-trial replication). But we might want to know just how much better we are at detecting the snake than the rope contours. Or, as I already mentioned, a psychophysicist might measure the sensitivity of subjects to various stimulus parameters in this experiment (e.g., the distance between patches, the amount of noise in the orientations we can tolerate, etc) and this could tell us something about how vision works. The Cohen’s d would be pretty large for all of these. That does not make it trivial but in my view it makes it useless:

Depending on my design choices the estimated effect size may be a very poor reflection of the true effect size. As mentioned earlier, the relative effect size is directly dependent on the between-subject variance – but that in turn depends on the reliability of individual measurements. If each subject only does one trial, the effect of just one attentional lapse or accidental button press in the task is much more detrimental than when they perform 1000 trials, even if the overall rate of lapses/accidents is the same*.

Why does this matter?

In many experiments, the estimate of between-subject variance will be swamped by the within-subject variability. Returning to the example of gender height differences, this is essentially what would happen if you chose to eyeball each person’s height instead of using a tape measure. I’d suspect that is the case for many experiments in social or personality psychology where each measurement is essentially a single quantity (say, timing the speed with which someone walks out of the lab in a priming experiment) rather than being based on hundreds or thousands of trials as in psychophysics. Notoriously noisy measurements are also doubtless the major limiting factor in most neuroimaging experiments. On the other hand, I assume a lot of questionnaire-type results you might have in psychology (such as IQ or the Big Five personality factors) have actually pretty high test-retest reliability and so you probably do get mostly the between-subject variance.

The problem is that often it is very difficult to determine which scenario we are in. In psychophysics, we are often so extremely dominated by the measurement reliability that a knowledge of the “true” population effect size is actually completely irrelevant. This is a critical issue because you cannot use such an effect size for power analysis: If I take an experiment someone did and base my power analysis on the effect size they reported, I am not really powering my experiment to detect a similar effect but a similar design. (This is particularly useless if I then decide to use a different design…)

So next time you see an unusually large Cohen’s (d>10 or even d>3) ask yourself not simply whether this is a plausible effect but whether this experiment can plausibly estimate the true population effect. If this result is based on a single observation per subject with a highly variable measurement (say, how often Breton drivers stop for female hitchhikers wearing red clothing…), even a d=1 seems incredibly large.

But if it is for a measurement that could have been made more reliable by doubling the amount of data collected in each subject (say, a change in psychometric thresholds), then a very high Cohen’s d is entirely plausible – but it is also pretty meaningless. In this situation, what we should really care about is the absolute effect size (How much does the threshold change? How much does the accuracy drop? etc).

And I must say, I remain unsure whether absolute effect sizes aren’t more useful in general, including for experiments on complex human behaviour, neuroimaging, or drug effects.

* Actually the lapse rate probably increases with a greater number of trials due to subject fatigue, drop in motivation, or out of pure spite. But even that increase is unlikely to be as detrimental as having too few trials.

Of hacked peas and crooked teas

The other day, my twitter feed got embroiled in another discussion about whether or not p-hacking is deliberate and if it constitutes fraud. Fortunately, I then immediately left for a trip abroad and away from my computer, so there was no danger of me being drawn into this debate too deeply and running the risk of owing Richard Morey another drink. However, now that I am back I wanted to elaborate a bit more on why I think the way our field has often approached p-hacking is both wrong and harmful.

What the hell is p-hacking anyway? When I google it I get this Wikipedia article, which uses it as a synonym for “data dredging”. There we already have a term that seems to me more appropriate. P-hacking refers to when you massage your data and analysis methods until your result reaches a statistically significant p-value. I will put it to you that in practice most p-hacking is not necessarily about hacking p-s but about dredging your data until your results fit a particular pattern. That may be something you predicted but didn’t find or could even just be some chance finding that looked interesting and is amplified this way. However, the p-value is usually probably secondary to the act here. The end result may very well be the same in that you continue abusing the data until a finding becomes significant, but I would bet that in most cases what matters to people is not the p-value but the result. Moreover, while null-hypothesis significance testing with p-values is still by far the most widespread way to make inferences about results, it is not the only way. All this fussing about p-hacking glosses over the fact that the same analytic flexibility or data dredging can be applied to any inference, whether it is based on p-values, confidence intervals, Bayes factors, posterior probabilities, or simple summary statistics. By talking of p-hacking we create a caricature that this is somehow a problem specific to p-values. Whether or not NHST is the best approach for making statistical inferences is a (much bigger) debate for another day – but it has little to do with p-hacking.

What is more, not only is p-hacking not really about p’s but it is also not really about hacking. Here is the dictionary entry for the term ‘hacking‘. I think we can safely assume that when people say p-hacking they don’t mean that peas are physically being chopped or cut or damaged in any way. I’d also hazard a guess that it’s not meant in the sense of “to deal or cope with” p-values. In fact, the only meaning of the term that seems to come even remotely close is this:

“to modify a computer program or electronic device in a skillful or clever way”

Obviously, what is being modified in p-hacking is the significance or impressiveness of a result, rather than a computer program or electronic device, but we can let this slide. I’d also suggest that it isn’t always done in a skillful or clever way either, but perhaps we can also ignore this. However, the verb ‘hacking’ to me implies that this is done in a very deliberate way. It may even, as with computer hacking, carry the connotation of fraud, of criminal intent. I believe neither of these things are true about p-hacking.

That is not to say that p-hacking isn’t deliberate. I believe in many situations it likely is. People no doubt make conscious decisions when they dig through their data. But the overwhelming majority of p-hacking is not deliberately done to create spurious results that the researcher knows to be false. Anyone who does so would be committing actual fraud. Rather, most p-hacking is the result of confirmation bias combined with analytical flexibility. This leads people to sleep walk into creating false positives or – as Richard Feynman would have called it – fooling themselves. Simine Vazire already wrote an excellent post about this a few years ago (and you may see a former incarnation of yours truly in the comment section arguing against the point I’m making here… I’d like to claim that it’s cause I have grown as a person but in truth I only exorcised this personality :P). I’d also guess that a lot of p-hacking happens out of ignorance, although that excuse really shouldn’t fly as easily in 2017 as it may have done in 2007. Nevertheless, I am pretty sure people do not normally p-hack because they want to publish false results.

Some may say that it doesn’t matter whether or not p-hacking is fraud – the outcome is the same: many published results are false. But in my view it’s not so simple. First, the solution to these two problems surely isn’t the same. Preregistration and transparency may very well solve the problem of analytical flexibility and data dredging – but it is not going to stop deliberate fraud, nor is it meant to. Second, actively conflating fraud and data dredging implicitly accuses researchers of being deliberately misleading and thus automatically puts them on the defensive. This is hardly a way to have a productive discussion and convince people to do something about p-hacking. You don’t have to look very far for examples of that playing out. Several protracted discussions on a certain Psychology Methods Facebook group come to mind…

Methodological flexibility is a real problem. We definitely should do something about it and new moves towards preregistration and data transparency are at least theoretically effective solutions to improve things. The really pernicious thing about p-hacking is that people are usually entirely unaware of the fact that they are doing it. Until you have tried to do a preregistered study, you don’t appreciate just how many forks in the road you passed along the way (I may blog about my own experiences with that at some point). So implying, however unintentionally, that people are fraudsters is not helping matters.

Preregistration and data sharing have gathered a lot of momentum over the past few years. Perhaps the opinions of some old tenured folks opposed to such approaches no longer carry so much weight now, regardless how powerful they may be. But I’m not convinced that this is true. Just because there has been momentum now does not mean that these ideas will prevail. It is just as likely that they fizzle out due to lacking enthusiasm or because people begin to feel that the effort isn’t worth it. I seems to me that “open science” very much exists in a bubble and I have bemoaned that before. To change scientific practices we need to open the hearts and minds of sceptics to why p-hacking is so pervasive. I don’t believe we will achieve that by preaching to them. Everybody p-hacks if left to their own devices. Preregistration and open data can help protect yourself against your mind’s natural tendency to perceive patterns in noise. A scientist’s training is all about developing techniques to counteract this tendency, and so open practices are just another tool for achieving that purpose.

1920px-fish2c_chips_and_mushy_peas
There is something fishy about those pea values…

 

Angels in our midst?

A little more on “tone” – but also some science

This post is somewhat related to the last one and will be my last words on the tone debate*. I am sorry if calling it the “tone debate” makes some people feel excluded from participating in scientific discourse. I thought my last post was crystal clear that science should be maximally inclusive, that everyone has the right to complain about things they believe to be wrong, and that unacceptable behaviour should be called out. And certainly, I believe that those with the most influence have a moral obligation to defend those who are in a weaker position (with great power comes great responsibility, etc…). It is how I have always tried to act. In fact, not so long ago I called out a particularly bullish but powerful individual because he repeatedly acts in my (and, for that matter, many other people’s) estimation grossly inappropriately in post-publication peer review. In response, I and others have taken a fair bit of abuse from said person. Speaking more generally, I also feel that as a PI I have a responsibility to support those junior to me. I think my students and postdocs can all stand up for themselves, and I would support them in doing so, but in any direct confrontation I’ll be their first line of defense. I don’t think many who have criticised the “tone debate” would disagree with this.

The problem with arguments about tone is that they are often very subjective. The case I mentioned above is a pretty clear cut case. Many other situations are much greyer. More importantly, all too often “tone” is put forth as a means to silence criticism. Quite to the contrary of the argument that this “excludes” underrepresented groups from participating in the debate, it is used to categorically dismiss any dissenting views. In my experience, the people making these arguments are almost always people in positions of power.

A recent example of the tone debate

One of the many events that recently brought the question of tone to my mind was this tweet by Tom Wallis. On PubPeer** a Lydia Maniatis has been posting comments on what seems to be just about every paper published on psychophysical vision science.

I find a lot of things to be wrong with Dr Maniatis’ comments. First and foremost, it remains a mystery to me what the actual point is she is trying to make. I confess I must first read some of the literature she cites to comprehend the fundamental problem with vision science she clearly believes to have identified. Who knows, she might have an important theoretical point but it eludes me. This may very well be due to my own deficiency but it would help if she spelled it out more clearly for unenlightened readers.

The second problem with her comments is that they are in many places clearly uninformed with regard to the subject matter. It is difficult to argue with someone about the choices and underlying assumptions for a particular model of the data when they seemly misapprehend what these parameters are. This is not an insurmountable problem and it may also partly originate in the lack of clarity with which they are described in publications. Try as you might***, to some degree your method sections will always make tacit assumptions about the methodological knowledge of the reader. A related issue is that she picks seemingly random statements from papers and counters them with quotes from other papers that often do not really support her point.

The third problem is that there is just so much of Maniatis’ comments! I probably can’t talk as I am known to write verbose blogs myself – but conciseness is a virtue in communication. In my scientific writing in manuscripts or reviews I certainly aim for it. Yet, in her comments of this paper by my colleague John Greenwood are a perfect example: by my count she expends 5262 words on this before giving John a chance to respond! Now perhaps the problems with that paper are so gigantic that this is justified but somehow I doubt it. Maniatis’ concern seems to be with the general theoretical background of the field. It seems to me that a paper or even a continuous blog would be a far better way to communicate her concerns than targeting one particular paper with this deluge. Even if the paper were a perfect example of the fundamental problem, it is hard to see the forest for the trees here. Furthermore, it also drowns out the signal-to-noise ratio of the PubPeer thread considerably. If someone had an actual specific concern, say because they identified a major statistical flaw, it would be very hard to see it in this sea of Maniatis. Fortunately most of her other comments on PubPeer aren’t as extensive but they are still long and the same issue applies.

Why am I talking about this? Well, a fourth problem that people have raised is that her “tone” is unacceptable (see for example here). I disagree. If there is one thing I don’t take issue with it is her tone. Don’t get me wrong: I do not like her tone. I also think that her criticisms are aggressive, hostile, and unnecessarily inflammatory. Does this mean we can just brush aside her comments and ignore her immediately? It most certainly doesn’t. Even if her comments were the kind of crude bullying some other unpleasant characters in the post-publication peer review sphere are guilty of (like that bullish person I mentioned above), we should at least try to extract the meaning. If someone continues to be nasty after being called out on it, I think it is best to ignore them. In particularly bad cases they should be banned from participating in the debate. No fruitful discussion will happen with someone who just showers you in ad hominems. However, none of that categorically invalidates the arguments they make underneath all that rubbish.

Maniatis’ comments are aggressive and uncalled for. I do however not think they are nasty. I would prefer it if she “toned it down” as they say but I can live with how she says what she says (but of course YMMV). The point is, the other three issues I described above are what concerns me, not her tone. To address them I see these solutions: first of all, I need to read some of the literature her criticisms are based on to try to understand where she is coming from. Secondly, people in the field need to explain to her points of apparent misunderstanding. If she refuses to engage or acknowledge that, then it is best to ignore her. Third, the signal-to-noise ratio of PubPeer comments could be improved by better filtering, so by muting a commenter like you can on Twitter. If PubPeer doesn’t implement that, then perhaps it can be achieved with a browser plug-in.

You promised there would be some science!

Yes I did. I am sorry it took so long to get here but I will briefly discuss a quote from Maniatis’ latest comment on John’s paper:

Let’s suppose that the movement of heavenly bodies is due to pushing by angels, and that some of these angels are lazier than others. We may then measure the relative motions of these bodies, fit them to functions, infer the energy with which each angel is pushing his or her planet, and report our “angel energy” findings. We may ignore logical arguments against the angel hypothesis. When, in future measurements, changes in motion are observed that makes the fit to our functions less good, we can add assumptions, such as that angels sometimes take a break, causing a lapse in their performance. And we can report these inferences as well. If discrepancies can’t be managed with quantitative fixes, we can just “hush them up.”

I may disagree (and fail to understand) most of her criticisms, but I really like this analogy. It actually reminds me of an example I used when commenting on Psi research and which I also use in my teaching about the scientific method. I used the difference between the heliocentric and geocentric models of planetary movements to illustrate Occam’s Razor, explanatory power, and the trade-off with model complexity. Maniatis’ angels are a perfect example for how we can update our models to account for new observations by increasing their complexity and overfitting the noise. The best possible model however should maximise explanatory power while minimising our assumptions. If we can account for planetary motion without assuming the existence of angels, we may be on the right track (as disappointing as that is).

It won’t surprise you when I say I don’t believe Maniatis’ criticism applies to vision science. Our angels are supported by a long list of converging scientific observations and I think that if we remove them from the model the explanatory power of the models goes down and the complexity increases. Or at least Maniatis hasn’t made it clear why that isn’t the case. However, leaving this specific case aside, I do like the analogy a lot. There you go, I actually discussed science for a change.

* I expect someone to hold me to this!
** She also commented on PubMed Central but apparently her account there has been blocked.
*** But this is no reason not to try harder.

fnhum-08-00332-g001

Is open science tone deaf?

The past week saw the latest installment of what Chris Chambers called the “arse-clenchingly awful ‘tone debate‘ in psychology”. If you have no idea what he might be referring to, consider yourself lucky, leave this blog immediately, and move on with your life with the happy thought that sometimes ignorance is indeed bliss. If you think to know what it is referring to, you may or may not be right because there seem to have been lots of different things going on and “tone” seems to mean very different things to different people. It apparently involves questions such as this:

  1. What language is acceptable when engaging in critical post-publication peer review?
  2. Is it ever okay to call reanalysis and replication attempts “terrorism”?
  3. While on this topic, what should we do when somebody’s brain fart produces a terrible and tenuous analogy about something?
  4. Should you tag someone in a twitter discussion on a conference when they didn’t attend it?
  5. How should a new and unconventional conference be covered on social media?
  6. What is sarcasm and satire and are they ever okay?
  7. Also, if I don’t find your (bad?) joke or meme funny, does this mean you’re “excluding” me from the discussion?
  8. When should somebody be called a troll?
  9. Is open science tone deaf?

If you were hoping to find a concrete answer to any of these questions, I am sorry to disappoint you. We could write several volumes on each of these issues. But here I only want to address the final question, which is also the title of this post. In clear adherence to Betteridge’s Law the answer is No.

What has bothered me about this “tone debate” for quite some time, but which I only now managed to finally put my finger on, is that tone and science are completely orthogonal and independent of one another. I apologise to Chris as I’m probably rehashing this point from his arse-unclenching post. The point is also illustrated in this satirical post, which you may or may not find funny/clever/appropriate/gluten-free.

In fact, what also bothers me is this focus on open science as, to use Chris’ turn of phrase, an “evangelical movement”. If open science is an evangelical movement, is Brian Nosek its Pope? And does this make Daniël Lakens and Chris Chambers rabid crusaders, EJ Wagenmakers a p-value-bashing Lutheran, and Susan Fiske the Antichrist? I guess there is no doubt that Elsevier is the cult of Cthulhu.

Seriously, what the £$%@ is “open” science anyway? I have come to the conclusion that all this talk about open science is actually detrimental to the cause this “movement” seeks to advance. I hereby vow not to use the term “open science” ever again except to ridicule the concept. I think the use of this term undermines its goals and ironically produces all this garbage about exclusivity and tone that actually prevents more openness in science.

I have no illusions that I can effect a change in people’s use of the term. It is far too wide-spread and ingrained at this point. Perhaps you could change it if you could get Donald Trump to repeatedly tweet about it abusively and thus tarnish the term for good – just as he did with the Fake News moniker (I think “Sad” might be another victim). But at least I can stop using this exclusive and discriminatory term in my own life and thus help bring about a small but significant (p=0.0049) change in the way we do research.

There is no such thing as “open science”. There is good science and there is bad science (and lots of it). There are ways to conduct research that are open and transparent. I believe greater openness makes science better. As things stand right now, the larger part of the scientific community, at least in biological, social, and behavioural sciences, remains in the status quo and has not (yet) widely embraced many open practices. Slowly but surely, the field is however moving in the direction of more openness. And we have already made great strides, certainly within the decade or so that I have been a practicing scientist. Having recently had the displeasure of experiencing firsthand in my own life how the news media operate, I can tell you that we have made leaps in terms of transparency and accountability. In my view, the news media and politics would be well served to adopt more scientific practice by having easier access to source data, fighting plagiarism, and minimising unsubstantiated interpretation of data.

None of this makes “open science” special – it is really just science. Treating proponents of open practices as some sort of homogeneous army (“The Methodological Liberation Front”?) is doing all scientists a disservice. Yes, there are vocal proponents (who often vehemently disagree on smaller points, such as the best use of p-values) but in the end all scientists should have an interest in improving scientific practice. This artificial division into open science and the status quo (“closed science”?) is not helpful in convincing sceptics to adopt open practices. It is bad enough when some sceptics use their professional position to paint a large number of people with the same brush (e.g. “replicators”, “terrorists”, “parasites”, etc). The last thing people whose goal is to improve science should do is to encapsulate and separate themselves from the larger scientific community by calling themselves things like “open science”.

So what does any of this have to do with “tone”? Nothing whatsoever – that’s my point. Are there people whose language could be more refined when criticising published scientific studies? Yes, no doubt there are. One of my first experiences with data sharing was when somebody sent me a rude one-line email asking for our data and spiced it up with a link to the journal’s data sharing policy which added a level of threat to their lack of tact. It was annoying and certainly didn’t endear them to me but I shared the data anyway, neither because of the tone of the email nor the journal’s policy but because it is the right thing to do. We can avoid that entire problem in the future by regularly publishing data (as far as ethically and practically feasible) with the publication or (even better) when submitting the manuscript for review.

Wouldn’t it be better if everyone were just kind and polite to one another and left their emotions out of it? Yes, no doubt it would be but we aren’t machines. You can’t remove the emotion from the human beings who do the science. All of human communication is plagued by emotions, misunderstandings, and failures of diplomacy. I have a friend and colleague who regularly asks questions at conference talks that come across as rather hostile and accusatory. Knowing the man asking the question I’m confident this is due to adrenaline rather than spite. This does not mean you can’t call out people for offending you – but at least initially they also deserve to be given the benefit of the doubt (see Hanlon’s Razor and, for that matter, the Presumption of Innocence).

Bad “tone” is also not exactly a new thing. If memory serves, a few years before many of us were even involved in science social media, a journal deemed it acceptable to publish a paper by one of my colleagues calling his esteemed colleagues’ arguments “gobbledygook“. Go back a few decades or centuries and you’ll find scientists arguing in the most colourful words and making all manner of snide remarks about one another. And of course, the same is true outside the world of science. Questions about the appropriate tone are as old as our species.

By all means, complain about the tone people use if you feel it is inappropriate but be warned that this frequently backfires. The same emotions that lead you to take offense to somebody’s tone (which may or may not be justified) may also cause them to take offense to you using bad “tone” as a defense. In many situations it often seems wiser to simply ignore that individual by filtering them out. If they somehow continue to break into your bubble and pester you, you may have a case of abuse and harassment and that’s a whole different beast, one that deserves to be slain. But honestly, it’s a free world so nobody can or should stop you from complaining about it. Sometimes a complaint is fully justified.

It is also true that we people on social media or post-publication peer review platforms can probably take a good hard look in the mirror and consider our behaviour. I have several colleagues who told me they avoid science twitter “because of all the assholes”. Nobody can force anyone to stop being an asshole but it is true that you may get further with other people when you don’t act like a dick around them. I also think that post-publication review and science in general could be a bit more forgiving. Mistakes and lack of knowledge are human and common and we can do a lot better at appreciating this. Someone once described the posts on RetractionWatch as “gleeful” and I think there is some truth to that. If we want to improve science we need to make it easier and socially acceptable to admit when you’re wrong. There have been some laudable efforts in that direction but we’re far from where we should be.

Last but not least, you don’t have to like snarky remarks. Nobody can force you to find Dr Primestein funny or to be thrilled when he generalises all research in a particular field or even alludes that it’s fraudulent. But again, satire and snark are as old as humanity. It should be taken with a grain of salt. I don’t find every joke funny. For instance, I find it incredibly tedious when people link every mention of Germans back to the Nazis. It’s a tired old trope but to be honest I don’t even find it particularly offensive – I certainly don’t feel the need to complain about it every bloody time. But the question of hilarity aside, satire can reveal some underlying truths and in my view there is something in Primestein’s message that people should take to heart. However, if he pisses you off and you’d rather leave him, that’s your unalienable right.

Whatever you do, just for the love of god don’t pretend that this has anything to do with “open science”! Primestein isn’t the open science spokesperson. Neither is a racist who uses open data reflecting bad on the “movement”. The price of liberty is eternal vigilance. Freedom of speech isn’t wrong because it enables some people to say unacceptable things. Neither is open data bad because somebody might abuse it for their nasty agenda. And the truth is, they could have easily done the same with closed science. If somebody does bad science, you should criticise them and prove them wrong, even more so when they do it with some bad ulterior motive. If somebody is abusive or exploitative or behaving unethically, call them out, report them, sue them, get them arrested, depending on the severity of the case. Open science doesn’t have a problem with inclusivity because open science doesn’t exist. However, science definitely does have a problem with inclusivity and I think we should all work hard to improve that. Making science more open, both in terms of access to results and methods as well as who can join its community, is making science better. But treating “open science” as some exclusive club inside science you are inadvertently creating barriers that did not need to exist in the first place.

And honestly, why and how should the “tone” of some people turn you off from using open practices? Is data sharing only a good cause when people are nice? Does a pre-registration become useless when someone snarkily dismisses your field? Is post-publication review worthless simply because some people are assholes? I don’t think so. If anything, more people adopting such practices would further normalise them and thus help equilibrate the entire field. Openness is not the problem but the solution.

cthulhu_and_r27lyeh
At the nightly editorial board meeting

 

Strolling through the Garden of Forking Paths

The other day I got into another Twitter argument – for which I owe Richard Morey another drink – about preregistration of experimental designs before data collection. Now, as you may know, I have in the past had long debates with proponents of preregistration. Not really because I was against it per se but because I am a natural skeptic. It is still far too early to tell if the evidence supports the claim that preregistration improves the replicability and validity of published research. I also have an innate tendency to view any revolutionary proposals with suspicion. However, these long discussions have eased my worries and led me to revise my views on this issue. As Russ Poldrack put it nicely, preregistration no longer makes me nervous. I believe the theoretical case for preregistration is compelling. While solid empirical evidence for the positive and negative consequences of preregistration will only emerge over the course of the coming decades, this is not actually all that important. I seriously doubt that preregistration actually hurts scientific progress. At worst it has not much of an effect at all – but I am fairly confident that it will prove to be a positive development.

Curiously, largely due to the heroic efforts by one Christopher Chambers, a Sith Lord at my alma mater Cardiff University, I am now strongly in favor of the more radical form of preregistration, registered reports (RRs), where the hypothesis and design is first subject to peer review, data collection only commences when the design has been accepted, and eventual publication is guaranteed if the registered plan was followed. In departmental discussions, a colleague of mine repeatedly voiced his doubts that RRs could ever become mainstream, because they are such a major effort. It is obvious that RRs are not ideal for all kinds of research and to my knowledge nobody claims otherwise. RRs are a lot of work that I wouldn’t invest in something like a short student project, in particular a psychophysics experiment. But I do think they should become the standard operating procedure for many larger, more expensive projects. We already have project presentations at our imaging facility where we discuss new projects and make suggestions on the proposed design. RRs are simply a way to take this concept into the 21st century and the age of transparent research. It can also improve the detail or quality of the feedback: most people at our project presentations will not be experts on the proposed research while peer reviewers at least are supposed to be. And, perhaps most important, RRs ensure that someone actually compares the proposed design to what was carried out eventually.

When RRs are infeasible or impractical, there is always the option of using light preregistration, in which you only state your hypothesis and experimental plans and upload this to OSF or a similar repository. I have done so twice now (although one is still in the draft stage and therefore not yet public). I would strongly encourage people to at least give that a try. If a detailed preregistration document is too much effort (it can be a lot of work although it should save you work when writing up your methods later on), there is even the option for very basic registration. The best format invariably depends on your particular research question. Such basic preregistrations can add transparency to the distinction between exploratory and confirmatory results because you have a public record of your prior predictions. Primarily, I think they are extremely useful to you, the researcher, as it allows you to check how directly you navigated the Garden of Forking Paths. Nobody stops you from taking a turn here or there. Maybe this is my OCD speaking, but I think you should always peek down some of the paths at least, simply as a robustness check. But the preregistration makes it less likely that you fool yourself. It is surprisingly easy to start believing that you took a straight path and forget about all the dead ends along the way.

This for me is really the main point of preregistration and RRs. I think a lot of the early discussion of this concept, and a lot of the opposition to it, stems from the implicit or even explicit accusation that nobody can be trusted. I can totally understand why this fails to win the hearts and minds of many people. However, it’s also clear that questionable research practices and deliberate p-hacking have been rampant. Moreover, unconscious p-hacking due to analytical flexibility almost certainly affects many findings. There are a lot of variables here and so I’d wager that most of the scientific literature is actually only mildly skewed by that. But that is not the point. Rather, I think as scientists, especially those who study cognitive and mental processes of all things, shouldn’t you want to minimize your own cognitive biases and human errors that could lead you astray? Instead of  the rather negative “data police” narrative that you often hear, this is exactly what preregistration is about. And so I think first and foremost a basic preregistration is only for yourself.

When I say such a basic preregistration is for yourself, this does not necessarily mean it cannot also be interesting to others. But I do believe their usefulness to other people is limited and should not be overstated. As with many of the changes brought on by open science, we must remain skeptical of any unproven claims of their benefits and keep in mind potential dangers. The way I see it, most (all?) public proponents of either form of preregistration are fully aware of this. I think the danger really concerns the wider community. I occasionally see anonymous or sock-puppet accounts popping up in online comment sections espousing a very radical view that only preregistered research can be trusted. Here is why this is disturbing me:

1. “I’ll just get some fresh air in the garden …”

Preregistered methods can only be as good as the detail they provide. A preregistration can be so vague that you cannot make heads or tails of it. The basic OSF-style registrations (e.g. the AsPredicted format) may be particularly prone to this problem but it could even be the case when you wrote a long design document. In essence, this is just saying you’ll take a stroll in the hedge maze without giving any indication whatsoever which paths you will take.

2. “I don’t care if the exit is right there!”

Preregistration doesn’t mean that your predictions make any sense or that there isn’t a better way to answer the research question. Often such things will only be revealed once the experiment is under way or completed and I’d actually hazard the guess that this is usually the case. Part of the beauty of preregistration is that it demonstrates to everyone (including yourself!) how many things you probably didn’t think of before starting the study. But it should never be used as an excuse not to try something unregistered when there are good scientific reasons to do so. This would be the equivalent of taking one predetermined path through the maze and then getting stuck in a dead end – in plain sight of the exit.

3. “Since I didn’t watch you, you must have chosen forking paths!”

Just because someone didn’t preregister their experiment does not mean their experiment was not confirmatory. Exploratory research is actually undervalued in the current system. A lot of research is written up as if it were confirmatory even if it wasn’t. Ironically, critics of preregistration often suggest that it devalues exploratory research but it actually places greater value on it because you are no longer incentivized to hide it. But nevertheless, confirmatory research does happen even without preregistration. It doesn’t become any less confirmatory because the authors didn’t tell you about it. I’m all in favor of constructive skepticism. If a result seems so surprising or implausible that you find it hard to swallow, by all means scrutinize it closely and/or carry out an (ideally preregistered) attempt to replicate it. But astoundingly, even people who don’t believe in open science sometimes do good science. When a tree falls in the garden and nobody is there to hear it, it still makes a sound.

Late September when the forks are in bloom

Obviously, RRs are not completely immune to these problems either. Present day peer review frequently fails to spot even glaring errors, so it is inevitable that it will also make mistakes in the RR situation. Moreover, there are additional problems with RRs, such as the fact that they require an observant and dedicated editor. This may not be so problematic while RR editors are strong proponents of RRs but if this concept becomes more widespread this will not always be the case. It remains to be seen how that works out. However, I think on the whole the RR concept is a reasonably good guarantee that hypotheses and designs are scrutinized, and that results are published, independent of the final outcome. The way I see it, both of these are fundamental improvements over the way we have been doing science so far.

But I’d definitely be very careful not to over-interpret the fact that a study is preregistered, especially when it isn’t a RR. Those badges they put on Psych Science articles may be a good incentive for people to embrace open science practices but I’m very skeptical of anyone who implies that just because a study was preregistered, or because it shares data and materials, that this makes it more trustworthy. Because it simply doesn’t. It lulls you into a false sense of security and I thought the intention here was not to fool ourselves so much any more. A recent case of data being manipulated after it was uploaded demonstrates how misleading an open data badge can be. In the same vein, just because an experiment is preregistered does not mean the authors didn’t lead us (and themselves) down the garden path. There have also been cases of preregistered studies that then did not actually report the outcomes of their intended analyses.

So, preregistration only means that you can read what the authors said they would do and then check for yourself how this compares to what they did do. That’s great because it’s transparent. But unless you actually do this check, you should treat the findings with the same skepticism (and the authors with the same courtesy and respect) as you would those of any other, non-registered study.

hedgemaze
Sometimes it is really not that hard to find your way through the garden…