In my previous post, I talked about why I think all properly conducted research should be published. Null results are important. The larger scientific community needs to know whether or not a particular hypothesis has been tested before. Otherwise you may end up wasting somebody’s time because they repeatedly try in vain to answer the same question. What is worse, we may also propagate false positives through the scientific record because failed replications are often still not published. All of this contributes to poor replicability of scientific findings.
However, the emphasis here is on ‘properly conducted research‘. I already discussed this briefly in my post but it also became the topic of an exchange between (for the most part) Brad Wyble, Daniël Lakens, and myself. In some fields, for example psychophysics, extensive piloting, and “fine-tuning” of experiments is not only very common but probably also necessary. To me it doesn’t seem sensible to make the results of all of these attempts publicly available. This inevitably floods the scientific record with garbage. Most likely nobody will look at it. Even if you are a master at documenting your work, nobody but you (and after a few months maybe not even you) will understand what is in your archive.
Most importantly, it can actually be extremely misleading for others who are less familiar with the experiment to see all of the tests you did ensuring the task was actually doable, that monitors were at the correct distance from the participant, your stereoscope was properly aligned, the luminance of the stimuli was correct, that the masking procedure was effective, etc. Often you may only realise during your piloting that the beautiful stimulus you designed after much theoretical deliberation doesn’t really work in practice. For example, you may inadvertently induce an illusory percept that alters how participants respond in the task. This in fact happened recently with an experiment a collaborator of mine piloted. And more often than not, after having tested a particular task on myself at great length I then discover that it is far too difficult for anyone else (let’s talk about overtrained psychophysicists another time…).
Such pilot results are not very meaningful
It most certainly would not be justified to include them in a meta-analysis to quantify the effect – because they presumably don’t even measure the same effect (or at least not very reliably). A standardised effect size, like Cohen’s d, is a signal-to-noise ratio as it compares an effect (e.g. difference in group means) to the variability of the sample. The variability is inevitably larger if a lot of noisy, artifactual, and quite likely erroneous data are included. While some degree of this can be accounted for in meta-analysis by using a random-effects model, it simply doesn’t make sense to include bad data. We are not interested in the meta-effect, that is, the average result over all possible experimental designs we can dream up, no matter how inadequate.
What we are actually interested in is some biological effect and we should ensure that we take the most precise measurement as possible. Once you have a procedure that you are confident will yield precise measurements, by all means, carry out a confirmatory experiment. Replicate it several times, especially if it’s not an obvious effect. Pre-register your design if you feel you should. Maximise statistical power by testing many subjects if necessary (although often significance is tested on a subject-by-subject basis, so massive sample sizes are really overkill as you can treat each participant as a replication – I’ll talk about replication in a future post so I’ll leave it at this for now). But before you do all this you usually have to fine-tune an experiment, at least if it is a novel problem.
Isn’t this contributing to the problem?
Colleagues in social/personality psychology often seem to be puzzled and even concerned by this. The opacity of what has or hasn’t been tried is part of the problems that plague the field and lead to publication bias. There is now a whole industry meta-analysing results in the literature to quantify ‘excess significance’ or a ‘replication index’. This aims to reveal whether some additional results, especially null results, may have been suppressed or if p-hacking was employed. Don’t these pilot experiments count as suppressed studies or p-hacking?
No, at least not if this is done properly. The criteria you use to design your study must of course be orthogonal to and independent from your hypothesis. Publication bias, p-hacking, and other questionable practices are all actually sub-forms of circular reasoning: You must never use the results of your experiment to inform the design as you may end up chasing (overfitting) ghosts in your data. Of course, you must not run 2-3 subjects on an experiment, look at the results and say ‘The hypothesis wasn’t confirmed. Let’s tweak a parameter and start over.’ This would indeed be p-hacking (or rather ‘result hacking’ – there are usually no p-values at this stage).
A real example
I can mainly speak from my own experience but typically the criteria used to set up psychophysics experiments are sanity/quality checks. Look for example at the figure below, which shows a psychometric curve of one participant. The experiment was a 2AFC task using the method of constant stimuli: In each trial the participant made a perceptual judgement on two stimuli, one of which (the ‘test’) could vary physically while the other remained constant (the ‘reference’). The x-axis plots how different the two stimuli were, so 0 (the dashed grey vertical line) means they were identical. To the left or right of this line the correct choice would be the reference or test stimulus, respectively. The y-axis plots the percentage of trials the participant chose the test stimulus. By fitting a curve to these data we can extrapolate the ability of the participant to tell apart the stimuli – quantified by how steep the curve is – and also their bias, that is at what level of x the two stimuli appeared identical to them (dotted red vertical line):
As you can tell, this subject was quite proficient at discriminating the stimuli because the curve is rather steep. At many stimulus levels the performance is close to perfect (that is, either near 0 or 100%). There is a point where performance is at chance (dashed grey horizontal line). But once you move to the left or the right of this point performance becomes good very fast. The curve is however also shifted considerably to the right of zero, indicating that the participant indeed had a perceptual bias. We quantify this horizontal shift to infer the bias. This does not necessarily tell us the source of this bias (there is a lot of literature dealing with that question) but that’s beside the point – it clearly measures something reliably. Now look at this psychometric curve instead:
The general conventions here are the same but these results are from a completely different experiment that clearly had problems. This participant did not make correct choices very often as the curve only barely goes below the chance line – they chose the test stimulus far too often. There could be numerous reasons for this. Maybe they didn’t pay attention and simply made the same choice most of the time. For that the trend is bit too clean though. Perhaps the task was too hard for them, maybe because the stimulus presentation was too brief. This is possible although it is very unlikely that a healthy, young adult with normal vision would not be able to tell apart the more extreme stimulus levels with high accuracy. Most likely, the participant did not really understand the task instructions or perhaps the stimuli created some unforeseen effect (like the illusion I mentioned before) that actually altered what percept they were judging. Whatever the reason, there is no correct way to extrapolate the psychometric parameters here. The horizontal shift and the slope are completely unusable. We see an implausibly poor discrimination performance and extremely large perceptual bias. If their vision really worked this way, they should be severely impaired…
So these data are garbage. It makes no sense to meta-analyse biologically implausible parameter estimates. We have no idea what the participant was doing here and thus we can also have no idea what effect we are measuring. Now this particular example is actually a participant a student ran as part of their project. If you did this pilot experiment on yourself (or a colleague) you might have worked out what the reason for the poor performance was.
What can we do about it?
In my view, it is entirely justified to exclude such data from our publicly shared data repositories. It would be a major hassle to document all these iterations. And what is worse, it would obfuscate the results for anyone looking at the archive. If I look at a data set and see a whole string of brief attempts from a handful of subjects (usually just the main author), I could be forgiven for thinking that something dubious is going on here. However, in most cases this would be unjustified and a complete waste of everybody’s time.
At the same time, however, I also believe in transparency. Unfortunately, some people do engage in result-hacking and iteratively enhance their findings by making the experimental design contingent on the results. In most such cases this is probably not done deliberately and with malicious intent – but that doesn’t make it any less questionable. All too often people like to fiddle with their experimental design while the actual data collection is already underway. In my experience this tendency is particularly severe among psychophysicists who moved into neuroimaging where this is a really terrible (and costly) idea.
How can we reconcile these issues? In my mind, the best way is perhaps to document briefly what you did to refine the experimental design. We honestly don’t need or want to see all the failed attempts at setting up an experiment but it could certainly be useful to have an account of how the design was chosen. What experimental parameters were varied? How and why were they chosen? How many pilot participants were there? This last point is particularly telling. When I pilot something, there usually is one subject: Sam. Possibly I will have also tested one or two others, usually lab members, to see if my familiarity with the design influences my results. Only if the design passes quality assurance, say by producing clear psychometric curves or by showing to-be-expected results in a sanity check (e.g., the expected response on catch trials), I would dare to actually subject “real” people to a novel design. Having some record, even if as part of the documentation of your data set, is certainly a good idea though.
The number of participants and pilot experiments can also help you judge the quality of the design. Such “fine-tuning” and tweaking of parameters isn’t always necessary – in fact most designs we use are actually straight-up replications of previous ones (perhaps with an added condition). I would say though that in my field this is a very normal thing to do when setting up a new design at least. However, I have also heard of extreme cases that I find fishy. (I will spare you the details and will refrain from naming anyone). For example in one study the experimenters ran over a 100 pilot participants – tweaking the design all along the way – to identify those that showed a particular perceptual effect and then used literally a handful of these for an fMRI study that claims to have been about “normal” human brain function. Clearly, this isn’t alright. But this also cannot possibly count as piloting anymore. The way I see it, a pilot experiment can’t have an order of magnitude more data than the actual experiment…
How does this relate to the wider debate?
I don’t know how applicable these points are to social psychology research. I am not a social psychologist and my main knowledge about their experiments are from reading particularly controversial studies or the discussions about them on social media. I guess that some of these issues do apply but that it is far less common. An equivalent situation to what I describe here would be that you redesign your questionnaire because it people always score at maximum – and by ‘people’ I mean the lead author :P. I don’t think this is a realistic situation in social psychology, but it is exactly how psychophysical experiments work. Basically, what we do in piloting is what a chemist would do when they are calibrating their scales or cleaning their test tubes.
Or here’s another analogy using a famous controversial social psychology finding we discussed previously: Assume you want to test whether some stimulus makes people walk more slowly as they leave the lab. What I do in my pilot experiments is to ensure that the measurement I take of their walking speed is robust. This could involve measuring the walking time for a number of people before actually doing any experiment. It could also involve setting up sensors to automate this measurement (more automation is always good to remove human bias but of course this procedure needs to be tested too!). I assume – or I certainly hope so at least – that the authors of these social psychology studies did such pre-experiment testing that was not reported in their publications.
As I said before, humans are dirty test tubes. But you should ensure that you get them as clean as you can before you pour in your hypothesis. Perhaps a lot of this falls under methods we don’t report. I’m all for reducing this. Methods sections frequently lack necessary detail. But to some extend, I think some unreported methods and tests are unavoidable.