Category Archives: statistics

Uncontrollable hidden variables in recruiting participants

Here is another post on experimental design and data analyses. I need to get this stuff out of my system before I’m letting the domain on this blog expire at the end of the year (I had planned this for last year already but then decided to keep it going because of a certain difficult post I must write then…)

Hidden group differences?

This (hopefully brief) post is inspired by a Twitter discussion I had last night. These people have had a journal club about one of my lab ‘s recent publications. The details of that are not really important for this post – you can read their excellent questions about our work and my replies to them in the tweet thread. However, what this discussion reminded me of is the issues you can run into when dealing with human volunteer participants that you have no control over and – what is worse – you may not even be aware of.

In this particular study, we compared retinotopic maps from groups of identical (MZ) and fraternal (DZ) twin pairs. One very notable point when you read our article is that the sample sizes for the two groups are quite different, with more MZ twin pairs than DZ pairs. We had some major difficulties finding DZ twins to take part and what made matters worse is that we had to reclassify several purported DZ twins to MZ twins after genetic testing. Looking at the literature, this seems quite common. For example, we found that in the Human Connectome Project there is a similar imbalance in the sample sizes (see for instance this preprint that also looked at retinotopic maps in twins at a more macroscopic level). A colleague of ours working on another twin study experienced the same problem (I don’t think this study has been published yet). Finally, here is just one more example of a vision science study with substantially greater sample for MZ than DZ twins.

There are clearly problems recruiting DZ twins. Undoubtedly MZ twins are more “special”, and so there are organisations through which they can be reached. While there are participant pools for twins that contain both zygosities, the people managing these can be rather protective. This is understandable because these databases are a valuable scientific resource and they don’t want to tire out their participants by allowing too many researchers to approach them with requests to participate. These pools of participants may also be imbalanced because MZ self-select into them because they have a strong interest to learn about how similar they are. In contrast, DZ twins may have less interest in this question (although some obviously do). And even if you have a well-balanced pool of potential participants, there may be additional social factors at play. The MZ twins in those pools may be keener to take part than the DZ twins. Of course, the zygosity may also interact in hidden ways with this. MZ twins might have a closer relationship to one another, even if just being more geographically closer, and that will doubtless affect how easy it is for them to participate in your study. All these issues are extremely difficult to know about, let alone to control.

Not about twins

As I said, the details of our study aren’t really important and this post isn’t about twins. Rather, this is clearly a broader issue. Similar concerns affect any comparison between groups of participants. Anyone studying patients with a particular condition is probably familiar with that issue. Many patients are keen to take part in studies because they have an interest in better understanding their condition or – for some disorders or illnesses – contributing to the development of treatments. In contrast, recruiting the “matched” control participants can be very difficult and you may go to great and unusual lengths to find them. This can result in your control group being quite unusual compared to the standard participant sample you might have when you do fundamental research, especially considering a lot of such research is done on young undergraduate students recruited on university campuses.

Let’s imagine for example that we want to understand visual processing by professional basketball players in the NBA. A quick Googling suggests the average body height of NBA players is 1.98 m, considerably taller than the average male. Any comparison that does not take this into account would confound body height with basketball skill. Obviously you can control aspects like this to some extent by using covariates (e.g. in multiple regression analyses) – but that is only for the variables you know about. More importantly, you’d be well-advised to recruit a matched control group that has similar body height as your basketball players but without the athletic skills. That way you cancel out the effect of body height.

But how does this recruitment drive interact with your samples? For one thing, it will probably be difficult to find these tall controls. While most NBA players are very tall (even short NBA players are presumably above average height), really tall people in the general population are rare. So finding them may take a long time. But what is worse, the ones you do find may also differ in other respects from your average person. For body height, this may not be too problematic but you never know what issues a very tall person faces who doesn’t happen to have a multi-million dollar athletic contract.

These issues can be quite nefarious. For instance, I was involved in a study a few years ago where we must recruit control participants matched to our main group of interest both in terms of demographic details and psychological measures. What we ended up with was a lot of exclusions of potential control partipants due to drug use, tattoos or metal implants (a safety hazard), and in one case an undisclosed medical history we only discovered serendipitously. The rationale for selecting participants with particular matched traits from the general population is based on the assumption that these traits are random – however, this fails if there is some hidden association between that trait and other confounding factors. In essence, this is just another form of selection bias that I have written about recently…

The problem is there is simply no good way to control for that. You cannot use a variable as covariate when you don’t know it exists. This means that particular variable simply becomes part of the noise, the variance not explained by your model. It is entirely possible that this noise masquerades as a difference that doesn’t really exist (Type I error) or obscures true effects (Type II error). You can and should obviously check for potential caveats and thus establish the robustness of the findings but that can only go so far.

Small N designs

This brings me back to another one of my pet issues: small N designs, as are common in psychophysics. Some psychophysics experiments have as few as two participants, both of whom are also authors of the publication. It is debatable how valid this extreme might be – one of my personal heuristics is that you should always include some “naive” observers (or at least one?) to show that results do not crucially depend on knowledge of the hypothesis. But these designs can nevertheless be valid. Many experiments are actually difficult for the participant to influence through mere willpower alone. I’ve done some experiments on myself where I thought I was responding a certain way only to find the results didn’t reflect this intuition at all.

And there is definitely something to be said about having trained observers. I’ve covered this topic several times before so I won’t go in detail on this. But it doesn’t really make sense to contaminate your results with bad data. A lot of psychophysical experiments require steady eye gaze to ensure that stimuli are presented at the parafoveal and peripheral locations you want to test. It doesn’t make much sense to include participants who cannot maintain fixation. (On that note, it is interesting that some results can actually be surprisingly robust even in the presence of considerable eye movements – such as what we found in this study. This opens up a number of questions as to what those results mean but I have not yet figured out a good way to answer them…).

This is quite different from your typical psychology experiment. Imagine you want to test (bear with me here) how fast your participants walk down the corridor after leaving your lab cubicle where you had them do some task with words… While there may be some justified reasons for exclusion of participants (such as that they obviously didn’t comply with your task instructions or failed to understand the words or that they get an urgent phone call that caused them to sprint down the hall), there is no such thing as a “trained observer” here. You want to make an inference about how the average person reacts to your experimental manipulation. Therefore you need to use a statistical approach that tests the group average. We don’t want only people who are well trained at walking down corridors.

In contrast, in threshold psychophysics you don’t care about the “average person” but rather you want to know what the threshold performance is after all that other human noise – say inattention, hand-eye-coordination, fixation instability, uncorrected refractive error, mind-wandering, etc – has been excluded. Your research question is what is the just noticeable difference in stimuli under optimal conditions, not what is the just noticeable difference when distracted by thoughts about dinner or your inability to press the right button at the right time. A related (and more insidious) issue is also introspection. One could make the argument that many trained observers are also better judging the contents of their perceptual awareness than someone you recruited off the street (or your student participant pool). A trained observer may be quite adept at saying that the grating you showed appeared to them tilted slightly to the left – your Average Jo(sephine) may simply say that they noticed no difference. (This could in part be quantified by differences in response criterion but that is not guaranteed to work).

Taken together, the problem here is not with the small N approach – it is doubtless justified in many situations. Rather I wonder how to decide when it is justified. The cases described above seem fairly obvious but in many situations things can be more complicated. And to return to the main topic of this post, there could be insidious interactions between finding the right observers and your results. If I need trained observers for a particular experiment but I also want to find some who are naive to the purpose of the experiment, my inclusion criteria may bias the participants I end up with (this usually means your participants are all members of your department :P). For many purposes these biases may not matter. In some cases they probably do – for instance reports that visual illusions differ considerably in different populations. Ideally you want trained observers from all the groups you are comparing in this case.

What is the right way to clean your data?

If you’re on cog neuro twitter, you may have already come across an ongoing debate about a Verification Report by Chalkia et al. published in Cortex in 2020. A Verification Report is a new format at Cortex in which researchers can publish their attempts at reproducing the results of previously published research, in this case an influential study by Schiller et al, published 2010 in Nature. This caused a heated debate between them, the original authors, and also the handling editors at Cortex, which is still ongoing. While I am an editor at Cortex and work closely with the editors handling this particular case, I was not involved with that process. The particular research in question is outside my immediate expertise and there are a lot of gory details that I am ill-prepared to discuss. There is also room for yet another debate on how such scientific debates should be conducted – I don’t want to get into any of that here either.

However, this case reminded me of considerations I’ve had for quite some time. Much of the debate on this case revolves around the criteria by which data from participants were excluded in the original study. A student of mine has also been struggling in the past few months with the issue of data cleaning – specifically removing outlier data that clearly result from artifacts which would undeniably contaminate the results and lead us to draw undeniably false conclusions.

Data cleaning and artifact exclusion are important. You wouldn’t draw a star chart using a dirty telescope – otherwise that celestial object might just be a speck of dust on the lens. Good science involves checking that your tools work and removing data when things go wrong (and go wrong they inevitably will, even with the best efforts to maintain your equipment and ensure high data quality). In visual fMRI studies, some of the major measurement artifacts result from excessive head motion or poor fixation compliance. In psychophysical experiments, a lot depends on the reliability of participants at doing the tasks (some of which can be quite arduous), also about maintaining stable eye gaze, and even at introspecting about their perceptual experience. In EEG experiments, poor conductance in some electrodes may produce artifacts, and so on.

So it is obvious that sometimes data must be removed from a data set before we can make any inferences about our research question. The real question is what is the right way to go about doing that. An obvious answer is that these criteria must be defined a priori, before any data collection took place. In theory, this is where piloting is justified and could inform the range of parameters to be used in the actual experiments. A Registered Report would allow researchers to prespecify these criteria, vetted by independent reviewers. However, even long before anybody was talking about preregistration, researchers knew that data exclusion criteria should be defined upfront (it was literally one of the first things my PhD supervisor taught me).

Unfortunately, in truth this is not realistic. Your perfectly thought out criteria may simply be inappropriate in the face of real data. Your pilot experiment is likely “cleaner” than the final data set because you probably have a limited pool of participants to work with. No level of scrutiny by reviewers and editors of a Registered Report submission can foresee all eventualities and in fact sometimes fail miserably at predicting what will happen (I wonder in how far this speaks to the debates I’ve had in the past about the possibility of precognition :P).

An example from psychophysics

So what do you do then? In some cases the decision could be obvious. Using an example I used in a previous post (that I’m too lazy to dig up as it’s deep in the history of this blog), imagine I am running a psychophysics experiment. The resulting plots of participants should look something like this:

The x-axis is the stimulus level we presented to the participant, and the y-axis is their behavioural choice. But don’t worry about the details. The point is that the curve should have a sigmoidal shape. It might vary in terms of slope and where the 50% threshold (red dotted line) is but it generally should look similar to this plot. Now imagine one participant comes out like this:

This is obviously rubbish. Even if the choice of what is “obviously test/reference” are not really as obvious to the random participant as they are to the experimenter (and this happens often…), the curve should nonetheless have a general sigmoid shape and range from closer to 0% on the left to closer to 100% on the right. It most certainly should not start on the left at 50% and rise. I don’t know what happened in this particular case here (in fact I don’t even recall which experiment I took this from) but the most generous explanation is that the participant misunderstood the task instructions. Perhaps a more likely scenario is that they were simply bored and didn’t do the task at all and pressed response buttons at random. While this doesn’t explain why the curve is still rising to the right, that could perhaps be because they did the task to begin with – performing adequately on a few trials where “obviously test” was the correct response – but then gave up on it after a while.

Outlier removal and robust statistics

It doesn’t matter. Whatever the reason, this data set is clearly unusable. A simple visual inspection shows this but this is obviously subjective. This is a pretty clear example but where should we draw the line between a bad data set and a good one? Interestingly, a more objective, quantitative way to address this, the goodness of the sigmoidal curve fit, would be entirely inappropriate. The goodness-of-fit here may not be perfect but it’s clearly still a pretty close description of the data. No, our decision to reject this data set must be contingent on the outcome variables, the parameter fits of the sigmoid (the slope and threshold). But outcome contingent decisions are precisely one of those cardinal sins of analytic flexibility that we are supposed to avoid.

There are quantitative ways to clean such parameters posthoc by definiting a criterion for outliers. Often this will take the form of using the dispersion of the data (e.g., standard deviation, median absolute deviation, or some such measure) to reject values who fall a certain number of dispersion units from the group mean/median – for instance, people might remove all participants whose values fall +/-2 SDs from the mean. This can be reasonable in some situations because it effectively trims the distribution of your values. However, there is ample flexibility where you set this criterion and looking through the literature you will probably find anything from 1-3 SDs being used without any justification. Moreover, because your measure of dispersion is inherently related to how the data are distributed it is obviously affected by the outlier. A few very extreme and influential outliers can therefore mean that some obviously artifactual data are not flagged as outliers. To address this issue, several years ago we proposed to use bootstrapping to estimate the dispersion using a measure we called Shepherd’s Pi (the snark was strong with me then…).

In general, there are of course robust and/or non-parametric statistical tests that can help you make inferences on cases like this. While the details differ, what they have in common is that they are most robust in the face of certain kinds of outliers or in unsual data distributions. Some of them are so fierce that they will remove the majority of your observations – which is clearly absurd suggesting that the test may be ill-suited for this particular situation at least. There is a whole statistical literature on the development of these tests and comparing their performance. To anyone but a stats aficionado, this is very tedious and full of names and Greek letters (hence our snark…). What these kinds of robust statistics (and more arbitrary data cleaning methods like a dispersion criterion) are good for is robustness checks, a type of multiverse analysis where you compare the outcome of a range of tests to see if your results depend strongly on the particular test used. It seems sensible to include these checks, provided they are done transparently. There are obviously also situations where you might want to prespecify a particular test in a Registered Report, such as when you know that you expect a lot of outliers in a correlation you might preregister upfront that you will use Spearman’s correlation instead of the standard Pearson’s. You might even go so far as to adapt the use of a particular robust test a priori, even though it means you are sacrificing statistical power for improved robustness.

(Im-)practicalities and a way forward?

But the real issue is that this is clearly not a solution for the actual problem. In our example above, the blue data set is crap. Whatever your statistical criterions or robust statistical tests tell you, the blue data set should never enter our inference in the first place. Again, this is still a rather extreme example. It isn’t hard to justify posthoc why the curve shouldn’t start at 50% on the left – it violates the entire concept of the experiment. But most real situations are not so obvious. Sometimes we can make the argument that if participants perform at considerably less than ceiling/floor level for the “obvious” stimulus levels (left and right side of the plot in my example), that consistutes valid grounds for exclusion. We had this in a student project a few years ago where some participants performed at levels that – if they had done the task properly – really could only be because they were legally blind (which they obviously weren’t). It seems justified to exclude those participants, but it probably also underlines issues with the experiment: participants either had a tendency to misunderstand the task instructions, the task might have been to difficult for them, or the experimenter didn’t ensure that they did the experiment properly. If there are a lot of such participants, this flags up a problem that you probably want to fix in future experiments. In the worst case, it casts serious doubt on the whole experiment because if there are such problems for a lot of people how confident can we be that it isn’t affecting the “good” data sets, albeit in more subtle ways?

Perhaps the best way to deal with such problems is in the context of a Registered Report. This may seem counterintuitive. There have been numerous debates on how preregistration may stifle the necessary flexibility to deal with real data like this (I have been guilty of perpetuating these discussions myself for a time). This view still prevails in some corners. But actually the exact opposite is true. From my own (admittedly still limited) experience editing Registered Reports, I can say that it is not uncommon for authors to request alterations after they started data collection. In most cases, these are such minor changes that I wouldn’t bother the original reviewers with. They can simply be addressed by a footnote in the final submission. But of course in cases like our example here, where your dream of the preregistered experiment crashes against the rocky shores of real data, it may necessitate sending an alteration back to the reviewers who approved the preregistered design. There are valid scientific debates about which data should be included and excluded and there will inevitably be differences in opinion. That is completely fine. The point is that in this scenario you have a transparent record of the changes and why they were made, and you have independent experts weighing these arguments before the final outcome is known. Of course, a more practical solution could be to simply include the new outlier criterion as an additional, exploratory analysis in your final manuscript alongside the one you originally preregistered. However, in a situation like my example here this seems ill-advised: if the preregistered approach contains data that obviously shouldn’t be included, it isn’t worth very much. Some RR editors may take a stricter view on this, but I’d rather have an adapted design where the reason for changes are clearly justified and transparent and anyone can check the effects of these changes for themselves with the published data sets.

Transparency is really key. A Registered Report will provide some safety nets in terms of ensuring people declare the changes that were made and that they were reviewed by independent experts. But even in a classical (or exploratory) research article, the solution to this problem is simply transparent reporting. If data must be removed for justifiable posthoc reasons, then this should be declared explicitly. The excluded data should be made available so others can inspect it for themselves. Your mileage may vary with regard to some choices, but as long as it is clear why certain decisions were made this is simply a matter for debate. The big problem with all this is that the incentives in scientific publishing still work against this. As far as many high impact journals are concerned, data cleanliness is next to godliness. However justified the reasons may be, a study where some data were excluded posthoc inevitably looks messier than one where those exclusions are simply hidden.

Some have argued for rather Orwellian measures of publishing lab books with detailed records alongside research studies. I haven’t used a “lab book” since the mid-noughties. We have digital records of experiments we conducted, which are less prone to human error. Insofar as this is possible, such records are shared as part of published data sets. I have actually heard of some people who upload each experimental data set automatically as soon as it is collected. For one thing, this is only realistic for relatively small data (e.g. psychophysics or some cognitive psychology experiments perhaps). It doesn’t seem particularly feasible for a lot of research producing big files. It also just seems over-the-top to me. More importantly, this might be illegal in some jurisdictions (data sharing rules where I live are rather strict, in fact I sometimes wonder if any of my open science oriented colleagues aren’t breaking the law).

In the end, such ideas seem to treat the symptom and not the cause. We need to change the climate to make research more transparent. For this it is imperative that editors and reviewers remember that real data are messy. Obviously, if a result only emerges with extensive posthoc exclusions and the problems with the data set are mounting, there can be good reasons to question the finding. But it is crucial that researchers feel free to do the right thing.

Everything* you ever wanted to know about perceived income but were afraid to ask

This is a follow-up to my previous post. For context, you may wish to read this first. In that post I discussed how a plot from a Guardian piece (based on a policy paper) made the claim that German earners tend to misjudge themselves as being closer to the mean or, in the authors’ own words, “everyone thinks they’re middle class“. Now last week, I simply looked at this in the simplest way possible. What I think this plot shows is simply the effect of transforming a normal-ish data distribution into a quantile scale. For reference, here is the original figure again:

The column on the left aren’t data. They simply label the deciles, 10% brackets of the income distribution. My point previously was that if you calculate the means of the actual data for each decile you get exactly this line-squeeze plot that is shown here. Obviously this depends on the range of the scale you use. I simply transformed (normalised) the income data into a 1-10 scale where the maximum earner gets a score of 10 and everyone else is below. The point really is that in this scenario this has absolutely nothing to do with perceiving your income at all. It is simply plotting the normalised income data and you produce a plot that is highly reminiscent of the real thing.

Does the question matter?

Obviously my example wasn’t really mimicking what happens with perceived income. By design, it wasn’t supposed to. However, this seems to have led to some confusion about what my “simulation” (if you can even call it that) was showing. A blog post by Dino Carpentras argues that what matters here is how perceived income was measured. Here I want to show why I believe this isn’t the case.

First of all, Dino suggested that if people indeed reported their decile then the plot should have perfectly horizontal lines. Dino’s post already includes some very nice illustrations of that so I won’t rehash that here and instead encourage you to read that post. A similar point was made to me on Twitter by Rob McCutcheon. Now, obviously if people actually reported their true deciles then this would indeed be the case. In this case we are simply plotting the decile against the decile – no surprises there. In fact, almost the same would happen if they estimated the exact quantile they fall in and we then average that (that’s what I think Rob’s tweet is showing but I admit my R is too rusty to get into this right now).

My previous post implicitly assumed that people are not actually doing that. When you ask people to rate themselves on a 1-10 scale in terms of where their income lies, I doubt people will think about deciles. But keep in mind that the actual survey asked the participants to rate exactly that. Yet even in this case, I doubt that people are naturally inclined to think about themselves in terms of quantiles. Humans are terrible at judging distributions and probability and this is no exception. However, this is an empirical question – there may well be a lot of research on this already that I’m unaware of and I’d be curious to know about it.

But I maintain that my previous point still stands. To illustrate, I first show what the data would look like in these different scenarios if people could indeed judge their income perfectly on either scale. The plot below is showing what I used in my example previously. This is a distribution of (simulated) actual incomes. The x-axis shows the income in fictitious dollars. All my previous simulation did was to normalise so the numbers/ticks on the x-axis are changed to be between 1-10 but all the relationships remain the same.

But now let us assume that people can judge their income quantile. This comes with a big assumption that all survey respondents even know what that means, which I’d doubt strongly. But let’s take that granted that any individual is able to report accurately what percentage of the population earns less than them. Below I plot that on the y-axis against the actual income on the x-axis. It gives you the characteristic sigmoid shape – it’s a function most psychophysicists will be very familiar with: the cumulative Gaussian.

If we averaged the y-values for each x-decile and plotted this the way the original graph did, we would get close to horizontal lines. That’s the example I believe Rob showed in his tweet above. However, Dino’s post goes further and assumes people can actually report their deciles (that is, answer the question the survey asked perfectly). That is effectively rounding the quantile reports into 10% brackets. Here is the plot of that. It still follows the vague sigmoid shape but becomes sharply edged.

If you now plotted the line squeeze diagram used in the original graph, you would get perfectly horizontal lines. As I said, I won’t replot this; there really is no need for it. But obviously this is not a realistic scenario. We are talking about self-ratings here. In my last post I already elaborated on a few psychological factors why self-rating measures will be noisy. This is by no means exhaustive. There will be error on any measure, starting from simple mistakes in self-report or whatever. While we should always seek to reduce the noise in our measurements, noisy measurements are at the heart of science.

So let’s simulate that. Sources of error will affect the “perceived income” construct at several levels. The simplest we can do to simulate it is an error on how much the individual thinks their actual income is – we take each person’s income and add a Gaussian error. I used a Gaussian with SD=$30,000. That may be excessive but we don’t really know that. There is likely error in how high people think their income is relative to their peers and general spending power. Even more likely, there must be error on how they rate themselves on the 1-10 decile scale. I suspect that when transformed back into actual income this will be disproportionally larger than the error on judging their own income in dollars. It doesn’t really matter in principle.

Each point here is a simulated person’s self-reported income quantile plotted against their actual income. As you can see while the data still follow the vague sigmoid shape, there is a lot of scatter in people’s “reported” quantiles compared to what it actually should be. For clarity, I added a colour code here which denotes the actual income decile each person belongs to. The darkest blue are the 10% lowest earners and the yellow bracket is the top earners.

Next I round people’s income to simulate their self-reported deciles. The point of this is to effectively transform the self-reports into the discrete 1-10 scale that we believe the actual survey respondents used (I still don’t know the methods and if people were allowed to score themselves a 5.5 for instance – but based on my reading of the paper the scale was discrete). I replot these self-reported deciles using the same format:

Obviously, the y-axis will now again cluster in these 10 discrete levels. But as you can see from the colour code, the “self-reported” decile is a poor reflection of the actual income bracket. While a relative majority (or plurality) of respondents scoring themselves 1 are indeed in the lowest decile, in this particular example some of them are actual top earners. The same applies to the other brackets. Respondents thinking of themselves as perfectly middle class in decile 5 actually come more or less equally from across the spectrum. Now, again this may be a bit excessive but bear with me for just a while longer…

What happens when we replot this with our now infamous line plots? VoilĂ , doesn’t this look hauntingly familiar?

The reason for this is that perceived income is a psychological measure. Or even just a real world measure. It is noisy. The take-home message here is: It does not matter what question you ask the participants. People aren’t computers. The corollary of that is that when data are noisy the line plot must necessarily produce this squeezing effect the original study reported.

Now you may rightly say, Sam, this noise simulation is excessive. That may well be. I’ll be the first to admit that there are probably not many billionaires who will rate themselves as belonging to the lowest decile. However, I suspect that people indeed have quite a few delusions about their actual income. This may be more likely to affect the people in the actual middle range perhaps. So I don’t think the example here is as extreme as it may appear at first glance. There are also many further complications, such as that these measures are probably heteroscedastic. The error by which individuals misjudge their actual income level in dollars is almost certainly greater for high earners. My example here is very simplistic in assuming the same amount of error across the whole population. This heteroscedasticity is likely to introduce further distortions – such as the stronger “underestimation” by top earners compared to the “overestimation” by low earners, i.e. what the original graph purports to show.

In any case, the amount of error you choose for the simulation doesn’t affect the qualitative pattern. If people are more accurate at judging their income decile, the amount of “squeezing” we see in these line plots will be less extreme. But it must be there. So any of these plots will necessarily contain a degree of this artifact and thus make it very difficult to ascertain if this misestimation claimed by the policy paper and the corresponding Guardian piece actually exists.

Finally, I want to reiterate this because it is important: What this shows is that people are bad at judging their income. There is error on this judgement, but crucially this is Gaussian (or semi-Gaussian) error. It is symmetric. Top earner Jeff may underestimate his own income because he has no real concept of how the other half** live. In contrast, billionaire Donny may overestimate his own wealth because of his fragile ego and he forgot how much money he wastes on fake tanning oil. The point is, every individual*** in our simulated population is equally likely to over- or under-estimate their income – however, even with such symmetric noise the final outcome of this binned line plot is that the bin averages trend towards the population mean.

*) Well, perhaps almost everything?

**) Or to be precise, how the other 99.999% live.

***) Actually because my simulation prevents negative incomes for the very lowest earners, the error must skew their perceived income upwards.

Matlab code for this simulation is available here.

It’s #$%&ing everywhere!

I can hear you singing in the distance
I can see you when I close my eyes
Once you were somewhere and now you’re everywhere


Superblood Wolfmoon – Pearl Jam

If you read my previous blog post you’ll know I have a particular relationship these days with regression to the mean – and binning artifacts in general. Our recent retraction of a study reminded me of this issue. Of course, I was generally aware of the concept, as I am sure are most quantitative scientists. But often the underlying issues are somewhat obscure, which is why I certainly didn’t immediately clock on to them in our past work. It took a collaborate group effort with serendipitous suggestions, much thinking and simulating and digging, and not least of all the tireless efforts of my PhD student Susanne Stoll to uncover the full extent of this issue in our published research. We also still maintain that this rabbit hole goes a lot deeper because there are numerous other studies that used similar analyses. They must by necessity contain the same error – hopefully the magnitude of the problem is less severe in most other studies so that their conclusions aren’t all completely spurious. However, we simply cannot know that until somebody investigates this empirically. There are several candidates out there where I think the problem is almost certainly big enough to invalidate the conclusions. I am not the data police and I am not going to run around arguing people’s conclusions are invalid without A) having concrete evidence and B) having talked to the authors personally first.

What I can do, however, is explain how to spot likely candidates of this problem. And you really don’t have far too look. We believe that this issue is ubiquitous in almost all pRF studies; specifically, it affects all pRF studies that use any kind of binning. There are cases where this is probably of no consequence – but people must at least be aware of the issue before it leads to false assumptions and thus erroneous conclusions. We hope to publish another article in the future that lays out this issue in some depth.

But it goes well beyond that. This isn’t a specific problem with pRF studies. Many years before that I had discussions with David Shanks about this subject when he was writing an article (also long since published) of how this artifact confounds many studies in the field of unconscious processing, something that certainly overlaps with my own research. Only last year there was an article arguing that the same artifact explains the Dunning-Kruger effect. And I am starting to see this issue literally everywhere1 now… Just the other day I saw this figure on one of my social media feeds:

This data visualisation makes a striking claim with very clear political implications: High income earners (and presumably very rich people in general) underestimate their wealth relative to society as a whole, while low income earners overestimate theirs. A great number of narratives can be spun about this depending on your own political inclinations. It doesn’t take much imagination to conjure up the ways this could be used to further a political agenda, be it a fierce progressive tax policy or a rabid pulling-yourself-up-by-your-own-bootstraps type of conservatism. I have no interest in getting into this discussion here. What interests me here is whether the claim is actually supported by the evidence.

There are a number of open questions here. I don’t know how “perceived income” is measured exactly2. It could theoretically be possible that some adjustments were made here to control for artifacts. However, taken at face value this looks almost like a textbook example of regression to the mean. Effectively, you have an independent variable, the individuals’ actual income levels. We can presumably regard this as a ground truth – an individual’s income is what it is. We then take a dependent variable, perceived income. It is probably safe to assume that this will correlate with actual income. However, this is not a perfect correlation because perfect correlations are generally meaningless (say correlating body height in inches and centimeters). Obviously, perceived income is a psychological measure that must depend on a whole number of extraneous factors. For one thing, people’s social networks aren’t completely random but we all live embedded in a social context. You will doubtless judge your wealth relative to the people you mostly interact with. Another source of misestimation could be how this perception is measured. I don’t know how that was done here in detail but people were apparently asked to self-rate their assumed income decile. We can expect psychological factors at play that make people unlikely to put themselves in the lowest or highest scores on such a scale. There are many other factors at play but that’s not really important. The point is that we can safely assume that people are relatively bad at judging their true income relative to the whole of society.

But to hell with it, let’s just disregard all that. Instead, let us assume that people are actually perfectly accurate at judging their own income relative to society. Let’s simulate this scenario3. First we draw 10,000 people a Gaussian distribution of actual incomes. This distribution has a mean of $60,000 and a standard deviation of $20,000 – all in fictitious dollars which we assume our fictitious country uses. We assume these are based on people’s paychecks so there is no error4 on this independent variable at all. I use the absolute values to ensure that there is no negative income. The figure below shows the actual objective income for each (simulated) person on the x-axis. The y-axis is just random scatter for visualisation – it has no other significance. The colour code denotes the income bracket (decile) each person belongs to.

Next I simulate perceived income deciles for these fictitious people. To do this we need to do some rescaling to get everyone on the scale 1-10, with 10 being highest top earner. However – and this is important – as per our (certainly false) assumption above, perceived income is perfectly correlated with actual income. It is a simple transformation to rescale it. Now, what happens when you average the perceived income in each of these decile brackets like that graph above did? I do that below, using the same formatting as the original graph:

I will leave it to you, gentle reader, to determine how this compares to the original figure. Why is this happening? It’s simple really when you think about it: Take the highest income bracket. This ranges widely from high-but-reasonable to filthy-more-money-than-you-could-ever-spend-in-a-lifetime rich. This is not a symmetric distribution. The summary statistics of these binned data will be heavily skewed. Its mean/median will be biased downward for the top income brackets and upwards for the low income brackets. Only the income decile near the centre will be approximately symmetric and thus produce an unbiased estimate. Or to put it in simpler terms: the left column simply labels the deciles brackets. The only data here is in the right column and all this plot really shows is that the incomes have a Gaussian-like distribution. This has nothing to do with perceptions of income whatsoever.

In discussions I’ve had this all still confuses some people. So I added another illustration. In the graph below I plot a normal distribution. The coloured bands denote the approximated deciles. The white dots on the X-axis show the mean for each decile. The distance between these dots is obviously not equal. They all trend to be closer to the population mean (zero) than to the middle of their respective bands. This bias is present for all deciles except perhaps the most central ones. However, it is most extreme for the outermost deciles because these have the most asymmetric distributions. This is exactly what the income plots above are showing. It doesn’t matter whether we are looking at actual or perceived income. It doesn’t matter at all if there is error on those measures or not. All that matters is the distribution of the data.

Now, as I already said, I haven’t seen the detailed methodology of that original survey. If the analysis made any attempt to mathematically correct for this problem then I’ll stand corrected5. However, even in that case, the general statistical issue is extremely wide-spread and this serves as a perfect example of how binning can result in widely erroneous conclusions. It also illustrates the importance of this issue. The same problem relates to pRF tuning widths and stimulus preferences and whatnot – but that is frankly of limited importance. But things like these income statistics could have considerable social implications. What this shows to me is two-fold: First, please be careful when you do data analysis. Whenever possible, feed some simulated data to your analysis to see if it behaves as you think it should. Second, binning sucks. I see it effing everywhere now and I feel like I haven’t slept in months6

Superbloodmoon eclipse
Photo by Dave O’Brien, May 2021
  1. A very similar thing happened when I first learned about heteroscedasticity. I kept seeing it in all plots then as well – and I still do…
  2. Many thanks to Susanne Stoll for digging up the source for these data. I didn’t see much in terms of actual methods details here but I also didn’t really look too hard. Via Twitter I also discovered the corresponding Guardian piece which contains the original graph.
  3. Matlab code for this example is available here. I still don’t really do R. Can’t teach an old dog new tricks or whatever…
  4. There may be some error with a self-report measure of people’s actual income although this error is perhaps low – either way we do not need to assume any error here at all.
  5. Somehow I doubt it but I’d be very happy to be wrong.
  6. There could however be other reasons for that…

If this post confused you, there is now a follow-up post to confuse you even more… 🙂

When the hole changes the pigeon

or How innocent assumptions can lead to wrong conclusions

I promised you a (neuro)science post. Don’t let the title mislead you into thinking we’re talking about world affairs and societal ills again. While pigeonholing is directly related to polarised politics or social media, for once this is not what this post is about. Rather, it is about a common error in data analysis. While there have been numerous expositions about similar issues throughout the decades – as we’ve learned the hard way, it is a surprisingly easy mistake to make. A lay summary and some wider musings on the scientific process was published by Benjamin de Haas. A scientific article by Susanne Stoll laying out this problem in more detail is currently available as a preprint.

Pigeonholing (Source: https://commons.wikimedia.org/wiki/File:TooManyPigeons.jpg)

Data binning

In science you often end up with large data sets, with hundreds or thousands of individual observations subject to considerable variance. For instance, in my own field of retinotopic population receptive field (pRF) mapping, a given visual brain area may have a few thousand recording sites, and each has a receptive field position. There are many other scenarios of course. It could be neural firing, or galvanic skin responses, or eye positions recorded at different time points. Or it could be hundreds or thousands of trials in a psychophysics experiment etc. I will talk about pRF mapping because this is where we recently encountered the problem and I am going to describe how it has affected our own findings – however, you may come across the same issue in many guises.

Imagine that we want to test how pRFs move around when you attend to a particular visual field location. I deliberately use this example because it is precisely what a bunch of published pRF studies did, including one of ours. There is some evidence that selective attention shifts the position of neuronal receptive fields, so it is not far-fetched that it might shift pRFs in fMRI experiments also. Our study for instance investigated whether pRFs shift when participants are engaged in a demanding (“high load”) task at fixation, compared to a baseline condition where they only need to detect a simple colour change of the fixation target (“low load”). Indeed, we found that across many visual areas pRFs shifted outwards (i.e. away from fixation). This suggested to us that the retinotopic map reorganises to reflect a kind of tunnel vision when participants are focussed on the central task.

What would be a good way to quantify such map reorganisation? One simple way might be to plot each pRF in the visual field with a vector showing how it is shifted under the attentional manipulation. In the graph below, each dot shows a pRF location under the attentional condition, and the line shows how it has moved away from baseline. Since there is a large number pRFs, many of which are affected by measurement noise or other errors, these plots can be cluttered and confusing:

Plotting shift of each pRF in the attention condition relative to baseline. Each dot shows where a pRF landed under the attentional manipulation, and the line shows how it has shifted away from baseline. This plot is a hellishly confusing mess.

Clearly, we need to do something to tidy up this mess. So we take the data from the baseline condition (in pRF studies, this would normally be attending to a simple colour change at fixation) and divide the visual field up into a number of smaller segments, each of which contains some pRFs. We then calculate the mean position of the pRFs from each segment under the attentional manipulation. Effectively, we summarise the shift from baseline for each segment:

We divide the visual field into segments based on the pRF data from the baseline condition and then plot the mean shift in the experimental condition for each segment. A much clearer graph that suggests some very substantial shifts…

This produces a much clearer plot that suggests some interesting, systematic changes in the visual field representation under attention. Surely, this is compelling evidence that pRFs are affected by this manipulation?

False assumptions

Unfortunately it is not1. The mistake here is to assume that there is no noise in the baseline measure that was used to divide up the data in the first place. If our baseline pRF map were a perfect measure of the visual field representation, then this would have been fine. However, like most data, pRF estimates are variable and subject to many sources of error. The misestimation is also unlikely to be perfectly symmetric – for example, there are several reasons why it is more likely that a pRF will be estimated closer to central vision than in the periphery. This means there could be complex and non-linear error patterns that are very difficult to predict.

The data I showed in these figures are in fact not from an attentional manipulation at all. Rather, they come from a replication experiment where we simply measured a person’s pRF maps twice over the course of several months. One thing we do know is that pRF measurements are quite robust, stable over time, and even similar between scanners with different magnetic field strengths. What this means is that any shifts we found are most likely due to noise. They are completely artifactual.

When you think about it, this error is really quite obvious: sorting observations into clear categories can only be valid if you can be confident in the continuous measure on which you base these categories. Pigeonholing can only work if you can be sure into which hole each pigeon belongs. This error is also hardly new. It has been described in numerous forms as regression to the mean and it rears its ugly head every few years in different fields. It is also related to circular inference, which has already caused a stir in cognitive and social neuroscience a few years ago. Perhaps the reason for this is that it is a damn easy mistake to make – but that doesn’t make the face-palming moment any less frustrating.

It is not difficult to correct this error. In the plot below, I used an independent map from yet another, third pRF mapping session to divide up the visual field. Then I calculated how the pRFs in each visual field segment shifted on average between the two experimental sessions. While some shift vectors remain, they are considerably smaller than in the earlier graph. Again, keep in mind that these are simple replication data and we would not really expect any systematic shifts. There certainly does not seem to be a very obvious pattern here – perhaps there is a bit of a clockwise shift in the right visual hemifield but that breaks down in the left. Either way, this analysis gives us an estimate of how much variability there may be in this measurement.

We use an independent map to divide the visual field into segments. Then we calculate the mean position for each segment in the baseline and the experimental condition, and work out the shift vector between them. For each segment, this plot shows that vector. This plot loses some information, but it shows how much and into which direction pRFs in each segment shifted on average.

This approach of using a third, independent map loses some information because the vectors only tell you the direction and magnitude of the shifts, not exactly where the pRFs started from and where they end up. Often the magnitude and direction of the shift is all we really need to know. However, when the exact position is crucial we could use other approaches. We will explore this in greater depth in upcoming publications.

On the bright side, the example I picked here is probably extreme because I didn’t restrict these plots to a particular region of interest but used all supra-threshold voxels in the occipital cortex. A more restricted analysis would remove some of that noise – but the problem nevertheless remains. How much it skews the findings depends very much on how noisy the data are. Data tend to be less noisy in early visual cortex than in higher-level brain regions, which is where people usually find the most dramatic pRF shifts…

Correcting the literature

It is so easy to make this mistake that you can find it all over the pRF literature. Clearly, neither authors nor reviewers have given it much thought. It is definitely not confined to studies of visual attention, although this is how we stumbled across it. It could be a comparison between different analysis methods or stimulus protocols. It could be studies measuring the plasticity of retinotopic maps after visual field loss. Ironically, it could even be studies that investigate the potential artifacts when mapping such plasticity incorrectly. It is not restricted to the kinds of plots I showed here but should affect any form of binning, including the binning into eccentricity bins that is most common in the literature. We suspect the problem is also pervasive in many other fields or in studies using other techniques. Only a few years ago a similar issue was described by David Shanks in the context of studying unconscious processing. It is also related to warnings you may occasionally hear about using median splits – really just a simpler version of the same approach.

I cannot tell you if the findings from other studies that made this error are spurious. To know that we would need access to the data and reanalyse these studies. Many of them were published before data and code sharing was relatively common2. Moreover, you really need to have a validation dataset, like the replication data in my example figures here. The diversity of analysis pipelines and experimental designs makes this very complex – no two of these studies are alike. The error distributions may also vary between different studies, so ideally we need replication datasets for each study.

In any case, as far as our attentional load study is concerned, after reanalysing these data with unbiased methods, we found little evidence of the effects we published originally. While there is still a hint of pRF shifts, these are no longer statistically significant. As painful as this is, we therefore retracted that finding from the scientific record. There is a great stigma associated with retraction, because of the shady circumstances under which it often happens. But to err is human – and this is part of the scientific method. As I said many times before, science is self-correcting but that is not some magical process. Science doesn’t just happen, it requires actual scientists to do the work. While it can be painful to realise that your interpretation of your data was wrong, this does not diminish the value of this original work3 – if anything this work served an important purpose by revealing the problem to us.

We mostly stumbled across this problem by accident. Susanne Stoll and Elisa Infanti conducted a more complex pRF experiment on attention and found that the purported pRF shifts in all experimental conditions were suspiciously similar (you can see this in an early conference poster here). It took us many months of digging, running endless simulations, complex reanalyses, and sometimes heated arguments before we cracked that particular nut. The problem may seem really obvious now – it sure as hell wasn’t before all that.

This is why this erroneous practice appears to be widespread in this literature and may have skewed the findings of many other published studies. This does not mean that all these findings are false but it should serve as a warning. Ideally, other researchers will also revisit their own findings but whether or not they do so is frankly up to them. Reviewers will hopefully be more aware of the issue in future. People might question the validity of some of these findings in the absence of any reanalysis. But in the end, it doesn’t matter all that much which individual findings hold up and which don’t4.

Check your assumptions

I am personally more interested in taking this whole field forward. This issue is not confined to the scenario I described here. pRF analysis is often quite complex. So are many other studies in cognitive neuroscience and, of course, in many other fields as well. Flexibility in study designs and analysis approaches is not a bad thing – it is in fact essential for addressing scientific questions that we can adapt our experimental designs.

But what this story shows very clearly is the importance of checking our assumptions. This is all the more important when using the complex methods that are ubiquitous in our field. As cognitive neuroscience matures, it is critical that we adopt good practices in ensuring the validity of our methods. In the computational and software development sectors, it is to my knowledge commonplace to test algorithms on conditions where the ground truth is known, such as random and/or simulated data.

This idea is probably not even new to most people and it certainly isn’t to me. During my PhD there was a researcher in the lab who had concocted a pretty complicated analysis of single-cell electrophysiology recordings. It involved lots of summarising and recentering of neuronal tuning functions to produce the final outputs. Neither I nor our supervisor really followed every step of this procedure based only on our colleague’s description – it was just too complex. But eventually we suspected that something might be off and so we fed random numbers to the algorithm – lo and behold the results were a picture perfect reproduction of the purported “experimental” results. Since then, I have simulated the results of my analyses a few other times – for example, when I first started with pRF modelling or when I developed new techniques for measuring psychophysical quantities.

This latest episode taught me that we must do this much more systematically. For any new design, we should conduct control analyses to check how it behaves with data for which the ground truth is known. It can reveal statistical artifacts that might hide inside the algorithm but also help you determine the method’s sensitivity and thus allow you to conduct power calculations. Ideally, we would do that for every new experiment even if it uses a standard design. I realise that this may not always be feasible – but in that case there should be a justification why it is unnecessary.

Because what this really boils down to is simply good science. When you use a method without checking that it works as intended, you are effectively doing a study without a control condition – quite possibly the original sin of science.

Acknowlegdements

In conclusion, I quickly want to thank several people: First of all, Susanne Stoll deserves major credit for tirelessly pursuing this issue in great detail over the past two years with countless reanalyses and simulations. Many of these won’t ever see the light of day but helped us wrap our heads around what is going on here. I want to thank Elisa Infanti for her input and in particular the suggestion of running the analysis on random data – without this we might never have realised how deep this rabbit hole goes. I also want to acknowledge the patience and understanding of our co-authors on the attentional load study, Geraint Rees and Elaine Anderson, for helping us deal with all the stages of grief associated with this. Lastly, I want to thank Benjamin de Haas, the first author of that study for honourably doing the right thing. A lesser man would have simply booked a press conference at Current Biology Total Landscaping instead to say it’s all fake news and announce a legal challenge5.

Footnotes:

  1. The sheer magnitude of some of these shifts may also be scientifically implausible, an issue I’ve repeatedly discussed on this blog already. Similar shifts have however been reported in the literature – another clue that perhaps something is awry in these studies…
  2. Not that data sharing is enormously common even now.
  3. It is also a solid data set with a fairly large number of participants. We’ve based our canonical hemodynamic response function on the data collected for this study – there is no reason to stop using this irrespective of whether the main claims are correct or not.
  4. Although it sure would be nice to know, wouldn’t it?
  5. Did you really think I’d make it through a blog post without making any comment like this?

Irish Times OpEds are just bloody awful at science (n=1)

TL-DR: No, men are not “better at science” than women.

Clickbaity enough for you? I cannot honestly say I have read a lot of OpEds in the Irish Times so the evidence for my titular claim is admittedly rather limited. But it is still more solidly grounded in actual data than this article published yesterday in the Irish Times. At least I have one data point.

The article in question, a prime example of Betteridge’s Law, is entitled “Are men just better at science than women?“. I don’t need to explain why such a title might be considered sensationalist and controversial. The article itself is an “Opinion” piece, thus allowing the publication to disavow any responsibility for its authorship whilst allowing it to rake in the views from this blatant clickbait. In it, the author discusses some new research reporting gender differences in systemising vs empathising behaviour and puts this in the context of some new government initiative to specifically hire female professors because apparently there is some irony here. He goes on a bit about something called “neurosexism” (is that a real word?) and talks about “hard-wired” brains*.

I cannot quite discern if the author thought he was being funny or if he is simply scientifically illiterate but that doesn’t really matter. I don’t usually spend much time commenting on stuff like that. I have no doubt that the Irish Times, and this author in particular, will be overloaded with outrage and complaints – or, to use the author’s own words, “beaten up” on Twitter. There are many egregious misrepresentations of scientific findings in the mainstream media (and often enough, scientists and/or the university press releases are the source of this). But this example of butchery is just so bad and infuriating in its abuse of scientific evidence that I cannot let it slip past.

The whole argument, if this is what the author attempted, is just riddled with logical fallacies and deliberate exaggerations. I have no time or desire to go through them all. Conveniently, the author already addresses a major point himself by admitting that the study in question does not actually speak to male brains being “hard-wired” for science, but that any gender differences could be arising due to cultural or environmental factors. Not only that, he also acknowledges that the study in question is about autism, not about who makes good professors. So I won’t dwell on these rather obvious points any further. There are much more fundamental problems with the illogical leaps and mental gymnastics in this OpEd:

What makes you “good at science”?

There is a long answer to this question. It most certainly depends somewhat on your field of research and the nature of your work. Some areas require more manual dexterity, whilst others may require programming skills, and others yet call for a talent for high-level maths. As far as we can generalise, in my view necessary traits of a good researcher are: intelligence, creativity, patience, meticulousness, and a dedication to seek the truth rather than confirming theories. That last one probably goes hand-in-hand with some scepticism, including a healthy dose of self-doubt.

There is also a short answer to this question. A good scientist is not measured by their Systemising Quotient (SQ), a self-report measure that quantifies “the drive to analyze or build a rule-based system”. Academia is obsessed with metrics like the h-index (see my previous post) but even pencil pushers and bean counters** in hiring or grant committees haven’t yet proposed to use SQ to evaluate candidates***.

I suspect it is true that many scientists score high on the SQ and also the related Autism-spectrum Quotient (AQ) which, among other things, quantifies a person’s self-reported attention to detail. Anecdotally, I can confirm that a lot of my colleagues score higher than the population average on AQ. More on this in the next section.

However, none of this implies that you need to have a high SQ or AQ to be “good at science”, whatever that means. That assertion is a logical fallacy called affirming the consequent. We may agree that “systemising” characterises a lot of the activities a typical scientist engages in, but there is no evidence that this is sufficient to being a good scientist. It could mean that systemising people are attracted to science and engineering jobs. It certainly does not mean that a non-systemising person cannot be a good scientist.

Small effect sizes

I know I rant a lot about relative effect sizes such as Cohen’s d, where the mean difference is normalised by the variability. I feel that in a lot of research contexts these are given undue weight because the variability itself isn’t sufficiently controlled. But for studies like this we can actually be fairly confident that they are meaningful. The scientific study had a pretty whopping sample size of 671,606 (although that includes all their groups) and also used validation data. The residual physiologist inside me retains his scepticism about self-report questionnaire type measures, but even I have come to admit that a lot of questionnaires can be pretty effective. I think it is safe to say that the Big 5 Personality Factors or the AQ tap into some meaningful real factors. Further, whatever latent variance there may be on these measures, that is probably outweighed by collecting such a massive sample. So the Cohen’s d this study reports is probably quite informative.

What does this say? Well, the difference in SQ between males and females was 0.31. In other words, the distributions of SQ between sexes overlap quite considerably but the distribution for males is somewhat shifted towards higher values. Thus, while the average man has a subtly higher SQ than the average woman, a rather considerable number of women will have higher SQs than the average man. The study helpfully plots these distributions in Figure 1****:

Sex diffs SQ huge N
Distributions of SQ in control females (cyan), control males (magenta), austistic females (red), and autistic males (green).

The relevant curves here are the controls in cyan and magenta. Sorry, colour vision deficient people, the authors clearly don’t care about you (perhaps they are retinasexists?). You’ll notice that the modes of the female and male distributions are really not all that far apart. More noticeable is the skew of all these distributions with a long tail to the right: Low SQs are most common in all groups (including autism) but values across the sample are spread across the full range. So by picking out a random man and a random woman from a crowd, you can be fairly confident that their SQs are both on the lower end but I wouldn’t make any strong guesses about whether the man has a higher SQ than the woman.

However, it gets even tastier because the authors of the study actually also conducted an analysis splitting their data from controls into people in Science, Technology, Engineering, or Maths (STEM) professions compared to controls who were not in STEM. The results (yes, I know the colour code is now weirdly inverted – not how I would have done it…) show that people in STEM, whether male or female, tend to have larger SQs than people outside of STEM. But again, the average difference here is actually small and most of it plays out in the rightward tail of the distributions. The difference between males and females in STEM is also much less distinct than for people outside STEM.

Sex & STEM diffs SQ
Distributions of SQ in STEM females (cyan), STEM males (magenta), control females (red), and control males (green).

So, as already discussed in the previous section, it seems to be the case that people in STEM professions tend to “systemise” a bit more. It also suggests that men systemise more then women but that difference probably decreases for people in STEM. None of this tells us anything about whether people’s brains are “hard-wired” for systemising, if it is about cultural and environmental differences between men and women, or indeed if  being trained in a STEM profession might make people more systemising. It definitely does not tell you who is “good at science”.

What if it were true?

So far so bad for those who might want to make that interpretive leap. But let’s give them the benefit of the doubt and ignore everything I said up until now. What if it were true that systemisers are in fact better scientists? Would that invalidate government or funders initiatives to hire more female scientists? Would that be bad for science?

No. Even if there were a vast difference in systemising between men and women, and between STEM and non-STEM professions, respectively, all such a hiring policy will achieve is to increase the number of good female scientists – exactly what this policy is intended to do. Let me try an analogy.

Basketball players in the NBA tend to be pretty damn tall. Presumably it is easier to dunk when you measure 2 meters than when you’re Tyrion Lannister. Even if all other necessary skills here are equal there is a clear selection pressure for tall people to get into top basketball teams. Now let’s imagine a team decided they want to hire more shorter players. They declare they will hire 10 players who cannot be taller than 1.70m. The team will have try-outs and still seek to get the best players out of their pool of applicants. If they apply an objective criterion for what makes a good player, such as the ability to score consistently, they will only hire short players with excellent aim or who can jump really high. In fact, these shorties will be on average better at aiming and/or jumping than the giants they already have on their team. The team selects for the ability to score. Shorties and Tallies get there via different means but they both get there.

In this analogy, being a top scorer is being a systemiser, which in turn makes you a good scientist. Giants tend to score high because they find it easy to reach the basket. Shorties score high because they have other skills that compensate for their lack of height. Women can be good systemisers despite not being men.

The only scenario in which such a specific hiring policy could be counterproductive is if two conditions are met: 1) The difference between groups in the critical trait (i.e., systemising) is vast and 2) the policy mandates hiring from a particular group without any objective criteria. We have already established that the former condition isn’t fulfilled here – the difference in systemising between men and women is modest at best. The latter condition is really a moot point because this is simply not how hiring works in the real world. Hiring committees don’t usually just offer jobs to the relatively best person out of the pool but also consider the candidates’ objective abilities and achievements. This is even more pertinent here because all candidates in this case will already be eligible for a professorial position anyway. So all that will in fact happen is that we end up with more female professors who will also happen to be high in systemising.

Bad science reporting

Again, this previous section is based on the entirely imaginary and untenable assumption that systemisers are better scientists. I am not aware of any evidence of that – in part because we cannot actually quantify very well what makes a good scientist. The metrics academics actually (and sadly) use for hiring and funding decisions probably do not quantify that either but I am not even aware of any link between systemising and those metrics. Is there a correlation between h-indeces (relative to career age) and SQ? I doubt it.

What we have here is a case of awful science reporting. Bad science journalism and the abuse of scientific data for nefarious political purposes are hardly a new phenomenon – and this won’t just disappear. But the price of freedom (to practice science) is eternal vigilance. I believe as scientists we have a responsibility to debunk such blatant misapprehensions by journalists who I suspect have never even set foot in an actual lab or spoken to any actual scientists.

Some people assert that improving the transparency and reliability of research will hurt the public’s faith in science. Far from it, I believe those things can show people how science really works. The true damage to how the public perceives science is done by garbage articles in the mainstream media like this one – even if it is merely offered as an “opinion”.

1280px-tyson_chandler
By Keith Allison

*) Brains are not actually hard-wired to do anything. Leaving the old Hebbian analogy aside, brains aren’t wired at all, period. They are soft, squishy, wet sponges containing lots of neuronal and glial tissue plus blood vessels. Neurons connect via synapses between axons and dendrites and this connectivity is constantly regulated and new connections grown while others are pruned. This adaptability is one of the main reasons why we even have brains, and lies at the heart of the intelligence, ingenuity, and versatility of our species.

**) I suspect a lot of the pencil pushers and bean counters behind metrics like impact factors or the h-index might well be Systemisers.

***) I hope none of them read this post. We don’t want to give these people any further ideas…

****) Isn’t open access under Creative Commons license great?

P-values are uniformly distributed when the null hypothesis is true

TL-DR: If the title of this blog post is unsurprising to you, I suggest you go play outside.

Many discussions in my science social media bubble circle around p-values (what an exciting life I lead…). Just a few days ago, there was a big kerfuffle about p-curving and whether p-values just below 0.05 are a sign of whatever. One of the main concepts behind p-curves is that under the assumption that the null hypothesis (H0) of no effect/difference is true, p-values should be uniformly distributed (at least as long as the test assumptions are met reasonably). This once again supported my suspicions that most people don’t actually know what p-values mean. Reports of people defining p-values incorrectly abound, sometimes even in stats textbooks. It also seems to me that people find p-values rather unintuitive. And I get the impression a lot of people vastly overestimate how widely known things like p-curve actually are.

A few weeks ago I got embroiled in a Facebook discussion. A friend of mine was running a permutation analysis to test something about his experiment and found something very odd: the distribution of p-values was skewed severely to the left – there were very few low p-values but the proportion was steadily increasing with most p-values being just below 1. He expected this distribution to be uniform because under the random permutations H0 should be true. A lot of commenters on his post seemed rather surprised and/or confused by the whole idea that p-values should be distributed randomly when H0 is true. “Surely,” so the common intuition goes, “when there is actually no difference, most p-values should be high and close to 1?”

No, and the reason why not is the p-value itself. A p-value can be calculated/estimated in many different ways. Most people use parametric tests but essentially they all share one philosophy. If you have no underlying effect and randomly sample data ad infinitum you end up with a distribution of test statistics. In my example, I draw two variables each with n=100 from a normal distribution and calculate the Pearson correlation between them – and I repeat this 20,000 times. This produces a distribution of correlation coefficients like this:

Rs0

There is no correlation between two random variables (H0 is true) and so the distribution is centred on zero. The spread of the distribution depends on the sample size. Larger samples will produce narrower distributions. Critically, we can use this distribution to get a p-value. If we had observed a correlation of r=0.3 in our experiment, we could calculate the proportion of correlation coefficients in this distribution that are equal or greater than 0.3. This would give us a one-tailed p-value. If you ignore the sign of the correlation, you get a two-tailed p-value.

In the plot above, I coloured the 5% most extreme correlation coefficients in blue (2.5% to the left and to the right, respectively). These regions are abutted by vertical red lines at just below +/-0.2 in this case. This reflects the critical effect size needed to get p<0.05 – only 5% of the correlations coefficients in this distribution are +/-0.19ish or even more extreme.

Now compare this to the region coloured in red. This region also makes up 5% of the whole distribution. However, the red region surrounds zero, that is, those correlation coefficients that are really close to the true correlation value. Random chance makes the distribution spread out (and that becomes more severe when your sample size is low) but most of the correlations will nevertheless be close to the true value of zero. Therefore, the range of values in this red region is much narrower because the values are much denser here.

But of course these nigh-zero correlation coefficients will have the largest p-values. Consider again what a p-value reflects. If your observed correlation is 0.006 and you again ignore the sign of the effects, almost all correlations in this null distribution would be equal or greater than 0.006. So this proportion, the p-value, is almost 1. Put in other words, 5% of low p-values below 0.05 are from the long, thin tails of the null distribution, while 5% of really high p-values above 0.95 are from a really narrow slither of the null distribution near zero:

Ps0

Visualised the same way, you have the blue region with p<0.05 on the left. Here correlations are large (greater than 0.19ish). On the right, you have the red region with p>0.95. Here correlations are really close to zero.

In other words, you can directly read off the p-value from the x-axis of this distribution of p-values. This is a direct consequence of what p-values represent. They are the proportion of values in the null distribution where correlations are equal or more extreme than the observed correlation.

Of course, if the null hypothesis is false and there actually is a correlation between the two variables this distribution must become skewed. There should now be many more tests with low p-values than with large ones. This is exactly what happens and this is the pattern that analyses like p-curve seek to detect:

Ps1

Now, my friend’s p-distribution looked essentially like the mirror image of this. I still haven’t learned what could have possibly caused this. It would mean that more effect sizes were close to zero than there should be under H0. This could suggest some assumptions not being met but none of my own feeble simulations managed to reproduce the pattern he found. His analyses sounded quite complex so there is a possibility that there were some complex errors in it.

 

Is d>10 a plausible effect size?

TL;DR: You may get a very large relative effect size (like Cohen’s d), if the main source of the variability in your sample is the reliability of each observation and the measurement was made as exact as is feasible. Such a large d is not trivial, but in this case talking about d is missing the point.

In discussions of scientific findings you will often hear talk about relative effect sizes, like the ubiquitous Cohen’s d. Essentially, such effect sizes quantify the mean difference between groups/treatments/conditions relative to the variability across subjects/observations. The situation is actually a lot more complicated because even for a seemingly simple results like the difference between conditions you will find that there are several ways of calculating the effect size. You can read a nice summary by Jake Westfall here. There are also other effect sizes, such as correlation coefficients, and what I write here applies to that, too. I will however stick to the difference-type effect size because it is arguably the most common.

One thing that has irked me about those discussions for some years is that this ignores a very substantial issue: the between-subject variance of your sample depends on the within-subject variance. The more unreliable the measurement of each subject, the greater is the variability of your sample. Thus the reliability of individual measurements limits the relative effect size you can possibly achieve in your experiment given a particular experimental design. In most of science – especially biological and psychological sciences – the reliability of individual observations is strongly limited by the measurement error and/or the quality of your experiment.

There are some standard examples that are sometimes used to illustrate what a given effect size means. I stole a common one from this blog post about the average height difference between men and women, which apparently was d=1.482 in 1980 Spain. I have no idea if this is true exactly but that figure should be in the right ballpark. I assume most people will agree that men are on average taller than women but that there is nevertheless substantial overlap in the distributions – so that relatively frequently you will find a woman who is taller than many men. That is an effect size we might consider strong.

The height difference between men and women is a good reference for an effect size because it is largely limited by the between-subject variance, the variability in actual heights across the population. Obviously, the reliability of each observation also plays a role. There will definitely be a degree of measurement error. However, I suspect that this error is small, probably on the order of a few millimeters. Even if you’re quite bad at this measurement I doubt you will typically err by more than 1-2 cm and you can probably still replicate this effect in a fairly small sample. However, in psychology experiments your measurement rarely is that accurate.

Now, in some experiments you can increase the reliability of your individual measurement by increasing the number of trials (at this point I’d like to again refer to Richard Morey’s related post on this topic). In psychophysics, collecting hundreds or thousands of trials on one individual subject is not at all uncommon. Let’s take a very simple case. Contour integration refers to the ability of the visual system to detect “snake” contours better than “ladder” contours or those defined by other orientations (we like to call those “ropes”):

 

In the left image you should hopefully see a circle defined by 16 grating patches embedded in a background or randomly oriented gratings. This “snake” contour pops out from the background because the visual system readily groups orientations along a collinear (or cocircular) path into a coherent object. In contrast, when the contour is defined by patches of other orientations, for example the “rope” contour in the right image which is defined by patches at 45 degrees relative to the path, then it is much harder to detect the presence of this contour. This isn’t a vision science post so I won’t go into any debates on what this means. The take-home message here is that if healthy subjects with normal vision are asked to determine the presence or absence of a contour like this, especially with limited viewing time, they will perform very well for the “snake” contours but only barely above chance levels for the “rope” contours.

This is a very robust effect and I’d argue this is quite representative of many psychophysical findings. A psychophysicist probably wouldn’t simply measure the accuracy but conduct a broader study of how this depends on particular stimulus parameters – but that’s not really important here. It is still pretty representative.

What is the size of this effect? 

If I study this in a group of subjects, the relative effect size at the group level will depend on how accurately I measure the performance in each individual. If I have 50 subjects (which is between 10-25 larger than your typical psychophysics study…) and each performs just one trial, then the sample variance will be much larger compared to if each of them does 100 trials or if they each do 1000 trials. As a result, the Cohen’s d of the group will be considerably different. A d>10 should be entirely feasible if we collect enough trials per person.

People will sometimes say that large effects (d>>2 perhaps) are trivial. But there is nothing trivial about this. In this particular example you may see the difference quite easily for yourself (so you are a single-subject and single-trial replication). But we might want to know just how much better we are at detecting the snake than the rope contours. Or, as I already mentioned, a psychophysicist might measure the sensitivity of subjects to various stimulus parameters in this experiment (e.g., the distance between patches, the amount of noise in the orientations we can tolerate, etc) and this could tell us something about how vision works. The Cohen’s d would be pretty large for all of these. That does not make it trivial but in my view it makes it useless:

Depending on my design choices the estimated effect size may be a very poor reflection of the true effect size. As mentioned earlier, the relative effect size is directly dependent on the between-subject variance – but that in turn depends on the reliability of individual measurements. If each subject only does one trial, the effect of just one attentional lapse or accidental button press in the task is much more detrimental than when they perform 1000 trials, even if the overall rate of lapses/accidents is the same*.

Why does this matter?

In many experiments, the estimate of between-subject variance will be swamped by the within-subject variability. Returning to the example of gender height differences, this is essentially what would happen if you chose to eyeball each person’s height instead of using a tape measure. I’d suspect that is the case for many experiments in social or personality psychology where each measurement is essentially a single quantity (say, timing the speed with which someone walks out of the lab in a priming experiment) rather than being based on hundreds or thousands of trials as in psychophysics. Notoriously noisy measurements are also doubtless the major limiting factor in most neuroimaging experiments. On the other hand, I assume a lot of questionnaire-type results you might have in psychology (such as IQ or the Big Five personality factors) have actually pretty high test-retest reliability and so you probably do get mostly the between-subject variance.

The problem is that often it is very difficult to determine which scenario we are in. In psychophysics, we are often so extremely dominated by the measurement reliability that a knowledge of the “true” population effect size is actually completely irrelevant. This is a critical issue because you cannot use such an effect size for power analysis: If I take an experiment someone did and base my power analysis on the effect size they reported, I am not really powering my experiment to detect a similar effect but a similar design. (This is particularly useless if I then decide to use a different design…)

So next time you see an unusually large Cohen’s (d>10 or even d>3) ask yourself not simply whether this is a plausible effect but whether this experiment can plausibly estimate the true population effect. If this result is based on a single observation per subject with a highly variable measurement (say, how often Breton drivers stop for female hitchhikers wearing red clothing…), even a d=1 seems incredibly large.

But if it is for a measurement that could have been made more reliable by doubling the amount of data collected in each subject (say, a change in psychometric thresholds), then a very high Cohen’s d is entirely plausible – but it is also pretty meaningless. In this situation, what we should really care about is the absolute effect size (How much does the threshold change? How much does the accuracy drop? etc).

And I must say, I remain unsure whether absolute effect sizes aren’t more useful in general, including for experiments on complex human behaviour, neuroimaging, or drug effects.

* Actually the lapse rate probably increases with a greater number of trials due to subject fatigue, drop in motivation, or out of pure spite. But even that increase is unlikely to be as detrimental as having too few trials.

Of hacked peas and crooked teas

The other day, my twitter feed got embroiled in another discussion about whether or not p-hacking is deliberate and if it constitutes fraud. Fortunately, I then immediately left for a trip abroad and away from my computer, so there was no danger of me being drawn into this debate too deeply and running the risk of owing Richard Morey another drink. However, now that I am back I wanted to elaborate a bit more on why I think the way our field has often approached p-hacking is both wrong and harmful.

What the hell is p-hacking anyway? When I google it I get this Wikipedia article, which uses it as a synonym for “data dredging”. There we already have a term that seems to me more appropriate. P-hacking refers to when you massage your data and analysis methods until your result reaches a statistically significant p-value. I will put it to you that in practice most p-hacking is not necessarily about hacking p-s but about dredging your data until your results fit a particular pattern. That may be something you predicted but didn’t find or could even just be some chance finding that looked interesting and is amplified this way. However, the p-value is usually probably secondary to the act here. The end result may very well be the same in that you continue abusing the data until a finding becomes significant, but I would bet that in most cases what matters to people is not the p-value but the result. Moreover, while null-hypothesis significance testing with p-values is still by far the most widespread way to make inferences about results, it is not the only way. All this fussing about p-hacking glosses over the fact that the same analytic flexibility or data dredging can be applied to any inference, whether it is based on p-values, confidence intervals, Bayes factors, posterior probabilities, or simple summary statistics. By talking of p-hacking we create a caricature that this is somehow a problem specific to p-values. Whether or not NHST is the best approach for making statistical inferences is a (much bigger) debate for another day – but it has little to do with p-hacking.

What is more, not only is p-hacking not really about p’s but it is also not really about hacking. Here is the dictionary entry for the term ‘hacking‘. I think we can safely assume that when people say p-hacking they don’t mean that peas are physically being chopped or cut or damaged in any way. I’d also hazard a guess that it’s not meant in the sense of “to deal or cope with” p-values. In fact, the only meaning of the term that seems to come even remotely close is this:

“to modify a computer program or electronic device in a skillful or clever way”

Obviously, what is being modified in p-hacking is the significance or impressiveness of a result, rather than a computer program or electronic device, but we can let this slide. I’d also suggest that it isn’t always done in a skillful or clever way either, but perhaps we can also ignore this. However, the verb ‘hacking’ to me implies that this is done in a very deliberate way. It may even, as with computer hacking, carry the connotation of fraud, of criminal intent. I believe neither of these things are true about p-hacking.

That is not to say that p-hacking isn’t deliberate. I believe in many situations it likely is. People no doubt make conscious decisions when they dig through their data. But the overwhelming majority of p-hacking is not deliberately done to create spurious results that the researcher knows to be false. Anyone who does so would be committing actual fraud. Rather, most p-hacking is the result of confirmation bias combined with analytical flexibility. This leads people to sleep walk into creating false positives or – as Richard Feynman would have called it – fooling themselves. Simine Vazire already wrote an excellent post about this a few years ago (and you may see a former incarnation of yours truly in the comment section arguing against the point I’m making here… I’d like to claim that it’s cause I have grown as a person but in truth I only exorcised this personality :P). I’d also guess that a lot of p-hacking happens out of ignorance, although that excuse really shouldn’t fly as easily in 2017 as it may have done in 2007. Nevertheless, I am pretty sure people do not normally p-hack because they want to publish false results.

Some may say that it doesn’t matter whether or not p-hacking is fraud – the outcome is the same: many published results are false. But in my view it’s not so simple. First, the solution to these two problems surely isn’t the same. Preregistration and transparency may very well solve the problem of analytical flexibility and data dredging – but it is not going to stop deliberate fraud, nor is it meant to. Second, actively conflating fraud and data dredging implicitly accuses researchers of being deliberately misleading and thus automatically puts them on the defensive. This is hardly a way to have a productive discussion and convince people to do something about p-hacking. You don’t have to look very far for examples of that playing out. Several protracted discussions on a certain Psychology Methods Facebook group come to mind…

Methodological flexibility is a real problem. We definitely should do something about it and new moves towards preregistration and data transparency are at least theoretically effective solutions to improve things. The really pernicious thing about p-hacking is that people are usually entirely unaware of the fact that they are doing it. Until you have tried to do a preregistered study, you don’t appreciate just how many forks in the road you passed along the way (I may blog about my own experiences with that at some point). So implying, however unintentionally, that people are fraudsters is not helping matters.

Preregistration and data sharing have gathered a lot of momentum over the past few years. Perhaps the opinions of some old tenured folks opposed to such approaches no longer carry so much weight now, regardless how powerful they may be. But I’m not convinced that this is true. Just because there has been momentum now does not mean that these ideas will prevail. It is just as likely that they fizzle out due to lacking enthusiasm or because people begin to feel that the effort isn’t worth it. I seems to me that “open science” very much exists in a bubble and I have bemoaned that before. To change scientific practices we need to open the hearts and minds of sceptics to why p-hacking is so pervasive. I don’t believe we will achieve that by preaching to them. Everybody p-hacks if left to their own devices. Preregistration and open data can help protect yourself against your mind’s natural tendency to perceive patterns in noise. A scientist’s training is all about developing techniques to counteract this tendency, and so open practices are just another tool for achieving that purpose.

1920px-fish2c_chips_and_mushy_peas
There is something fishy about those pea values…

 

Chris Chambers is a space alien

Imagine you are a radio astronomer and you suddenly stumble across a signal from outer space that appears to be evidence of an extra-terrestrial intelligence. Let’s also assume you already confidently ruled out any trivial artifactual explanation to do with naturally occurring phenomena or defective measurements. How could you confirm that this signal isn’t simply a random fluke?

This is actually the premise of the novel Contact by Carl Sagan, which happens to be one of my favorite books (I never watched the movie but only caught the end which is nothing like the book so I wouldn’t recommend it…). The solution to this problem proposed in the book is that one should quantify how likely the observed putative extraterrestrial signal would be under the assumption that it is the product of random background radiation.

This is basically what a p-value in frequentist null hypothesis significance testing represents. Using frequentist inference requires that you have a pre-specified hypothesis and a pre-specified design. You should have an effect size in mind, determine how many measurements you need to achieve a particular statistical power, and then you must carry out this experiment precisely as planned. This is rarely how real science works and it is often put forth as one of the main arguments why we should preregister our experimental designs. Any analysis that wasn’t planned a priori is by definition exploratory. The most extreme form of this argument posits that any experiment that hasn’t been preregistered is exploratory. While I still find it hard to agree with this extremist position, it is certainly true that analytical flexibility distorts the inferences we can make about an observation.

This proposed frequentist solution is therefore inappropriate for confirming our extraterrestrial signal. Because the researcher stumbled across the signal, the analysis is by definition exploratory. Moreover, you must also beware of the base-rate fallacy: even an event extremely unlikely under the null hypothesis is not necessarily evidence against the null hypothesis. Even if p=0.00001, a true extraterrestrial signal may be even less likely, say, p=10-100. Even if extra-terrestrial signals are quite common, given the small amount of space, time, and EM bands we have studied thus far, how probable is it we would just stumble across a meaningful signal?

None of that means that exploratory results aren’t important. I think you’d agree that finding credible evidence of an extra-terrestrial intelligence capable of sending radio transmissions would be a major discovery. The other day I met up with Rob McIntosh, one of the editors for Registered Reports at Cortex, to discuss the distinction between exploratory and confirmatory research. A lot of the criticism of preregistration focuses on whether it puts too much emphasis on hypothesis-driven research and whether it in turn devalues or marginalizes exploratory studies. I have spent a lot of time thinking about this issue and (encouraged by discussions with many proponents of preregistration) I have come to the conclusion that the opposite is true: by emphasizing which parts of your research are confirmatory I believe exploration is actually valued more. The way scientific publishing works conventionally many studies are written up in a way that pretends to be hypothesis-driven when in truth they weren’t. Probably for a lot of published research the truth lies somewhere in the middle.

So preregistration just keeps you honest with yourself and if anything it allows you to be more honest about how you explored the data. Nobody is saying that you can’t explore, and in fact I would argue you should always include some exploration. Whether it is an initial exploratory experiment that you did that you then replicate or test further in a registered experiment, or whether it is a posthoc robustness test you do to ensure that your registered result isn’t just an unforeseen artifact, some exploration is almost always necessary. “If we knew what we were doing, it would not be called research, would it?” (a quote by Albert Einstein, apparently).

One idea I discussed with Rob is whether there should be a publication format that specifically caters to exploration (Chris Chambers has also mentioned this idea previously). Such Exploratory Reports would allow researchers to publish interesting and surprising findings without first registering a hypothesis. You may think this sounds a lot like what a lot of present day high impact papers are like already. The key difference is that these Exploratory Reports would contain no inferential statistics and critically they are explicit about the fact that the research is exploratory – something that is rarely the case in conventional studies. However, this idea poses a critical challenge: on the one hand you want to ensure that the results presented in such a format are trustworthy. But how do you ensure this without inferential statistics?

Proponents of the New Statistics (which aren’t actually “new” and it is also questionable whether you should call them “statistics”) will tell you that you could just report the means/medians and confidence intervals, or perhaps the whole distributions of data. But that isn’t really helping. Inspecting confidence intervals and how far they are from zero (or another value of no interest) is effectively the same thing as a significance test. Even merely showing the distribution of observations isn’t really helping. If a result is so blatantly obvious that it convinces you by visual inspection (the “inter-ocular trauma test”), then formal statistical testing would be unnecessary anyway. If the results are even just a little subtler, it can be very difficult to decide whether the finding is interesting. So the way I see it, we either need a way to estimate statistical evidence, or you need to follow up the finding with a registered, confirmatory experiment that specifically seeks to replicate and/or further test the original exploratory finding.

In the case of our extra-terrestrial signal you may plan a new measurement. You know the location in the sky where the signal came from, so part of your preregistered methods is to point your radio telescope at the same point. You also have an idea of the signal strength, which allows you to determine the number of measurements needed to have adequate statistical power. Then you carry out this experiment, sticking meticulously to your planned recipe. Finally, you report your result and the associated p-value.

Sounds good in theory. In practice, however, this is not how science typically works. Maybe the signal isn’t continuous. There could be all sorts of reasons why the signal may only be intermittent, be it some interstellar dust clouds blocking the line of transmission, the transmitter pointing away from Earth due to the rotation of the aliens’ home planet, or even simply the fact that the aliens are operating their transmitter on a random schedule. We know nothing about what an alien species, let alone their civilization, may be like. Who is to say that they don’t just fall into random sleeping periods in irregular intervals?

So some exploratory, flexible analysis is almost always necessary. If you are too rigid in your approach, you are very likely to miss important discoveries. At the same time, you must be careful not to fool yourself. If we are really going down the route of Exploratory Reports without any statistical inference we need to come up with a good way to ensure that such exploratory findings aren’t mostly garbage. I think in the long run the only way to do so is to replicate and test results in confirmatory studies. But this could already be done as part of a Registered Report in which your design is preregistered. Experiment 1 would be exploratory without any statistical inference but simply reporting the basic pattern of results. Experiment 2 would then be preregistered and replicate or test the finding further.

However, Registered Reports can take a long time to publish. This may in fact be one of the weak points about this format that may stop the scientific community from becoming more enthusiastic about them. As long as there is no real incentive to doing slow science, the idea that you may take two or three years to publish one study is not going to appeal to many people. It will stop early career researchers from getting jobs and research funding. It also puts small labs in poorer universities at a considerable disadvantage compared to researchers with big grants, big data, and legions of research assistants.

The whole point of Exploratory Reports would be to quickly push out interesting observations. In some ways, this is then exactly what brief communications in high impact journals are currently for. I don’t think it will serve us well to replace the notion of snappy (and likely untrue) high impact findings with inappropriate statistical inferences with snappy (and likely untrue) exploratory findings without statistical inference. If the purpose of Exploratory Reports is solely to provide an outlet for quick publication of interesting results, we still have the same kind of skewed incentive structure as now. Also, while removing statistical inference from our exploratory findings may be better statistical practice I am not convinced that it is better scientific practice unless we have other ways of ensuring that these exploratory results are kosher.

The way I see it, the only way around this dilemma is to finally stop treating publications as individual units. Science is by nature a lengthy, incremental process. Yes, we need exciting discoveries to drive science forward. At the same time, replicability and robustness of our discoveries is critical. In order to combine these two needs I believe research findings should not be seen as separate morsels but as a web of interconnected results. A single Exploratory Report (or even a bunch of them) could serve as the starting point. But unless they are followed up by Registered Reports replicating or scrutinizing these findings further, they are not all that meaningful. Only once replications and follow up experiments have been performed the whole body of a finding takes shape. A search on PubMed or Google Scholar would not merely spit out the original paper but a whole tree of linked experiments.

The perceived impact and value of a finding thus would be related to how much of a interconnected body of evidence it has generated rather than whether it was published in Nature or Science. Critically, this would allow people to quickly publish their exciting finding and thus avoid being deadlocked by endless review processes and disadvantaged compared to other people who can afford to do more open science. At the same time, they would be incentivized to conduct follow-up studies. Because a whole body of related literature is linked, it would however also be an incentive for others to conduct replications or follow up experiments on your exploratory finding.

There are obviously logistic and technical challenges with this idea. The current publication infrastructure still does not really allow for this to work. This is not a major problem however. It seems entirely feasible to implement such a system. The bigger challenge is how to convince the broader community and publishers and funders to take this on board.

200px-arecibo_message-svg