On the magic of independent piloting

TL,DR: Never simply decide to run a full experiment based on whether one of the small pilots in which you tweaked your paradigm supported the hypothesis. Use small pilots only to ensure the experiment produces high quality data, judged by criteria that are unrelated to your hypothesis.

Sorry for the bombardment with posts on data peeking and piloting. I felt this would have cluttered up the previous post so I wrote a separate one. After this one I will go back to doing actual work though, I promise! That grant proposal I should be writing has been neglected for too long…

In my previous post, I simulated what happens when you conduct inappropriate pilot experiments by running a small experiment and then continuing data collection if the pilot produces significant results. This is really data peeking and it shouldn’t come as much of a surprise that this inflates false positives and massively skews effect size estimates. I hope most people realize that this is a terrible thing to do because it makes your results entirely dependent on the outcome. Quite possibly, some people would have learned about this in their undergrad stats classes. As one of my colleagues put it, “if it ends up in the final analysis it is not a pilot.” Sadly, I don’t think this as widely known as it should be. I was not kidding when I said that I have seen it happen before or overheard people discussing having done this type of inappropriate piloting.

But anyway, what is an appropriate pilot then? In my previous post, I suggested you should redo the same experiment but restart data collection. You now stick to the methods that gave you a significant pilot result. Now the data set used to test your hypothesis is completely independent, so it won’t be skewed by the pre-selected pilot data. Put another way, your exploratory pilot allows you to estimate a prior, and your full experiment seeks to confirm it. Surely there is nothing wrong with that, right?

I’m afraid there is and it is actually obvious why: your small pilot experiment is underpowered to detect real effects, especially small ones. So if you use inferential statistics to determine if a pilot experiment “worked,” this small pilot is biased towards detecting larger effect sizes. Importantly, this does not mean you bias your experiment towards larger effect sizes. If you only continue the experiment when the pilot was significant, you are ignoring all of the pilots that would have shown true effects but which – due to the large uncertainty (low power) of the pilot – failed to do so purely by chance. Naturally, the proportion of these false negatives becomes smaller the larger you make your pilot sample – but since pilots are by definition small, the error rate is pretty high in any case. For example, for a true effect size of δ = 0.3, the false negatives at a pilot sample of 2 is 95%. With a pilot sample of 15, it is still as high as 88%. Just for illustration I show below the false negative rates (1-power) for three different true effect sizes. Even for quite decent effect sizes the sensitivity of a small pilot is abysmal:

False Negatives

Thus, if you only pick pilot experiments with significant results to do real experiments you are deluding yourself into thinking that the methods you piloted are somehow better (or “precisely calibrated”). Remember this is based on a theoretical scenario that the effect is real and of fixed strength. Every single pilot experiment you ran investigated the same underlying phenomenon and any difference in outcome is purely due to chance – the tweaking of your methods had no effect whatsoever. You waste all manner of resources piloting some methods you then want to test.

So frequentist inferential statistics on pilot experiments are generally nonsense. Pilots are by nature exploratory. You should only determine significance for confirmatory results. But what are these pilots good for? Perhaps we just want to have an idea of what effect size they can produce and then do our confirmatory experiments for those methods that produce a reasonably strong effect?

I’m afraid that won’t do either. I simulated this scenario in a similar manner as in my previous post. 100,000 times I generated two groups (with a full sample size of n = 80, although the full sample size isn’t critical for this). Both groups are drawn from a population with standard deviation 1 but one group has a mean of zero while the other’s mean is shifted by 0.3 – so we have a true effect size here (the actual magnitude of this true effect size is irrelevant for the conclusions). In each of the 100,000 simulations, the researcher runs a number of pilot subjects per group (plotted on x-axis). Only if the effect size estimate for this pilot exceeds a certain criterion level, the researcher runs an independent, full experiment. The criterion is either 50%, 100%, or 200% of the true effect size. Obviously, the researcher cannot know this however. I simply use these criteria as something that the researcher might be doing in a real world situation. (For the true effect size I used here, these criteria would be d = 0.15, d = 0.3, or d = 0.6, respectively).

The results are below. The graph on the left once again plots the false negative rates against the pilot sample size. A false negative here is not based on significance but on effect size, so any simulation for which d was below the criterion. When the criterion is equal to the true effect size, the false negative rate is constant at 50%. The reason for this is obvious: each simulation is drawn from a population centered on the true effect of 0.3, so half of these simulations will exceed that value. However, when the criterion is not equal to the true effect the false negative rates depend on the pilot sample size. If the criterion is lower than the true effect, false negatives decrease. If the criterion is strict, false negatives increase. Either way, the false negative rates are substantially greater than the 20% mark you would have with an adequately powered experiment. So you will still delude yourself a considerable number of times if you only conduct the full experiment when your pilot has a particular effect size. Even if your criterion is lax (and d = 0.15 for a pilot sounds pretty lax to me), you are missing a lot of true results. Again, remember that all of the pilot experiments here investigated a real effect of fixed size. Tweaking the method makes no difference. The difference between simulations is simply due to chance.

Finally, the graph on the right shows the mean effect sizes  estimated by your completed experiments (but not the absolute this time!). The criterion you used in the pilot makes no difference here (all colors are at the same level), which is reassuring. However, all is not necessarily rosy. The open circles plot the effect size you get under publication bias, that is, if you only publish the significant experiments with p < 0.05. This effect is clearly inflated compared to the true effect size of 0.3. The asterisks plot the effect size estimate if you take all of the experiments. This is the situation you would have (Chris Chambers will like this) if you did a Registered Report for your full experiment and publication of the results is guaranteed irrespective of whether or not they are significant. On average, this effect size is an accurate estimate of the true effect.

Simulation Results

Again, these are only the experiments that were lucky enough to go beyond the piloting stage. You already wasted a lot of time, effort, and money to get here. While the final outcome is solid if publication bias is minimized, you have thrown a considerable number of good experiments into the trash. You’ve also misled yourself into believing that you conducted a valid pilot experiment that honed the sensitivity of your methods when in truth all your pilot experiments were equally mediocre.

I have had a few comments from people saying that they are only interested in large effect sizes and surely that means they are fine? I’m afraid not. As I said earlier already, the principle here is not dependent on the true effect size. It is solely a factor of the low sensitivity of the pilot experiment. Even with a large true effect, your outcome-dependent pilot is a blind chicken that errs around in the dark until it is lucky enough to hit a true effect more or less by chance. For this to happen you must use a very low criterion to turn your pilot into a real experiment. This however also means that if the null hypothesis is true an unacceptable proportion of your pilots produce false positives. Again, remember that your piloting is completely meaningless – you’re simply chasing noise here. It means that your decision whether to go from pilot to full experiment is (almost) completely arbitrary, even when the true effect is large.

So for instance, when the true effect is a whopping δ = 1, and you are using d > 0.15 as a criterion in your pilot of 10 subjects (which is already large for pilots I typically hear about), your false negative rate is nice and low at ~3%. But critically, if the null hypothesis of δ = 0 is true, your false positive rate is ~37%. How often you will fool yourself by turning a pilot into a full experiment depends on the base rate. If you give this hypothesis at 50:50 chance of being true, almost one in three of your pilot experiments will lead you to chase a false positive. If these odds are lower (which they very well may be), the situation becomes increasingly worse.

What should we do then? In my view, there are two options: either run a well-powered confirmatory experiment that tests your hypothesis based on an effect size you consider meaningful. This is the option I would chose if resources are a critical factor. Alternatively, if you can afford the investment of time, money, and effort, you could run an exploratory experiment with a reasonably large sample size (that is, more than a pilot). If you must, tweak the analysis at the end to figure out what hides in the data. Then, run a well-powered replication experiment to confirm the result. The power for this should be high enough to detect effects that are considerably weaker than the exploratory effect size. This exploratory experiment may sound like a pilot but it isn’t because it has decent sensitivity and the only resource you might be wasting is your time* during the exploratory analysis stage.

The take-home message here is: don’t make your experiments dependent on whether your pilot supported your hypothesis, even if you use independent data. It may seem like a good idea but it’s tantamount to magical thinking. Chances are that you did not refine your method at all. Again (and I apologize for the repetition but it deserves repeating): this does not mean all small piloting is bad. If your pilot is about assuring that the task isn’t too difficult for subjects, that your analysis pipeline works, that the stimuli appear as you intended, that the subjects aren’t using a different strategy to perform the task, or quite simply to reduce the measurement noise, then it is perfectly valid to run a few people first and it can even be justified to include them in your final data set (although that last point depends on what you’re studying). The critical difference is that the criteria for green-lighting a pilot experiment are completely unrelated to the hypothesis you are testing.

(* Well, your time and the carbon footprint produced by your various analysis attempts. But if you cared about that, you probably wouldn’t waste resources on meaningless pilots in the first place, so this post is not for you…)

MatLab code for this simulation.

On the worthlessness of inappropriate piloting

So this post is just a brief follow up to my previous post on data peeking. I hope it will be easy to see why this is very related:

Today I read this long article about the RRR of the pen-in-mouth experiments – another in a growing list of failures to replicate classical psychology findings. I was quite taken aback by one comment in this: the assertion that these classical psychology experiments (in particular the social priming ones) had been “precisely calibrated to elicit tiny changes in behavior.” It is an often repeated argument to explain why findings fail to replicate – the “replicators” simply do not have the expertise and/or skill to redo these delicate experiments. And yes, I am entirely willing to believe that I’d be unable to replicate a lot of experiments outside my area, say, finding subatomic particles or even (to take an example from my general field) difficult studies on clinical populations.

But what does this statement really mean? How were these psychology experiments “calibrated” before they were run? What did the authors do to nail down the methods before they conducted the studies? It implies that extensive pilot experiments were conducted first. I am in no position to say that this is what the authors of these psychology studies did during their piloting stage but one possibility is that several small pilot experiments were run and the experimental parameters were tweaked until a significant result supporting the hypothesis was observed. Only then they continued the experiment and collected a full data set that included the pilot data. I have seen and heard of people who did precisely this sort of piloting until the “experiment worked.”

So, what actually happens when you “pilot” experiments to “precisely calibrate” them? I decided to simulate this and the results are in the graph below (each data point is based on 100,000 simulations). In this simulation, an intrepid researcher first runs a small number of pilot subjects per group (plotted on x-axis). If the pilot fails to produce significant results at p < 0.05, the experiment is abandoned and the results are thrown in the bin never to again see the light of day. However, if the results are significant, the eager researcher collects more data until the full sample in each group is n = 20, 40, or 80. On the y-axis I plotted the proportion of these continued experiments that produced significant results. Note that all simulated groups were drawn from a normal distribution with mean 0 and standard deviation 1. Therefore, any experiments that “worked” (that is, they were significant) are false positives. In a world where publication bias is still commonplace, these are the findings that make it into journals – the rest vanish in the file-drawer.


False Positives

As you can see, such a scheme of piloting until the experiment “works,” can produce an enormous number of false positives in the completed experiments. Perhaps this is not really all that surprising – after all this is just another form of data peeking. Critically, I don’t think this is unrealistic. I’d wager this sort of thing is not at all uncommon. And doesn’t it seem harmless? After all we are only peeking once! If a pilot experiment “worked,” we continue sampling until the sample is complete.

Well, even under these seemingly benign conditions false positives can be inflated dramatically. The black curve is for the case when the final sample size, of the completed studies, is only 20. This is the worst case and it is perhaps unrealistic. If the pilot experiment consists of 10 subjects (that is, half the full sample) about a third of results will be flukes. But even in the other cases, when only a handful of pilot subjects are collected compared to the much larger full samples, false positives are well above 5%. In other words, whenever you pilot an experiment and decide that it’s “working” because it seems to support your hypothesis, you are already skewing the final outcome.

Of course, the true false positive rate, taken across the whole set of 100,000 pilots that were run, would be much lower (0.05 times the rates I plotted above to be precise, because we picked from the 5% of significant “pilots” in the first place). However, since we cannot know how much of this inappropriate piloting went on behind the scenes, knowing this isn’t particularly helpful.

More importantly, we aren’t only interested in the false positive rate. A lot of researchers will care about the effect size estimates of their experiments. Crucially, this form of piloting will substantially inflate these effect size estimates as well and this may have even worse consequences for the interpretation of these experiments. In the graph below, I plot the effect sizes (the mean absolute Cohen’s d) for the same simulations for which I showed you the false positive rates above. I use the absolute effect size because the sign is irrelevant – the whole point of this simulation exercise is to mimic a full-blown fishing expedition via inappropriate “piloting.” So our researcher will interpret a significant result as meaningful regardless of whether d is positive or negative.

Forgive the somewhat cluttered plot but it’s not that difficult to digest really. The color code is the same as for the previous figure. The open circles and solid lines show you the effect size of the experiments that “worked,” that is, the ones for which we completed data collection and which came out significant. The asterisks and dashed lines show the effect sizes for all of global false positives, that is, all the simulations with p < 0.05 after the pilot but using the full the data set, as if you had completed these experiments. Finally, the crosses and dotted lines show the effect sizes you get for all simulations (ignoring inferential statistics). This is just given as a reference.

Effect Sizes

Two things are notable about all this. First, effect size estimates increase with “pilot” sample size for the set of global false positives (asterisks) but not the other curves. This is because the “pilot” sample size determines how strongly the fluke pilot effect will contribute to the final effect size. More importantly, the effect size estimates for those experiments with significant pilots and which also “worked” after completion are massively exaggerated (open circles). The degree of exaggeration is a factor of the baseline effect (crosses). The absolute effect size estimate depends on the full sample size. At the smallest full sample size (n=20, black curve) the effect sizes are as high as d = 0.8. Critically, the degree of exaggeration does not depend on how large your pilot sample was. Whether your “pilot” had only 2 or 15 subjects, the average effect size estimate is around 0.8.

The reason for this is that the smaller the pilot experiment, the more underpowered it is. Since it is a condition for continuing the experiment that the pilot must be significant, the pilot effect size must be considerably larger for small pilots than larger pilots. Because the true effect size is always zero, this cancels out in the end so the final effect size estimate is constant regardless of the pilot sample size. But in any case, the effect size estimate you got from your precisely calibrated inappropriately piloted experiments are enormously overrated. It shouldn’t be much of a surprise if these don’t replicate and that posthoc power calculations based on these effect sizes suggest low power (of course, you should never use posthoc power in that way but that’s another story…) .

So what should we do? Ideally you should just throw away the pilot data, preregister the design, and restart the experiment anew with the methods you piloted. In this case the results are independent and only the methods are shared. Importantly, there is nothing wrong with piloting in general. After all, I had a previous post praising pilot experiments. But piloting should be about ensuring that the methods are effective in producing clean data. There are many situations in which an experiment seems clever and elegant in theory but once you actually start it in practice you realize that it just can’t work. Perhaps the participants don’t use the task strategy you envisioned. Or they simply don’t perceive the stimuli the way they were intended. In fact, this happened to us recently and we may have stumbled over an interesting finding in its own right (but this must also be confirmed by a proper experiment!). In all these situations, however, the decision on the pilot results is unrelated to the hypothesis you are testing. If they are related, you must account for that.

MatLab code for these simulations is available. As always, let me know if you find errors. (To err is human, to have other people check your code divine?)

Realistic data peeking isn’t as bad as you* thought – it’s worse

Unless you’ve been living under a rock, you have probably heard of data peeking – also known as “optional stopping”. It’s one of those nasty questionable research practices that could produce a body of scientific literature contaminated by widespread spurious findings and thus lead to poor replicability.

Data peeking is when you run a Frequentist statistical test every time you collect a new subject/observation (or after every few observations) and stop collecting data when the test comes out significant (say, at p < 0.05). Doing this clearly does not accord with good statistical practice because under the Frequentist framework you should plan your final sample size a priori based on power analysis, collect data until you have that sample size, and never look back (but see my comment below for more discussion of this…). What is worse, under the aforementioned data peeking scheme you can be theoretically certain to reject the null hypothesis eventually. Even if the null hypothesis is true, sooner or later you will hit a p-value smaller than the significance threshold.

Until recently, many researchers, at least in psychological and biological sciences, appeared to be unaware of this problem and it isn’t difficult to see that this could contribute to a prevalence of false positives in the literature. Even now, after numerous papers and blog posts have been written about this topic, this problem still persists. It is perhaps less common but I still occasionally overhear people (sometimes even in their own public seminar presentations) saying things like “This effect isn’t quite significant yet so we’ll see what happens after we tested a few more subjects.” So far so bad.

Ever since I heard about this issue (and I must admit that I was also unaware of the severity of this problem back in my younger, carefree days), I have felt somehow dissatisfied with how this issue has been described. While it is a nice illustration of a problem, the models of data peeking seem extremely simplistic to me. There are two primary aspects of this notion that in my opinion just aren’t realistic. First, the notion of indefinite data collection is obviously impossible, as this would imply having an infinite subject pool and other bottomless resources. However, even if you allow for a relatively manageable maximal sample size at which a researcher may finally stop data collection even when the test is not significant, the false positive rate is still massively inflated.

The second issue is therefore a bigger problem: the simple data peeking procedure described above seems grossly fraudulent to me. I would have thought that even if the researcher in question were unaware of the problems with data peeking, they probably would nonetheless feel that something is quite right with checking for significant results after every few subjects and continuing until they get them. As always, I may be wrong about this but I sincerely doubt this is what most “normal people do. Rather, I believe people would be more likely to peek at the data to look if the results are significant, and only if the p-value “looks promising” (say 0.05 < p < 0.1) they continue testing. This sampling plan sounds a lot more like what may actually happen. So I wanted to find out how this sort of sampling scheme would affect results. I have no idea if anyone already did something like this. If so, I’d be grateful if you could point me to that analysis.

So what I did is the following: I used Pearson’s correlation as the statistical test. In each iteration of the simulation I generated a data set of 150 subjects, each with two uncorrelated Gaussian variables, let’s just pretend it’s the height of some bump on the subjects’ foreheads and a behavioral score of how belligerent they are. 150 is thus the maximal sample size, assuming that our simulated phrenologist – let’s call him Dr Peek – would not want to test more than 150 subjects. However, Dr Peek actually starts with only 3 subjects and then runs the correlation test. In the simplistic version of data peeking, Dr Peek will stop collecting data if p < 0.05; otherwise he will collect another subject until p < 0.05 or 150 subjects are eventually reached. In addition, I simulated three other sampling schemes that feel more realistic to me. In these cases, Dr Peek will also stop data collection when p < 0.05 but he will also stop when p is either greater than 0.1, greater than 0.3, or greater than 0.5. I repeated each of these simulations 1000 times.

The results are in the graph below. The four sampling schemes are denoted by the different colors. On the y-axis I plotted the proportion of the 1000 simulations in which the final outcome (that is, whenever data collection was stopped) yielded p < 0.05. The scenario I described above is the leftmost set of data points in which the true effect size, the correlation between forehead bump height and belligerence, is zero. Confirming previous reports on data peeking, the simplistic case (blue curve) has an enormously inflated false positive rate of around 0.42. Nominally, the false positive rate should be 0.05. However, under the more “realistic” sampling schemes the false positive rates are far lower. In fact, for the case where data collection only continues while p-values are marginal (0.05 < p < 0.1), the false positive rate is 0.068, only barely above the nominal rate. For the other two schemes, the situation is slightly worse but not by that much. So does this mean that data peeking isn’t really as bad as we have been led to believe?


Hold on, not so fast. Let us now look what happens in the rest of the plot. I redid the same kind of simulation for a range of true effect sizes up to rho = 0.9. The x-axis shows the true correlation between forehead bump height and belligerence. Unlike for the above case when the true correlation is zero, now the y-axis shows statistical power, the proportion of simulations in which Dr Peek concluded correctly that there actual is a correlation. All four curves rise steadily as one might expect with stronger true effects. The blue curve showing the simplistic data peeking scheme rises very steeply and reaches maximal power at a true correlation of around 0.4. The slopes of the other curves are much more shallow and while the power at strong true correlations is reasonable at least for two of them, they don’t reach the lofty heights of the simplistic scheme.

This feels somehow counter-intuitive at first but it makes sense: when the true correlation is strong, the probability of high p-values is low. However, at the very small sample sizes we start out with even a strong correlation is not always detectable – the confidence interval of the estimated correlation is very wide. Thus there will be a relatively large proportion of p-values that pass that high cut-off and terminate data collection prematurely without rejecting the null hypothesis.

Critically, these two things, inflated false positive rates and reduced statistical power to detect true effects, dramatically reduce the sensitivity of any analysis that is performed under these realistic data peeking schemes. In the graph below, I plot the sensitivity (quantified as d’) of the analysis. Larger d’ means there is a more favorable ratio between the number of simulations in which Dr Peek correctly detected a true effect and how often he falsely concluded there was a correlation when there wasn’t one. Sensitivity for the simplistic sample scheme (blue curve) rises steeply until power is maximal. However, sensitivity for the other sampling schemes starts off close to zero (no sensitivity) and only rises fairly slowly.


For reference compare this to the situation under desired conditions, that is, without questionable research practices, with adequate statistical power of 0.8, and the nominal false positive rate of 0.05: in this case the sensitivity would be d’ = 2.49, so higher than any of the realistic sampling schemes ever get. Again, this is not really surprising because data collections will typically be terminated at sample sizes that give far less than 0.8 power. But in any case, this is bad news. Even though the more realistic forms of data peeking don’t inflate false positives as massively as the most pessimistic predictions, they impede the sensitivity of experiments dramatically and are thus very likely to only produce rubbish. It should come as no surprise that many findings fail to replicate.

Obviously, what I call here more realistic data peeking is not necessarily a perfect simulation of how data peeking may work in practice. For one thing, I don’t think Dr Peek would have a fixed cut-off of p > 0.1 or p > 0.5. Rather, such a cut-off might be determined on a case-by-case basis, dependent on the prior expectation Dr Peek has that the experiment should yield significant results. (Dr Peek may not use Bayesian statistics, but like all of us he clearly has Bayesian priors.) In some cases, he may be very confident that there should be an effect and he will continue testing for a while but then finally give up when the p-value is very high. For other hypotheses that he considered to be risky to begin with, he may not be very convinced even by marginal p-values and thus will terminate data collection when p > 0.1.

Moreover, it is probably also unrealistic that Dr Peek would start with a sample size of 3. Rather, it seems more likely that he would have a larger minimal sample size in mind, for example 20 and collect that first. While he may have been peeking at the data before he completed testing 20 subjects, there is nothing wrong with that provided he doesn’t stop early if the result becomes significant. Under these conditions the situation becomes somewhat better but the realistic data peeking schemes still have reduced sensitivity, at least for lower true effect sizes, which are presumably far more prevalent in real world situations. The only reason that sensitivity goes up fairly quickly to reasonable levels is that with the starting sample size of 20 subjects, the power to detect those stronger correlations is already fairly high – so in many cases data collection will be terminated as soon as the minimum sample is completed.


Finally, while I don’t think this plot is entirely necessary, I also show you the false positives / power rates for this latter case. The curves are such beautiful sigmoids that I just cannot help myself but to include them in this post…


So to sum up, leaving aside the fact that you shouldn’t really peek at your data and stop data collection prematurely in any case, if you do this you can shoot yourself seriously in the foot. While the inflation of false positives through data peeking may have contributed a considerable number of spurious, unreplicable findings to the literature, what is worse it may very well also have contributed a great number of false negatives to the proverbial file drawer: experiments that were run but failed to produce significant results after peeking a few times and which were then abandoned, never to be heard of again. When it comes to spurious findings in the literature, I suspect the biggest problem is not actually data peeking but other questionable practices from the Garden of Forking Paths, such as tweaking the parameters of an experiment or the analysis.

* Actually it may just be me…

Matlab code for these simulations. Please let me know if you discover the inevitable bugs in this analysis.

On brain transplants, the Matrix, and Dualism

Warning: This post contains spoilers for the movie The Matrix.

Today a tweet by Neuroskeptic pointed me to this post entitled “You are not your brain: Why a head transplant is not what you think it is“. The title initially sparked my interest because it is a topic I have been thinking about a lot. I am actually writing a novel that deals with topics such as the scientific study of unconsciousness, non-free will, and disembodied cognition*. This issue is therefore succinctly relevant to me.

Unfortunately, this particular post does not really deal with this topic in any depth but only espouses a trivial form of mind-brain dualism. It discusses some cherry-picked findings without any proper understanding of current neuroscientifc knowledge and brushes aside most scientific arguments about consciousness as “bizarre” claims, without providing any concrete argument why that is so. Don’t get me wrong, some claims by neuroscientists about free will and consciousness are probably on logically shaky ground, and neuroscientists themselves frequently espouse a form of inadvertent dualism in their own writing about how the brain relates to the mind. However, this post doesn’t really discuss these issues in an adequate way – but go and read it and make up your own mind.

Either way, I think the general thought is intriguing nonetheless. What would actually happen if we could transplant a human brain (or the whole head) into a different body? Let’s ignore for the moment the fact that our surgical technology is nowhere near the point where we could do this with humans and allow the transplanted head to actually control the new host body. Instead let us assume that we can in fact connect up all the peripheral neurons and muscles in the body to the corresponding neurons in the transplanted brain.

Thinking about this already reveals the first problem: there has got to be a mismatch between the number of neurons in the body and the brain. Perhaps this doesn’t matter and some afferent and efferent nerve fibers need not be connected up to the brain, or – vice versa – some of the brain’s neurons need not receive any input or have any targets in the body. If the bulk of the brain is connected up properly perhaps this suffices? In any case, our brains are calibrated to the body and so to place them into a new body must inevitably throw this calibration completely out of whack. Perhaps this can be overcome and a new calibration can emerge but in how far this is possible is anybody’s guess.

A related problem is how the brain represents the body in which it has been placed. Somehow we carry in our minds a body image that encodes the space that our body occupies, how it looks, how it feels, etc. There are illusions that distort this representation of our own bodies. Some malfunction or fluke in that system could also explain some out-of-body experiences although it is of course difficult to study such phenomena. It seems however pretty likely that such experiences should be exacerbated in a person whose brain has been transplanted into a new body. Imagine you are 1.5 meters tall but your brain has been transplanted into the body of an NBA player. Your experience of the world through this new, much taller body must inevitably be far greater than simply looking at the world from your new vantage point. Over a lifetime of existing in your short body you should have no representation of the sensory experiences related to being 2 meters tall, nor of the feats your muscles are capable of when you can slam dunk. It is possible that we can learn to live in this new corporal shell but who can know whether that is the case.

In that sense, there may actually be truth to the claim in the aforementioned post that there is some kind of “bodily memory”. For one thing, the flexibility and strength of various muscles is presumably related to what you are doing with them on a daily basis. Who knows, perhaps the various nervous tissues in the body also undergo other forms of synaptic plasticity we don’t yet know about? Of course, none of this suggests – as this post claims – that much of your self is in fact stored inside your body or that you become the host person. The brain is undoubtedly the seat of consciousness and of much of your memory, including the fine procedural or motor memory that you take for granted. But I think it is fair to say that by having your brain being wired up to a new body you would certainly experience the world in uniquely different ways. Insofar as your perception affects how you interact with the world this may very well alter your personality and thus really change who you are.

In the same vein I also view other thought experiments, such as the common science-fiction notion that we could one day upload our brains to a computer. Even if we had a computer with the data capacity to not only store a complete wiring diagram and the synaptic weights of all the neurons in the human brain and even if neuronal wiring diagrams were all there is to processing in the brain (thus completely ignoring the role of astrocytes, possibly important functional roles of particular ion channel proteins, or of slow neurochemical transmissions), whatever this stores would presumably not really be the person’s mind. Simply running such a network in a computer would effectively cut this brain off from its host body and in this way it would be comparable to the brain transplant situation. It is difficult to imagine what such a brain would in fact experience in silico. To approximate normal functioning you would also have to simulate the sensory inputs and the reciprocal interactions between the simulated brain and the simulated world this brain inhabits. This would be a bit like the Matrix (although that movie does not involve disembodiment). It is hard to imagine what this might really feel like. Quite likely, at least the cruder beta versions of such a simulation would be highly uncanny because they don’t accurately capture real world experience? In fact, this is part of the plot of the Matrix movie because the protagonist senses that something isn’t right with the world.

I find this topic quite fascinating. Whatever the case may be, I think it is safe to say that we are not just our brains. As opposed to the simplistic notion of how the brain works suggested by many science fiction and fantasy stories, our minds aren’t merely software running on brain hardware. Our brain exists inside a body and that presumably must have some influence on the mind. I don’t buy into much of the embodied cognition literature, as a lot of that also seems very simplistic. I certainly agree in large parts with Chaz Firestone and Brian Scholl that it has not really been demonstrated that things like the heaviness of a backpack can affect your perception of the steepness of the hill before you. But at the same time, I think some degree of embodiment must exist, precisely because I am not a dualist. I don’t think there is any evidence to suggest the mind simply floats in the ether, completely removed from the brain and body. Rather it is emergent property of the brain, a brain that is intricately connected to the rest of the body it resides in (and even that is simplistic: I would in fact say that the brain is part of the body).

Coming back to that post about head transplants, the post is on a website called Religion News and the author is a professor of theological and social ethics. As such it is unsurprising that he discusses a dualist view of the mind and criticizes some of the neuroscientific claims that conflict with that notion. However, his argument is quite odd when you dig a little deeper: rather than saying that your self arises in your brain, the author implicitly suggests that is inherent to your body – he literally states that the person whose brain is transplanted dies because they are in a new body. He further suggests that any children the patient would have in their new body would not be his but those of the host body. While genetically this argument is correct, it completely ignores the fact that there is now a new mind driving the body. Whatever distortions and changes to this mind may result from the brain transplant, it is clearly wrong to claim that the host body would completely override the mind inside the transplanted brain. Yes, biologically the children would be those of the host body but mentally they would be the children of the transplanted mind. Claiming otherwise is equivalent to the suggestion that adoptive parents are not real parents.

In conclusion, I agree that there are some interesting philosophical and theological ramifications to consider about brain transplants. If you believe in the existence of a soul, it is not immediately obvious how you should interpret such a case. I don’t think science can give you the answer to that but that is between you and your rabbi or guru or whoever holds your spirituality together. But I think one thing is clear to me: the soul is not inherently attached to your body any more than it is to your brain. No, you are definitely not just your brain. But you aren’t just your body either.


This fella knows how to have a fun time! (Source)


(* Work on this is going very slowly so don’t get your hopes up you’ll see any of this anytime soon – it’s more of a lifetime project…)

How funders could encourage replication efforts

As promised, here is a post about science stuff, finally back to a more cheerful and hopeful topic than the dreadful state the world outside science is in right now…

A Dutch research funding agency recently announced a new grant initiative that exclusively funds replication attempts. The idea is to support replication efforts of particularly momentous “cornerstone” research findings. It’s not entirely clear what this means but presumably such findings include highly cited findings, those with great media coverage and public policy impact etc. It isn’t clear who determines whether a finding falls under this.

You can read about this announcement here. In that article you can see some comments by me on how I think funders should encourage replications by requiring that new grant proposals should also contain some replication of previous work. Like most people I believe replication to be one of the pillars supporting science. Before we treat any discovery as important we must know that it is reliable and meaningful. We need to know in how far it generalizes or if it is fickle and subject to minor changes in experimental parameters. If you read anything I have written about replication, you will probably already know my view on this: most good research is built on previous findings. This is how science advances. You take some previously observed results and use it to generate new hypotheses to be tested in a new experiment. In order to do so, you should include a replication and/or sanity check condition in your new experiment. This is precisely the suggestion Richard Feynman made in his famous Cargo Cult Science lecture.

Imagine somebody published a finding that people perceive the world as darker when they listen to sad classical music (let’s ignore for the moment the inherent difficulty in actually demonstrating such an effect…). You now want to ask if they also perceive the world as darker when they listen to dark metal. If you simply run the same experiment but replace the music any result you find will be inconclusive. If you don’t find any perceptual effect, it could be that your participant sample simply isn’t affected by music. The only way to rule this out is to also include the sad classical music condition in your experiment to test whether this claim actually replicates. Importantly, even if you do find a perceptual effect of dark metal music, the same problem applies. While you could argue that this is a conceptual replication, if you don’t know that you could actually replicate the original effect of classical music, it is impossible to know that you really found the same phenomenon.

My idea is that when applying for funding we should be far more explicit about how the proposal builds on past research and, insofar this is feasible, build more replication attempts into the proposed experiments. Critically, if you fail to replicate those experiments, this would in itself be an important finding that should be added to the scientific record. The funding thus implicitly sets aside some resources for replication attempts to validate previous claims. However, this approach also supports the advance of science because every proposal is nevertheless designed to test novel hypotheses. This stands in clear contrast between pure replication efforts such as those this Dutch initiative advocates or the various large-scale replication efforts like the RPP and Many Labs project. While I think these efforts clearly have value, one major concern I have with them is that they seem to stagnate scientific progress. They highlighted a lack of replicability in the current literature and it is undoubtedly important to flag that up. But surely this cannot be the way we will continue to do science from now on. Should we have a new RPP every 10 years now? And who decides which findings should be replicated? I don’t think we should really care whether every single surprising claim is replicated. Only the ones that are in fact in need of validation because they have an impact on science and society probably need to be replicated. But determining what makes a cornerstone discovery is not really that trivial.

That is not to say that such pure replication attempts should no longer happen or that they should receive no funding at all. If anyone is happy to give you money to replicate some result, by all means do so. However, my suggestion differs from these large-scale efforts and the Dutch initiative in that it treats replication the way it should be treated, as an essential part of all research, rather than as a special effort that is somehow separate from the rest. Most research would only be funded if it is explicit about which previous findings it builds on. This inherently also answers the question which previous claims should be replicated: only those findings that are deemed important enough by other researchers to motivate new research are sufficiently important for replication attempts.

Perhaps most crucially, encouraging replication in this way will help to break down the perceived polarization between the replicators and original authors of high-impact research claims. While I doubt many scientists who published replications actually see themselves as a “replication police,” we continue to rehash these discussions. Many replication attempts are also being suspected to be motivated by mistrust in the original claim. Not that there is really anything wrong with that because surely healthy skepticism is important in science. However, whether justified or not, skepticism of previous claims can lead to the perception that the replicators were biased and the outcome of the replication was a self-fulfilling prophecy. My suggestion would mitigate this problem at least to a large degree because most grant proposals would at least seek to replicate results that have a fighting chance of being true.

In the Nature article about this Dutch initiative there are also comments from Dan Gilbert, a vocal critic of the large-scale replication efforts. He bemoans that such replication research is based on its “unoriginality” and suspects that we will learn more about the universe by spending money on “exploring important new ideas.” I think this betrays the same false dichotomy I described above. I certainly agree with Gilbert that the goal of science should be to advance our understanding of the world but originality is not really the only objective here. Scientific claims must also be valid and generalize beyond very specific experimental contexts and parameters. In my view, both are equally important for  healthy science. As such, there is not a problem with the Dutch initiative but it seems rather gimmicky to me and I am unconvinced its effects will be lasting. Instead I believe the only way to encourage active and on-going replication efforts will be to overhaul the funding structure as I outlined here.

52% seems barely above chance. Someone should try to replicate that stupid referendum.

The bottom line

So I haven’t posted in a while, first because I was depressed and lethargic from the dreadful outcome of the EU referendum, and then because I was busy with actual work. I was considering writing a post about how direct democracy has the same problems as citizen science (thanks to Chris Chambers for inspiring that thought a little) but then I don’t feel like it right now.

There isn’t much left to be said about “Brexit” (how I hate that word) that others haven’t already said. The bottom line is, it is highly likely to seriously hurt British science and, I wager, also Britain in general. It seems the political will isn’t there to simply slide into EEA membership (which would keep freedom of movement) and any other solutions appear to be like a terrible deal for the UK, for the EU, and for science. What exactly will happen nobody can predict (as you know I don’t believe in precognition) so we’ll just have to wait and see. Except we don’t have to wait and see for it here. I don’t really see why I should suffer the consequences of a referendum I wasn’t even allowed to vote in despite being a settled and contributing member of society. It is too early to make any rash decisions but I can certainly perceive greener pastures elsewhere…

For the time being, however, I have merely decided to switch to American spelling. This is not reawakening the Devil’s Neuroscientist (She also used American English). It’s just a protest. And, perhaps, depending how the US elections in November go I may have to change it back… On the bright side, my next post will presumably be about something sciency.

I am currently considering petitioning UCL to open a branch in Gibraltar given that this region will almost certainly have to get some special status after the UK leaves the EU

Six flawed arguments for leaving the EU

As anyone who reads this blog probably knows, the UK will hold a referendum about its continued membership in the EU later this month, on 23rd June. I already discussed my views on this in my previous post, so I won’t go into any depth on that here. The discussion is raging, not only in the media but no doubt in many family homes and workplaces (would be curious to be a fly on the wall when Boris Johnson and his brother, science minister Jo Johnson, talk about this in private…). I do think I have said most that I can say about it already – but I keep hearing the same tired, naïve arguments over and over. So I’ll write something about it, one last time before putting my future career, my civil rights, and most likely my continued life in this country in the hands of voters. Here I address six flawed arguments for leaving the EU:

1. “It will change everything”

Actually, most likely nothing major will happen at all. By far the most likely scenario is that the UK leaves the EU, and then joins the EEA in which free movement of people remains in turn for having full access to the single market. EU citizens in the UK will retain the same rights they had previously. Parliament comprises a large number of MPs from Labour, the Lib Dems, the SNP, and one (I think?) from the Greens, plus a healthy number of Europhile Conservatives. This means that this outcome is essentially guaranteed, at least until the next general election (and even then it seems highly unlikely that this situation will change dramatically). Of course, the UK would nevertheless give up its rights to influence EU policy. Sounds like a rotten deal to me. Anyway, leaving this aside, in the remainder of this post I will pretend that a vote to leave the EU will mean also an end to freedom of movement, which is the illusory scenario the Leave campaign is  peddling.

(Update 11 June 2016: The above EEA scenario of course assumes that the UK is allowed to remain in the single market. Wolfgang Schäuble seems to think that isn’t going to happen. I don’t agree with Schäuble much about anything but then again he is also highly influential in EU politics so it’s difficult to know what to think about his argument.)

2.”We can spend the money we save on UK science”

One reason I and many scientists are vehemently opposing this nostalgic independence nonsense is that a great deal of British science funding comes from the EU and that science in the UK would suffer if that were lost. An oft-repeated counterargument to this is that by leaving the EU the UK would no longer pay contributions to European funds and could thus use those savings to spend on British science. This is based on false economy and wishful thinking. The UK brings in more science funding than it pays in, so it would have to increase its science funding. When was the last time a British government did that? Do you honestly think it is likely they will do that now? Of course this argument is not even taking into account the strain on the economy now. It also ignores the likely hit the economy will take after leaving which will reduce and quite possibly wipe out any potential savings. And it blatantly neglects the substantial cost that the UK must pay to leave the EU in the first place. None of these things suggest there will be lots of spare pennies to fund UK research and development. (For similar reasons I also don’t believe this money will be used for the NHS or building homes but that’s outside the scope of my post).

3. “We will be free of EU bureaucracy”

Science has always been collaborative and it is increasingly so in our age. We need international science projects and the EU science initiatives (which go well beyond EU member states) can facilitate this far better than any single national body could. So the UK will quite likely continue to contribute to those initiatives, just as other non-EU countries (like Switzerland) are contributing – without any say in its direction.

4. “Scientists can still collaborate”

Funding is a big factor in science and the cynics on the Leave side are probably right that it is one of the driving factors why all vice-chancellors and governing bodies of British universities want the UK to stay in the EU. But it’s not just about that. Because science is collaborative and international, universities and research centres are usually extremely multinational. This may be especially true in English-speaking countries and this ability to attract bright minds from all over the world is what boost British science output (e.g. a large proportion of research grants brought to UK universities are brought in by people who are not UK citizens). You do not help this by putting up barriers. Leave campaigners like to talk about “point-based immigration systems” that would allow the UK to hire people in professions it needs and that makes it possible for excellent students to come here. Sure, because the best thing is always to have more bureaucracy and paperwork! That will doubtless attract great applicants who could instead be free to move to Paris, Berlin – or Dublin.

5. “EU citizens already living here can stay”

Much of this referendum debate has focused on immigration. Recent years have seen unprecedented immigration of people from other EU nations (although this still only accounts for around half of overall immigration to the UK). It is not surprising that this could cause some issues and concerns. More people making demands on the health system, on housing, or on jobs may strain the country’s capacity. Stopping EU immigration dead in its tracks will perhaps relieve this strain – however, one question Leave campaigners steadfastly ignore to address is what happens to the people who are already here. Unless they all pack and leave voluntarily on 24th June they will still put a strain on the capacity for some time to come. One argument I often hear is “nobody will be kicked out”. However, non-EU citizens are being deported left and right, sometimes for ludicrous reasons and in ludicrous ways. Under the Reign of Terroresa May, neither having a doctorate nor a British spouse necessarily protect you from this. Unless some sort of special agreement is negotiated, the same rules will apply to EU citizens if the UK leaves the EU. There is a lot of conflicting information out there, the most insidious of which is blatant (but presumably lucrative?) scare-mongering by law firms pushing people to apply for citizenship. Now, I don’t think many EU citizens will be deported, especially not those who are already settled here. But Leave campaigners show an obvious disconnect: On the one hand, they seem to believe that by leaving the EU the burden on the NHS and housing is magically lifted. On the other hand, they (at least the sane ones) maintain that there won’t be any mass-deportation of the very people they blame for this burden.

6. “We will regain our sovereignty”

The UK still is, and remains to be, a sovereign nation insofar that such a thing exists in this globalised world. I wasn’t overly impressed by David Cameron’s performance in that cringe-worthy ITV townhall meeting but one compelling answer he gave is that voting to Leave the EU will give an illusion of independence from foreign powers whilst sacrificing actual influence on the world and European stage. I call this the Libertarian Fallacy because it is the same faulty logic that leads many self-declared Libertarians to oppose all sorts of policies in the name of “liberty” without achieving any individual freedom at all. It’s the reasoning that allows some to decry background checks on guns as tyranny but sees no problem with strict tests for driving licenses. It’s the cognitive dissonance in which citizen ID cards evoke the spectre of fascist dictatorship but nobody worries about the far less controlled surveillance via credit card transactions or online activities. Whatever utopian dreams you may have about a “sovereign” UK after EU exit, it will lose its seat at the table and have reduced sway in any decision-making process in Europe – and by extension also in the world. Perhaps it’s fine with many to be an isolated island in a big sea dominated by China and the US, and a new Russian empire rattling its sabres. Fine, not all nations need to be world players. Perhaps these big guys will even leave you in peace. But don’t think for a second that by leaving the EU Britannia will rule the waves again.