Here is another post on experimental design and data analyses. I need to get this stuff out of my system before I’m letting the domain on this blog expire at the end of the year (I had planned this for last year already but then decided to keep it going because of a certain difficult post I must write then…)
Hidden group differences?
This (hopefully brief) post is inspired by a Twitter discussion I had last night. These people have had a journal club about one of my lab ‘s recent publications. The details of that are not really important for this post – you can read their excellent questions about our work and my replies to them in the tweet thread. However, what this discussion reminded me of is the issues you can run into when dealing with human volunteer participants that you have no control over and – what is worse – you may not even be aware of.
In this particular study, we compared retinotopic maps from groups of identical (MZ) and fraternal (DZ) twin pairs. One very notable point when you read our article is that the sample sizes for the two groups are quite different, with more MZ twin pairs than DZ pairs. We had some major difficulties finding DZ twins to take part and what made matters worse is that we had to reclassify several purported DZ twins to MZ twins after genetic testing. Looking at the literature, this seems quite common. For example, we found that in the Human Connectome Project there is a similar imbalance in the sample sizes (see for instance this preprint that also looked at retinotopic maps in twins at a more macroscopic level). A colleague of ours working on another twin study experienced the same problem (I don’t think this study has been published yet). Finally, here is just one more example of a vision science study with substantially greater sample for MZ than DZ twins.
There are clearly problems recruiting DZ twins. Undoubtedly MZ twins are more “special”, and so there are organisations through which they can be reached. While there are participant pools for twins that contain both zygosities, the people managing these can be rather protective. This is understandable because these databases are a valuable scientific resource and they don’t want to tire out their participants by allowing too many researchers to approach them with requests to participate. These pools of participants may also be imbalanced because MZ self-select into them because they have a strong interest to learn about how similar they are. In contrast, DZ twins may have less interest in this question (although some obviously do). And even if you have a well-balanced pool of potential participants, there may be additional social factors at play. The MZ twins in those pools may be keener to take part than the DZ twins. Of course, the zygosity may also interact in hidden ways with this. MZ twins might have a closer relationship to one another, even if just being more geographically closer, and that will doubtless affect how easy it is for them to participate in your study. All these issues are extremely difficult to know about, let alone to control.
Not about twins
As I said, the details of our study aren’t really important and this post isn’t about twins. Rather, this is clearly a broader issue. Similar concerns affect any comparison between groups of participants. Anyone studying patients with a particular condition is probably familiar with that issue. Many patients are keen to take part in studies because they have an interest in better understanding their condition or – for some disorders or illnesses – contributing to the development of treatments. In contrast, recruiting the “matched” control participants can be very difficult and you may go to great and unusual lengths to find them. This can result in your control group being quite unusual compared to the standard participant sample you might have when you do fundamental research, especially considering a lot of such research is done on young undergraduate students recruited on university campuses.
Let’s imagine for example that we want to understand visual processing by professional basketball players in the NBA. A quick Googling suggests the average body height of NBA players is 1.98 m, considerably taller than the average male. Any comparison that does not take this into account would confound body height with basketball skill. Obviously you can control aspects like this to some extent by using covariates (e.g. in multiple regression analyses) – but that is only for the variables you know about. More importantly, you’d be well-advised to recruit a matched control group that has similar body height as your basketball players but without the athletic skills. That way you cancel out the effect of body height.
But how does this recruitment drive interact with your samples? For one thing, it will probably be difficult to find these tall controls. While most NBA players are very tall (even short NBA players are presumably above average height), really tall people in the general population are rare. So finding them may take a long time. But what is worse, the ones you do find may also differ in other respects from your average person. For body height, this may not be too problematic but you never know what issues a very tall person faces who doesn’t happen to have a multi-million dollar athletic contract.
These issues can be quite nefarious. For instance, I was involved in a study a few years ago where we must recruit control participants matched to our main group of interest both in terms of demographic details and psychological measures. What we ended up with was a lot of exclusions of potential control partipants due to drug use, tattoos or metal implants (a safety hazard), and in one case an undisclosed medical history we only discovered serendipitously. The rationale for selecting participants with particular matched traits from the general population is based on the assumption that these traits are random – however, this fails if there is some hidden association between that trait and other confounding factors. In essence, this is just another form of selection bias that I have written about recently…
The problem is there is simply no good way to control for that. You cannot use a variable as covariate when you don’t know it exists. This means that particular variable simply becomes part of the noise, the variance not explained by your model. It is entirely possible that this noise masquerades as a difference that doesn’t really exist (Type I error) or obscures true effects (Type II error). You can and should obviously check for potential caveats and thus establish the robustness of the findings but that can only go so far.
Small N designs
This brings me back to another one of my pet issues: small N designs, as are common in psychophysics. Some psychophysics experiments have as few as two participants, both of whom are also authors of the publication. It is debatable how valid this extreme might be – one of my personal heuristics is that you should always include some “naive” observers (or at least one?) to show that results do not crucially depend on knowledge of the hypothesis. But these designs can nevertheless be valid. Many experiments are actually difficult for the participant to influence through mere willpower alone. I’ve done some experiments on myself where I thought I was responding a certain way only to find the results didn’t reflect this intuition at all.
And there is definitely something to be said about having trained observers. I’ve covered this topic several times before so I won’t go in detail on this. But it doesn’t really make sense to contaminate your results with bad data. A lot of psychophysical experiments require steady eye gaze to ensure that stimuli are presented at the parafoveal and peripheral locations you want to test. It doesn’t make much sense to include participants who cannot maintain fixation. (On that note, it is interesting that some results can actually be surprisingly robust even in the presence of considerable eye movements – such as what we found in this study. This opens up a number of questions as to what those results mean but I have not yet figured out a good way to answer them…).
This is quite different from your typical psychology experiment. Imagine you want to test (bear with me here) how fast your participants walk down the corridor after leaving your lab cubicle where you had them do some task with words… While there may be some justified reasons for exclusion of participants (such as that they obviously didn’t comply with your task instructions or failed to understand the words or that they get an urgent phone call that caused them to sprint down the hall), there is no such thing as a “trained observer” here. You want to make an inference about how the average person reacts to your experimental manipulation. Therefore you need to use a statistical approach that tests the group average. We don’t want only people who are well trained at walking down corridors.
In contrast, in threshold psychophysics you don’t care about the “average person” but rather you want to know what the threshold performance is after all that other human noise – say inattention, hand-eye-coordination, fixation instability, uncorrected refractive error, mind-wandering, etc – has been excluded. Your research question is what is the just noticeable difference in stimuli under optimal conditions, not what is the just noticeable difference when distracted by thoughts about dinner or your inability to press the right button at the right time. A related (and more insidious) issue is also introspection. One could make the argument that many trained observers are also better judging the contents of their perceptual awareness than someone you recruited off the street (or your student participant pool). A trained observer may be quite adept at saying that the grating you showed appeared to them tilted slightly to the left – your Average Jo(sephine) may simply say that they noticed no difference. (This could in part be quantified by differences in response criterion but that is not guaranteed to work).
Taken together, the problem here is not with the small N approach – it is doubtless justified in many situations. Rather I wonder how to decide when it is justified. The cases described above seem fairly obvious but in many situations things can be more complicated. And to return to the main topic of this post, there could be insidious interactions between finding the right observers and your results. If I need trained observers for a particular experiment but I also want to find some who are naive to the purpose of the experiment, my inclusion criteria may bias the participants I end up with (this usually means your participants are all members of your department :P). For many purposes these biases may not matter. In some cases they probably do – for instance reports that visual illusions differ considerably in different populations. Ideally you want trained observers from all the groups you are comparing in this case.