Recently I have been thinking a bit about what the best way is to represent group data. The most typical way this is done is by showing summary statistics (usually the mean) and error bars (usually standard errors) either in bar plots or in plots with lines and symbols. A lot of people seem to think this is not an appropriate way to visualise results because it obscures the data distribution and also whether outliers may influence the results. One reason prompting me to think about this is that in at least one of our MSc courses students are explicitly told by course tutors that they should be plotting individual subject data. It is certainly true that close inspection of your data is always important – but I am not convinced that it is the only and best way to represent all sorts of data. In particular, looking at the results from an experiment of a recent student of mine you wouldn’t make heads or tails from just plotting individual data. Part of the reason is that most of the studies we do use within-subject designs and standard ways of plotting individual data points can actually be misleading. There are probably better ones, and perhaps my next post will deal with that.
For now though I want to only consider group data which were actually derived from between-subject or at least mixed designs. A recently published study in Psychological Science reported that sad people are worse at discriminating colours along the blue-yellow colour axis but not along the red-green colour axis. This sparked a lot of discussion on Twitter and in the blogosphere, for example this post by Andrew Gelman and also this one by Daniel Lakeland. Publications like this tend to attract a lot of coverage by mainstream media and this was no exception. This then further fuels the rage of skeptical researchers :P. There are a lot of things to debate here, from the fact that the study authors interpret a difference between differences as significant without testing the interaction, the potential inadequacy of the general procedure for measuring perceptual differences (raw accuracy rather than a visual threshold measure), and also that outliers may contribute to the main result. I won’t go into this discussion but I thought this data set (which to the authors’ credit is publicly available) would be a good example for my musings.
So here I am representing the data from their first study by plotting it in four different ways. The first plot, in the upper left, is a bar plot showing the means and standard errors for the different experimental conditions. The main result in the article is that the difference between control and sadness is significant for discriminating colours along the blue-yellow axis (the two bars on the left).
And judging by the bar graph you could certainly be forgiven for thinking so (I am using the same truncated scale used in the original article). The error bars seem reasonably well separated and this comparison is in fact statistically significant at p=0.0147 on a parametric independent sample t-test or p=0.0096 on a Mann-Whitney U-test (let’s ignore the issue of the interaction for this example).
Now consider the plot in the upper right though. Here we have the individual data points for the different groups and conditions. To give an impression of how the data are distributed, I added a little Gaussian noise to the x-position of each point. The data are evidently quite discrete due to the relatively small number of trials used to calculate the accuracy for every subject. Looking at the data in this way does not seem to give a very clear impression that there is a substantial difference between the control and sadness groups in either colour condition. The most noticeable difference is that there is one subject in the sadness group whose accuracy is not matched with any counterpart in the control group, at 0.58 accuracy. Is this an outlier pulling the result?
Next I generated a box-and-whisker plot in the lower left panel. The boxes in these plots denote the inter-quartile range (IQR, i.e. between 25th and 75th percentile of the data), the red lines indicate the medians, the error bars denote a range of 1.5 times the IQR beyond the percentiles (although it is curtailed when there are no data points beyond that range as by the ceiling at 1), and the red crosses are outliers that fall outside this range. The triangular notches surrounding the medians are a way to represent uncertainty and if they do not overlap (as is the case for the blue-yellow data) this suggests a difference between medians at the 5% significance level. Clearly the data point at 0.58 accuracy in the sadness group is considered an outlier in this plot although it is not the only one.
Finally, I also wrote a Matlab function to create cat-eye plots (Wikipedia calls those violin plots – personally they look mostly like bottles, amphoras or vases to me – or, in this case, like balloons). This is shown in the lower right panel. These plots show the distribution of the data in each condition smoothed by a kernel density function. The filled circles indicate the median, the vertical lines the inter-quartile range, and the asterisk the mean. Plots like this seem to be becoming more popular lately. They do have the nice feature that they give a fairly direct impression of how the data are distributed. It seems fairly clear that these are not normal distributions, which probably has largely to do with the ceiling effect: as accuracy cannot be higher than 1 the distributions are truncated there. The critical data set, the blue-yellow discrimination for the sadness group, has a fairly thick tail towards the bottom which is at least partially due to that outlier. This all suggests that the traditional t-test was inappropriate here but then again we did see a significant difference on the U-test. And certainly, visual inspection still suggests that there may be a difference here.
Next I decided to see what happens if I remove this outlier at 0.58. For consistency, I also removed their data from the red-green data set. This change does not alter the statistical inference in a qualitative way even though the p-values increase slightly. The t-test is still significant at p=0.0259 and the U-test at p=0.014.
Again, the bar graph shows a fairly noticeable difference. The scatter plot of the individual data points on the other hand now hardly seems to show any difference. Both the whisker and the cat-eye plot seem to still show qualitatively similar results as when the outlier is included. There seems to be a difference in medians for the blue-yellow data set. The cat-eye plot makes is more apparent that the tail of the distribution for the sadness group is quite heavy something that isn’t that clear in the whisker plot.
Finally, I decided to simulate a new data set with a similar pattern of results but in which I knew the ground truth. All four data sets contained 50 data points that were chosen from a Gaussian distribution with mean of 70 and standard deviation of 10 (I am a moron and therefore generated these on a scale of percent rather than proportion correct – and now I’m too lazy to replot all this just to correct it. It doesn’t matter really). For the control group in the blue-yellow condition I added an offset of 5 while in the sadness group I subtracted 5. This means that there is a significant difference (t-test: p=0.0017; U-test: 0.0042).
Now all four types of plot fairly clearly reflect this difference between control and sadness groups. The bar graph in particular clearly reflects the true population means in each group. But even in the scatter plot the difference is clearly apparent even though the distributions overlap considerably. The difference seems a lot less obvious in the whisker and cat-eye plots however. The notches in the whisker plot do not overlap although they seem to be very close. The difference seems to be more visually striking for the cat-eye plot but it isn’t immediately apparent from the plot how much confidence this should instill in this result.
Conclusions & Confusions
My preliminary conclusion is that all of this is actually more confusing than I thought. I am inclined to agree that the bar graph (or a similar symbol and error bar plot) seems to overstate the strength of the evidence somewhat (although one should note that this is partly because of the truncated y-scales that such plots usually employ). On the other hand, showing the individual subject data does seem to understate the results considerably except when the effect is pretty strong. So perhaps things like whisker or cat-eye (violin/bottle/balloon) plots are the most suitable but in my view they also aren’t as intuitive as some people seem to suggest. Obviously, I am not the first person who has thought about these things nor have I spent an extraordinarily long time thinking about it. It might be useful to conduct a experiment/survey in which people have to judge the strength of effects based on different kinds of plot. Anyway, in general I would be very curious to hear other people’s thoughts.
The Matlab code and data file for these examples can be found here.