Disclaimer: This is a follow-up to my previous post about the discussion between Niko Kriegeskorte and Brad Love. Here are my scientific views on the preprint by Bobadilla-Suarez, Ahlheim, Mehrotra, Panos, & Love and some of the issues raised by Kriegeskorte in his review/blog post. This is not a review and therefore not as complete as a review would be, and it contains some additional explanations and non-scientific points. Given my affiliation with Bobadilla-Suarez’s department, a formal review for a journal would constitute a conflict of interest anyway.
What’s the point of all this?
I was first attracted to Niko’s post because just the other day my PhD student and I discussed the possibility of running a new study using Representational Similarity Analysis (RSA). Given the title of his post, I jokingly asked him what was the TL;DR answer to the question “What’s the best measure of representational dissimilarity?”. At the time, I had no idea that this big controversy was brewing… I have used multivoxel pattern analyses in the past and am reasonably familiar with RSA but I have never used it in published work (although I am currently preparing a manuscript that contains one such analysis). The answer to this question is therefore pretty relevant to me.
RSA is a way to quantify the similarity of patterns of brain responses (usually measured as voxel response patterns with fMRI or the firing rates of a set of neurons etc) to a range of different stimuli. This produces a (dis)similarity matrix where each pairwise comparison is a cell that denotes how similar/confusable the response patterns to those stimuli are. In turn, the pattern of these similarities (the “representational similarity”) then allows researchers to draw inferences about how particular stimuli (or stimulus dimensions) are encoded in the brain. Here is an illustration:
The person called Warshort believes journal reviews, preprint comments, and blog posts to be more or less the same thing, public commentaries on published research. The logic of RSA is that somewhere in their brain the pattern of neural activity evoked by these three concepts is similar. Contrast this to person Liebe who regards reviews and preprint comments to be similar (but not as similar as Warshort would) but who considers personal blog posts to be diametrically opposed to reviews.
What is the research question?
According to their introduction, Bobadilla-Suarez et al. set themselves the following goals:
“The first goal was to ascertain whether the similarity measures used by the brain differ across regions. The second goal was to investigate whether the preferred measures differ across tasks and stimulus conditions. Our broader aim was to elucidate the nature of neural similarity.”
In some sense, it is one of the overarching goals of cognitive neuroscience to answer that final question, so they certainly have their work cut out for them. But looking at this more specifically, the question of the best measure of comparing brain states across conditions and how this depends on where and what is being compared is an important one for the field.
Unfortunately, to me this question seems ill-posed in the context of this study. If the goal is to understand what similarity measures are “used by the brain” we immediately need to ask ourselves whether the techniques used to address this question are appropriate to answer it. This is largely a conceptual point, and the study’s first caveat for me. We could instead reinterpret this into a technical comparison of different methods, but therein lies another caveat and this seems to be the main concern Kriegeskorte raised in his review. I’ll elaborate on both these points in turn:
The conceptual issue
I am sure the authors are fully aware of the limitations of making inferences about neural representations from brain imaging data. Any such inferences can only be as good as the method for measuring brain responses. Most studies using RSA are based on fMRI data which measures a metabolic proxy of neuronal activity. While fMRI experiments have doubtless made important discoveries about how the brain is organised and functions, this is a caveat we need to take seriously: there may very well be information in brain activity that is not directly reflected in fMRI measures. It is almost certainly not the case that brain regions communicate with one another directly via reading out their respective metabolic activity patterns.
This issue is further complicated by the fact that RSA studies using fMRI are based on voxel activity patterns. Voxels are individual elements in a brain image, the equivalent to pixels in a digital image. How a brain scan is subdivided into voxels is completely arbitrary and depends on a lot of methodological choices and parameters. The logic of using voxel patterns for RSA is that individual voxels will usually exhibit biased responses depending on the stimulus – however, the nature of these response biases remains highly controversial and also quite likely depends on what brain states (visual stimuli, complex tasks, memories, etc) are being compared. Critically, voxel patterns cannot possibly be directly relevant to neural encoding. At best, they are indirectly correlated with the underlying patterns, and naturally, the voxel resolution may also matter. In theory, two stimuli could be encoded by completely non-overlapping and unconnected neuronal populations which are nevertheless mixed into the same voxels. Even if voxel responses were a direct measure of neuronal activity, they might not show any biased responses at all, and the voxel response pattern would therefore carry no information about the stimuli whatsoever.
But there is an even more fundamental issue here. This is also unaffected by what actual brain measure is used, be it voxel patterns or the firing rates of actual neurons. The authors’ stated goal is to reveal what measure the brain itself uses to establish the similarity of brain states. The measures they compare are statistical methods, e.g. the Pearson correlation coefficient or the Mahalanobis distance between two response patterns. But the brain is no statistician. At most, a statistical quantity like a Pearson’s r might be a good description for what some read-out neurons somewhere in the processing hierarchy do to categorise the response patterns in up-stream regions. This may sound like an unnecessarily pedantic semantic distinction, but I’d disagree: by only testing predefined statistical models of how pattern similarity could be quantified, we may impose an artificially biased set of models. The actual way this is implemented in neuronal circuits may very well be a hybrid or a completely different process altogether. Neural similarity might linearly correlate with Pearson’s r over some range, say between r=0.5-1, but then be more consistent with a magnitude code at the lower end of similarities. It might also come with built in thresholding or rectifying mechanisms in which patterns below a certain criterion are automatically encoded as dissimilar. Of course, you have to start somewhere and the models the authors used are reasonable choices. However, this description should be more circumspect in my view because in the best case we could really say that the results suggest a mechanism that is well described by a given statistical model.
Finally, the authors seem to make an implicit assumption that does not necessarily hold: there is actually no reason to accept up-front that the brain quantifies pattern similarity at all. I assume that it does, and it is certainly an important assumption to be tested empirically. But in theory it seems entirely possible that spatial patterns of neural activity in a particular brain region are an epi-phenomenon of how neurons in that region are organised. This does however not mean that downstream neurons necessarily use this pattern information. I’d wager this almost certainly also depends on the stimulus/task. For instance, a higher-level neuron whose job it is to determine whether a stimulus appeared on the left or the right presumably uses the spatial pattern of retinotopically-organised responses in the earlier visual regions. For other, more complex stimulus dimensions, this may not be the case.
The technical issue
This brings me to the other caveat I see with Bobadilla-Suarez et al.’s approach here. As I said, this is largely the same point made by Kriegeskorte in his review and since this takes up most of his post I’ll keep it brief. If we brush aside the conceptual points I made above and instead assume that the brain indeed determines the similarity of response patterns in up-stream areas, what is the best way to test how it does this? The authors used a machine learning classifier to use pair-wise decoding of different stimuli and construct a confusability matrix. Conceptually, this is pretty much the same as the similarity matrix derived from the other measures they are testing (e.g. Pearson’s r) but it instead uses a classifier algorithm the determine the discriminability of the response patterns. The authors then compare these decoding matrices with those based on the similarity measures they tested.
As Kriegeskorte suggests, these decoding methods are just another method of determining neural similarity. Different kinds of decoders are also closely related to the various methods Bobadilla-Suarez et al. compared: the Mahalanobis distance isn’t conceptually very far from a linear discriminant decoder, and you can actually build a classifier using Pearson’s r (in fact, this is the classifier I mostly used in my own studies).
The premise of Bobadilla-Suarez et al.’s study therefore seems circular. They treat decodability of neural activity patterns as the ground truth of neural similarity, and that assumption seems untenable to me. They discuss the confound that the choice of decoding algorithm would affect the results and therefore advocate using the best available algorithm, yet this doesn’t really address the underlying issue. The decoder establishes the statistical similarity between neural response patterns. It does not quantify the actual neural similarity code – as a matter of fact, it cannot possibly do so.
It is therefore also unsurprising if the similarity measure that best matches classifier performance is the method that is closest to what the given classifier algorithm is based on. I may have missed this, but I cannot discern from the manuscript which classifier was actually used for the final analyses, only that the best of three was chosen. The best classifier was determined separately for the two datasets the authors used, which could be one explanation for why their outcome results differ between them.
Bobadilla-Suarez et al. ask an interesting and important question but I don’t think the study as it is can actually address it. There is a conceptual issue in that the brain may not necessarily use any of the available statistical models to quantify neural similarity, and in fact it may not do so at all. Of course, it is perfectly valid to compare different models of how it achieves this feat and any answer to this question need not be final. It does however seem to me that this is more of a methodological comparison rather than an attempt to establish what the brain is actually doing.
To my understanding, the approach the authors used to establish which similarity measure is best cannot answer this question. In this I appear to concur with Kriegeskorte’s review. Perhaps I am wrong of course, as the authors have previously suggested that Kriegeskorte “missed the point”, in which case I would welcome further explanation of the authors’ rationale here. However, from where I’m currently standing, I would recommend that the authors revise their manuscript as a methodological comparison and to be more circumspect with regard to claims about neural representations.
The results shown here are certainly not without merit. By comparing commonly used similarity measures to the best available decoding algorithm they may not establish which measure is closest to what the brain is doing, but they certainly do show how these measures compare to complex classification algorithms. This in itself is informative for practical reasons because decoding is computationally expensive. Any squabbling aside, the authors show that the most commonly used measure, Pearson’s correlation, clearly does not perform in the same way as a lot of other possible techniques. This finding should also be of interest to anyone conducting an RSA experiment.
Some final words
I hope the authors find this comment useful. Just because I agree with Kriegeskorte’s main point, I hope that doesn’t make me his “acolyte” (I have neither been trained by him nor would I say that we stem from the same theoretical camp). I may have “missed the point” too, in which case I would appreciate further insight.
I find it very unfortunate that instead of a decent discussion on science, this debate descended into something not far above a poo-slinging contest. I have deliberately avoided taking sides in that argument because of my relationship to either side. While I vehemently object to the manner with which Brad responded to Niko’s post, I think it should be obvious that not everybody is on the same wavelength when it comes to open reviewing. It is depressing and deeply unsettling how many people on either side of this divide appear to be unwilling to even try to understand the other point of view.