Tampilkan postingan dengan label voodoo. Tampilkan semua postingan
Tampilkan postingan dengan label voodoo. Tampilkan semua postingan

The A Team Sets fMRI to Rights

Remember the voodoo correlations and double-dipping controversies that rocked the world of fMRI last year? Well, the guys responsible have teamed up and written a new paper together. They are...

The paper is Everything you never wanted to know about circular analysis, but were afraid to ask. Our all-star team of voodoo-hunters - including Ed "Hannibal" Vul (now styled Professor Vul), Nikolaus "Howling Mad" Kriegeskorte, and Russell "B. A." Poldrack - provide a good overview of the various issues and offer their opinions on how the field should move forward.

The fuss concerns a statistical trap that it's easy for neuroimaging researchers, and certain other scientists, to fall into. Suppose you have a large set of data - like a scan of the brain, which is a set of perhaps 40,000 little cubes called voxels - and you search it for data points where there is a statistically significant effect of some kind.

Because you're searching in so many places, in order to avoid getting lots of false positives you set the threshold for significance very high. That's fine in itself, but a problem arises if you find some significant effects and then take those significant data points and use them as a measure of the size of the effects - because you have specifically selected your data points on the basis that they show the very biggest effects out of all your data. This is called the non-independence error and it can make small effects seem much bigger.

The latest paper offers little that's new in terms of theory, but it's a good read and it's interesting to get the authors' expert opinion on some hot topics. Here's what they have to say about the question of whether it's acceptable to present results that suffer from the non-independence error just to "illustrate" your statistically valid findings:

Q: Are visualizations of non-independent data helpful to illustrate the claims of a paper?

A: Although helpful for exploration and story telling, circular data plots are misleading when presented as though they constitute empirical evidence unaffected by selection. Disclaimers and graphical indications of circularity should accompany such visualizations.
Now an awful lot of people - and I confess that I've been among them - do this without the appropriate disclaimers. Indeed, it is routine. Why? Because it can be useful illustration - although the size of the effects appears to be inflated in such graphs, on a qualitative level they provide a useful impression of the direction and nature of the effects.

But the A Team are right. Such figures are misleading - they mislead about the size of the effect, even if only inadvertently. We should use disclaimers, or ideally, avoid using misleading graphs. Of course, this is a self-appointed committee: no-one has to listen to them. We really should though, because what they're saying is common sense once you understand the issues.

It's really not that scary - as I said on this blog at the outset, this is not going to bring the whole of fMRI crashing down and end everyone's careers; it's a technical issue, but it is a serious one, and we have no excuse for not dealing with it.

ResearchBlogging.orgKriegeskorte, N., Lindquist, M., Nichols, T., Poldrack, R., & Vul, E. (2010). Everything you never wanted to know about circular analysis, but were afraid to ask Journal of Cerebral Blood Flow & Metabolism DOI: 10.1038/jcbfm.2010.86

New, Voodoo-Free fMRI Technique

MIT brain scanners Fedorenko et al present A new method for fMRI investigations of language: Defining ROIs functionally in individual subjects. Also on the list of authors is Nancy Kanwisher, one of the feared fMRI voodoo correlations posse.

The paper describes a technique for mapping out the "language areas" of the brain in individual people, not for their own sake, but as a way of improving other fMRI studies of language. That's important because while everyone's brain is organized roughly the same way, there are always individual differences in the shape, size and location of the different regions.

This is a problem for fMRI researchers. Suppose you scan 10 people and show them pictures of apples and pictures of pears. And suppose that apples activate the brain's Fruit Cortex much more strongly than pears. But unfortunately, the Fruit Cortex is a small area, and its location varies between people. In fact, in your 10 subjects, no-one's Fruit Cortex overlaps with anyone else's, even though everyone has one and they all work exactly the same way.

If you did this experiment you'd fail to find the effect of apples vs. pears, even though it's a strong effect, because there will be no one place in the brain where apples reliably cause more activation. What you need is a way of finding the Fruit Cortex in each person beforehand. What you'd need to do is a functional localization scan - say, showing people a big bowl of fruit - as a preliminary step.

Fedorenko et al scanned a bunch of people while doing a simple reading task, and compared that to a control condition, reading a random list of nonsense which makes no linguistic sense. As you can see, there's a lot of variation between people, but there's also clearly a basic pattern of activation: it looks a bit like a tilted "V" on the left side of the brain:

These are the language areas of each person. (Incidentally, this is why fMRI, despite its limitations, is an amazing technology. There is no better way of measuring this activation. EEG is cheaper but nowhere near as good at localizing activity; PET is close, but it's slow, expensive and involves injecting people with radioactivity.)

Fedorenko et al then overlapped all the individual images to produce of map of the brain showing how many people got activation in each part:

The most robust activations were on the left side of the brain, and they formed a nice "V" shape again. These are the areas which have long been known to be involved in language, so this is not surprising in itself.

Here's the clever bit: they then took the areas activated in a large % of people, and automatically divided them up into sub-regions; each of the "peaks" where an especially large proportion of subjects showed activation became a separate region.

This is on the assumption that these peaks represent parts of the brain with distinct functions - separate "language modules" as it were. But each module will be in a slightly different place in each person (see the first picture). So they overlapped the subdivisions with the individual activation blobs to get a set of individual functional zones they call Group-constrained Subject-Specific functional Regions of Interest, or GcSSfROIs to their friends.

Fedorenko et al claim various advantages to this technique, and present data showing that it produces nice results in independent subjects (i.e. not the ones they used to make the group map in the first place.)

In particular, they argue that it should allow future fMRI studies to have a better chance of finding the specific functions of each region. So far, experiments using fMRI to investigate language have largely failed to find activations specific to particular aspects of language like grammar, word meaning, etc. which is unexpected because patients suffering lesions to specific areas often do show very selective language problems.

Does this relate to the voodoo correlations issue? Indirectly, yes. The voodoo (non-independence error) problem arises when you do a large number of comparisons, and then focus on the "best" results, because these are likely to be wholly, or partially, only that good by chance.

Fedorenko et al's method allows you to avoid doing lots of comparisons in the first place. Instead of looking all over the whole brain for something interesting, you can first do a preliminary scan to map out where in each person's brain interesting stuff is likely to happen, and then focus on those bits in the real experiment.

There's still a multiple-comparisons problem: Fedorenko et al identified 16 candidate language areas per brain, and future studies could well provide more. But that's nothing compared to the 40,000 voxels in a typical whole-brain analysis. We'll have to wait and see if this technique proves useful in the real world, but it's an interesting idea...

ResearchBlogging.orgFedorenko, E., Hsieh, P., Nieto Castanon, A., Whitfield-Gabrieli, S., & Kanwisher, N. (2010). A new method for fMRI investigations of language: Defining ROIs functionally in individual subjects Journal of Neurophysiology DOI: 10.1152/jn.00032.2010

Can We Rely on fMRI?

Craig Bennett (of Prefrontal.org) and Michael Miller, of dead fish brain scan fame, have a new paper out: How reliable are the results from functional magnetic resonance imaging?


Tal over at the [citation needed] blog has an excellent in-depth discussion of the paper, and Mind Hacks has a good summary, but here's my take on what it all means in practical terms.

Suppose you scan someone's brain while they're looking at a picture of a cat. You find that certain parts of their brain are activated to a certain degree by looking at the cat, compared to when they're just lying there with no picture. You happily publish your results as showing The Neural Correlates of Cat Perception.

If you then scanned that person again while they were looking at the same cat, you'd presumably hope that exact same parts of the brain would light up to the same degree as they did the first time. After all, you claim to have found The Neural Correlates of Cat Perception, not just any old random junk.

If you did find a perfect overlap in the area and the degree of activation that would be an example of 100% test-retest reliability. In their paper, Bennett and Miller review the evidence on the test-retest reliability of fMRI studies. They found 63 of them. On average, they found that the reliability of fMRI falls quite far short of perfection: the areas activated (clusters) had a mean Dice overlap of 0.476, while the strength of activation was correlated with a mean ICC of 0.50.

But those numbers, taken out of context, do not mean very much. Indeed, what is a Dice overlap? You'll have to read the whole paper to find out, but even when you do, they still don't mean that much. I suspect this is why Bennett and Miller don't mention them in the Abstract of the paper, and in fact they don't spend more than a few lines discussing them at all.

A Dice overlap of 0.476 and an ICC of 0.50 are what you get if average over all of the studies that anyone's done looking at the test-retest reliability of any particular fMRI experiment. But different fMRI experiments have different reliabilities. Saying that the average reliability of fMRI is 0.5 is rather like saying that the mean velocity of a human being is 0.3 km per hour. That's probably about right, averaging over everyone in the world, including those who are asleep in bed and those who are flying on airplanes - but it's not very useful. Some people are moving faster than others, and some scans are more reliable than others.


Most of this paper is not concerned with "how reliable fMRI is", but rather, with how to make any given scanning experiment more reliable. And this is an important thing to write about, because even the most optimistic cognitive neuroscientist would agree that many fMRI results are not especially reliable, and as Bennett and Miller say, reliability matters for lots of reasons:

Scientific truth. While it is a simple statement that can be taken straight out of an undergraduate research methods course, an important point must be made about reliability in research studies: it is the foundation on which scientific knowledge is based. Without reliable, reproducible results no study can effectively contribute to scientific knowledge.... if a researcher obtains a different set of results today than they did yesterday, what has really been discovered?
Clinical and Diagnostic Applications. The longitudinal assessment of changes in regional brain activity is becoming increasingly important for the diagnosis and treatment of clinical disorders...
Evidentiary Applications. The results from functional imaging are increasingly being submitted as evidence into the United States legal system...
Scientific Collaboration. A final pragmatic dimension of fMRI reliability is the ability to share data between researchers...
So what determines the reliability of any given fMRI study? Lots of things. Some of them are inherent to the nature of the brain, and are not really things we can change: activation in response to basic perceptual and motor tasks is probably always going to be more reliable than activation related to "higher" functions like emotions.

But there are lots of things we can change. Although it's rarely obvious from the final results, researchers make dozens of choices when designing and analyzing an fMRI experiment, many of which can at least potentially have a big impact on the reliability of their findings. Bennett and Miller cover lots of them:
voxel size... repetition time (TR), echo time (TE), bandwidth, slice gap, and k-space trajectory... spatial realignment of the EPI data can have a dramatic effect on lowering movement-related variance ... Recent algorithms can also help remove remaining signal variability due to magnetic susceptibility induced by movement... simply increasing the number of fMRI runs improved the reliability of their results from ICC = 0.26 to ICC = 0.58. That is quite a large jump for an additional ten or fifteen minutes of scanning...
The details get extremely technical, but then, when you do an fMRI scan you're using a superconducting magnet to image human neural activity by measuring the quantum spin properties of protons. It doesn't get much more technical.

Perhaps the central problem with modern neuroimaging research is that it's all too easy for researchers to write off the important experimental design issues as "merely" technicalities, and just put some people in a scanner using the default scan sequence and see what happens. This is something few fMRI users are entirely innocent of, and I'm certainly not, but it is a serious problem. As Bennett and Miller point out, the devil is in the technical details.
The generation of highly reliable results requires that sources of error be minimized across a wide array of factors. An issue within any single factor can significantly reduce reliability. Problems with the scanner, a poorly designed task, or an improper analysis method could each be extremely detrimental. Conversely, elimination of all such issues is necessary for high reliability. A well maintained scanner, well designed tasks, and effective analysis techniques are all prerequisites for reliable results.
ResearchBlogging.orgBennett CM, Miller MB. (2010). How reliable are the results from functional magnetic resonance imaging? Annals of the New York Academy of Sciences

fMRI Gets Slap in the Face with a Dead Fish

A reader drew my attention to this gem from Craig Bennett, who blogs at prefrontal.org:

Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction

This is a poster presented by Bennett and colleagues at this year's Human Brain Mapping conference. It's about fMRI scanning on a dead fish, specifically a salmon. They put the salmon in an MRI scanner and "the salmon was shown a series of photographs depicting human individuals in social situations. The salmon was asked to determine what emotion the individual in the photo must have been experiencing."

I'd say that this research was justified on comedic grounds alone, but they were also making an important scientific point. The (fish-)bone of contention here is multiple comparisons correction. The "multiple comparisons problem" is simply the fact that if you do a lot of different statistical tests, some of them will, just by chance, give interesting results.

In fMRI, the problem is particularly severe. An MRI scan divides the brain up into cubic units called voxels. There are over 40,000 in a typical scan. Most fMRI analysis treats every voxel independently, and tests to see if each voxel is "activated" by a certain stimulus or task. So that's at least 40,000 separate comparisons going on - potentially many more, depending upon the details of the experiment.

Luckily, during the 1990s, fMRI pioneers developed techniques for dealing with the problem: multiple comparisons correction. The most popular method uses Gaussian Random Field Theory to calculate the probability of falsely "finding" activated areas just by chance, and to keep this acceptably low (details), although there are other alternatives.

But not everyone uses multiple comparisons correction. This is where the fish comes in - Bennett et al show that if you don't use it, you can find "neural activation" even in the tiny brain of dead fish. Of course, with the appropriate correction, you don't. There's nothing original about this, except the colourful nature of the example - but many fMRI publications still report "uncorrected" results (here's just the last one I read).

Bennett concludes that "the vast majority of fMRI studies should be utilizing multiple comparisons correction as standard practice". But he says on his blog that he's encountered some difficulty getting the results published as a paper, because not everyone agrees. Some say that multiple comparisons correction is too conservative, and could lead to genuine activations being overlooked - throwing the baby salmon out with the bathwater, as it were. This is a legitimate point, but as Bennett says, in this case we should report both corrected and uncorrected results, to make it clear to the readers what is going on.

More Brain Voodoo, and This Time, It's Not Just fMRI

Ed Vul et al recently created a splash with their paper, Puzzlingly high correlations in fMRI studies of emotion, personality and social cognition (better known by its previous title, Voodoo Correlations in Social Neuroscience.) Vul et al accused a large proportion of the published studies in a certain field of neuroimaging of committing a statistical mistake. The problem, which they call the "non-independence error", may well have made the results of these experiments seem much more impressive than they should have been. Although there was no suggestion that the error was anything other than an honest mistake, the accusations still sparked a heated and ongoing debate. I did my best to explain the issue in layman's terms in a previous post.

Now, like the aftershock following an earthquake, a second paper has appeared, from a different set of authors, making essentially the same accusations. But this time, they've cast their net even more widely. Vul et al focused on only a small sub-set of experiments using fMRI to examine correlations between brain activity and personality traits. But they implied that the problem went far beyond this niche field. The new paper extends the argument to encompass papers from across much of modern neuroscience.

The article, Circular analysis in systems neuroscience: the dangers of double dipping, appears in the extremely prestigious Nature Neuroscience journal. The lead author, Dr. Nikolaus Kriegeskorte, is a postdoc in the Section on Functional Imaging Methods at the National Institutes of Health (NIH).

Kriegeskorte et al's essential point is the same as Vul et al's. They call the error in question "circular analysis" or "double-dipping", but it is the same thing as Vul et al's "non-independent analysis". As they put it, the error could occur whenever

data are first analyzed to select a subset and then the subset is reanalyzed to obtain the results.
and it will be a problem whenever the selection criteria in the first step are not independent of the reanalysis criteria in the second step. If the two sets of criteria are independent, there is no problem.


Suppose that I have some eggs. I want to know whether any of the eggs are rotten. So I put all the eggs in some water, because I know that rotten eggs float. Some of the eggs do float, so I suspect that they're rotten. But then I decide that I also want to know the average weight of my eggs . So I take a handful of eggs within easy reach - the ones that happen to be floating - and weigh them.

Obviously, I've made a mistake. I've selected the eggs that weigh the least (the rotten ones) and then weighed them. They're not representative of all my eggs. Obviously, they will be lighter than the average. Obviously. But in the case of neuroscience data analysis, the same mistake may be much less obvious. And the worst thing about the error is that it makes data look better, i.e. more worth publishing:
Distortions arising from selection tend to make results look more consistent with the selection criteria, which often reflect the hypothesis being tested. Circularity is therefore the error that beautifies results, rendering them more attractive to authors, reviewers and editors, and thus more competitive for publication. These implicit incentives may create a preference for circular practices so long as the community condones them.
To try to establish how prevalent the error is, Kriegeskorte et al reviewed all of the 134 fMRI papers published in the highly regarded journals Science, Nature, Nature Neuroscience, Neuron and the Journal of Neuroscience during 2008. Of these, they say, 42% contained at least one non-independent analysis, and another 14% may have done. That leaves 44% which were definitely "clean". Unfortunately, unlike Vul et al who did a similar review, they don't list the "good" and the "bad" papers.

They then go on to present the results of two simulated fMRI experiments in which seemingly exciting results emerge out of pure random noise, all because of the non-independence error. (One of these simulations concerns the use of pattern-classification algorithms to "read minds" from neural activity, a technique which I previously discussed). As they go on to point out, these are extreme cases - in real life situations, the error might only have a small impact. But the point, and it's an extremely important one, is that the error can creep in without being detected if you're not very careful. In both of their examples, the non-independence error is quite subtle and at first glance the methodology is fine. It's only on closer examination that the problem becomes apparent. The price of freedom from the error is eternal vigilance.

But it would be wrong to think that this is a problem with fMRI alone, or even neuroimaging alone. Any neuroscience experiment in which a large amount of data is collected and only some of it makes it into the final analysis is equally at risk. For example, many neuroscientists use electrodes to record the electrical activity in the brain. It's increasingly common to use not just one electrode but a whole array of them to record activity from more than brain one cell at once. This is a very powerful technique, but it raises the risk the non-independence error, because there is a temptation to only analyze the data from those electrodes where there is the "right signal", as the author's point out:
In single-cell recording, for example, it is common to select neurons according to some criterion (for example, visual responsiveness or selectivity) before applying
further analyses to the selected subset. If the selection is based on the same dataset as is used for selective analysis, biases will arise for any statistic not inherently independent of the selection criterion.
In fact,
Kriegeskorte et al praise fMRI for being, in some ways, rather good at avoiding the problem:
To its great credit, neuroimaging has developed rigorous methods for statistical mapping from its beginning. Note that mapping the whole measurement volume avoids selection altogether; we can analyze and report results for all locations equally, while accounting for the multiple tests performed across locations..
With any luck, the publication of this paper and Vul's so close together will force the neuroscience community to seriously confront this error and related statistical weaknesses in modern neuroscience data analysis. Neuroscience can only emerge stronger from the debate.

ResearchBlogging.orgKriegeskorte, N., Simmons, W., Bellgowan, P., & Baker, C. (2009). Circular analysis in systems neuroscience: the dangers of double dipping Nature Neuroscience DOI: 10.1038/nn.2303

The Voodoo Strikes Back

Just when you thought it was safe to compute a correlatation between a behavioural measure and a cluster mean BOLD change...

The fMRI voodoo correlations controversy isn't over. Ed Vul and collegues have just responded to their critics in a new article (pdf). The critics appear to have scored at least one victory, however, since the original paper has now been renamed. So it's goodbye to "Voodoo Correlations in Social Neuroscience" - now it's "Puzzlingly high correlations in fMRI studies of emotion, personality and social cognition" by Vul et. al. 2009. Not quite as catchy, but then, that's the point...

Just in case you need reminding of the story so far: A couple of months ago, MIT grad student Ed Vul and co-authors released a pre-publication manuscript, then titled Voodoo Correlations in Social Neuroscience. This paper reviewed the findings of a number of fMRI studies which reported linear correlations between regional brain activity and some kind of measure of personality. Vul et. al. argued that many (but by no means all) of these correlations were in fact erroneous, with the reported correlations being much higher than the true ones. Vul et. al. alleged that the problem arose due to a flaw in the statistical analysis used, the "non-independence error". For my non-technical explanation of the issue, see my previous post, or go read the original paper (it really doesn't require much knowledge of statistics).

Vul's paper attracted a lot of praise and also a lot of criticism, both in the blogosphere and in the academic literature. Many complained that it was sensationalistic and anti-fMRI. Others embraced it for the same reasons. My view was that while the paper's style was certainly journalistic, and while many of those who praised the paper did so for the wrong reasons, the core argument was both valid and important. While not representing a radical challenge to social neuroscience or fMRI in general, Vul et. al. draws attention to a widespread and potentially serious technical issue with the analysis of fMRI data, one which all neuroscientists should be aware of.

That's still my opinion. Vul et. al.'s response to their critics is a clearly worded and convincing defense. Interestingly, their defense is in many ways just a clarificiation of the argument. This is appropriate, because I think the argument is pretty much just common sense once it is correctly understood. As far as I can see the only valid defence against it is to say that a particular paper did not in fact commit the error - while not disputing that the error itself is a problem. Vul et. al. say that to their knowledge no accused papers have turned out to be innocent - although I'm sure we haven't heard the last of that.

Vul et. al. also now make explicit something which wasn't very clear in their original paper, namely that the original paper made accusations of two completely seperate errors. One, the non-independence error, is common but probably less serious than the second, the "Forman error", which is pretty much fatal. Fortunately, so far, only two papers are known to have fallen prey to the Forman error - although there could be more. Go read the article for more details on what could be Vul's next bombshell...

ResearchBlogging.orgEDWARD VUL, CHRISTINE HARRIS, PIOTR WINKIELMAN, AND, & HAROLD PASHLER (2009). Reply to comments on “Puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition” Perspectives in Psychological Science

"Voodoo Correlations" in fMRI - Whose voodoo?

It's the paper that needs little introduction - Ed Vul et. al.'s "Voodoo Correlations in Social Neuroscience". If you haven't already heard about it, read the Neurocritic's summary here or the summary at BPS research digest here. Ed Vul's personal page has some interesting further information here. (Probably the most extensive discussion so far, with a very comprehensive collection of links, is here.)

Few neuroscience papers have been discussed so widely, so quickly, as this one. (Nature, New Scientist, Newsweek, Scientific American have all covered it.) Sadly, both new and old media commentators seem to have been more willing to talk about the implications of the controversy than to explain exactly what is going on. This post is a modest attempt to, first and foremost, explain the issues, and then to evaluate some of the strengths and limitations of Vul et al's paper.

[Full disclosure: I'm an academic neuroscientist who uses fMRI, but I've never performed any of the kind of correlational analyses discussed below. I have no association with Vul et al., nor - to my knowledge - with any of the authors of any of the papers in the firing line. ]

1. Vul et al.'s central argument. Note that this is not their only argument.

The essence of the main argument is quite simple: if you take a set of numbers, then pick out some of the highest ones, and then take the average of the numbers you picked, the average will tend to be high. This should be no surprise, because you specifically picked out the high numbers. However, if for some reason you forgot or overlooked the fact that you had picked out the high numbers, you might think that your high average was an interesting discovery. This would be an error. We can call it the "non-independence error", as Vul et al. do.

Vul et al. argue that roughly half of the published scientific papers in a certain field of neuroscience include results which fall prey to this error. The papers in question are those which attempt to correlate activity in certain parts of the brain (measured using fMRI) against behavioural or self-report measures of "social" traits - essentially, personality. Vul et al. call this "social neuroscience", but it's important to note that it's only a small part of that field.

Suppose, for example, that the magnitude of the neural activation in the amygdala caused by seeing a frightening picture was positively correlated with the personality trait of neuroticism - tending to be anxious and worried about things. The more of a worrier a person is, the bigger their amygdala response to the scary image. (I made this example up, but it's plausible.)

The correlation coefficient, r, is a measure of how strong the relationship is. A coefficient of 1.0 indicates a perfect linear correlation. A coefficient of 0.4 would mean that the link was a lot weaker, although still fairly strong. A coefficient of 0 indicates no correlation at all. This image from Wikipedia shows what linear correlations of different strengths "look like".

Vul's argument is that many of the correlation coefficients appearing in social neuroscience papers are higher than they ought to be, because they fall prey to the non-independence error discussed above. Many reported correlations were in the range of r=0.7-0.9, which they describe as being implausibly high.

They say that the problem arises when researchers search across the whole brain for any parts where the correlation between activity and some personality measure is statistically significant - that is to say, where it is high - and then work out the average correlation coefficient in only those parts. The reported correlation coefficient will tend to be a high number, because they specifically picked out the high numbers (since only high numbers are likely to be statistically significantly different from zero.)

Suppose that you divided the amygdala into 100 small parts (voxels) and separately worked out the linear correlation between activity and neuroticism for each voxel. Suppose that you then selected those voxels in which the correlation was greater than (say) 0.8, and work out the average: (say) 0.86. This does not mean that activity across the amygdala as a whole is correlated with neuroticism with r=0.86. The "full" amygdala-neuroticism correlation must be less than this. (Clarification 5.2.09: Since there is random noise in any set of data, it is likely that some of those correlations which reached statistical significance were those which were very high by chance. This does not mean that there weren't any genuinely correlated voxels. However, it means that the average of the correlated voxels is not a measure of the average of the genuinely correlated voxels. This is a case of regression to the mean.)

Vul et. al. say that out of 52 social neuroscience fMRI papers they considered, 28 (54%) fell prey to this problem. They determined this by writing to the authors of the papers and asking them to answer some multiple-choice questions about their statistical methodology.This chart shows the reported correlation coefficients in the papers which seemed to suffer from the problem (in red) vs. those which didn't (in green); unsurprisingly, the ones which do tended to give higher coefficients. (Each square is one paper.)
That's it. It's quite simple. but....there is a very important question remaining. We've said that non-independent analysis leads to "inflated" or "too high" correlations, but too high compared to what? Well, the "inflated" correlation value reported by a non-independent analysis is entirely accurate - in that it's not just made up - but it only refers to a small and probably unrepresentative collection of voxels. It only becomes wrong if you think that this correlation is representative of the whole amygdala (say).

So you might decide that the "true" correlation might be the mean correlation over all of the voxels in the amygdala. But that's only one option. There are others. It would be equally valid to take the average correlation over the whole amygdalo-hippocampal complex (a larger region). Or the whole temporal cortex. That would be silly, but not an error - so long as you make it clear what your correlation refers to, any correlation figure is valid. If you say "The voxel in the amygdala with the greatest correlation with neuroticism in this data-set had an r=0.99", that would be fine, because readers will realize that this r=0.99 figure was probably an outlier. However, if you say, or imply, that "The amygdala was correlated with neuroticism r=0.99" based on the same data, you're making an error.

My diagram (if you can call it that...) to the left illustrates this point. The ovals represent the brain. The colour of each point in the brain represents the degree of linear correlation between some particular fMRI signal in that spot, and some measure of personality.

Oval 1 represents a brain in which no area is really correlated with personality. So most of the brain is gray, meaning very low correlation. But a few spots are moderately correlated just by chance, so they show up as yellow.

Oval 2 represents a brain in which a large blob of the brain (the "amygdala" let's call it) is really correlated quite well i.e. yellow. However, some points within this blob are, just by chance, even more correlated, shown in red.

Now, if you took the average correlation over the whole of the "amygdala", it would be moderate (yellow) - i.e. picture 2a. However, suppose that instead, you picked out those parts of the brain where the correlation was so high that it could not have occurred by chance (statistically significant).

We've seen that yellow spots often occur by chance even without any real correlation, but red ones don't - it's just too unlikely. So you pick out the red spots. If you average those, the average is obviously going to be very high (red). i.e. picture 2b. But if you then noticed that all of the red spots were in the amygdala, and said that the correlation in the amygdala was extremely high, you'd be making (one form of) the non-independence error.

Some people have taken issue with Vul's argument, saying that it's perfectly valid to search for voxels significantly correlated with a behaviour, and then to report on the strength of that correlation. See for example this anonymous commentator:

many papers conducted a whole brain correlation of activation with some behavioral/personality measure. Then they simply reported the magnitude of the correlation or extracted the data for visualization in a scatterplot. That is clearly NOT a second inferential step, it is simply a descriptive step at that point to help visualize the correlation that was ALREADY determined to be significant.
The academic responses to Vul make the same point (but less snappily).

The truth is that while there is technically nothing wrong with doing this, it could easily be misleading in practice. Searching for voxels in the brain where activation is significantly correlated with something is perfectly valid, of course. But the magnitude of the correlation in these voxels will be high by definition. These voxels are not representative because they have been selected for high correlation. In particular, even if these voxels all happen to be located within, say, the amygdala, they are not representative of the average correlation in the amygdala.

A related question is whether this is a "one-step" or a "two-step" analysis. Some have objected t that Vul implies it is a two-step analysis in which the second step is "wrong", whereas in fact it's just a one-step analysis. That's a purely semantic issue. There is only one statistical inference step (searching for significantly correlated voxels). But to then calculate and report the average correlation in those voxels is a second, descriptive step. The second step is not strictly wrong but it could be misleading, not because it introduces a new, flawed analysis, but because it would be a misinterpretation of the results of the first step.

2. Vul et al.'s secondary argument The argument set out above is not the only argument in the Vul et. al. paper. There's an entirely separate one introduced on Page 18 (Section F.)

The central argument is limited in scope. If valid it means that some papers, those which used non-independent methods to compute correlations, reported inappropriately high correlation coefficients. But it does not even claim that the true correlation coefficients were zero, or that the correlated parts of the brain were in the wrong places. If one picks out those voxels in the brain which are significantly correlated with a certain measure, it may be wrong to then compute the average correlation, but the fact that the correlation is significantly greater than zero remains. Indeed, the whole argument rests upon the fact that they are!

but...this all assumes that the calculation of statistical significance was done correctly. Such calculations can get very complex when it comes to fMRI data. It can be difficult to correct for the multiple comparisons problem. Vul et al. point out that some of the papers in question (they only cite one, but say that the same also applies to an unspecified number of others), the calculation of significance seems to have been done wrong. They trace the mistake to a table printed in a paper published in 1995. They accuse some people of having misunderstood this table, leading to completely wrong significance calculations.
The per-voxel false detection probabilities described by E. et al (and others) seem to come from Forman et al.’s Table 2C. Values in Forman et al’s table report the probability of false alarms that cluster within a single 2D slice (a single 128x128 voxel slice, smoothed with a FWHM of 0.6*voxel size). However, the statistics of clusters in 2D (a slice) are very different from those of a 3D volume: there are many more opportunity for spatially clustering false alarm voxels in the 3D case, as compared to the 2D case. Moreover, the smoothing parameter used in the papers in question was much larger than 0.6*voxel size assumed by Forman in Table 2C (in E. et al., this was >2*voxel size). The smoothing, too, increases the chances of false alarms appearing in larger spatial clusters.
If this is true, then it's a knock-down point. Any results based upon such a flawed significance calculation would be junk, plain and simple. You'd need to read the papers concerned in detail to judge whether it was, in fact, accurate. But this is a completely separate point to Vul et al.'s primary non-independence argument. The primary argument concerns a statistical phenomenon; this secondary argument accuses some people of simply failing to read a paper. The primary argument suggests that some reported correlation coefficients are too high, but only this second argument suggests that some correlation coefficients may in fact be zero. And Vul et al. do not say how many papers they think suffer from this serious flaw.

These two arguments seem to have gotten mixed up in the minds of many people. Responses to the Vul et al. paper have seized upon the secondary accusation that some correlations are completely spurious. The word "voodoo" in the title can't have helped. But this misses the point of Vul et al.'s central argument, which is entirely separate, and seems almost indisputable so far as it goes.

3. Some Points to Note
  • Just to reiterate, there are two arguments about brain-behaviour correlations in Vul et al. The main one - the one everyone's excited about - purports to show that 54% of the reported correlations in social neuroscience are weaker than they have been claimed, but cannot be taken to mean that they are zero. The second one claims that some correlations are entirely spurious because they were based on a very serious error stemming from misreading a paper. But at present only one paper has been named as a victim of this error.
  • The non-independence error argument is easy to understand and isn't really about statistics at all. If you've read this far, you should understand it as well as I do. There are no "intricacies". (The secondary argument, about multiple-comparison testing in fMRI, is a lot trickier however.)
  • How much the non-independence error inflates correlation sizes is difficult to determine, and it will vary in every different case. Amongst many other things the degree of inflation will depend upon two factors: the strictness of the statistical threshold used to pick the voxels (a stricter threshold = higher correlations picked); and the number of voxels picked (if you pick 99% of the voxels in the amygdala, then that's nearly as good as averaging over the whole thing; if you pick the one best voxel, then you could inflate the correlation enormously.) Note, however, that many of the papers that avoided the error still reported pretty strong correlations.
  • It's easy to work out brain activity-behaviour correlations while avoiding the non-independence problem. Half of the papers Vul et al. considered in fact did this (the "green" papers). One simply needs to select the voxels in which to calculate the average correlation based on some criteria other than the correlation itself. One could, for example, use an anatomy textbook to select those voxels making up the amygdala. Or, one could select those voxels which are strongly activated by seeing a scary picture. Many of the "green" papers which did this still reported strong correlations (r=0.6 or above).
  • Vul et al.'s criticisms apply only to reports of linear correlations between regional fMRI activity and some behavioural or personality measure. Most fMRI studies do not try to do this. In fact, many do not include any behavioural or personality measures at all. At the moment, fMRI researchers are generally seeking to find areas of the brain which are activated during experience of a certain emotion, performance of a cognitive process, etc. Such papers escape entirely unscathed.
  • Conversely, although Vul et al. looked at papers from social neuroscience, any paper reporting on brain activity-behaviour linear correlations could suffer from the non-independence problem. The fact that the authors happened to have chosen to focus on social neuroscience is irrelevant.
  • Indeed, Vul & Kerwisher have also recently written an excellent book chapter discussing the non-independence problem in a more general sense. Read it and you'll understand the "voodoo" better.
  • Therefore, "social neuroscience" is not under attack (in this paper.) To anyone who's read & understood the paper, this will be quite obvious.
4. Remarks: On the Art of Voodoo Criticism Vul et al. is a sound warning about a technical problem that can arise with a certain class of fMRI analyzes. The central point, although simple, is not obvious - no-one has noticed it before, after all - and we should be very grateful to have it pointed out. I can see no sound defense against the central argument: the correlations reported on the "red list" papers are probably misleadingly high, although we do not know by how much. (The only valid defense would be to say that your paper did not, in fact, use a non-independent analysis.)

Some have criticized Vul et. al. for their combative or sensationalist tone. It's true that they could have written the paper very differently. They could have used a conservative academic style and called it "Activity-behaviour correlations in functional neuroimaging: a methodological note". But no-one would have read it. Calling their paper "Voodoo correlations" was a very smart move - although there is no real justification for this, it brilliantly served to attract attention. And attention is what papers like this deserve.

But this paper is not an attack on fMRI as a whole, or social neuroscience as a whole, or even the calculation of brain-behaviour correlations as a whole. Those who treat it as such are the real voodoo practitioners in the old-fashioned sense: they see Vul sticking pins into a small part of neuroscience, and believe that this will do harm to the whole of it. This means you, Sharon Begley of Newsweek : "The upcoming paper, which rips apart an entire field: the use of brain imaging in social neuroscience...". This means you, anyone who read about this paper and thought "I knew it". No, you didn't, you may have thought that there was something wrong with all of these social neuroscience fMRI papers, but unless you are Ed Vul, you didn't know what it was.

There's certainly much wrong with contemporary cognitive neuroscience and fMRI. Conceptual, mathematical, and technical problems plague the field, just a few of which have been covered previously on Neuroskeptic and on other blogs as well as in a few papers (although surprisingly few). In all honesty, a few inflated correlations ranks low on the list of the problems with the field. Vul's is a fine paper. But its scope is limited. As always, be skeptical of the skeptics.

ResearchBlogging.orgEdward Vul, Christine Harris, Piotr Winkielman, Harold Pashler (2008). Voodoo Correlations in Social Neuroscience Perspectives on Psychological Science

 
powered by Blogger