Tampilkan postingan dengan label statistics. Tampilkan semua postingan
Tampilkan postingan dengan label statistics. Tampilkan semua postingan

The Tufnel Effect


In This Is Spin̈al Tap, British heavy metal god Nigel Tufnel says, in reference to one of his band's less succesful creations:

It's such a fine line between stupid and...uh, clever.
This is all too true when it comes to science. You can design a breathtakingly clever experiment, using state of the art methods to address a really interesting and important question. And then at the end you realize that you forgot to type one word when writing the 1,000 lines of software code that runs this whole thing, and as a result, the whole thing's a bust.

It happens all too often. It has happened to me, let me think, three times in my scientific career and, I know of several colleagues who had similar problems and I'm currently struggling to deal with the consequences of someone else's stupid mistake.

Here's my cautionary tale. I once ran an experiment involving giving people a drug or placebo and when I crunched the numbers I found, or thought I'd found, a really interesting effect which was consistent with a lot of previous work giving this drug to animals. How cool is that?

So I set about writing it up and told my supervisor and all my colleagues. Awesome.

About two or three months later, for some reason I decided to reopen the data file, which was in Microsoft Excel, to look something up. I happened to notice something rather odd - one of the experimental subjects, who I remembered by name, was listed with a date-of-birth which seemed wrong: they weren't nearly that old.

Slightly confused - but not worried yet - I looked at all the other names and dates of birth and, oh dear, they were all wrong. But why?

Then it dawned on me and now I was worried: the dates were all correct but they were lined up with the wrong names. In an instant I saw the horrible possibility: m ixed up names would be harmless in themselves but what if the group assignments (1 = drug, 0 = placebo) were lined up with the wrong results? That would render the whole analysis invalid... and oh dear. They were.

As the temperature of my blood plummeted I got up and lurched over to my filing cabinet where the raw data was stored on paper. It was deceptively easy to correct the mix-up and put the data back together. I re-ran the analysis.

No drug effect.

I checked it over and over. Everything was completely watertight - now. I went home. I didn't eat and I didn't sleep much. The next morning I broke the news to my supervisor. Writing that email was one of the hardest things I've ever done.

What happened? As mentioned I had been doing all the analysis in Excel. Excel is not a bad stats package and it's very easy to use but the problem is that it's too easy: it just does whatever you tell it to do, even if this is stupid.

In my data as in most people's, each row was one sample (i.e. a person) and each column was a piece of info. What happened was that I'd tried to take all the data, which was in no particular order, and reorder the rows alphabetically by subject name to make it easier to read.

How could I screw that up? Well, by trying to select "all the data" but actually only selecting a few of the columns. Then I reordered them, but not the others, so all the rows became mixed up. And the crucial column, drug=1 placebo=0, was one of the ones I reordered.

The immediate lesson I learned from this was: don't use Excel, use SPSS, which simply does not allow you to reorder only some of the data. Actually, I still use Excel for making graphs and figures but every time I use it, I think back to that terrible day.

The broader lesson though is that if you're doing something which involves 100 steps, it only takes 1 mistake to render the other 99 irrelevant. This is true in all fields but I think it's especially bad in science, because mistakes can so easily go unnoticed due to the complexity of the data, and the consequences are severe because of the long time-scale of scientific projects.


Here's what I've learned: Look at your data, every step of the way, and look at your methods, every time you use them. If you're doing a neuroimaging study, the first thing you do after you collect the brain scans is to open them up and just look at them. Do they look sensible?

Analyze your data as you go along. Every time some new results come in, put it into your data table and just look at it. Make a graph which just shows absolutely every number all on one massive, meaningless line from Age to Cigarette's Smoked Per Week to EEG Alpha Frequency At Time 58. For every subject. Get to know the data. That way if something weird happens to it, you'll know. Don't wait to the end of the study to do the analysis. And don't rely on just your own judgement - show your data to other experts.

Check and recheck your methods as you go along. If you're running, say, a psychological experiment involving showing people pictures and getting them to push buttons, put yourself in the hot seat and try it on yourself. Not just once, but over and over. Some of the most insidious problems with these kinds of studies will go unnoticed if you only look at the task once - such as the old "randomized"-stimuli-that-aren't-random issue.

Trust no-one. This sounds bad, but it's not. Don't rely on their work, in experimental design or data analysis, until you've checked it yourself. This doesn't mean you're assuming they're stupid, because everyone makes these mistakes. It just means you're assuming they're human like you.

Finally, if the worst happens and you discover a stupid mistake in your own work: admit it. It feels like the end of the world when this happens, but it's not. However, if you don't admit it, or even worse, start fiddling other results to cover it up - that's misconduct, and if you get caught doing that, it is the end of the world, or your career, at any rate.

SSRIs and Suicide

Prozac and suicide: what's going on?

Many people think that SSRI antidepressants do indeed cause suicide, and in recent years this idea has gained a huge amount of attention. My opinion is that, well, it's all rather complicated...

At first glance, it seems as though it should be easy to discover the truth. SSRIs are some of the most studied drugs in the world. We have data from several hundred randomized placebo-controlled trials, totaling tens of thousands of patients. Let's just look and see whether people given SSRIs are more likely to die by suicide than people given placebo.

Unfortunately, that doesn't really work. Actual suicides are extremely rare in antidepressant trials. This is partly because most trials only last 4 to 6 weeks, but also because anyone showing evidence of suicidal tendencies is excluded from the studies at the outset. There just aren't enough suicides to be able to study.

What you can do is to look at attempted suicide, and at "suicidality", meaning suicidal thoughts and self-harming behaviours. Suicidality is more common than actual suicide, so it's easier to research. Here's the bad news: the evidence from a huge number of trials is that compared to placebo, antidepressants do raise the risk of suffering suicidality(1) and of suicide attempts(1) (from 1.1 per 1000 to 2.7 per 1000), when given to people with psychiatric disorders.

There's no good evidence that SSRIs are any worse or any better than other antidepressants, or that any one SSRI stands out as particularly bad(1,2). The risk seems to be worst in younger people: compared to placebo, SSRIs raised suicidality in people below age 25, had no effect in most adults, and lowered it in the oldest age groups(1). This is why SSRIs (and all other antidepressants) now carry a "black box" in the USA, warning about the risk of suicide in young people.

*

This is very troubling. Hang on though. I mentioned that suicidality is an exclusion criterion from pretty much all antidepressant trials. This is for ethical as well as practical reasons: it's considered unethical to give a suicidal person an experimental drug, and it's really impractical to have patients dying during your trial.

Indeed the recorded rate of suicidality in these trials is incredibly tiny: only 0.5% of the psychiatric patients experienced any suicidal ideation or behaviour at all(1). The other 99.5% never so much as thought about it, apparently. If that were representative of the real world it would be great; unfortunately it isn't. Yet what this all means is that antidepressants could not possibly reduce suicidality in these trials, because there's just nothing there to reduce. Even if, in the real world, they prevent loads of suicides, these trials wouldn't show it.

How do you investigate the effects of drugs "in the real world"? By observational studies - instead of recruiting people for a trial, you just look to see what happens to people who are prescribed a certain drug by their doctor. Observational studies have strengths and weaknesses. They're not placebo controlled, but they can be much larger than trials, and they can study the full spectrum of patients.

Observational studies have found very little evidence suggesting that antidepressants cause suicide. Most strikingly, since 1990 when SSRIs were introduced, antidepressant sales have increased enormously, and the suicide rate has fallen steadily; this is true of all Western countries.

More detailed analyses of antidepressant sales vs. suicide rates across time and location have generally either found either no effect, or a small protective effect, of antidepressant sales(1,2,3, many others). In the past few years, concern over suicidality has led to a fall in antidepressant use in adolescents in many countries: but there is no evidence that this reduced the adolescent suicide rate(1,2).

Another observational approach is to see whether people who have actually died by suicide were taking SSRIs at the time of death. Australian psychiatrists Dudley et al have just published a review of the evidence on this question, and they found that out of a total of 574 adolescent suicide victims from the USA, Britain, and Scandinavia, only 9 (1.5%) were taking an SSRI when they died. In other words, the vast majority of youth suicides occur in non-SSRI users. This sets a very low upper limit on the number of suicides that could be caused by SSRIs.


*

So what does all this mean? As I said, it's very controversial, but here's my take, with the standard caveat that I'm just some guy on the internet.

The evidence from randomized controlled trials is clear: SSRIs can cause suicidality, including suicide attempts, in some people, especially people below age 25. The chance of this happening is below 1% according to the trials, but this is still worrying given that lots of people take antidepressants. However, the use of antidepressants on a truly massive scale has not led to any rise in the suicide rate in any age group. This implies that overall, antidepressants prevent at least as many suicides as they cause.

My conclusion is that the clinical trials are not much use when it comes to knowing what will happen to any individual patient. The evidence is that antidepressants could worsen suicidality, or they could reduce it. This is hardly a satisfactory conclusion for people who want neat and tidy answers, but there aren't many of those in psychiatry. For patients, the implication is, boringly, that we should follow the instructions on the packet - be vigilant for suicidality, but don't stop taking them except on a doctor's orders.

ResearchBlogging.orgDudley, M., Goldney, R., & Hadzi-Pavlovic, D. (2010). Are adolescents dying by suicide taking SSRI antidepressants? A review of observational studies Australasian Psychiatry, 18 (3), 242-245 DOI: 10.3109/10398561003681319

New, Voodoo-Free fMRI Technique

MIT brain scanners Fedorenko et al present A new method for fMRI investigations of language: Defining ROIs functionally in individual subjects. Also on the list of authors is Nancy Kanwisher, one of the feared fMRI voodoo correlations posse.

The paper describes a technique for mapping out the "language areas" of the brain in individual people, not for their own sake, but as a way of improving other fMRI studies of language. That's important because while everyone's brain is organized roughly the same way, there are always individual differences in the shape, size and location of the different regions.

This is a problem for fMRI researchers. Suppose you scan 10 people and show them pictures of apples and pictures of pears. And suppose that apples activate the brain's Fruit Cortex much more strongly than pears. But unfortunately, the Fruit Cortex is a small area, and its location varies between people. In fact, in your 10 subjects, no-one's Fruit Cortex overlaps with anyone else's, even though everyone has one and they all work exactly the same way.

If you did this experiment you'd fail to find the effect of apples vs. pears, even though it's a strong effect, because there will be no one place in the brain where apples reliably cause more activation. What you need is a way of finding the Fruit Cortex in each person beforehand. What you'd need to do is a functional localization scan - say, showing people a big bowl of fruit - as a preliminary step.

Fedorenko et al scanned a bunch of people while doing a simple reading task, and compared that to a control condition, reading a random list of nonsense which makes no linguistic sense. As you can see, there's a lot of variation between people, but there's also clearly a basic pattern of activation: it looks a bit like a tilted "V" on the left side of the brain:

These are the language areas of each person. (Incidentally, this is why fMRI, despite its limitations, is an amazing technology. There is no better way of measuring this activation. EEG is cheaper but nowhere near as good at localizing activity; PET is close, but it's slow, expensive and involves injecting people with radioactivity.)

Fedorenko et al then overlapped all the individual images to produce of map of the brain showing how many people got activation in each part:

The most robust activations were on the left side of the brain, and they formed a nice "V" shape again. These are the areas which have long been known to be involved in language, so this is not surprising in itself.

Here's the clever bit: they then took the areas activated in a large % of people, and automatically divided them up into sub-regions; each of the "peaks" where an especially large proportion of subjects showed activation became a separate region.

This is on the assumption that these peaks represent parts of the brain with distinct functions - separate "language modules" as it were. But each module will be in a slightly different place in each person (see the first picture). So they overlapped the subdivisions with the individual activation blobs to get a set of individual functional zones they call Group-constrained Subject-Specific functional Regions of Interest, or GcSSfROIs to their friends.

Fedorenko et al claim various advantages to this technique, and present data showing that it produces nice results in independent subjects (i.e. not the ones they used to make the group map in the first place.)

In particular, they argue that it should allow future fMRI studies to have a better chance of finding the specific functions of each region. So far, experiments using fMRI to investigate language have largely failed to find activations specific to particular aspects of language like grammar, word meaning, etc. which is unexpected because patients suffering lesions to specific areas often do show very selective language problems.

Does this relate to the voodoo correlations issue? Indirectly, yes. The voodoo (non-independence error) problem arises when you do a large number of comparisons, and then focus on the "best" results, because these are likely to be wholly, or partially, only that good by chance.

Fedorenko et al's method allows you to avoid doing lots of comparisons in the first place. Instead of looking all over the whole brain for something interesting, you can first do a preliminary scan to map out where in each person's brain interesting stuff is likely to happen, and then focus on those bits in the real experiment.

There's still a multiple-comparisons problem: Fedorenko et al identified 16 candidate language areas per brain, and future studies could well provide more. But that's nothing compared to the 40,000 voxels in a typical whole-brain analysis. We'll have to wait and see if this technique proves useful in the real world, but it's an interesting idea...

ResearchBlogging.orgFedorenko, E., Hsieh, P., Nieto Castanon, A., Whitfield-Gabrieli, S., & Kanwisher, N. (2010). A new method for fMRI investigations of language: Defining ROIs functionally in individual subjects Journal of Neurophysiology DOI: 10.1152/jn.00032.2010

Can We Rely on fMRI?

Craig Bennett (of Prefrontal.org) and Michael Miller, of dead fish brain scan fame, have a new paper out: How reliable are the results from functional magnetic resonance imaging?


Tal over at the [citation needed] blog has an excellent in-depth discussion of the paper, and Mind Hacks has a good summary, but here's my take on what it all means in practical terms.

Suppose you scan someone's brain while they're looking at a picture of a cat. You find that certain parts of their brain are activated to a certain degree by looking at the cat, compared to when they're just lying there with no picture. You happily publish your results as showing The Neural Correlates of Cat Perception.

If you then scanned that person again while they were looking at the same cat, you'd presumably hope that exact same parts of the brain would light up to the same degree as they did the first time. After all, you claim to have found The Neural Correlates of Cat Perception, not just any old random junk.

If you did find a perfect overlap in the area and the degree of activation that would be an example of 100% test-retest reliability. In their paper, Bennett and Miller review the evidence on the test-retest reliability of fMRI studies. They found 63 of them. On average, they found that the reliability of fMRI falls quite far short of perfection: the areas activated (clusters) had a mean Dice overlap of 0.476, while the strength of activation was correlated with a mean ICC of 0.50.

But those numbers, taken out of context, do not mean very much. Indeed, what is a Dice overlap? You'll have to read the whole paper to find out, but even when you do, they still don't mean that much. I suspect this is why Bennett and Miller don't mention them in the Abstract of the paper, and in fact they don't spend more than a few lines discussing them at all.

A Dice overlap of 0.476 and an ICC of 0.50 are what you get if average over all of the studies that anyone's done looking at the test-retest reliability of any particular fMRI experiment. But different fMRI experiments have different reliabilities. Saying that the average reliability of fMRI is 0.5 is rather like saying that the mean velocity of a human being is 0.3 km per hour. That's probably about right, averaging over everyone in the world, including those who are asleep in bed and those who are flying on airplanes - but it's not very useful. Some people are moving faster than others, and some scans are more reliable than others.


Most of this paper is not concerned with "how reliable fMRI is", but rather, with how to make any given scanning experiment more reliable. And this is an important thing to write about, because even the most optimistic cognitive neuroscientist would agree that many fMRI results are not especially reliable, and as Bennett and Miller say, reliability matters for lots of reasons:

Scientific truth. While it is a simple statement that can be taken straight out of an undergraduate research methods course, an important point must be made about reliability in research studies: it is the foundation on which scientific knowledge is based. Without reliable, reproducible results no study can effectively contribute to scientific knowledge.... if a researcher obtains a different set of results today than they did yesterday, what has really been discovered?
Clinical and Diagnostic Applications. The longitudinal assessment of changes in regional brain activity is becoming increasingly important for the diagnosis and treatment of clinical disorders...
Evidentiary Applications. The results from functional imaging are increasingly being submitted as evidence into the United States legal system...
Scientific Collaboration. A final pragmatic dimension of fMRI reliability is the ability to share data between researchers...
So what determines the reliability of any given fMRI study? Lots of things. Some of them are inherent to the nature of the brain, and are not really things we can change: activation in response to basic perceptual and motor tasks is probably always going to be more reliable than activation related to "higher" functions like emotions.

But there are lots of things we can change. Although it's rarely obvious from the final results, researchers make dozens of choices when designing and analyzing an fMRI experiment, many of which can at least potentially have a big impact on the reliability of their findings. Bennett and Miller cover lots of them:
voxel size... repetition time (TR), echo time (TE), bandwidth, slice gap, and k-space trajectory... spatial realignment of the EPI data can have a dramatic effect on lowering movement-related variance ... Recent algorithms can also help remove remaining signal variability due to magnetic susceptibility induced by movement... simply increasing the number of fMRI runs improved the reliability of their results from ICC = 0.26 to ICC = 0.58. That is quite a large jump for an additional ten or fifteen minutes of scanning...
The details get extremely technical, but then, when you do an fMRI scan you're using a superconducting magnet to image human neural activity by measuring the quantum spin properties of protons. It doesn't get much more technical.

Perhaps the central problem with modern neuroimaging research is that it's all too easy for researchers to write off the important experimental design issues as "merely" technicalities, and just put some people in a scanner using the default scan sequence and see what happens. This is something few fMRI users are entirely innocent of, and I'm certainly not, but it is a serious problem. As Bennett and Miller point out, the devil is in the technical details.
The generation of highly reliable results requires that sources of error be minimized across a wide array of factors. An issue within any single factor can significantly reduce reliability. Problems with the scanner, a poorly designed task, or an improper analysis method could each be extremely detrimental. Conversely, elimination of all such issues is necessary for high reliability. A well maintained scanner, well designed tasks, and effective analysis techniques are all prerequisites for reliable results.
ResearchBlogging.orgBennett CM, Miller MB. (2010). How reliable are the results from functional magnetic resonance imaging? Annals of the New York Academy of Sciences

That Sinking Feeling?

Sinking and Swimming is a paper just out from the Young Foundation, a British think-tank. It "explores how psychological and material needs are being met and unmet in Britain." I'm not sure how useful their broad concept of "unmet needs" is, but there's some rather interesting data in this report.

On page 238, and prominently in the executive summary, we find the following terrifying graph, which comes with warnings like "anxiety and depression looks set to double during the course of a single generation..."

The % of the population self-reporting suffering from depression or anxiety seems to have been consistently rising since 1990, from less than 6% to almost 10% today. And the line continues ever upwards. Eeek!

Is Britain really becoming more depressed and anxious? No, and that's what makes this graph terrifying. According to the large government Adult Psychiatric Morbidity Survey, the prevalence of self-reported depression and anxiety symptoms rose slightly from 1993 to 2000 (15.5% to 17.5%) and then stayed level up to 2007 (17.6%). Not very scary. Even the Young Foundation note (on page 80) that when you look at "well-being"

analysis of the English health survey that uses a variation of GHQ [General Health Questionnaire] suggested that the proportion of the working age population with poor psychological well-being decreased from 17% in 1997 to 13% in 2006.
On that measure, we're getting happier. And the rate of new diagnoses of clinical depression fell over the past decade.

So what about that ominous line? Well, that graph was based on "self-reported anxiety or depression", but in a specific sense. People were not reporting feeling scared or unhappy (see above for the data on that), but rather, reporting having anxiety or depression as medical disorders. Curiously the % of people reporting having every other sort of health problems (except with vision) increased from 1991 to 2007 as well:


What seems to be happening is that British people are becoming more willing to label our problems as medical illnesses, although in fact our mental health has not changed much over the past two decades, and may even have improved slightly. This is what's terrifying, because medicalizing emotional issues is a bad idea.

Mental illness does exist, and medicine can help treat it, but medicine can't resolve non-medical problems even if they're labelled as illnesses. Antidepressants, for example, are (imperfectly) effective for severe clinical depression but probably not for "mild depression"; much of what is labelled "mild depression" is probably not, in any meaningful sense, an illness.

Why does this matter? Drugs have side effects, and psychotherapy is expensive. The cost-benefit profile of any treatment is obviously negative when there are no benefits because the treatment is being used inappropriately. My biggest concern, though, is that if someone is unhappy because of tensions in their marriage or because they're in the wrong job, they don't need treatment, they need to do something about it. Labelling a problem as an illness and treating it medically may, in itself, make that problem harder to overcome.

[BPSDB]

Statistically

"Statistically, airplane travel is safer than driving..." "Statistically, you're more likely to be struck by lightning than to..." "Statistically, the benefits outweigh the risks..."

What does statistically mean in sentences like this? Strictly speaking, nothing at all. If airplane travel is safer than driving, then that's just a fact. (It is true on an hour-by-hour basis). There's no statistically about it. A fact can't be somehow statistically true, but not really true. Indeed, if anything, it's the opposite: if there are statistics proving something, it's more likely to be true than if there aren't any.

But we often treat the word statistically as a qualifier, something than makes a statement less than really true. This is because psychologically, statistical truth is often different to, and less real than, other kinds of truth. As everyone knows, Joseph Stalin said that one death is a tragedy, but a million deaths is a statistic. Actually, Stalin didn't say that, but it's true. And if someone has a fear of flying, then all the statistics in the world probably won't change that. Emotions are innumerate.

*

Another reason why statistics feel less than real is that, by their very nature, they sometimes seem to conflict with everyday life. Statistics show that regular smoking, for example, greatly raises your risk of suffering from lung cancer, emphysema, heart disease and other serious illnesses. But it doesn't guarantee that you will get any of them, the risk is not 100%, so there will always be people who smoke a pack a day for fifty years and suffer no ill effects.

In fact, this is exactly what the statistics predict, but you still hear people referring to their grandfather who smoked like a chimney and lived to 95, as if this somehow cast doubt on the statistics. Statistically, global temperatures are rising, which predicts that some places will be unusually cold (although more will be unusually warm), but people still think that the fact that it's a bit chilly this year casts doubt on the fact of global warming.

*

Some people admit that they "don't believe in statistics". And even if we don't go that far, we're often a little skeptical. There are lies, damn lies, and statistics, we say. Someone wrote a book called How To Lie With Statistics. Few of us have read it, but we've all heard of it.

Sometimes, this is no more than an excuse to ignore evidence we don't like. It's not about all statistics, just the inconvenient ones. But there's also, I think, a genuine distrust of statistics per se. Partially, this reflects distrust towards the government and "officialdom", because most statistics nowadays come from official sources. But it's also because psychologically, statistical truth is just less real than other kinds of truth, as mentioned above.

*

I hope it's clear that I do believe in statistics, and so should you, all of them, all the time, unless there is a good reason to doubt a particular one. I've previously written about my doubts concerning mental health statistics, because there are specific reasons to think that these are flawed.

But in general, statistics are the best way we have of knowing important stuff. It is indeed possible to lie with statistics, but it's much easier to lie without them: there are more people in France than in China. Most people live to be at least 110 years old. Africa is richer than Europe. Those are not true. But statistics are how we know that.

[BPSDB]

 
powered by Blogger