Lange's mind drifted to a knotty research problem he and scientists worldwide had been tangling with for years, one that severely hampered their ability to identify genes associated with complex diseases such as asthma, obesity, and cancer. Conditions like these arise from the interplay of DNA and a host of environmental factors, from air pollutants to diet.
Linking genes and disease conditions wouldn't seem to have much in common with games of chance. But from a statistical standpoint, the two are---to use a Mendelian allusion--like peas in a pod.
How so? Consider this scenario: At a poker tournament, there are 500,000 players, and 500,000 decks of cards. Each of the players draws five cards from his or her own deck. Quite a few of the players get a royal flush, just by chance. Now, suppose that 20 of those decks are stacked. Again each of the 500,000 players draws five cards from his or her own deck. Again, quite a few get a royal flush. But now a mystery has arisen: Which of the royal flushes are due to chance (the ordinary decks), and which ones are due to stacked decks?
Scientists looking for troublesome genes play a similar "game": Using statistical tests, they compute the relationship of a person's DNA--all 3 billion subunits of it--to 500,000 known variations in the human genome that signal troublesome genes to see if any of the subunits match up with the variations. Typically, each of the statistical tests will have a chance probability of 5 percent. That means that, for the 500,000 variations, there will be 25,000 chance matches. The mystery arises again: Which of those matches are due just to chance, and which are "real"--that is, due to a stacked deck (a troublesome gene)?
How to sift through those 25,000 "hits" to separate the real from the chance ones has long been the Achilles' heel of genetic studies. It is known in scientific circles as the "multiple-comparison problem."
Christoph Lange was mulling over this conundrum on that spring day in 2003 as the faint wind damped down to a whimper, leaving him stranded mid-river. And then it hit--his "Eureka!" moment: a way to cull the variations so only the most promising ones remained. It would be like thinning out a haystack, so the needles would glint in the sun.
Lange nearly leaped out of the boat. "I was curious to see if the idea would work in practice, but couldn't get to shore for an hour because there was no wind," he says. "When I finally made it, I raced to my apartment, opened my laptop, and tried it. It worked fantastically. It was scary: It made sense that such an idea would work, but it seemed too good to be true."
In short order, Lange and HSPH colleagues Professor of Biostatistics Nan Laird and then-postdoctoral fellow Kristal Van Steen developed a statistical methodology that fundamentally changes the way scientists approach the multiple-comparison problem. And in a paper published in the April 14 issue of Science, they've identified a single gene associated with obesity--and proved that their strategy is reliable.
The Gene Hunt
Alas, the relationship between genes and complex diseases can be as murky as the Charles River. It's not that a troublesome gene causes a complex disease; rather, in concert with other troublesome genes, it contributes to the possibility that, with the right environmental triggers, a person harboring it could develop a condition. But trying to pinpoint those genes along the full length of the human genome would be like trying to locate, say, Chicago, Reno, or Minneapolis on a cross-country drive from Boston to San Francisco with neither a map nor road signs, just miles of meandering highways. So scientists identified a series of landmarks--sites of common tiny variations in the genome called SNPs ("snips"), or single nucleotide polymorphisms--that mark the troublesome genes. All told, scientists have identified 8 million SNPs. Today, they can track a whopping 500,000 of them dotting the genome's landscape--beacons of light shining on possible trouble spots.
The most fruitful genetic studies rely on family members as subjects rather than the population at large. That's because families share much of their DNA, so the number of genetic variables is reduced from the get-go. In these studies, researchers compare all 3 billion subunits of close relatives (say, parents and children) to each other, to see which of the 500,000 SNPs they share. Then, they cross-match those shared SNPs to the appearance of a trait--obesity, for instance--in the children. The degree of relatedness indicates how likely it is that the trait was inherited.
Here's where chance muddies the water. A scientist comparing the 3 billion subunits of a family member's DNA to the 500,000 SNPs will invariably get a match, simply by chance. Chance matches turn gene searches into hit-and-miss propositions. While various studies have turned up putative genes for obesity, for example, their findings can't be replicated.
The insight that smacked Christoph Lange in the head out on the Charles takes chance out of the picture.
What, Lange wondered, if researchers performed some statistical gymnastics before testing suspected genes against a trait, in order to whittle those 500,000 SNPs down to the 10 most likely to highlight the troublesome genes? By so doing, they could reduce the multiple-comparison problem to very manageable proportions.
Here's how Lange proposed to do that: First, researchers would pretend that the genetic makeup of the children was missing, and use genetic information only from the parents to surmise--using classic Mendel's laws--what it might be. They could then calculate the likelihood of each of the 500,000 SNPs' being passed on. "We wanted to estimate the heritability of each SNP," says Laird. They would use the degree of heritability to calculate how much influence a gene associated with each SNP would have on a trait. Finally, they would rank the SNPs in order of influence.
Selecting the 10 SNPs with the biggest influence, they could use just those to actually test against the trait in the children. The software program PBAT--which was developed at HSPH--was used to crunch those numbers (see "Built by Association," at right).
"Nan and Christoph are doing fundamental work in genetic epidemiology," says James Ware, HSPH Dean for Academic Affairs and Frederick Mosteller Professor of Biostatistics. "By perfecting a method for solving the multiple-comparison problem, they have overcome a major statistical obstacle in the interpretation of whole genome scans."
PROOF OF CONCEPT
Searching for genes common to obese people, the researchers followed two generations of families enrolled in the Framingham Heart Study using data collected by the study's originators. That information included the subjects' genetic makeup and traits--in particular, their body mass index (BMI), a ratio of weight to height commonly used to assess obesity. Individuals with BMI greater than or equal to 25 kg/m2 are considered overweight, and those with a BMI greater than or equal to 30 kg/m2 are considered obese.
Even before the Science paper, Laird and Lange had applied their new method to two studies at the Channing Laboratory at Boston's Brigham and Women's Hospital, both of which sought--and found--genes associated with chronic obstructive pulmonary disease. "Without Nan and Christoph's statistical methodology, we may not have identified the associations," says the Channing's Edwin K. Silverman, senior author on both papers. But those studies considered a few hundred SNPs apiece. The obesity study looked at huge numbers--116,204, to be precise.
"That was the proof of concept," says Lange. The pi`ece de résistance was being able to replicate the results of the obesity study in four subsequent trials, each with a very different population--more than 10,000 individuals in samples of Western European ancestry, African Americans, and children. "There are no other common obesity gene-variant associations that are reproducible," says Helen Lyon, of the Hirschhorn Laboratory at Children's Hospital Boston, who led two of the replication studies. Adds James Ware, "Their methodology sets a new standard for documenting an association between genetic makeup and a health outcome."
The multiple-comparison problem has always been present in familial genetic studies, but its ability to muck up the works multiplied exponentially as the capabilities of DNA-SNP matching technology grew. As recently as 2003, testing 10,000 SNPs--which cover a small proportion of the genome--against human DNA was considered heroic.
Today, researchers can investigate 500,000 SNPs at once, sweeping the entire genome like a Geiger counter sweeping soil. At their fingertips are computer-readable chips from Affymetrix, Inc., with 500,000 tiny test wells on their surface, about 20 for each SNP. The researchers simply apply sample DNA to the chemically treated wells and track the signals that go off, to see whether a SNP in the DNA coincides with a SNP in a particular well.
Laird and Lange's biostatistical breakthrough may have come just in time. Chips that can analyze a million SNPs at once are not far off.
The two HSPH biostatisticians are deeply
ensconced in unraveling the double helix further. They are working
to find genes
implicated in bipolar disorder,
disease, and other complex diseases that develop in the place where genes
and the environment meet.
Thea Singer is the senior writer for the Review within HSPH's Office for Resource Development.
This page is maintained by Development Communications in the Office of Resource Development.
To contact us with suggestions, comments, and questions, please e-mail: firstname.lastname@example.org
Copyright, 2006, President and Fellows of Harvard College