11
The past 30 years has seen considerable progress in understanding the molecular genetics of diseases that are inherited according to Mendelian rules — that is, those in which mutations in a single gene have a large effect on disease. A key challenge now is to unravel the molecular and genetic basis of complex human disease, which results from the combined effects of many genes — that is, diseases that show polygenic inheritance — a challenge that might be aided by advances in genomic technology. This will bring new insights to disease aetiology, which, in turn, will help in the development of new methods for disease prevention and treatment. As a result of these advances, it might also become possible to target interventions to individuals at greatest risk of disease. It is likely that the inheritance of most common cancers is polygenic. All the common cancers show some degree of familial clustering, with disease being 2–4-fold more common in first-degree relatives of individuals with cancer 1 . The higher rate of most cancers in the monozygotic twins of cases than in dizy- gotic twins or siblings indicates that most of the famil- ial clustering is the result of genetic variation rather than lifestyle or environmental factors 2 . Some of this clustering occurs as part of specific familial cancer syndromes where disease results from single genes conferring a high risk, and many genes associated with specific familial cancer syndromes have been identi- fied. However, the susceptibility alleles in these genes are rare in the population and they account for a small minority of the inherited component of cancer. For example highly PENETRANT variants in the breast cancer susceptibility genes BRCA1 and BRCA2 account for less than 20% of the genetic risk of breast cancer, with other rarer high-penetrance genes such as TP53, ATM and PTEN accounting for less than 5% 3 . Despite extensive efforts, attempts to identify more BRCA-like highly pen- etrant cancer-susceptibility genes have failed. Together with data on patterns of familial occurrence of cancer that exclude cases due to known high-risk genes 4 , this strongly indicates that most genetic susceptibility to common cancers results from the combined effects of many genetic variants, each of which have a modest effect individually.Various possibilities are consistent with a polygenic model for cancer susceptibility, ranging from a handful of alleles of moderate risk to a large number of alleles each conferring a slight increase in risk across a wide range of allele frequencies. Family-based LINKAGE STUDIES have been the founda- tion for the many successes in mapping of genes associ- ated with Mendelian disorders, including many rare family cancer syndromes such as multiple endocrine neoplasia type 2 and adenomatosis polyposis coli (APC). Linkage studies have also been successful in mapping genes for several common cancers, such as BRCA1 and BRCA2 in breast and ovarian cancer , MSH2 and MLH1 in colorectal cancer, and CDKN2A in melanoma. However, the success of this approach has mainly been limited to genes with rare, highly penetrant ASSOCIATION STUDIES FOR FINDING CANCER-SUSCEPTIBILITY GENETIC VARIANTS Paul D. P. Pharoah*,Alison M. Dunning*, Bruce A. J. Ponder*and Douglas F. Easton Abstract | Cancer is the result of complex interactions between inherited and environmental factors. Known genes account for a small proportion of the heritability of cancer, and it is likely that many genes with modest effects are yet to be found. Genetic-association studies have been widely used in the search for such genes, but success has been limited so far. Increased knowledge of the function of genes and the architecture of human genetic variation combined with new genotyping technologies herald a new era of gene mapping by association. PENETRANCE The frequency with which individuals who carry a given mutation show the manifestations associated with that mutation. If the penetrance of a disease allele is 100%, then all individuals carrying that allele will express the associated phenotype. *Cancer Research UK Human Cancer Genetics Group, Department of Oncology and Genetic Epidemiology Group, Department of Public Health and Primary Care, Strangeways Research Laboratory, Worts Causeway, Cambridge, CB1 8RN, UK. Correspondence to B.A.J.P. e-mail: [email protected] doi:10.1038/nrc1476 850 | NOVEMBER 2004 | VOLUME 4 www.nature.com/reviews/cancer REVIEWS

Association studies for finding cancer-susceptibility genetic variants

Embed Size (px)

Citation preview

Page 1: Association studies for finding cancer-susceptibility genetic variants

The past 30 years has seen considerable progress inunderstanding the molecular genetics of diseases that areinherited according to Mendelian rules — that is, those inwhich mutations in a single gene have a large effect ondisease. A key challenge now is to unravel the molecularand genetic basis of complex human disease, whichresults from the combined effects of many genes — thatis, diseases that show polygenic inheritance — a challengethat might be aided by advances in genomic technology.This will bring new insights to disease aetiology, which, inturn, will help in the development of new methods for disease prevention and treatment. As a result ofthese advances, it might also become possible to targetinterventions to individuals at greatest risk of disease.

It is likely that the inheritance of most commoncancers is polygenic. All the common cancers showsome degree of familial clustering, with disease being2–4-fold more common in first-degree relatives ofindividuals with cancer1. The higher rate of most cancers in the monozygotic twins of cases than in dizy-gotic twins or siblings indicates that most of the famil-ial clustering is the result of genetic variation ratherthan lifestyle or environmental factors2. Some of thisclustering occurs as part of specific familial cancer syndromes where disease results from single genesconferring a high risk, and many genes associated withspecific familial cancer syndromes have been identi-fied. However, the susceptibility alleles in these genesare rare in the population and they account for a small

minority of the inherited component of cancer. Forexample highly PENETRANT variants in the breast cancersusceptibility genes BRCA1 and BRCA2 account for lessthan 20% of the genetic risk of breast cancer, with otherrarer high-penetrance genes such as TP53, ATM andPTEN accounting for less than 5%3. Despite extensiveefforts, attempts to identify more BRCA-like highly pen-etrant cancer-susceptibility genes have failed. Togetherwith data on patterns of familial occurrence of cancerthat exclude cases due to known high-risk genes4, thisstrongly indicates that most genetic susceptibility tocommon cancers results from the combined effects ofmany genetic variants, each of which have a modesteffect individually. Various possibilities are consistentwith a polygenic model for cancer susceptibility, rangingfrom a handful of alleles of moderate risk to a largenumber of alleles each conferring a slight increase inrisk across a wide range of allele frequencies.

Family-based LINKAGE STUDIES have been the founda-tion for the many successes in mapping of genes associ-ated with Mendelian disorders, including many rarefamily cancer syndromes such as multiple endocrineneoplasia type 2 and adenomatosis polyposis coli(APC). Linkage studies have also been successful inmapping genes for several common cancers, such asBRCA1 and BRCA2 in breast and ovarian cancer, MSH2and MLH1 in colorectal cancer, and CDKN2A inmelanoma. However, the success of this approach hasmainly been limited to genes with rare, highly penetrant

ASSOCIATION STUDIES FORFINDING CANCER-SUSCEPTIBILITYGENETIC VARIANTSPaul D. P. Pharoah*,Alison M. Dunning*, Bruce A. J. Ponder*and Douglas F. Easton‡

Abstract | Cancer is the result of complex interactions between inherited and environmentalfactors. Known genes account for a small proportion of the heritability of cancer, and it is likelythat many genes with modest effects are yet to be found. Genetic-association studies have beenwidely used in the search for such genes, but success has been limited so far. Increasedknowledge of the function of genes and the architecture of human genetic variation combinedwith new genotyping technologies herald a new era of gene mapping by association.

PENETRANCE

The frequency with whichindividuals who carry a givenmutation show themanifestations associated withthat mutation. If the penetranceof a disease allele is 100%, thenall individuals carrying thatallele will express the associatedphenotype.

*Cancer Research UKHuman Cancer GeneticsGroup, Department ofOncology and ‡Genetic EpidemiologyGroup, Department ofPublic Health and PrimaryCare, Strangeways ResearchLaboratory, WortsCauseway, Cambridge,CB1 8RN, UK.Correspondence to B.A.J.P.e-mail:[email protected]:10.1038/nrc1476

850 | NOVEMBER 2004 | VOLUME 4 www.nature.com/reviews/cancer

R E V I E W S

Page 2: Association studies for finding cancer-susceptibility genetic variants

NATURE REVIEWS | CANCER VOLUME 4 | NOVEMBER 2004 | 851

R E V I E W S

LINKAGE STUDIES

A statistical method in which thegenotypes and phenotypes ofparents and offspring in familiesare studied to determinewhether two or more loci areassorting independently orexhibiting linkage duringmeiosis.

PHENOCOPY

A non-hereditary alteration inphenotype, induced byenvironmental factors such asnutritional status, that mimicsthe phenotype produced by aspecific gene.

POLYMORPHISM

A polymorphism is the existenceof two or more variants (alleles,sequence variants, chromosomalstructural variants) at significantfrequencies in the population. Itis conventional for a geneticvariant with a frequency of >1%to be called a polymorphism.

genetic predisposition to cancer is due to alleles withmoderate effects or how many such alleles exist (FIG. 1).The association-study design is being increasinglyused in the study of inherited susceptibility to dis-ease: a Medline search using the terms ‘case–control’and ‘POLYMORPHISM’ yielded 2,793 hits for the period2000–2004, compared with 991 for the period1995–1999.

In this review, we describe the methods used inearly association studies in cancer genetics and thelessons learned from them. We then discuss the com-monly used, current approaches to study design beforedescribing some of the developments that mightimprove the chances of success in the future. The dis-cussion is presented from the point of view of cancer,but important lessons and ideas have also been derivedfrom the experience of carrying out association studiesin other diseases.

The candidate-gene approachAssociation studies can either be population-based orfamily-based, although family-based association stud-ies are rarely used in cancer because of the difficulty inrecruiting elderly relatives of cases. The population-based association study is a simple case–control studyin which genotype frequencies at a particular locus arecompared in cases and unrelated controls. Most asso-ciation studies are based on candidate genes thatencode proteins thought to be involved in carcinogen-esis, such as those involved in apoptosis, cell-cyclecontrol, carcinogen metabolism or DNA repair, orthose known to be somatically altered in cancer. Therationale for the candidate-gene approach is that bymaximizing the biological plausibility, the chances ofsuccess are increased. However, the approach is lim-ited by its reliance on existing knowledge to identifycandidate genes based on function.

alleles. Linkage studies lack power to detect the allelesconferring moderate risks that are likely to be the normin complex disease. There are several reasons for this,including genetic heterogeneity, PHENOCOPIES and incom-plete penetrance. Genetic heterogeneity limits the use-fulness of combining data from multiple families, as different genes might be responsible for disease cluster-ing in different families. Within families, phenocopiesand incomplete penetrance are a problem, as the carrierstatus of a putative disease allele cannot be definitivelyinferred from disease status.

The main alternative to linkage studies for disease-gene mapping is the association study, in which thefrequency of a genetic variant in diseased individuals(cases) and individuals without the disease (controls)are compared5,6. Allelic association is present when thedistribution of genotypes differs in cases and controls.Such an association provides evidence that the locusunder study, or a neighbouring locus, is related to dis-ease susceptibility. Association studies for diseasegenes are generally based on the ‘common variant:common disease’ hypothesis7. Genetic variants arisinga long time ago might have become common in thepresent population at frequencies ranging from a fewpercent upwards. Some of these variants might pre-dispose to common diseases, and combinations ofthese variants are proposed to underlie differences indisease susceptibility. Association mapping will be apowerful tool for mapping such loci with moderateeffects. However, it is not clear how much of the

Summary

• The polygenic model for cancer susceptibility indicates that much of the inherited riskof cancer is due to multiple risk alleles, each with a low to moderate risk. The numberof such alleles for any specific cancer is unknown, but might be in the hundreds orthousands.

• Although linkage studies have been highly successful in mapping the genes thatunderlie monogenic disorders, these studies are of limited use for investigatingpredisposition to polygenic disease, such as cancer. Genetic-association studies — orcase–control studies — provide an efficient design for identifying common geneticvariants that confer modest disease risks.

• Few convincing cancer-susceptibility alleles have been identified so far using thegenetic-association study design. The limited success of these studies can be attributedmainly to the use of small study sizes — which provide insufficient statistical powerand give a high rate of false positives — and limitations in the selection of candidategenes.

• The rapid acquisition of data on the occurrence of common single-nucleotidepolymorphisms (SNPs) has made it possible to test for the association of a candidategene or region with disease using a tagging-SNP approach.

• Several approaches can be used to increase the efficiency of candidate-gene associationstudies, such as improving the selection of candidate genes that are likely to beassociated with cancer predisposition and enriching for genetic susceptibility bystudying families with a history of cancer.

• A combination of cheaper genotyping technologies with efficient study design willmake empirical, whole-genome studies a feasible prospect in the near future.

• Elucidating how multiple susceptibility alleles interact with each other and withlifestyle and environmental factors will be a key future challenge for the molecular andgenetic epidemiology of cancer predisposition.

Num

ber

of a

llele

s re

quire

d

RR = 1.3RR = 1.5

Allele frequency

0.00 0.05 0.10 0.15 0.20 0.25 0.300

40

80

120

160

200

Figure 1 | The number of alleles required to explain theexcess familial risk of a typical common cancer accordingto alleles with different frequencies and conferringdifferent risks. The graph illustrates the situation for a co-dominant model — that is, a disease in which risk increasesaccording to the number of susceptibility alleles carried — andassumes either a 1.3- or 1.5-fold increase in relative risk (RR) tothe siblings or offspring of a case (for example, breast cancer).Therefore, the excess risk could be explained by a modestnumber of common variants or a large number of rare variants.

Page 3: Association studies for finding cancer-susceptibility genetic variants

852 | NOVEMBER 2004 | VOLUME 4 www.nature.com/reviews/cancer

R E V I E W S

HAPLOTYPE

The physical arrangement ofmultiple alleles along achromosome or segment of achromosome.

TANDEM-REPEAT

POLYMORPHISM

A tandem repeat is two or morecopies of the same DNAsequence arranged in a directhead to tail succession along achromosome. The number ofcopies of the repeat might varyin the population.

RESTRICTION FRAGMENT

LENGTH POLYMORPHISM

A polymorphic difference inDNA sequence betweenindividuals that can berecognized by restrictionendonucleases.

SINGLE-NUCLEOTIDE

POLYMORPHISM

Any polymorphic variation at asingle nucleotide (base) in thegenome.

potential sources of bias and the small sample sizes (14 studies included fewer than 99 cases).

Most associations reported in the literature havenot been confirmed by subsequent studies14,15. Themost likely explanation is that most initial reports arefalse positives, and the most common reason for thisis simply chance (type I error), exacerbated by publi-cation bias. The probability of a type I error is quanti-fied in terms of the level of statistical significance.Unfortunately, the levels of significance appropriate inother contexts (P = 0.05 or P = 0.01) can be highlymisleading in association studies. Because the numberof possible genetic polymorphisms is very large andthe prior probability that any polymorphism will beassociated with disease will be low, most polymor-phisms achieving a modest level of statistical signifi-cance will be false positives, as explained in BOX 1. Thefalse-positive rate can be reduced by using more strin-gent levels of statistical significance and by improvingthe selection of candidate genes16 (see below). In addi-tion, non-replication might occur because of a lack ofadequate statistical power, resulting in false negatives(type II error) in the replication study (BOX 2).Alternatively, failure to confirm associations might bethe result of heterogeneity in risk between popula-tions. This might occur if there were population dif-ferences in linkage disequilibrium (see later fordetailed definition), or population differences in allelefrequencies of interacting genes or interacting lifestyleand environmental factors.

An empirical approach to candidate genesThe main lesson to be derived from these observa-tions is that large sample sizes are needed to detectand confirm, at appropriate levels of statistical signifi-cance, genetic variants that confer modest risks17,18.The chance of success can be further improved bycareful selection of both candidate gene and candidatepolymorphism16.

However, the selection of candidate SNPs has amajor drawback when no association is found, as alack of association does not necessarily rule out thepresence of another important variant in the samegene. The identification of large numbers of SNPsacross the human genome has made it possible tomove on from the study of specific candidate poly-morphisms to the study of candidate genes in whichan empirical approach is used to address the question‘are there any allelic variants of this candidate geneassociated with cancer?’ For any given gene of inter-est, there might be tens or even hundreds of differentvariable loci. Fortunately, it is not necessary to geno-type all possible variants to detect an association,because the alleles of SNPs that are physically close toeach other tend to be correlated with each other. Thisphenomenon is called linkage disequilibrium (LD; BOX 3). The ability of one SNP to report onanother depends on the strength of LD betweenthem, as described in BOX 3.

There has been much discussion of the theoreticalbasis for such an empirical approach in recent years19–23,

The first association studies used methods that coulddistinguish allelic variants at the protein level. Suchstudies led to the established associations between gastric cancer and the ABO blood group8, and betweencertain human leukocyte antigen HAPLOTYPES andnasopharyngeal cancer9. The advent of DNA-basedassays enabled investigators to study genetic variationdirectly. Gel-based methods were developed to detectsimple TANDEM-REPEAT POLYMORPHISMS and RESTRICTION

FRAGMENT LENGTH POLYMORPHISMS. These methods were eas-ier to perform and cheaper than protein-based meth-ods, but were still relatively expensive (US $1-10 pergenotype), difficult to implement in a high-throughputformat and prone to observer bias.

Before the completion of the human genomesequence, the range of possible association studies waslimited by the small number of genes with known poly-morphisms. In recent years, the identification of largenumbers of SINGLE-NUCLEOTIDE POLYMORPHISMS (SNPs)across the genome has markedly increased the numberof associations that can be tested. This, in combinationwith the rapid development of cheap gentoyping meth-ods that can be adapted to be high throughput, has ledto an explosion in the number of association studiesbeing carried out.

Lessons from early association studiesIt is beyond the scope of this article to provide a com-prehensive review of all cancer genetic-association stud-ies published so far. It is clear, however, that of the manyassociations that have been published, few, if any, havebecome established beyond reasonable doubt. This is, ofcourse, not the same as saying that all published associa-tions are incorrect, merely that adequately poweredstudies have not been carried out to confirm them. Insome cases, for example between the genes encodingglutathione S-transferase M1 and N-acetyltransferase 2and smoking-related cancers, quite impressive associa-tions have been reported in several studies10,11. However,no study is large and the possibility of substantial publi-cation bias means that these associations remain opento some doubt.

Two published systematic reviews illustrate some ofthe limitations of many of the studies published sofar12,13. In 1999 Dunning et al. carried out a meta-analy-sis, using published data from 46 studies, that examinedthe effect of common alleles of 18 different genes onbreast cancer risk12. No definitive conclusions could bedrawn; most reported studies were small (median num-ber of cases was 391), few studies reported significantrisks, and, where significant results were reported, thesewere not replicated by subsequent studies. Anothermeta-analysis summarized association studies of gastriccancer published up to October 2001 (REF. 13). Thirtyone articles based on 25 case–control studies carried outin Caucasian, Asian and African populations were iden-tified, most of which assessed the effect of genesinvolved in detoxifying pathways and inflammatoryresponses. Again, no definitive conclusions were drawn,and the authors concluded that the most importantlimitations were the lack of appropriate controls for

Page 4: Association studies for finding cancer-susceptibility genetic variants

NATURE REVIEWS | CANCER VOLUME 4 | NOVEMBER 2004 | 853

R E V I E W S

RELATIVE RISK

The relative risk of diseaseassociated with a particular riskfactor (also known as anexposure), such as a particulargenotype, is the ratio of theincidence of disease inindividuals with that risk factorto the incidence of disease inindividuals without the riskfactor.

Identifying a set of polymorphisms to assayGiven that it is not necessary to genotype all possiblevariants, the most efficient set of markers that ade-quately report on all the other variants needs to beelucidated. One approach would be to select the sub-set of SNPs that individually provide a minimumacceptable level of LD for all other SNPs (BOX 3).However, even this set of ‘tagging’ SNPs might be lessthan optimal because a combination of SNPs consid-ered together might report on unmeasured SNPsmore efficiently than any single SNP. It is also possiblethat it is not a single polymorphism within a gene that

but until very recently detailed knowledge about poly-morphic variation has been lacking for most genes.Consequently, there have been few examples of itsapplication in the published literature (a notable exam-ple of which is discussed below24). However, the pastfew years have seen rapid advances in our knowledge ofpolymorphic variation, and the availability of thisinformation in public databases has made it possiblefor the empirical gene-based approach to be morewidely used. Many researchers are now using methodsbroadly similar to those described below in the study ofa range of different cancers.

Box 1 | False positives in association studies

False positives that occur due to chance (type I error)Consider a disease, such as breast cancer, in which a putative dominant allele would account for approximately 1% of theexcess familial RELATIVE RISK.Assuming an additive effect of such genes, the maximum number of such disease alleles is 100.Assuming that there are 105 candidate polymorphic loci across the genome, the probability that a random candidatevariant will be associated with disease (known as prior probability) is no more than 1 in 1,000. If all these loci are tested forassociation with 90% statistical power to detect a true association at the 0.01 level of significance, the probability that astatistically significant association is a true positive is just 8% (90/1,089). The table below illustrates this, showing the likelyresult of 100,000 independent association studies with 90% power to detect associations with a prior probability of 0.001 ata significance of 0.01. For instance, of 100,000 association studies, 90 would be predicted to be true susceptibility alleles.

Note that if the significance level is made more stringent, the positive predictive value improves: 47% for a significanceof 10–3, 90% for a significance of 10–4 and 99% for a significance of 10–5. Of course, in reality, neither the expected number ofloci, their allele frequencies and risk, nor the number of variant loci in the genome are known. Nevertheless, adoption ofstringent significance levels, perhaps 10–4 for candidate-gene studies and 10–7 for whole-genome studies would avoid mostof the problems of non-replication41.Wacholder and colleagues recently introduced the idea of a false-positive reportprobability (FPRP) to evaluate a positive association47. The FPRP is the probability of no true association between a geneticvariant and disease given a statistically significant finding. It depends not only on the observed P value, but also on anassumed prior probability that the association between the genetic variant and the disease is real, and on the statisticalpower of the test. This might be a useful concept for interpreting positive associations, although the requirement for anarbitrary choice of a prior probability means that it is unlikely to be useful in routine reporting of results48.

Other sources of false positivesA false positive might also be caused by bias (confounding). Observer bias might arise if genotyping is not carried out blindto sample case–control status. If controls are not chosen from the same population as the cases, selection bias might occur.A more subtle cause of this bias is population admixture (otherwise known as stratification), in which allele frequenciesdiffer between subpopulations. False-positive associations might arise if cases and controls are drawn differentially fromthese subgroups.

This bias can be avoided using family-based approaches such as the transmission distortion (TDT) test, in whichthe case and both parents are genotyped and the untransmitted parental chromosomes act as controls. However,this approach is rarely used for cancer because recruitment of parents is impossible in large numbers, and use ofother family members severely reduces power. More recently, methods that have focused on recognizing whenpopulation substructure might affect a case–control study and correcting for it when it is present have beendeveloped49,50. Despite the availability of these methods, few published case–control studies have been formallyevaluated for the presence of population substructure. Overall, it seems likely that (in comparison with type I error)population stratification is not a serious problem for most association studies published so far, and, indeed, theexistence of significant population stratification that has resulted in a false genetic association has never beenempirically demonstrated51. On the other hand, it has been indicated that population stratification is more likely tobe an issue with the large sample sizes that are now being used52,53. It is likely that it can largely be avoided bythoughtful choice of controls (matching where necessary on place of residence and main ethnic group) andreplication of positive associations in different populations.

Result of True susceptibility Not true susceptibility Totalassociation study allele (number) allele (number)

Positive 90 999 1,089

Negative 10 98,901 98,911

Total 100 99,900 100,000

Page 5: Association studies for finding cancer-susceptibility genetic variants

854 | NOVEMBER 2004 | VOLUME 4 www.nature.com/reviews/cancer

R E V I E W S

‘locus scoring’ method26,30. Indeed, Chapman et al. haveshown that locus scoring is generally more powerfulthan scoring methods that include haplotype informa-tion. The moral here is that, provided that all variantsin a given gene have been identified, available methodsare generally adequate for identifying tagging SNPs andthat tagging SNPs can then be analysed using a simplecase–control approach, with little to be gained frommore complex haplotype-based methods.

The ideal candidate-gene association studyDrawing together these various strands, a practicalapproach to answer the question ‘are there any allelicvariants of this candidate gene associated with can-cer?’ can be envisaged. The stages involved in carryingout a candidate-gene association study are shown inFIG. 2. The first step is to describe SNP and haplotypediversity across each gene by re-sequencing a sampleof the relevant population across the genomic regionof interest. This will identify the common polymor-phisms in that population and the extent of LD acrossthat region. The sample size that is needed for re-sequencing will depend on the minimum frequencyof variants that you wish to identify. For example, if110 individuals are genotyped, there is a 99% chancethat any polymorphism with a population allele fre-quency of 0.03 will be detected at least twice. Variantsthat are detected just once are frequently not poly-morphic, either being sequencing errors or rare vari-ants that have been detected by chance. These data canthen be used to identify the minimum set of polymor-phisms that need to be assayed to adequately reporton all the other polymorphisms. Re-sequencing ofthis type is underway for a range of genes, includingmany candidate cancer-susceptibility genes, in severalpopulations, although the number of individuals is

has a functional effect, but it is a specific haplotypethat is important. For example, there might be twocoding SNPs in a given gene that only alter proteinfunction significantly when they occur together.Under these circumstances, a set of tagging SNPs thatprovides sufficient correlation between the haplotypesestimated using the tagging SNPs and each of the truehaplotypes is needed.

Various algorithms have been developed that min-imize the number of tagging SNPs that need to beassayed. These include those based on simple pair-wise LD-based SNP selection (LDSelect)25 and thosebased on HAPLOTYPE RESOLUTION (HaploScore26, Haplo-BlockFinder27 and tagSNPS28)29. Carlson et al.recently used data from resequencing 100 genes in 23European Americans and 24 African Americans toidentify tagging SNPs using an LD-selection-basedalgorithm25. In the European-American population,689 SNPs were required to tag 2,375 common SNPs,and 1,366 SNPs were required to tag the 3,178 com-mon SNPs found in the African-American popula-tion. The performance of the LDSelect algorithm wasalso compared with that of the HaploBlockFinderand tagSNPS programs. For a given number of tag-ging SNPs selected, the ability to predict all otherSNPs identified was similar for all three programs.

Although there is emerging literature on methodsfor choosing an optimal set of SNPs to detect associa-tion between a genetic region and disease, less attentionhas been given to the problem of how such studiesshould be analysed when completed. A commonapproach is to use the multilocus genotype data to esti-mate and compare haplotype frequencies in cases andcontrols. However, some authors have suggested thatthe most efficient approach is to limit analysis to themain effects for individual tagging variants — the

HAPLOTYPE RESOLUTION

The estimation of haploypefrequencies in a population iscomplicated by the fact thathaplotypes for diploid data arenot usually directly observable.Haplotypes can be resolved(inferred) by using parentalgenotype data or estimated byusing statistical estimation.

Box 2 | False negatives in association studies

It is important that case–control association studiesare sufficiently large to detect associations of realisticsize. The figure summarizes the sample sizes requiredto give 90% power to show significant associations forvarious co-dominant susceptibility alleles at asignificance of 10–4. For example, a study ofapproximately 1,000 cases and 1,000 controls will beneeded to detect a co-dominant susceptibility allelewith a frequency of 20% conferring a relative risk(RR) of 1.5. This is true provided that either the riskallele was measured directly or a marker in perfectlinkage disequilibrium (r2 = 1; see BOX 3) with the riskallele was measured. Such an allele would explain asibling relative risk of approximately 1.03, or about3% of the total familial risk of most cancers where thesibling relative risk is generally about 2. This seems areasonable aim. Recessive alleles require much largersample sizes for the same RR. Fortunately, most of thecommon cancers show some evidence of dominance(that is, similar risks to offspring and siblings ofcases), so one would anticipate that few of the important susceptibility genes are likely to be recessive54. For agenome-wide search (with a significance level of 10–7) the study size would need to be about 60% larger.

Allele frequency

Sam

ple

size

requ

ired

(log

scal

e)

RR = 1.2

RR = 1.5

RR = 2.0

0 0.2 0.4

100,000

10,000

1,000

100

Page 6: Association studies for finding cancer-susceptibility genetic variants

NATURE REVIEWS | CANCER VOLUME 4 | NOVEMBER 2004 | 855

R E V I E W S

Even when an association study is completed, theremight be residual uncertainty about the risks that areassociated with the susceptibility allele because of adifference in frequency between it and the marker.However, if it is assumed that this allele is in LD withthe markers typed, it is possible to place limits on thepossible effect of that allele, depending on its fre-quency (BOX 4). Finally, the identification of an associa-tion might not, by itself, determine the true functionalvariant; this has to be inferred or determined sepa-rately. Often, there will be several associated SNPs instrong LD that cannot be separated statistically, andfunctional assays might be required.

Improving candidate-gene approachesImproving candidate-gene selection. The history of thesearch for cancer-susceptibility genes indicates thatchoosing candidates based on known function is notoften successful, and alternative methods might beadvantageous. For example, candidate genes identifiedfrom LINKAGE PEAKS generated by family-based studiesmight still provide a useful approach if sufficient multi-ple-case families can be identified. This approach hasidentified important associations for other diseases, suchas cytotoxic-T-lymphocyte-associated protein 4 (CTLA4)in type 1 diabetes31 and NOD2 in Crohn’s disease32.

Another potential method for identifying candi-date genes is the use of animal models, in which locithat are associated with susceptibility or resistance can

smaller than the ideal number; see links to the FredHutchinson Center Seattle SNPs Program and theNational Institute of Environmental Health SciencesEnvironmental Genome Project SNPs Program in theonline links box for further information. Once the setof tagging SNPs has been identified, they can beassayed in a case–control study with an appropriatesample size.

There are few examples in the literature of thisideal approach, but one such study of CYP19 as a can-didate breast cancer susceptibility gene has been pub-lished24. CYP19 encodes the aromatase enzyme that isinvolved in the synthesis of oestrogen. Haiman andcolleagues identified 74 densely spaced SNPs spanning189 kb of the CYP19 locus and characterized LD andhaplotype patterns among 69–70 individuals fromseveral ethnic populations. Four regions of strong LDwere detected and found to be quite closely conservedacross populations. Within each region, there was alimited diversity of common haplotypes (5–10 with afrequency ≥5%) and most haplotypes were found tobe shared across populations. Twenty-five haplotype-tagging SNPs were selected to predict the commonhaplotypes with high probability and were genotypedin a moderately large breast cancer case–control study(1,355 cases; 2,580 controls). There was some evi-dence that a long-range haplotype carries a predispos-ing breast cancer susceptibility allele. Larger studieswill be needed to confirm this finding.

LINKAGE PEAKS

In a whole-genome linkageanalysis, the strength of linkageat any given marker is given bythe log of odds (LOD) score. Ahigh LOD score at one or severaladjacent markers can be called alinkage peak.

Box 3 | Linkage disequilibrium and tagging single-nucleotide polymorphisms

Linkage disequilibriumThe term linkage disequilibrium (LD) refers to the fact that particular alleles at nearby sites can co-occur on the samehaplotype more often than is expected by chance. Consider a polymorphic genetic locus with two alleles (locus A).When a new mutation occurs during meiosis on the same chromosome as locus A, the new allele (mutation) will beassociated with one or other of the two possible alleles at locus A — the marker allele — in the first-generationoffspring. The chromosome with the mutation will then be transmitted to the offspring of the individual. In the absenceof recombination, the marker allele and the mutation will be transmitted together. Each generation provides anopportunity for recombination to occur, and this recombination rearranges the association between the mutation andthe marker allele. Over an infinite number of generations, and therefore infinite recombinations, the marker allele andthe mutation would become randomly associated in the population (equilibrium). However, over a finite number ofgenerations the association between marker and mutation will persist.

Tagging single-nucleotide polymorphismsIt is the existence of LD between single-nucleotide polymorphisms (SNPs) that enables the genetic diversity acrossa region to be captured by a limited number of tagging SNPs. The ability of one SNP to report on another dependson the strength of LD between them. This can be measured in several ways, but the most important pairwisemeasures of LD are r2 (sometimes denoted ∆2) and |D′|. Both measures range from 0 (no disequilibrium) to 1(‘complete’ disequilibrium), but their interpretation is different. |D′| is defined in such a way that it is equal to 1 ifjust two or three of the four possible haplotypes are present, and it is <1 if all four possible haplotypes are present.So, |D′| = 1 will be observed for loci for which no ‘recombinant’ haplotypes are present in the population. This isoften described as ‘complete’ LD between loci. However, complete LD might be present between twopolymorphisms with substantially different allele frequencies. Under these circumstances, neither SNP will be anefficient marker for the other, as the ability of one polymorphism to report on another depends on both the LDbetween them and their relative allele frequencies. r2 takes into account both recombination and allele frequencies.It represents the statistical correlation between two sites, and takes the value of 1 if only two haplotypes are present.This might be referred to as ‘perfect’ LD.

Although |D′| is a better measure for describing the patterns of LD, r2 is of more direct relevance to the power ofassociation studies. Suppose that SNP1 is involved in disease susceptibility, but we genotype cases and controls at anearby site, SNP2. Then, to achieve the same power to detect association at SNP2 as we would have at SNP1, we need toincrease our sample size by a factor of 1/r2.

Page 7: Association studies for finding cancer-susceptibility genetic variants

856 | NOVEMBER 2004 | VOLUME 4 www.nature.com/reviews/cancer

R E V I E W S

Cancers occurring in specific sites are heterogenous,and it is possible that different subtypes of disease mighthave different genetic determinants. Therefore it mightbe possible to improve association-study signals byrestricting analysis to tumours of a particular subtype.The increasing sophistication of immunophenotypingusing expression-array analyses revolutionizes this pos-sibility. Precedents such as the ‘basal’ phenotype inbreast cancers associated with BRCA1 mutations34 givesus hope that such subtyping might be effective. Thechallenge here will be to identify those subtypes thatshow strong heritability.

Making use of available SNP data. The main drawbackof the ‘ideal’ approach is that re-sequencing genes ofinterest in a reasonable number of samples in all rele-vant populations is laborious and expensive. However, itmight also be possible to adequately tag all gene variantswithout formal re-sequencing. For many genes, manypolymorphisms might have been identified and vali-dated already. This makes it possible to genotype all theknown SNPs in a small sample set and to identify all thecommon haplotypes and haplotype-tagging SNPs basedon what is already known. This is the approach beingtaken by the International HapMap project (see onlinelinks box). The difficulty with this approach is that theprocess by which most SNPs have been identified is ill-defined, so it is unclear what SNPs have been missed. Atpresent, it remains unclear how much information islost using known SNPs without re-sequencing.

Limiting analyses to coding regions. Most of the knownmutations that underlie Mendelian diseases have beenfound to occur in the coding sequences of genes35, andso an alternative approach is to limit analysis to the cod-ing regions of candidate genes. The number of commoncoding non-synonymous SNPs is quite limited (approx-imately one per gene), making this a feasible proposi-tion even on a genome-wide scale. The weakness of thisapproach is that it ignores the existence of variants thatcan affect gene expression. Regulatory regions might beparticularly relevant to diseases that are due to inappro-priate levels of expression of enzymes involved in home-ostasis or response to environmental variation, and asthe effects of regulatory variation might be subtle, theymight be more important in common complex diseasessuch as cancer than in Mendelian diseases. The effects ofnon-coding variants on gene expression are incom-pletely understood, and an indirect approach throughtagging SNPs is the only method of determining theirassociation with disease. The relevance of regulatoryvariation to cancer susceptibility in humans is unclear,but it is possible that polymorphisms in non-codingregions might have an important role.

Enriching for genetic susceptibility. The number of indi-viduals to be genotyped can be reduced by selectingcases enriched for genetic susceptibility36. A recent studyidentified a rare truncating variant of CHEK2(1100delc)that abrogates the kinase activity and showed that it isassociated with breast cancer by comparing the allele

be mapped empirically, and the genes from the corre-sponding genomic region in man are then assessed ascandidates. The advantage of this approach is thatanimal models provide a tractable means to mapgenes of moderate effect using linkage and CONGENIC

STRAINS, and the availability of high-quality SYNTENIC

maps allows the accurate determination of the corre-sponding human loci. However, it is not clear howclosely animal models for cancer correspond tohuman disease. Furthermore, there is no guaranteethat the susceptibility genes most relevant in man willbe polymorphic in the model system or vice versa.

Another possibility is to search (in man or experi-mental animals, depending on feasibility) for allelesassociated with specific ‘assayable’ phenotypes that arestrongly heritable and also associated with the diseasephenotype of interest. For example, women with exten-sive dense breast tissue visible on a mammogram have arisk of breast cancer that is 1.8–6.0 times that of womenof the same age with little or no density, and breast den-sity has been shown to be heritable33. Other potentialintermediate phenotypes include radiation response,inflammation and serum growth-factor levels. For agiven sample size, the power to detect an association isgreater for a quantitative trait, but the main difficulty ofthis approach is to identify intermediate phenotypesthat are ‘assayable’ on an epidemiological scale.

CONGENIC STRAIN

A congenic strain is derived bymating mice carrying a locus ofinterest in each succeedinggeneration to mice of an inbredstrain. A fully congenic strainand the inbred partner areexpected to be identical at all lociexcept for the transferred locusand a linked segment ofchromosome.

SYNTENY

The physical presence of two ormore genetic loci on the samechromosome, whether or notthey are close enough together todemonstrate linkage.

Identify gene or genomicregion of interest

Decide what range of risksand allele frequencies (p) are to be detected

Identify all SNPs across thegene or region of interest with allele frequency >p

Select the minimum set of‘tagging’ SNPs that will be acceptable markers forall others

Estimate sample sizerequirement

Collect study subjects

Genotype tagging SNPs and test for association

Confirm results inindependent data set(s)

Identify causal variante.g. functional studies

Figure 2 | The stages in the design of an association studyfor cancer-susceptibility genes. The schematic shows thestages in testing a genomic region — whether simply acandidate gene or the whole genome — for a polymorphicvariant associated with cancer. The type of allele to be detected(frequency and risk conferred) is largely a pragmatic choicebased on what is feasible in terms of sample size and cost.Methods for selecting tagging single-nucleotide polymorphisms(SNPs) and for analysis of the data are described in the main text.

Page 8: Association studies for finding cancer-susceptibility genetic variants

NATURE REVIEWS | CANCER VOLUME 4 | NOVEMBER 2004 | 857

R E V I E W S

pooled, this provides the potential for a marked reduc-tion is genotyping costs. DNA-pooling methods and themethods for analysis of data from studies using DNApooling are reviewed in detail by Sham and colleagues39.However, current experience indicates that, even withreplicate PCR experiments, the error in the measure-ment of allele frequencies is typically a few percent40. Atthis level, there is unacceptable loss of power at the effectsizes (relative risk = 1.2–1.5) that we consider impor-tant. A further perceived disadvantage is the difficulty ofinferring haplotype information from the results,although statistical methods to overcome this are nowavailable. As a result of these difficulties, DNA poolinghas rarely been used in practice, despite the fact thatnumerous laboratory methods have been developed.

A genome-wide approach?In principle, it would be possible to avoid the limitationsof the candidate-gene approach by using an empiricalwhole- or partial-genome approach in which a set ofSNPs that adequately tag all the other SNPs of a given fre-quency are identified and assayed41,42. This would involvefirst identifying all the common variants in the genomeand then selecting an appropriate set of tagging SNPs forgenotyping in a case–control set — a formidable task. It islikely that 200,000–500,000 SNPs will be needed to ade-quately tag all SNPs with a minor-allele frequency of 5%or more43. Such a study is not yet feasible because a high-density SNP map that covers the whole genome is not yetavailable. Nevertheless, the process of SNP discovery andvalidation is progressing very rapidly, and sets of SNPsthat tag the known polymorphisms across extensiveregions of the genome are already known and it is likelythat a complete map will soon be available.

A further barrier to progress remains the cost ofgenotyping. For example to genotype 200,000 markersin a 2,000/2,000 case–control set would cost US $8 mil-lion at 1c per genotype. Cost cannot be reduced bydecreasing the number of markers that need to be testedwithout serious loss of power, but substantial gains inefficiency can be made by careful study design to reducethe number of samples to be assayed. One approach isto genotype the cases and controls in stages. The first setof cases and controls is assayed for all the polymor-phisms of interest. This set is designed to have high sta-tistical power, but its size is kept to a minimum by usinga nominal level of significance. The sample size mightalso be reduced by selecting cases enriched for geneticsusceptibility as described above. The polymorphismsthat exceed the nominal significance threshold are thenexcluded from subsequent analyses. A second, larger,case–control series is then assayed for the remainingpolymorphisms to identify the true positives from thefalse positives after the first stage. The second sample setis designed to achieve an appropriately stringent level ofsignificance (BOX 1).

What if the underlying hypothesis is wrong?The current generation of association studies dependson the ‘common disease: common variant’ hypothesis,because they are testing common polymorphisms or

frequency in cases with a family history with that inunrelated controls37. The study was successful despitethe fact that the risk allele has a population frequency ofless than 1% and conferred an estimated relative risk tocarriers of less than 2. Over 4,000 cases and 4,000 con-trols would be needed to detect such an allele with 90%power at a significance of 10–4 using a standardcase–control study design.

The success of this approach raises the question asto the extent that the use of familial cases improves thepower of an association study. Assuming a polygenicmodel for breast cancer susceptibility, Antoniou andEaston showed that, relative to a standard case–controlassociation study with cases unselected for family his-tory, the sample size required to detect a common dis-ease-susceptibility allele was typically reduced by morethan twofold if cases with an affected first-degree rela-tive were selected, and by more than fourfold if caseswith two affected first-degree relatives were used36. Therelative efficiency obtained by using familial cases wasgreater for rarer alleles. Analysis of extended familiesindicated that the power of this approach was mostdependent on the immediate (first-degree) family his-tory. Bilateral cases might offer a similar gain in powerto cases with two affected first-degree relatives. It wouldalso be possible to increase power by selecting controlswithout a family history (‘hypernormal’ controls), butfor most cancers the benefit will be slight, as only aminority of controls would have a positive family his-tory. The estimates derived by Antoniou and Eastonwere based on a polygenic model of susceptibility tobreast cancer, but the results will also apply with othermodels. Enrichment for genetic factors might also besought by selecting cases with an early age of onset.However, in contrast to the strong effect of family his-tory, varying the ages at diagnosis of the cases across therange of 35–65 years of age has been shown to havelimited effects on the power to detect association inbreast cancer36.

Use of suitable study populations. A related issue is thechoice of a suitable study population. Populationsderived from a smaller number of FOUNDERS (for exam-ple, Finland) have been extremely powerful for mappingMendelian disorders because genetic heterogeneity isreduced and the disease might be associated with a sin-gle founder haplotype. The advantages of using afounder population for complex disorders is less clear;the degree of LD between common SNPs is typicallysimilar to that in larger outbred populations38 and anyadvantage might be outweighed by the difficulties ofobtaining sufficiently large numbers of cases. Such pop-ulations might be useful in the study of rarer variants ofmore recent origin.

DNA pooling. Some genotyping technologies allow thepossibility of DNA-pooling experiments, in which allelefrequencies are estimated for pools of individuals.Apparent differences in allele frequency can then beconfirmed and re-tested by individual genotyping.Where hundreds, or even thousands, of samples can be

FOUNDER

When a population expandsfrom a limited number ofindividuals, those individuals areknown as founders. The foundereffect is when a particular alleleis frequent in a populationderived from a small number offounders.

Page 9: Association studies for finding cancer-susceptibility genetic variants

858 | NOVEMBER 2004 | VOLUME 4 www.nature.com/reviews/cancer

R E V I E W S

The relative importance of common and rare vari-ants in cancer susceptibility cannot be inferred fromexisting data and remains a major uncertainty over-hanging future studies. The identification of commonalleles is more tractable and, because they wouldexplain a greater fraction of disease burden, is of moredirect public-health relevance, and it is logical to con-centrate on these. However, alleles are likely to be rare ifthere is any degree of selection against the variant, forexample, if homozygotes are non-viable. If most cancersusceptibility is related to fundamental processes of cel-lular control, rare alleles might turn out to be the moreimportant component.

Describing gene–gene interactionsIdentifying disease-susceptibility alleles is only the firststep towards defining genetic risk. An equally impor-tant problem is to determine the risks associated withthe suceptibility genotypes. Although defining therisks associated with individual polymorphisms isstraightforward given sufficiently large case–controlstudies, defining the risks associated with combina-tions of genotypes is a more complex problem. Suchmultilocus genotypic risks are often referred to looselyas ‘gene–gene interactions’. However, the term ‘interac-tion’ can be misleading in this context. It is sometimestaken to refer to a biological interaction, implyingsome co-participation of two factors in the samecausal mechanism. Alternatively, it might imply a sta-tistical interaction; that is, risks that are non-additiveon some scale. For disease risks, this is often taken tobe the log-scale (that is, multiplicative disease risks),but this is an arbitrary choice. In practice, it is lessimportant to determine whether or not there is inter-action (however it is defined) than to estimate the risksassociated with combined genotypes.

The principle that combinations of common allelescan exert a profound influence on tumour susceptibil-ity is clearly seen in mouse models, but few if anyexamples in humans have been identified. Study sizemight present a major stumbling block here: if thereare n two-allele polymorphisms associated with dis-ease, the number of possible multilocus genotypes is3n. Even for as few as ten disease-causing alleles, thenumber of genotypes (59,049) would be prohibitivelylarge for any study to estimate genotype-specific risksindividually. Unless the number of disease loci is small,therefore, it will be necessary to impose some limits onthe analytical model. For example, interactions mightbe limited to those between genes within a given bio-logical pathway. Even then, some pathways have manypotentially interacting genes. Further refinementwould then be needed. One approach would be tolimit possible interactions to those between genesencoding proteins that are known to physically interactat the cellular level. For example, there are at least 11proteins (or genes) involved in repair of double-strandDNA breaks by homologous recombination. Theseproteins combine to form four different protein com-plexes, each responsible for a different step in therepair process. It would not be possible to test for all

haplotypes. It is, however, equally possible that much ofthe variation in cancer risk is due to rarer alleles (FIG. 1).Indeed, virtually all susceptibility alleles identified so farhave frequencies of less than 1%. These include high-penetrance mutations such as those in MLH1 and APCthat predispose to colorectal cancer, but also low-pene-trance variants in ATM and CHEK2 that predispose tobreast cancer. The rarity of these variants is likely to bedue to the fact that they are of recent origin and havenot had time to spread through the population.

The indirect tagging-SNP approach is unlikely towork for rare variants, because a rare allele will be tooweakly correlated with any common SNP haplotype.For example, a recent large breast cancer case–controlstudy of two common SNPs in CHEK2 did not pickup the signal from the rare 1100delc variant44.However, rare alleles are likely to be in LD with mark-ers over a larger interval because they have arisenmore recently, and it might be possible to useextended haplotype methods to map such variants.The use of these approaches for complex disease is,however, unproven. A further complication is the pos-sibility of allelic heterogeneity, in which multiple alle-les at each locus confer susceptibility. This will furtherreduce the power of LD-mapping approaches,because each new mutation will arise on a differenthaplotype background.

Although direct testing of rare variants is possible,the problems are formidable. Current efforts to iden-tify genetic variation have concentrated on commonpolymorphisms identifiable by re-sequencing smallnumbers of individuals. Identifying rare variants willrequire re-sequencing much larger numbers of indi-viduals, perhaps concentrating on individuals fromhigh-risk families where the frequencies might behigher. Furthermore, because the numbers of rarevariants will be much larger, the problems of multipletesting will be much greater. Finally, the frequencies ofsuch alleles are likely to vary between populations.

Box 4 | Determining risk associated with unknown linked alleles

When the disease risk for a given haplotype has been estimated in an association study,there might be uncertainty about the true magnitude of the risk, because it is possiblethat there is an unknown polymorphic variant linked to the apparent risk haplotype.However, if it is assumed that this allele is in linkage disequilibrium with the markerstyped, it is possible to place limits on the possible effect of the unknown allele. Supposethe true (unknown) dominant susceptibility allele has a frequency p and confers a relativerisk r (unknown) relative to non-carriers. Suppose this allele is associated with a markerhaplotype that has frequency p′ and for which the estimated relative risk of disease is r′.If D′ is the linkage-disequilibrium coefficient between marker haplotype and truesusceptibility allele, then it can be shown that

where

Therefore, for given values of p′, D′ and r′, the corresponding value of r can bedetermined.

r′ =(1 – p′)(rs + p′– s)

[(p – s)r + (1 – p – p′ + s)]p′

s = p(D′ – p′D′ + p′).

Page 10: Association studies for finding cancer-susceptibility genetic variants

NATURE REVIEWS | CANCER VOLUME 4 | NOVEMBER 2004 | 859

R E V I E W S

MULTIFACTOR-

DIMENSIONALITY REDUCTION

Uses case-control data to poolmultilocus genotypes into eithera high-risk or a low-risk group,effectively reducing the numberof genotype predictors to one.The new one-dimensionalmultilocus genotype can then beevaluated to classify and predictdisease status.

ConclusionThe association study design has been used as amethod to map cancer-susceptibility alleles for overfour decades. The advent of modern molecular genet-ics has seen a considerable increase in the use of thistype of study, which until now has been largely basedon the candidate-polymorphism approach. Theseefforts have, perhaps, been most notable for their fewsuccesses. However, improved knowledge of the mole-cular basis of carcinogenesis will enable better selec-tion of candidate genes, and improvements in studydesign, notably the use of much larger sample sizesthrough multicentre collaborations, might herald anew era of susceptibility-gene discovery. An exampleof this type of collaboration is the Consortium ofCohorts, which was formed by National CancerInstitute to address the need for large-scale collabora-tions for the study of gene–gene and gene–environmentinteractions in the aetiology of cancer and has morethan 20 cohorts participating (see online links box).In the future, advances in our understanding of thenature of human genetic variation coupled to newgenotyping technologies raise the hope that empiricalapproaches will bring similar successes to thoseachieved by empirical family-based linkage studies.Empirical association studies to scan 60% of thegenome for breast cancer susceptibility genes are nowin progress, and similar studies of other cancers arelikely in the near future.

possible interactions between all the genes in this path-way. However, limiting interactions to those between thegenes that code for proteins that form protein complexeswould make the problem more manageable. It is likelythat the next 5 years will see rapid developments in ourunderstanding of the interaction between proteins andpathways at the cellular level, and it might be possible tobe even more selective by limiting possible interactionsto those between polymorphisms that alter amino acidsknown to be in protein domains specifically involved inprotein–protein binding.

An alternative to parametric statistical methods suchas logistic regression is to implement statistical and com-putational methods that have improved power to identifymultivariate effects in relatively small samples. Theseinclude cluster analysis, linear discriminant analysis,recursive partitioning, support vector machine learning,neural networks and MULTIFACTOR-DIMENSIONALITY

REDUCTION45. Much of the recent impetus for developingthese methods in molecular biology has come from theanalysis of gene-expression array data. Other data — suchas germline genotype at multiple biallelic loci in cases andcontrols obtained in association studies — pose related,but different, problems. There has been little work on theuse of these methods to identify high-order interactionsin genetic susceptibility, although multifactor-dimension-ality reduction has recently been used to identify high-order interactions among oestrogen-metabolism genes inbreast cancer46.

1. Houlston, R. S. & Peto, J. in Genetic predisposition tocancer (eds Eeles, R. A., Ponder, B. A. J., Easton, D. F. &Horwich, A.) 208–226 (Chapman & Hall, London, 1996).

2. Lichtenstein, P. et al. Environmental and heritable factors inthe causation of cancer — analyses of cohorts of twins fromSweden, Denmark, and Finland. N. Engl. J. Med. 343,78–85 (2000).A landmark paper reporting the heritability of thecommon cancers based on data from over 40,000twin pairs from Scandinavia.

3. Easton, D. F. How many more breast cancerpredisposition genes are there. Breast Cancer Res. 1,14–17 (1999).

4. Antoniou, A. C. et al. A comprehensive model for familialbreast cancer incorporating BRCA1, BRCA2 and othergenes. Br. J. Cancer 86, 76–83 (2002).

5. Risch, N. Searching for genetic determinants in the newmillenium. Nature 405, 847–856 (2000).An excellent description of the strengths andweaknesses of different methods for gene mapping incomplex diseases.

6. Cardon, L. R. & Bell, J. I. Association study designs forcomplex diseases. Nature Rev. Genet. 2, 91–99 (2001).

7. Chakravarti, A. Population genetics — making sense out ofsequence. Nature Genet. 21, 56–60 (1999).

8. Glober, G. A., Cantrell, E. G., Doll, R. & Peto, R. Interactionbetween ABO and rhesus blood groups, the site of origin ofgastric cancers, and the age and sex of the patient. Gut 12,570–573 (1971).

9. Hildesheim, A. et al. Association of HLA class I and IIalleles and extended haplotypes with nasopharyngealcarcinoma in Taiwan. J. Natl Cancer Inst. 94, 1780–1789(2002).

10. Engel, L. S. et al. Pooled analysis and meta-analysis ofglutathione S-transferase M1 and bladder cancer: a HuGEreview. Am. J. Epidemiol. 156, 95–109 (2002).

11. Vineis, P. et al. Current smoking, occupation, N-acetyltransferase-2 and bladder cancer: a pooled analysisof genotype-based studies. Cancer Epidemiol. BiomarkersPrev. 10, 1249–1252 (2001).

12. Dunning, A. M. et al. A systematic review of geneticpolymorphisms and breast cancer risk. Cancer Epidemiol.Biomarkers Prev. 8, 843–854 (1999).

13. Gonzalez, C. A., Sala, N. & Capella, G. Geneticsusceptibility and gastric cancer risk. Int. J. Cancer 100,249–260 (2002).

14. Ioannidis, J. P., Ntzani, E. E., Trikalinos, T. A. & Contopoulos-Ioannidis, D. G. Replication validity of genetic associationstudies. Nature Genet. 29, 306–309 (2001).

15. Lohmueller, K. E., Pearce, C. L., Pike, M., Lander, E. S. &Hirschhorn, J. N. Meta-analysis of genetic associationstudies supports a contribution of common variants tosusceptibility to common disease. Nature Genet. 33,177–182 (2003).

16. Tabor, H. K., Risch, N. J. & Myers, R. M. Candidate-gene approaches for studying complex genetic traits:practical considerations. Nature Rev. Genet. 3, 391–397(2002).

17. Dahlman, I. et al. Parameters for reliable results in geneticassociation studies in common disease. Nature Genet. 30,149–150 (2002).

18. Colhoun, H. M., McKeigue, P. M. & Davey Smith, G.Problems of reporting genetic associations with complexoutcomes. Lancet 361, 865–872 (2003).

19. Patil, N. et al. Blocks of limited haplotype diversity revealedby high-resolution scanning of human chromosome 21.Science 294, 1719–1723 (2001).

20. Johnson, G. C. et al. Haplotype tagging for the identificationof common disease genes. Nature Genet. 29, 233–237(2001).

21. Gabriel, S. B. et al. The structure of haplotype blocks in thehuman genome. Science 296, 2225–2229 (2002).

22. Zhang, K., Calabrese, P., Nordborg, M. & Sun, F. Haplotypeblock structure and its applications to association studies:power and study designs. Am. J. Hum. Genet. 71,1386–1394 (2002).

23. Meng, Z., Zaykin, D. V., Xu, C. F., Wagner, M. & Ehm, M. G.Selection of genetic markers for association analyses, usinglinkage disequilibrium and haplotypes. Am. J. Hum. Genet.73, 115–130 (2003).

24. Haiman, C. A. et al. A comprehensive haplotype analysis ofCYP19 and breast cancer risk: the Multiethnic Cohort. Hum.Mol. Genet. 12, 2679–2692 (2003).One of the first studies to use a comprehensivehaplotype-tagging approach to examine a gene forcommon variants associated with breast cancer risk.

25. Carlson, C. S. et al. Selecting a maximally informative set ofsingle-nucleotide polymorphisms for association analysesusing linkage disequilibrium. Am. J. Hum. Genet. 74,106–120 (2004).This paper reports the results of re-sequencing 100genes in 24 African-American and 23 European-American samples. They showed that a tagging-SNPset can comprehensively interrogate for main effectsof common variants, but that tagging SNPs should beselected separately for populations of differentancestries.

26. Chapman, J. M., Cooper, J. D., Todd, J. A. & Clayton, D. G.Detecting disease associations due to linkagedisequilibrium using haplotype tags: a class of tests and thedeterminants of statistical power. Hum. Hered. 56, 18–31(2003).

27. Zhang, K. & Jin, L. HaploBlockFinder: haplotype blockanalyses. Bioinformatics 19, 1300–1301 (2003).

28. Stram, D. O. et al. Choosing haplotype-tagging SNPSbased on unphased genotype data using a preliminarysample of unrelated subjects with an example from theMultiethnic Cohort Study. Hum. Hered. 55, 27–36(2003).

29. Ke, X. & Cardon, L. R. Efficient selective screening ofhaplotype tag SNPs. Bioinformatics 19, 287–288 (2003).

30. Neale, B. M. & Sham, P. C. The future of association studies:gene-based analysis and replication. Am. J. Hum. Genet.75, 353–362 (2004).

31. Marron, M. P. et al. Insulin-dependent diabetes mellitus(IDDM) is associated with CTLA4 polymorphisms in multipleethnic groups. Hum. Mol. Genet. 6, 1275–1282 (1997).

32. Hugot, J. P. et al. Association of NOD2 leucine-rich repeatvariants with susceptibility to Crohn’s disease. Nature 411,599–603 (2001).

33. Boyd, N. F. et al. Heritability of mammographic density, a riskfactor for breast cancer. N. Engl. J. Med. 347, 886–894(2002).

34. Lakhani, S. R. et al. Multifactorial analysis of differencesbetween sporadic breast cancers and cancers involvingBRCA1 and BRCA2 mutations. J. Natl Cancer Inst. 90,1138–1145 (1998).

35. Botstein, D. & Risch, N. Discovering genotypes underlyinghuman phenotypes: past successes for mendelian disease,

Page 11: Association studies for finding cancer-susceptibility genetic variants

860 | NOVEMBER 2004 | VOLUME 4 www.nature.com/reviews/cancer

R E V I E W S

future approaches for complex disease. Nature Genet. 33,S228–S237 (2003).

36. Antoniou, A. & Easton, D. F. Polygenic inheritance of breastcancer: implications for design of association studies.Genet. Epidemiol. 25, 190–203 (2003).

37. Meijers-Heijboer, H. et al. Low-penetrance susceptibility tobreast cancer due to CHEK2(*)1100delC in noncarriers ofBRCA1 or BRCA2 mutations. Nature Genet. 31, 55–59(2002).

38. Dunning, A. M. et al. The extent of linkage disequilibrium infour populations with distinct demographic histories. Am. J.Hum. Genet. 67, 1544–1554 (2000).

39. Sham, P., Bader, J. S., Craig, I., O’Donovan, M. & Owen, M.DNA Pooling: a tool for large-scale association studies.Nature Rev. Genet. 3, 862–871 (2002).

40. Barratt, B. J. et al. Identification of the sources of error inallele frequency estimations from pooled DNA indicates anoptimal experimental design. Ann. Hum. Genet. 66,393–405 (2002).

41. Risch, N. & Merikangas, K. The future of geneticstudies of complex diseases. Science 273, 1516–1517(1996).

42. Carlson, C. S., Eberle, M. A., Kruglyak, L. & Nickerson, D. A.Mapping complex disease loci in whole-genome associationstudies. Nature 429, 446–452 (2004).

43. Kruglyak, L. Prospects for whole-genome linkagedisequilibrium mapping of common disease genes. NatureGenet. 22, 139–144 (1999).

44. Kuschel, B. et al. Common polymorphisms in CHEK2(checkpoint kinase 2) are not associated with breast cancerrisk. Cancer Epidemiol. Biomarkers Prev. 12, 809–812 (2003).

45. Hastie, T., Tibshirani, R. & Friedman, J. The elements ofstatistical learning (Springer–Verlag, New York, 2001).

46. Ritchie, M. D. et al. Multifactor-dimensionality reductionreveals high-order interactions among estrogen-metabolismgenes in sporadic breast cancer. Am. J. Hum. Genet. 69,138–147 (2001).

47. Wacholder, S., Chanock, S., Garcia-Closas, M., El Ghormli, L.& Rothman, N. Assessing the probability that a positivereport is false: an approach for molecular epidemiologystudies. J. Natl Cancer Inst. 96, 434–442 (2004).

48. Thomas, D. C. & Clayton, D. G. Betting odds and geneticassociations. J. Natl Cancer Inst. 96, 421–423 (2004).

49. Devlin, B. & Roeder, K. Genomic control for associationstudies. Biometrics 55, 997–1004 (1999).

50. Pritchard, J. K. & Rosenberg, N. A. The use of unlinkedgenetic markers to detect population stratification inassociation studies. Am. J. Hum. Genet. 65, 220–228(1999).

51. Cardon, L. R. & Palmer, L. J. Population stratificationand spurious allelic association. Lancet 361, 598–604(2003).An excellent review of methods to detect and accountfor population stratification in genotype–phenotypeassociation studies.

52. Marchini, J., Cardon, L. R., Phillips, M. S. & Donnelly, P.The effects of human population structure on largegenetic association studies. Nature Genet. 36, 512–517(2004).

53. Freedman, M. L. et al. Assessing the impact of populationstratification on genetic association studies. Nature Genet.36, 388–393 (2004).

54. Risch, N. The genetic epidemiology of cancer: interpretingfamily and twin studies and their implications for moleculargenetic approaches. Cancer Epidemiol. Biomarkers Prev.10, 733–741 (2001).

AcknowledgementsWe thank the referees and editors, whose comments on earlierdrafts of this manuscript were very helpful.

Competing interests statementThe authors declare no competing financial interests.

Online links

DATABASESThe following terms in this article are linked online to:Entrez Gene:http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=geneATM | BRCA1 | BRCA2 | CDKN2A | CHEK2 | CTLA4 | CYP19 |MLH1 | MSH2 | NOD2 | PTEN | TP53National Cancer Institute: http://cancer.gov/breast cancer | colorectal cancer | gastric cancer | melanoma |ovarian cancerOMIM: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=omimadenomatosis polyposis coli | multiple endocrine neoplasia type 2 |type 1 diabetes

FURTHER INFORMATIONFred Hutchinson Center Seattle SNPs Program:http://pga.gs.washington.edu/International HapMap Project: http://www.hapmap.orgNational Cancer Institute Consortium of Cohorts:http://epi.grants.cancer.gov/Consortia/cohort.htmlNational Institute of Environmental Health SciencesEnvironmental Genome Project SNPs Program:http://egp.gs.washington.edu/Access to this interactive links box is free online.