111
EPI293 Design and analysis of gene association studies Winter Term 2008 Lecture 2: Patterns of LD and “tag SNP” selection Peter Kraft [email protected] Bldg 2 Rm 207 2-4271

Peter Kraft [email protected] Bldg 2 Rm 207 2-4271

  • Upload
    farren

  • View
    18

  • Download
    0

Embed Size (px)

DESCRIPTION

EPI293 Design and analysis of gene association studies Winter Term 2008 Lecture 2: Patterns of LD and “tag SNP” selection. Peter Kraft [email protected] Bldg 2 Rm 207 2-4271. Before HapMap: “looking under lamppost”. Study 1: Pop’n A, small N, no assoc’n. - PowerPoint PPT Presentation

Citation preview

Page 1: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

EPI293Design and analysis of gene association studies

Winter Term 2008

Lecture 2: Patterns of LD and “tag SNP” selection

Peter [email protected]

Bldg 2 Rm 2072-4271

Page 2: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Study 1: Pop’n A, small N, no assoc’n

Study 2: Pop’n A, large N, no assoc’n

Study 3: Pop’n B, large N, assoc’n

Before HapMap: “looking under lamppost”

After HapMap

Study 2 revisited: Pop’n A, large N,

assoc’n

Page 3: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Outline

• Measures of linkage disequilibrium• Reasons for LD and empirical patterns of LD• “Tagging” SNPs• The HapMap project• Resources and tools for SNP selection

Page 4: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Outline

• Measures of linkage disequilibrium• Reasons for LD and empirical patterns of LD• “Tagging” SNPs • The HapMap project• Resources and tools for SNP selection

Page 5: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

A G

a g

A G

a g

A g

A G

a g

A G

A G

a g

A G

A g

a g

A G

a g

A G

Basic idea: linkage disequilibrium

Alleles at two (or more) loci are correlated on chromosomes drawn at random from the population

Page 6: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Measures of linkage disequilibrium

• Basic data: table of haplotype frequencies

A G

a g

A G

a g

A g

A G

a g

A G

A G

a g

A G

A g

a g

A G

a g

A G

A a

G 8 0 50%

g 2 6 50%

62.5% 37.5%

Page 7: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Linkage disequilibrium and marginal allele freqs.

A a

G pApG + = x qApG - = y pG

g pAqG - = w qAqG + = z qG

pA qA 1

• pA & pG are (minor) allele frequencies

– qA = 1-pA; qG = 1-pG

= x z – y w is a measure of departure from independence– No association between A and G = 0

– Max() = min(pA qG, pG qA)

Page 8: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

A a

G n11 n10 n1

g n01 n00 n0

n1 n0

Measure Formula Ref.

D’ Lewontin (1964)

2 = r2 Hill and Weir (1994)

* Levin (1953)

Edwards (1963)

Q Yule (1900)

)nn,nnmin(

nnnn

1001

01100011

o101

201100011

nnnn

nnnn

011

01100011

nn

nnnn

0110

0011

nn

nn

01100011

01100011

nnnn

nnnn

Page 9: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

|D’| and r2 are most common

• D prime … – …ranges from 0 [no LD] to 1 [complete LD]…– …is less sensitive to marginal allele frequencies…– …is directly related to recombination fraction

• R squared…– …also ranges from 0 to 1…– …is correlation between alleles on the same

chromosome…– …is very sensitive to marginal allele frequencies…– …is directly related to study power

• If a marker M and causal gene G are in LD, then a study with N cases and controls which measures M (but not G) will have the same power to detect an association as a study with r2 N cases and controls that directly measured G

• r2 N is the “effective sample size”

Page 10: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

A G

a g

A G

a g

A g

A G

a g

A G

A G

a g

A G

A g

a g

A G

a g

A G

A a

G 8 0 50%

g 2 6 50%

62.5% 37.5%

D’ = (86 - 0) / (86) =1 r2 = (86 - 0)2 / (10688) = .6

Page 11: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Computational detail

• Haplotyopes are rarely directly observed• Have to infer from genotype data

– Genotypes consistent with haplotype pairs

• Most popular algorithm: Expectation Maximizxation1

• Related to, but not exactly equal to 3x3 table of genotypes

Aa

Gg

A

G

a

g

A

g

a

G

1 Thomas pp. 243-245

AA=2

Aa=1

Aa=0

BB=2

Bb=1

Bb=0

Correlation from this table makes no assumptions about HWE

(Weir, Genetic Data Analysis)

Page 12: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Outline

• Measures of linkage disequilibrium• Reasons for LD and empirical patterns of LD• “Tagging” SNPs • The HapMap project• Resources and tools for SNP selection

Page 13: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Why does LD exist?

1. “Recombination coldspots”2. Demographics (e.g. bottlenecks)3. Population stratification or admixture

• Confounds gene-disease association• Does not decay with distance

(among other reasons… selective pressure … etc.)

Page 14: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

A

Decay of LD in Pictures

Page 15: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Decay of LD: T = 0 (1 - )T

0.05 0.10 0.15 0.20 0.25

0.0

0.2

0.4

0.6

0.8

1.0

theta

de

lta

1 generation

5 generations

10

20

40

80

Page 16: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271
Page 17: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

200 kbp from chr2, positions 51,783,239 to 51,983,238

Data from the ENCODE projecthttp://www.hapmap.org/downloads/encode1.html.en

Page 18: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Implications

•Admixture can lead to false positives– Two unlinked loci can stay in LD– Recent admixture, continual gene flow problematic

•Isolated populations have advantages for fine-mapping

– LD extends long distances, so fewer markers need be typed

– But resolution may be poor

Knowledge of local LD structure is essential for candidate gene studies !

Page 19: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Outline

• Measures of linkage disequilibrium• Reasons for LD and empirical patterns of LD• “Tagging” SNPs• The HapMap project• Resources and tools for SNP selection

Page 20: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Basic “tagging” design

Measure haplotypes/LD pattern in a subsample

(often external database)

Choose subset of SNPs (“tagSNPs”) that contain majority of information

Genotype “tagSNPs” in main study,analyze appropriately

Page 21: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Over 750 known SNPs – at least 50 are common in Europeans

ATM

Page 22: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

ATM

Page 23: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

“block” = region of limited haplotype diversity and/or

low LD

Page 24: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

But there are unappealing aspects of the “haplotype block” idea

• Definition and “block finding” algorithms are ad hoc– Different defns, algs lead to different block structures– Block structure changes with sample size, marker density

• “Hard boundaries” are…– …unappealing for tagSNP selection (what about “between blocks”)…

– … inaccurate description of LD patterns (some haps overlap boundaries)

• Plus, haplotypes present analytic challenges[Wall & Pritchard (2003a) Nat Rev Genet 4:587 (2003b) AJHG 73:502]

[Nothnagel and Rohde (2005) AJHG 77:988

Page 25: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

CYP19

Page 26: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

CYP19

Page 27: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271
Page 28: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Keep it simple

• We want SNPs that predict unobserved variants• Why not choose SNPs based on pairwise correlations?

• Q: What if we don’t know enough about common genetic variation to say we’ve captured it?

• A: HapMap and resequencing projects

A/T1

G/A2

G/C3

T/C4

G/C5

A/C6

high r2 high r2 high r2

AATT

GC

CG

GC

CG

TCCC

ACCC

GC

CG

TCCC

GGAA

GGAA

Page 29: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Outline

• Measures of linkage disequilibrium• Reasons for LD and empirical patterns of LD• “Tagging” SNPs• The HapMap project• Resources and tools for SNP selection

Page 30: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

HapMap:

application in the design and interpretation of association studies

Mark J. Daly, PhD on behalf of

The International HapMap Consortium

[OK it may look like I’m totally stealing these slides—but they are free on the web at http://www.hapmap.org/tutorials.html.en]

Page 31: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Goals of this segment

• Briefly summarize HapMap design and current status

• Discuss the application of HapMap to all aspects of association study design, analysis and interpretation

Page 32: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

HapMap Project

High-density SNP genotyping across the genome provides information about– SNP validation, frequency, assay conditions– correlation structure of alleles in the genome

A freely-available public resource to increase the power and efficiency

of genetic association studies to medical traits

All data is freely available on the web for applicationin study design and analyses as researchers see fit

Page 33: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

HapMap Samples• 90 Yoruba individuals (30 parent-parent-offspring trios) from

Ibadan, Nigeria (YRI)

• 90 individuals (30 trios) of European descent from Utah (CEU)

• 45 Han Chinese individuals from Beijing (CHB)

• 45 Japanese individuals from Tokyo (JPT)

Page 34: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

HapMap progress

PHASE I – completed, described in Nature paper

* 1,000,000 SNPs successfully typed in all 270 HapMap samples

* ENCODE variation reference resource available

PHASE II – data generation complete, data released early November 2005

* >3,500,000 SNPs typed in total !!!

Frazer, K. A., D. G. Ballinger, D. R. Cox, D. A. Hinds, L. L. Stuve, R. A. Gibbs, J. W. Belmont, A. Boudreau, P. Hardenbol, S. M. Leal, S. Pasternak, D. A. Wheeler, et al. (2007). "A second generation human haplotype map of over 3.1 million SNPs." Nature 449(7164): 851-61.

Page 35: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

ENCODE-HAPMAP variation project

• Ten “typical” 500kb regions

• 48 samples sequenced

• All discovered SNPs (and those dbSNP) typed in all 270 HapMap samples

• Current data set – 1 SNP every 279 bp

A much more complete variation resource by whichthe genome-wide map can evaluated

Page 36: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Completeness of dbSNP

Vast majority of common SNPs are contained in or highly correlated with a SNP in dbSNP

Page 37: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Recombination hotspots are widespreadand account for LD structure

7q21

Page 38: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Coverage of Phase II HapMap(estimated from ENCODE data)

From Table 6 – “A Haplotype Map of the Human Genome”, Nature

Panel %r2 > 0.8 max r2

YRI 81 0.90CEU 94 0.97CHB+JPT 94 0.97

Vast majority of common variation (MAF > .05) captured by Phase II HapMap

Page 39: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Applying the HapMap

• Study design - tagging• Study coverage evaluation• Study analysis - improving association testing• Study interpretation

– Comparison of multiple studies– Connection to genes/genomic features– Integration with expression and other functional data

• Other uses of HapMap data– Admixture, LOH, selection

Page 40: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Tagging from HapMap

• Since HapMap describes the majority of common variation in the genome, choosing non-redundant sets of SNPs from HapMap offers considerable efficiency without power loss in association studies

Page 41: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271
Page 42: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Pairwise tagging

Tags:

SNP 1SNP 3SNP 6

3 in total

Test for association:

SNP 1SNP 3SNP 6

A/T1

G/A2

G/C3

T/C4

G/C5

A/C6

high r2 high r2 high r2

AATT

GC

CG

GC

CG

TCCC

ACCC

GC

CG

TCCC

GGAA

GGAA

After Carlson et al. (2004) AJHG 74:106

Page 43: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Pairwise Tagging Efficiency

Table 7 Number of selected tag SNPs to capture all observed common SNPs in the Phase I HapMap for the three analysis panels using pairwise tagging at different r2 thresholds

YRI CEU CHB+JPT

Pairwise r2 ≥ 0.5 324,865 178,501 159,029

r2 ≥ 0.8 474,409 293,835 259,779

r2 = 1 604,886 447,579 434,476

Tag SNPs were picked to capture common SNPs in release 16c.1 for every 7,000 SNP bin using Haploview.

Tagging Phase I HapMap offers 2-5x gains in efficiency

Page 44: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Tags:

SNP 1SNP 3SNP 6

3 in total

Test for association:

SNP 1SNP 3SNP 6

Use of haplotypes can improve genotyping efficiency

Tags:

SNP 1SNP 3

2 in total

Test for association:

SNP 1 captures 1+2SNP 3 captures 3+5

“AG” haplotype captures SNP 4+6

AATT

GC

CG

GC

CG

TCCC

ACCC

GC

CG

TCCC

GGAA

GGAA

ACCC

A/T1

G/A2

G/C3

T/C4

G/C5

A/C6

tags in multi-marker test should be conditional on

significance of LD in order to avoid overfitting

Page 45: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Efficiency and powerR

elat

ive

pow

er (

%)

Average marker density (per kb)

tag SNPs

randomSNPs

P.I.W. de Bakker et al. (2005) Nat Genet Advance Online Publication 23 Oct 2005

~300,000 tag SNPsneeded to cover commonvariation in whole genome

in CEU

Page 46: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Will tag SNPs picked from HapMap apply to other population

samples?

Two issues: what if LD structure strongly differs between my samples and the HapMap samples?

Are CEU or YRI panels good surrogates for Latinos from Los Angeles? Are CEU samples even good surrogates for whites from

France?

Is HapMap sample size sufficient?Small sample correlation overestimated; are tagging algorithms

“overfitting” the sample

PK slide

Page 47: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Will tag SNPs picked from HapMap apply to other population

samples?

Population differences add very little inefficiencyPaul de Bakker Pac Symp Biocomput 2006

CEUCEU

Whites fromLos Angeles, CA

Whites fromLos Angeles, CA Botnia, FinlandBotnia, Finland

CEUCEUCEUCEU

Utah residents with European ancestry

(CEPH)

Utah residents with European ancestry

(CEPH)

Page 48: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

De Bakker et al (2006) Nat Genet

Page 49: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Need and Goldstein (2006) Nat Genet

Page 50: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Impact of training set sample size

Tags chosen as pairwise tags

Tags chosen as multimarker tags(up to 6 markers)

Zeggini et al Nature Genetics 37, 1320 - 1322 (2005)

Page 51: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Impact of training set sample size

Tags chosen for common variants

Tags chosen for common and rare varants

Zeggini et al Nature Genetics 37, 1320 - 1322 (2005)

Page 52: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Outline

• Measures of linkage disequilibrium• Reasons for LD and empirical patterns of LD• “Tagging” SNPs• The HapMap project• Resources and tools for SNP selection

Page 53: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Public sources of SNP data

• Candidate genes– “Seattle SNPs” http://pga.gs.washington.edu/ *

– Environmental Genome Project http://egp.gs.washington.edu/ *

– IIPGA http://innateimmunity.net/IIPGA2/index_html *

– HAPMAP http://www.hapmap.org/

– BPC3 http://www.uscnorris.com/MECGenetics/

• Genome-wide– HAPMAP

• dbSNP http://www.ncbi.nlm.nih.gov/projects/SNP/

• OMIM (online mendelian inheritance in man)

* Resequencing data

Page 54: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Bioinformatics tools

– https://innateimmunity.net/IIPGA2/Bioinformatics/– http://pga.gs.washington.edu/software.html

– Haploview http://www.broad.mit.edu/mpg/haploview/index.php

– SNPSelector

Page 55: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

So, OK, how should I select SNPs?

• PubMed/lit search– Previous associations with your (or related) phenotype

• GWAS!

– Functional studies

• Potentially functional variants– nsSNPs (perhaps ranked by SIFT or Polyphen score)

– Splice sites– Conserved regions

• tagSNPs

Page 56: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

SNP SelectorBioinformatics

21:4181

http://primer.duhs.duke.edu/

Page 57: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

SNP SelectorBioinformatics 21:4181

Page 58: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271
Page 59: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Molecular genotyping Molecular genotyping methodsmethods

David G. Cox M.S. Ph.D.David G. Cox M.S. Ph.D.Instructor of EpidemiologyInstructor of [email protected]@hsph.harvard.eduBldg. 2 Rm. 211Bldg. 2 Rm. 211(617) 432-2262(617) 432-2262

Page 60: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

OverviewOverview

How it worksHow it works Considerations in choosing a Considerations in choosing a

methodmethod Quality Control (QC)Quality Control (QC) Organizing your dataOrganizing your data Completing the studyCompleting the study

Page 61: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

PCRPCR

Rapid, versatile, Rapid, versatile, in vitro, in vitro, method for method for amplifying defined target DNA sequences amplifying defined target DNA sequences to yield multiple copies of specific region of to yield multiple copies of specific region of DNA sequenceDNA sequence

1980s, K. Mullis invented PCR1980s, K. Mullis invented PCR– Won Nobel Prize in 1993Won Nobel Prize in 1993

Applications for basic science, Applications for basic science, epidemiology, evolution, linkage analysis, epidemiology, evolution, linkage analysis, forensics, anthropologyforensics, anthropology

Page 62: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

PCR (2)PCR (2)

Allows for screening of Allows for screening of uncharacterized mutationsuncharacterized mutations

Rapid genotyping for polymorphic Rapid genotyping for polymorphic markersmarkers

Detecting point mutations Detecting point mutations

Page 63: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

PCR cyclePCR cycle

Three stepsThree steps

1.1. DenaturationDenaturation• Denature DNA to separate strandsDenature DNA to separate strands

2.2. AnnealingAnnealing• Primers bind to strandsPrimers bind to strands

3.3. ExtensionExtension• Polymerase synthesizes new strandsPolymerase synthesizes new strands

Page 64: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

PCR cycle (2)PCR cycle (2)

Reaction mixture proceeds through Reaction mixture proceeds through repeated cycles of primer annealing, repeated cycles of primer annealing, DNA synthesis, and denaturationDNA synthesis, and denaturation

Target sequence concentration Target sequence concentration increases exponentially for each cycleincreases exponentially for each cycle– Each newly synthesized DNA strand acts Each newly synthesized DNA strand acts

as a template for further DNA synthesis in as a template for further DNA synthesis in subsequent cyclessubsequent cycles

Page 65: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

DenaturationDenaturation

Page 66: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

AnnealingAnnealing

Page 67: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

ExtensionExtension

Page 68: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271
Page 69: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271
Page 70: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Main assays usedMain assays used

Looking to optimize three thingsLooking to optimize three things– Cost of genotypingCost of genotyping– Speed of genotypingSpeed of genotyping– Reliability of dataReliability of data

Three main categoriesThree main categories– Low-plexedLow-plexed

Usually PCR basedUsually PCR based– High-plexedHigh-plexed

PCR or non-PCR basedPCR or non-PCR based– Mega-plexedMega-plexed

Non-PCR basedNon-PCR based

Page 71: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

PCR bases methodsPCR bases methods

Plex = number of separate assays in Plex = number of separate assays in an individual tubean individual tube

Single to low-plexedSingle to low-plexed– Usually limited to number of tagsUsually limited to number of tags

Either fluorescent or massEither fluorescent or mass Tags are expensive partTags are expensive part Micro scale reactionsMicro scale reactions Low start-up costsLow start-up costs

– Robotics not necessaryRobotics not necessary– Machines in many labsMachines in many labs

Page 72: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

TaqmanTaqman

Page 73: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

BioTroveBioTrove

Miniaturized TaqmanMiniaturized Taqman– Primers and probes spotted into holesPrimers and probes spotted into holes– Taqman reaction exactly the sameTaqman reaction exactly the same

Reduces costReduces cost– Lowers quantity of probe and master Lowers quantity of probe and master

mixmix– Still need to order a minimum Still need to order a minimum

amount of primer and probeamount of primer and probe

Page 74: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

iPLEXiPLEX

Page 75: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Non-PCR bases Non-PCR bases methodsmethods Usually rely on some sort of genome Usually rely on some sort of genome

wide amplification stepwide amplification step Hybridization techniques increase plexHybridization techniques increase plex

– Stick DNA to some sort of chipStick DNA to some sort of chip Chips are roughly the size of microscope slidesChips are roughly the size of microscope slides

Nano scale reactionsNano scale reactions High start-up costs for machines and High start-up costs for machines and

roboticsrobotics– Core facilities normally usedCore facilities normally used

Page 76: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Illumina productsIllumina products

Highly multiplexed assaysHighly multiplexed assays– From 384 to ~1M SNPsFrom 384 to ~1M SNPs

Custom chips designed up to 72kCustom chips designed up to 72k GWAS products of ~500k and ~1MGWAS products of ~500k and ~1M

– Based on pair-wise tagging of SNPs from Based on pair-wise tagging of SNPs from hapmaphapmap

Use specially etched holesUse specially etched holes– Solves “spotting” problemSolves “spotting” problem– Addressing systemAddressing system

Page 77: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

GoldengateGoldengate

Page 78: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

InfiniumInfinium

Page 79: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

AffymetrixAffymetrix

GWAS chipGWAS chip– Over 1.8M featuresOver 1.8M features

SNPsSNPs– ~500k from earlier version~500k from earlier version

Evenly spaced across genomeEvenly spaced across genome– ~500k additional SNPs~500k additional SNPs

TagTag X/YX/Y mtDNAmtDNA New SNPsNew SNPs HotspotsHotspots

CNVsCNVs– ~200k specifically targeted to CNVs~200k specifically targeted to CNVs– ~750k additional probes across genome~750k additional probes across genome

Page 80: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Quick word on CNVsQuick word on CNVs

Latest craze in genetic epi (one of)Latest craze in genetic epi (one of) Copy Number VariantsCopy Number Variants

– Either more (3+) or less (1) copies of a genetic region Either more (3+) or less (1) copies of a genetic region presentpresent

Polymorphic regions of varying zygosityPolymorphic regions of varying zygosity– Detected as Mendelian errors in HapMapDetected as Mendelian errors in HapMap– Behave (from a population genetic standpoint) like Behave (from a population genetic standpoint) like

any other polymorphismany other polymorphism Still not well characterizedStill not well characterized

– i.e. regions with high homology can show up as CNVs i.e. regions with high homology can show up as CNVs Genotyped using quantification of genotype Genotyped using quantification of genotype

signal signal

Page 81: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Back to AffyBack to Affy

Page 82: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Affy vs. IlluminaAffy vs. Illumina

AffymetrixAffymetrix– Earlier productEarlier product

Began with ~100kBegan with ~100k Assay and software Assay and software

issuesissues

– Costs have Costs have drastically declineddrastically declined

– SNP coverage has SNP coverage has drastically increaseddrastically increased

tagSNPs addedtagSNPs added

– WGA DNA OKWGA DNA OK

IlluminaIllumina– Later productLater product

Began with ~500kBegan with ~500k Better assay and Better assay and

software design software design (originally)(originally)

– Cost issuesCost issues– SNP coverage SNP coverage

relatively constantrelatively constant Always based on Always based on

hapmaphapmap

– WGA DNA WGA DNA discourageddiscouraged

Page 83: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Genotype ClusteringGenotype Clustering

Page 84: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Clustering continuedClustering continued

Low-plex assaysLow-plex assays– Usually done by eye by a technicianUsually done by eye by a technician– Can be labor intensive and subject Can be labor intensive and subject

to user biasto user bias High- and mega-plex assaysHigh- and mega-plex assays

– Usually computer assisted or Usually computer assisted or completely automatedcompletely automated

– Less labor intensive but subject to Less labor intensive but subject to clustering errorsclustering errors

Page 85: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Best case scenarioBest case scenario

Page 86: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Software clusteringSoftware clustering

0 0.20 0.40 0.60 0.80 1Norm Theta

rs4804195

0

1

2

3

4

Nor

m R

1049 1566 459

Page 87: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Human clusteringHuman clustering

0 0.20 0.40 0.60 0.80 1Norm Theta

rs4804195

0

1

2

3

4

Nor

m R

1088 1549 491

Page 88: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

What would you do What would you do with this?with this?

0 0.20 0.40 0.60 0.80 1Norm Theta

rs6451182

0

1

2

3

4

Nor

m R

17 2885 225

Page 89: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Or this?Or this?

0 0.20 0.40 0.60 0.80 1Norm Theta

rs598558

-0.20

0

0.20

0.40

0.60

0.80

1

1.20

1.40

1.60

Nor

m R

3 38286

Page 90: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

So you want to So you want to genotype?!?genotype?!? Three main things to considerThree main things to consider

– Number of SNPsNumber of SNPs– Number of samplesNumber of samples– Budgetary considerationsBudgetary considerations

Minor considerationsMinor considerations– DNA sourceDNA source– DNA quantity/qualityDNA quantity/quality

Page 91: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

So you want to So you want to genotype?!?genotype?!?

Biotrove

Page 92: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Budgetary Budgetary ConsiderationsConsiderations Low plexLow plex

– Per SNP cost Per SNP cost normally doesn’t normally doesn’t decrease as the decrease as the number of SNPs number of SNPs increasesincreases

– Per genotype cost Per genotype cost may decrease as may decrease as sample size sample size increasesincreases

– Overall study cost Overall study cost can be low ($Ks)can be low ($Ks)

Higher plexHigher plex– Per SNP cost Per SNP cost

decreases as you decreases as you get closer to the get closer to the maximum plexmaximum plex

– Per genotype cost Per genotype cost decreases decreases drastically as plex drastically as plex goes upgoes up

– Overall study cost Overall study cost can be high ($Ms)can be high ($Ms)

Page 93: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

And the And the BIGBIG question question

Genotype Costs

0100200300400500600700800

Assay and Scale

Co

st

per

well

0

0.2

0.4

0.6

0.8

1

1.2

Co

st

per

gen

oty

pe

Cost/well

Cost/genotype

Page 94: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

How to minimize costs How to minimize costs while maximizing while maximizing genotypinggenotyping Genotype the right number of Genotype the right number of

samplessamples– Fill platesFill plates– Find the sweet spot in assay orderingFind the sweet spot in assay ordering

Genotype the right number of SNPsGenotype the right number of SNPs– If you can fill the beads, your per If you can fill the beads, your per

genotype cost goes down without genotype cost goes down without drastically increasing the total costdrastically increasing the total cost

Page 95: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Quality controlQuality control

Low-plexLow-plex– Blinded QCBlinded QC

Repeated samplesRepeated samples ~10% of the total ~10% of the total

sample sizesample size

– >95% completion >95% completion raterate

– Easy to repeat Easy to repeat individual plates individual plates to correct any to correct any errorserrors

High- to mega-plexHigh- to mega-plex– Blinded QCBlinded QC

One or two samples One or two samples per plateper plate

Same samples on Same samples on every plateevery plate

– Set both SNP and Set both SNP and Sample completion Sample completion ratesrates

– Not easy to repeat Not easy to repeat platesplates

Page 96: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Data overloadData overload

Low-plexLow-plex– Data trickles inData trickles in– Little need for elaborate databasesLittle need for elaborate databases

Assay descriptionAssay description– Primer/probe sequencePrimer/probe sequence– AllelesAlleles

SNP descriptionSNP description– Locus (usually rs# sufficient)Locus (usually rs# sufficient)

– Relational db for genotype dataRelational db for genotype data ID x genotypeID x genotype

Page 97: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Data overloadData overload

High- to mega-plexHigh- to mega-plex– Data delugeData deluge

Up to 1M SNPs worth of data at onceUp to 1M SNPs worth of data at once– Annotation of SNPsAnnotation of SNPs

Assay characteristicsAssay characteristics SNP characteristicsSNP characteristics

Large samples sizesLarge samples sizes– 1536x1000 samples is over 1.5 million data 1536x1000 samples is over 1.5 million data

pointspoints

Page 98: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Data analysis issuesData analysis issues

Low-plexLow-plex– Often <10 SNPs Often <10 SNPs

per studyper study Easy/quick to Easy/quick to

analyzeanalyze Data presentation Data presentation

simplesimple Data archival Data archival

simplesimple

– Multiple Multiple comparison issues comparison issues have largely been have largely been ignoredignored

High- to mega-plexHigh- to mega-plex– Massive data setsMassive data sets

Even summary stats Even summary stats for all the SNPs for all the SNPs takes hourstakes hours

Need to be able to Need to be able to access individual access individual SNPs as wellSNPs as well

Presenting 1536 -1M Presenting 1536 -1M SNPs worth of data SNPs worth of data is a challengeis a challenge

– Multiple comparison Multiple comparison issues more obviousissues more obvious

Page 99: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

In summaryIn summary

Genotyping is now a numbers gameGenotyping is now a numbers game– Methods are VERY accurateMethods are VERY accurate– Budgets are tighterBudgets are tighter

ConsiderationsConsiderations– Number of SNPsNumber of SNPs– Number of samplesNumber of samples– Quantity/Quality of DNAQuantity/Quality of DNA

Feel free to contact me regarding DNA sources Feel free to contact me regarding DNA sources etc.etc.

Page 100: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Online resources (and Online resources (and slide sources)slide sources) Taqman (appliedbiosystems.com)Taqman (appliedbiosystems.com) iplex (sequenom.com)iplex (sequenom.com) Illumina products (Illumina.com)Illumina products (Illumina.com) Affymetrix products Affymetrix products

(Affymetrix.com)(Affymetrix.com)

Page 101: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Genotyping Quality Control

P Kraft

Page 102: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Quality controlMethod of assessment High quality standard

Completion rate > 95% completeHigh failure rate correlated with high error rate

Reproducible genotypes Repeat genotyping of random 5% sample has <1% discordance

Hardy Weinberg Single loci: no significant deviations or small magnitude of deviation

Multiple loci: no more deviations than expected (q-q plot), no consistent trend (all undercalling hets)

Non-paternity Where family data available

Remove all non-paternities

See Leal (2005) Genet Epidemiol,Cox & Kraft (2006) Hum Hered,

Abacasis (2005) Am J Hum Genet

Page 103: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Testing for departure from HWE

aa Aa AA

N0 N1 N2

Observed A allele frequency = p = (2 N2+ N1)/(2N),where N = N2+N1+N0

Pearson’s chi-square test for departures from HWE:

2

220

21

2

222

ii

2ii

)p1(N

))p1(NN(

)p1(p2N

))p1(p2NN(

Np

)NpN(

E

)EO(

This should be compared to a central chi-squared distribution with one degree of freedom

Implemented e.g. in SAS GENETICS – PROC ALLELE

Page 104: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

http://www.sph.umich.edu/csg/abecasis/Exact/index.html

Page 105: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271
Page 106: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

delta= (PAa – 2pApa)/ 2pApa

Example from BPC3

Page 107: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

9 of 22 tests significant at .05 level!

Page 108: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

quantile of Chi-square distribution

Q-Q plots compares two distributions by plotting their quantiles against each other.

Here it is useful to similarity between observed distribution of test statsitics and

theoretical null distribution. Points should lie on y=x diagonal!

Page 109: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

GoodBetter than before Admixture?

Page 110: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

log-log quantile plot ofp-value for Hardy-Weinberg proportion

-2-3-4-5-6

-2

-3

-4

-5

-6

-7

log10(quantile)

log

10(p

-val

ue)

Exact test , 299 779 SNPs

20 simulations

Observedvalues

expected : 244observed : 586

expected : 2600observed : 3340

CGEMS prostate cancer whole genome scan: phase 1a, slide courtesy of G Thomas

Page 111: Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

Statistical Methods to Handle Errors

• Family-based:• AE-TDT: Models both missing parental data and genotype

errors» Am J Hum Genet. 2001 Aug;69(2):371-80.

» Eur J Hum Genet. 2004 Sep;12(9):752-61.

• Likelihood with nuisance parameters» Genet Epidemiol. 2004 Feb;26(2):142-54.

• Bayesian» Genet Epidemiol. 2004 Jan;26(1):70-80.

• Case-control:» Rice & Holmans (2003) Ann J Hum Genet 29:204» Gordon et al. (2004) Stat App Genet Molec Biol 3

• Need locus-specific error rates» Difficult to get for high-throughput platforms

Nondifferential genotyping error can lead to inflated Type I

error rates!

Nondifferential genotyping error does not generally lead to inflated Type I error

rates, but can lead to loss of power, bias

away from null