71
Single Nucleotide Single Nucleotide Polymorphisms Polymorphisms (SNPs), Haplotypes, (SNPs), Haplotypes, Linkage Linkage Disequilibrium, and Disequilibrium, and the Human Genome the Human Genome Gururaj p Gururaj p

SNP

Embed Size (px)

Citation preview

Page 1: SNP

Single Nucleotide Single Nucleotide Polymorphisms (SNPs), Polymorphisms (SNPs),

Haplotypes, Linkage Haplotypes, Linkage Disequilibrium, and the Disequilibrium, and the

Human GenomeHuman Genome

Gururaj pGururaj p

Page 2: SNP

OverviewOverview

►Biological BackgroundBiological Background►Terminology Terminology ►SNP related general SNP related general

informationinformation►SNP detection techniquesSNP detection techniques►SNP ApplicationsSNP Applications►ReferencesReferences

Page 3: SNP

Biological BackgroundBiological Background

►How can researchers hope to identify How can researchers hope to identify and study all the changes that occur in and study all the changes that occur in so many different diseases?so many different diseases?

►How can they explain why some How can they explain why some people respond to treatment and not people respond to treatment and not others?others?

Page 4: SNP

‘ ‘SNP’SNP’ is the answer to these is the answer to these questions…questions…

►So what exactly are SNPs? So what exactly are SNPs? ►How are they involved in so many How are they involved in so many

different aspects of health? different aspects of health?

Page 5: SNP

What is SNP ?What is SNP ?

►A A SNPSNP is defined as a single base is defined as a single base change in a DNA sequence that occurs change in a DNA sequence that occurs in a significant proportion (more than in a significant proportion (more than 1 percent) of a large population. 1 percent) of a large population.

Page 6: SNP

Variations in GenomeVariations in Genome

Page 7: SNP

TerminologyTerminology

► PolymorphismPolymorphism

► Linkage DisequilibriumLinkage Disequilibrium Correlation of characters Correlation of characters

states among states among polymorphic sitespolymorphic sites

Insufficient passage of Insufficient passage of time to randomize time to randomize character states by character states by meiotic recombinationsmeiotic recombinations

► HaplotypeHaplotype

Page 8: SNP

Some FactsSome Facts

► In human beings, 99.9 percent bases are same.In human beings, 99.9 percent bases are same.► Remaining 0.1 percent makes a person unique. Remaining 0.1 percent makes a person unique.

Different attributes / characteristics / traits Different attributes / characteristics / traits ► how a person looks, how a person looks, ► diseases he or she develops. diseases he or she develops.

► These variations can be:These variations can be: Harmless (change in phenotype)Harmless (change in phenotype) Harmful (diabetes, cancer, heart disease, Huntington's Harmful (diabetes, cancer, heart disease, Huntington's

disease, and hemophilia )disease, and hemophilia ) Latent (variations found in coding and regulatory regions, Latent (variations found in coding and regulatory regions,

are not harmful on their own, and the change in each gene are not harmful on their own, and the change in each gene only becomes apparent under certain conditions e.g. only becomes apparent under certain conditions e.g. susceptibility to lung cancer)susceptibility to lung cancer)

Page 9: SNP

SNP factsSNP facts

► SNPs are found in SNPs are found in coding and (mostly) noncoding regions. coding and (mostly) noncoding regions.

► Occur with a very high frequencyOccur with a very high frequency about 1 in 1000 bases to 1 in 100 to 300 bases. about 1 in 1000 bases to 1 in 100 to 300 bases.

► The abundance of SNPs and the ease with which they The abundance of SNPs and the ease with which they can be measured make these genetic variations can be measured make these genetic variations significant.significant.

► SNPs close to particular gene acts as a marker for SNPs close to particular gene acts as a marker for that gene.that gene.

► SNPs in coding regions may alter the protein SNPs in coding regions may alter the protein structure made by that coding region. structure made by that coding region.

Page 10: SNP

SNPs may / may not alter protein SNPs may / may not alter protein structurestructure

Page 11: SNP

SNPs act as gene markersSNPs act as gene markers

Page 12: SNP

SNP mapsSNP maps

►Sequence genomes of a large number Sequence genomes of a large number of peopleof people

►Compare the base sequences to Compare the base sequences to

discover SNPs. discover SNPs.

►Generate a single map of the human Generate a single map of the human genome containing all possible SNPs genome containing all possible SNPs => SNP maps=> SNP maps

Page 13: SNP

SNP MapsSNP Maps

Page 14: SNP

SNP ProfilesSNP Profiles

►Genome of each individual contains distinct Genome of each individual contains distinct SNP pattern.SNP pattern.

► People can be grouped based on the SNP People can be grouped based on the SNP profile.profile.

► SNPs Profiles important for identifying SNPs Profiles important for identifying response to Drug Therapy.response to Drug Therapy.

► Correlations might emerge between certain Correlations might emerge between certain SNP profiles and specific responses to SNP profiles and specific responses to treatment.treatment.

Page 15: SNP

SNP ProfilesSNP Profiles

Page 16: SNP

Techniques to detect known Techniques to detect known PolymorphismsPolymorphisms

►Hybridization TechniquesHybridization Techniques Micro arraysMicro arrays Real time PCRReal time PCR

► Enzyme based TechniquesEnzyme based Techniques Nucleotide extensionNucleotide extension CleavageCleavage LigationLigation Reaction product detection and displayReaction product detection and display

► Comparison of Techniques usedComparison of Techniques used

Page 17: SNP

Hybridization TechniquesHybridization Techniques

► Micro ArraysMicro Arrays ‘‘Sequencing by hybridization’ Sequencing by hybridization’ utilize a set of ‘tiling’ oligonucleotidesutilize a set of ‘tiling’ oligonucleotides somewhat complex somewhat complex pooling and processing of PCR amplicons that are pooling and processing of PCR amplicons that are

subsequently hybridized to a DNA micro array and subsequently hybridized to a DNA micro array and visualized. visualized.

Theoretically capable of genotyping thousands of Theoretically capable of genotyping thousands of polymorphisms simultaneouslypolymorphisms simultaneously

Success rate 97% (Somewhat low for this kind of analysis)Success rate 97% (Somewhat low for this kind of analysis) High False rates 11–21% High False rates 11–21% Design and fabrication of micro arrays is expensive, hence Design and fabrication of micro arrays is expensive, hence

users are confined to the set of genotypes established by users are confined to the set of genotypes established by the manufacturer.the manufacturer.

Page 18: SNP

► Real Time PCRsReal Time PCRs

Utilizes TaqmanTM DNA probes to detect PCR products in real-Utilizes TaqmanTM DNA probes to detect PCR products in real-timetime

TaqmanTM probe contains a fluorescent reporter at the 5' end TaqmanTM probe contains a fluorescent reporter at the 5' end and a fluorescence resonance energy transfer (FRET) moiety at and a fluorescence resonance energy transfer (FRET) moiety at the 3' end, which quenches the fluorescent signal of the the 3' end, which quenches the fluorescent signal of the reporter. reporter.

The probe sequence is complementary to the PCR amplicon The probe sequence is complementary to the PCR amplicon and is designed to anneal at the extension temperature. and is designed to anneal at the extension temperature.

During extension, the 5' 3' exonuclease activity of During extension, the 5' 3' exonuclease activity of TaqTaq DNA DNA polymerase I cleaves the probe, emitting signal due to the polymerase I cleaves the probe, emitting signal due to the separation of the reporter from the quencher. separation of the reporter from the quencher.

Polymorphism is determined solely by hybridization and not by Polymorphism is determined solely by hybridization and not by the ability of the enzyme to discriminate. the ability of the enzyme to discriminate.

Because the enzyme does not confer specificity in detection, Because the enzyme does not confer specificity in detection, this technique is classified as hybridization-based. this technique is classified as hybridization-based.

Depending on optical thermocycler platform 384 reactions can Depending on optical thermocycler platform 384 reactions can be monitored for each cycle without removing any samplebe monitored for each cycle without removing any sample

amenable to robotic automation. amenable to robotic automation.

Page 19: SNP

Real Time PCRsReal Time PCRs

Page 20: SNP

Enzyme based TechniquesEnzyme based Techniques► Nucleotide extension Nucleotide extension

Simplest techniques for known polymorphism detectionSimplest techniques for known polymorphism detection Existing in numerous variations (also known as Existing in numerous variations (also known as

minisequencing, SNuPE, GBA, APEX, AS-PE capture, FNC, minisequencing, SNuPE, GBA, APEX, AS-PE capture, FNC, TDI or PROBE) this assay typically involves the single base TDI or PROBE) this assay typically involves the single base extension of an oligonucleotide by a polymeraseextension of an oligonucleotide by a polymerase

Oligonucleotide is designed to anneal immediately Oligonucleotide is designed to anneal immediately upstream of the polymorphism locus and differentially upstream of the polymorphism locus and differentially labeled fluorescent dideoxynucleotides are utilized as labeled fluorescent dideoxynucleotides are utilized as substrates for polymerase extension. substrates for polymerase extension.

The fluorescent signal emitted corresponds to the The fluorescent signal emitted corresponds to the nucleotide incorporated and thus the sequence of the nucleotide incorporated and thus the sequence of the polymorphism. polymorphism.

Simplicity and accuracy in distinguishing between Simplicity and accuracy in distinguishing between heterozygous and homozygous genotypes. heterozygous and homozygous genotypes.

Targets need to be PCR amplified + PCR reagents must be Targets need to be PCR amplified + PCR reagents must be removed. removed.

False negatives due to mis-primingFalse negatives due to mis-priming

Page 21: SNP

Nucleotide ExtensionNucleotide Extension

Page 22: SNP

► CleavageCleavage

The InvaderTM assay utilizes the exonuclease activity of The InvaderTM assay utilizes the exonuclease activity of Cleavase VIII on overlapping oligonucleotide strands. Cleavase VIII on overlapping oligonucleotide strands.

Two oligonucleotides, an ‘invader’ probe and either a wild-Two oligonucleotides, an ‘invader’ probe and either a wild-type or mutant primary probe, overlap each other at a type or mutant primary probe, overlap each other at a single nucleotide position on the template only if they are single nucleotide position on the template only if they are complementary to the polymorphism being queried. complementary to the polymorphism being queried.

Cleavage occurs when the specific overlapping Cleavage occurs when the specific overlapping conformation is present, freeing an oligonucleotide conformation is present, freeing an oligonucleotide referred to as a ‘flap’. referred to as a ‘flap’.

This flap can be detected in a multiplex manner by size, This flap can be detected in a multiplex manner by size, mass or sequencemass or sequence

Commonly the flap participates in a second cleavage assay Commonly the flap participates in a second cleavage assay with another complementary target, causing release of a with another complementary target, causing release of a fluorescent signal. fluorescent signal.

Advantage - the same flap may bind to many targets, Advantage - the same flap may bind to many targets, generating a cascading signal amplification and thereby generating a cascading signal amplification and thereby obviating the need for PCR amplification. obviating the need for PCR amplification.

Single-tube one-step reaction. Single-tube one-step reaction.

Page 23: SNP

CleavageCleavage

Page 24: SNP

► LigationLigation One of the most specific assays due to the high specificity of One of the most specific assays due to the high specificity of

T4 ligase (oligo ligation assay) and even higher specificity of T4 ligase (oligo ligation assay) and even higher specificity of thermostable ligases (ligation detection reaction, LDR) thermostable ligases (ligation detection reaction, LDR)

Two primers are designed to anneal adjacent to one another Two primers are designed to anneal adjacent to one another on the target of intereston the target of interest

Generally, the upstream primer (discriminating primer) Generally, the upstream primer (discriminating primer) contains a fluorescent label at the 5' end, with the 3' contains a fluorescent label at the 5' end, with the 3' nucleotide overlapping the polymorphic base. nucleotide overlapping the polymorphic base.

The fluorescent signal corresponds to the allele being The fluorescent signal corresponds to the allele being queried at the 3' position of the discriminating primerqueried at the 3' position of the discriminating primer

When the discriminating primer forms a perfect complement When the discriminating primer forms a perfect complement with the target at the junction, the ligase covalently with the target at the junction, the ligase covalently attaches the adjacent downstream primer (common primer) attaches the adjacent downstream primer (common primer)

The resulting product is approximately twice as long as each The resulting product is approximately twice as long as each of the individual primers and can be easily monitored for of the individual primers and can be easily monitored for detection by means of capillary electrophoresis or by detection by means of capillary electrophoresis or by display on a microarray display on a microarray

Advantage – Very good sensitivity and specificity Advantage – Very good sensitivity and specificity

Page 25: SNP

Techniques to detect unknown Techniques to detect unknown PolymorphismsPolymorphisms

►Direct SequencingDirect Sequencing►MicroarrayMicroarray► Cleavage / LigationCleavage / Ligation► Electrophoretic mobility assaysElectrophoretic mobility assays

► Comparison of Techniques usedComparison of Techniques used

Page 26: SNP

Direct SequencingDirect Sequencing

► Sanger dideoxysequencing can detect any type of unknown Sanger dideoxysequencing can detect any type of unknown polymorphism and its position, when the majority of DNA polymorphism and its position, when the majority of DNA contains that polymorphism.contains that polymorphism.

► Misses polymorphisms and mutations when the DNA is Misses polymorphisms and mutations when the DNA is heterozygousheterozygous

► limited utility for analysis of solid tumors or pooled samples of limited utility for analysis of solid tumors or pooled samples of DNA due to low sensitivityDNA due to low sensitivity

► Once a sample is known to contain a polymorphism in a Once a sample is known to contain a polymorphism in a specific region, direct sequencing is particularly useful for specific region, direct sequencing is particularly useful for identifying a polymorphism and its specific position. identifying a polymorphism and its specific position.

► Even if the identity of the polymorphism cannot be discerned Even if the identity of the polymorphism cannot be discerned in the first pass, multiple sequencing attempts have proven in the first pass, multiple sequencing attempts have proven quite successful in elucidating sequence and position quite successful in elucidating sequence and position information. information.

Page 27: SNP

MicroarrayMicroarray► Variation detection arrays (VDA) scans large sequence blocks Variation detection arrays (VDA) scans large sequence blocks

and identify regions containing unknown polymorphisms. and identify regions containing unknown polymorphisms. ► This methodology suffers from the same limitations in This methodology suffers from the same limitations in

fabrication and design as observed in known polymorphism fabrication and design as observed in known polymorphism analysis, but has demonstrated much greater success in the analysis, but has demonstrated much greater success in the context of unknown polymorphism detection for both SNP and context of unknown polymorphism detection for both SNP and tumor analysis. tumor analysis.

► With respect to SNP analysis, a recent study of chromosome With respect to SNP analysis, a recent study of chromosome 21 successfully identified approximately half of the estimated 21 successfully identified approximately half of the estimated number of common SNPs (frequency of 10–50%) across the number of common SNPs (frequency of 10–50%) across the entire chromosome. entire chromosome.

► The experimental design required a sacrifice in sensitivity in The experimental design required a sacrifice in sensitivity in order to minimize false positives.order to minimize false positives.

► This explains the decrease in successful identification from 80 This explains the decrease in successful identification from 80 to 50%.to 50%.

Page 28: SNP

Cleavage/LigationCleavage/Ligation

► Unknown polymorphisms can also be identified by the Unknown polymorphisms can also be identified by the cleavage of mismatches in DNA–DNA heteroduplexes. cleavage of mismatches in DNA–DNA heteroduplexes.

► This can be achieved either chemically [chemical cleavage This can be achieved either chemically [chemical cleavage method (CCM) or enzymatically (T4 Endo nuclease VII, MutY method (CCM) or enzymatically (T4 Endo nuclease VII, MutY cleavage or Cleavase). cleavage or Cleavase).

► Typically, at least two samples are PCR amplified (one sample Typically, at least two samples are PCR amplified (one sample can be sufficient for solid tumor samples with high levels of can be sufficient for solid tumor samples with high levels of stromal contamination), denatured and then hybridized to stromal contamination), denatured and then hybridized to create DNA–DNA heteroduplexes of the variant strands. create DNA–DNA heteroduplexes of the variant strands.

► Enzymes cleave adjacent to the mismatch and products are Enzymes cleave adjacent to the mismatch and products are resolved via gel or capillary electrophoresis. resolved via gel or capillary electrophoresis.

► Unfortunately, the cleavage enzymes often nick Unfortunately, the cleavage enzymes often nick complementary regions of DNA as well. This increases complementary regions of DNA as well. This increases background noise, lowers specificity, and reduces the pooling background noise, lowers specificity, and reduces the pooling capacity of the assay. capacity of the assay.

Page 29: SNP

Cleavage / LigationCleavage / Ligation

Page 30: SNP

SNP ApplicationsSNP Applications

►Gene discovery and mappingGene discovery and mapping►Association-based candidate Association-based candidate

polymorphism testingpolymorphism testing►Diagnostics/risk profilingDiagnostics/risk profiling►Response predictionResponse prediction►Homogeneity testing/study designHomogeneity testing/study design►Gene function identificationGene function identification

Page 31: SNP

High-resolution High-resolution haplotype structure in haplotype structure in the human genome the human genome

Mark J. Daly, John D. Rioux, Mark J. Daly, John D. Rioux, Stephen F. Schaffner, Thomas Stephen F. Schaffner, Thomas

J. Hudson & Eric S. LanderJ. Hudson & Eric S. Lander

Page 32: SNP

AbstractAbstract

► Authors are describing a high-resolution Authors are describing a high-resolution analysis of the haplotype structure across analysis of the haplotype structure across 500 KB on chromosome 5q31 using 103 500 KB on chromosome 5q31 using 103 SNPs in a European derived population.SNPs in a European derived population.

► They developed an analytical model for They developed an analytical model for Linkage disequilibrium (LD) mapping based Linkage disequilibrium (LD) mapping based on high-resolution haplotype blocks, which on high-resolution haplotype blocks, which offers a coherent framework for creating a offers a coherent framework for creating a haplotype map of the human genome.haplotype map of the human genome.

Page 33: SNP

Data usedData used► 500 kb region on human chromosome 5q31 500 kb region on human chromosome 5q31

that is implicated as containing a genetic risk that is implicated as containing a genetic risk factor for Crohn disease. factor for Crohn disease. Rioux, J. D et al. Hierarchical linkage Rioux, J. D et al. Hierarchical linkage

disequilibrium mapping of a susceptibility gene for disequilibrium mapping of a susceptibility gene for Crohn ’s disease to the cytokine cluster on Crohn ’s disease to the cytokine cluster on chromosome 5. Nature Gene. 29, 223-228(2001)chromosome 5. Nature Gene. 29, 223-228(2001)

► 103 common (>5% minor allele frequency) 103 common (>5% minor allele frequency) SNPs genotyped from a European-derived SNPs genotyped from a European-derived population. Study describe 258 population. Study describe 258 chromosomes transmitted to individuals with chromosomes transmitted to individuals with Crohn disease and 258 untransmitted Crohn disease and 258 untransmitted chromosomes.chromosomes.

Page 34: SNP

Data usedData used►The genotype data used in study The genotype data used in study

provides the highest-resolution picture provides the highest-resolution picture of the patterns of genetic variation of the patterns of genetic variation across a large genomic region, with a across a large genomic region, with a market density of 1 SNP roughly every market density of 1 SNP roughly every 5 kb. 5 kb.

Page 35: SNP

StudyStudy►Focus on identifying the underlying Focus on identifying the underlying

haplotypes.haplotypes.►Authors initial focus was on Authors initial focus was on

untransmitted control chromosomes, untransmitted control chromosomes, however, the same haplotype structure however, the same haplotype structure was seen in the chromosomes was seen in the chromosomes transmitted to individuals with Crohn transmitted to individuals with Crohn disease, with the only difference being disease, with the only difference being that one of the haplotypes was that one of the haplotypes was enriched in frequency, reflecting its enriched in frequency, reflecting its association with Crohn disease. association with Crohn disease.

Page 36: SNP

StudyStudy► It became evident during the study that It became evident during the study that

the region could be largely the region could be largely decomposed into discrete haplotype decomposed into discrete haplotype blocks, each with a lack of diversity.blocks, each with a lack of diversity.

►As haplotype block structure was the As haplotype block structure was the same in both groups, they presented same in both groups, they presented combined data from all chromosomes combined data from all chromosomes (transmitted and untransmitted). (transmitted and untransmitted).

Page 37: SNP

Haplotype block structure on 5q31Haplotype block structure on 5q31

Page 38: SNP

Haplotype block structure on 5q31Haplotype block structure on 5q31

a. Common haplotype patterns in each block of low diversity. Dashed lines indicate locations where more than 2% of all chromosomes are observed to transition one common haplotype to a different one.

Page 39: SNP

Haplotype block structure on 5q31Haplotype block structure on 5q31

b. Percentage of observed chromosomes that match one of the common patterns exactly (total chromosomes = 258 transmitted + 258 untransmitted).

Page 40: SNP

Haplotype block structure on 5q31Haplotype block structure on 5q31

c. Percentage of each of the common patterns among 258 untransmitted chromosomes.

Page 41: SNP

Haplotype block structure on 5q31Haplotype block structure on 5q31

d. Rate of haplotype exchange between the blocks as estimated by the HMM.

Page 42: SNP

Haplotype block structure on 5q31Haplotype block structure on 5q31

-The haplotype blocks span up to 100 kb and contain multiple (five or more) common SNPs.

-The blocks have only few (2-4) haplotypes, which show no evidence of being derived from one another by recombination, and which account for nearly all chromosomes (>90%) in all cases in the sample.

Page 43: SNP

Haplotype block structure on 5q31Haplotype block structure on 5q31

For example, an 84 kb block shows only two distinct haplotypes that together account for 95% of the observed chromosomes (table -1).

Page 44: SNP

StudyStudy►The discrete blocks are separated by The discrete blocks are separated by

intervals in which several independent intervals in which several independent historical recombination event seem to historical recombination event seem to have occurred, giving rise to greater have occurred, giving rise to greater haplotype diversity for regions spanning haplotype diversity for regions spanning the blocks. the blocks.

►The most common recombination events The most common recombination events are indicated in previous figure by lines are indicated in previous figure by lines connecting the haplotypes. connecting the haplotypes.

►The recombination events appear to be The recombination events appear to be clustered; multiple obligate exchanges clustered; multiple obligate exchanges must have occurred between most blocks, must have occurred between most blocks, with little or no exchange within block.with little or no exchange within block.

Page 45: SNP

StudyStudy►Although there is detectable Although there is detectable

recombination between blocks, it is recombination between blocks, it is modest enough for there to be clear long-modest enough for there to be clear long-range correlation among (that is, LD) range correlation among (that is, LD) blocks. blocks.

►The haplotypes at the various blocks can The haplotypes at the various blocks can be readily assigned to one of the four be readily assigned to one of the four ancestral long-range haplotypes. ancestral long-range haplotypes.

► Indeed, 38% of the chromosomes studies Indeed, 38% of the chromosomes studies carried one of these four haplotypes carried one of these four haplotypes across the entire length of the region.across the entire length of the region.

Page 46: SNP

StudyStudy►Using HMM, they developed an Using HMM, they developed an

approach to define the block structure approach to define the block structure formally. formally.

►The HMM simultaneously assigns every The HMM simultaneously assigns every position along each observed position along each observed chromosome to one of the four chromosome to one of the four ancestral haplotypes and estimates the ancestral haplotypes and estimates the maximum-likelihood values of the maximum-likelihood values of the ‘historical recombination frequency’ (‘historical recombination frequency’ (ΘΘ) ) between each pair of markersbetween each pair of markers. .

Page 47: SNP

StudyStudy►The quantity The quantity ΘΘ provides a convenient provides a convenient

summary of the degree of haplotype summary of the degree of haplotype exchange across inter-marker intervals exchange across inter-marker intervals and relates directly toe conventional and relates directly toe conventional measures of LD. measures of LD.

► In this study, In this study, ΘΘ is estimated at less is estimated at less than 1% for 73 of the inter-marker than 1% for 73 of the inter-marker intervals, 1-4% for 14 of the intervals, intervals, 1-4% for 14 of the intervals, and more than 4% for only 9 of the and more than 4% for only 9 of the intervals.intervals.

Page 48: SNP

Methods: Individuals and market Methods: Individuals and market selectionselection

► The individuals studies, Canadians from The individuals studies, Canadians from metropolitan Toronto of predominantly metropolitan Toronto of predominantly European descent and the genotyping European descent and the genotyping methodologies are described in the paper methodologies are described in the paper Rioux, J. D et al. Hierarchical linkage disequilibrium Rioux, J. D et al. Hierarchical linkage disequilibrium

mapping of a susceptibility gene for Crohn ’s mapping of a susceptibility gene for Crohn ’s disease to the cytokine cluster on chromosome 5. disease to the cytokine cluster on chromosome 5. Nature Gene. 29, 223-228(2001)Nature Gene. 29, 223-228(2001)

► To ensure the ability to reconstruct multi-To ensure the ability to reconstruct multi-marker haplotypes, SNPs for haplotype marker haplotypes, SNPs for haplotype analysis were selected from the set of analysis were selected from the set of markers for which full genotypes were markers for which full genotypes were available for all members. available for all members.

► SNPs at CpG sites were not included to SNPs at CpG sites were not included to prevent potential confounding of common prevent potential confounding of common haplotype patterns from recurrent mutations. haplotype patterns from recurrent mutations.

Page 49: SNP

Methods: Haplotype countingMethods: Haplotype counting

►Haplotype percentages in ‘Haplotype Haplotype percentages in ‘Haplotype block structure in 5q31’block structure in 5q31’ figure were figure were computed using haplotypes generated computed using haplotypes generated by the transmission disequilibrium test by the transmission disequilibrium test (TDT) implementation in Genehunter 2.0 (TDT) implementation in Genehunter 2.0 (ref. 22 in the paper), followed by use of (ref. 22 in the paper), followed by use of an EM-type algorithm (ref. 23,24 in an EM-type algorithm (ref. 23,24 in paper), to include the minority of paper), to include the minority of chromosomes that had one or more chromosomes that had one or more markers with ambiguous phase or where markers with ambiguous phase or where one marker was missing genotype data.one marker was missing genotype data.

Page 50: SNP

Methods: Hidden Markov modelMethods: Hidden Markov model► The observation that over long distances most The observation that over long distances most

haplotypes can be described either as belonging to haplotypes can be described either as belonging to one of a small number of common haplotypes one of a small number of common haplotypes categories suggested the use of an HMM in which categories suggested the use of an HMM in which haplotype categories were defined as state.haplotype categories were defined as state.

► Authors assigned observed chromosomes to those Authors assigned observed chromosomes to those hidden states and simultaneously estimated the hidden states and simultaneously estimated the transition probability in each map interval by using transition probability in each map interval by using an EM algorithm and by making the simplifying an EM algorithm and by making the simplifying assumption that there was any transition probability assumption that there was any transition probability for each map interval rather than allowing specific for each map interval rather than allowing specific transition probabilities from each state to each state.transition probabilities from each state to each state.

► The output of this method was a maximum-The output of this method was a maximum-likelihood assignment to haplotype category at each likelihood assignment to haplotype category at each position and ML estimates of position and ML estimates of ΘΘ indicating how indicating how significantly recombination has acted to increase significantly recombination has acted to increase haplotype diversity in each map interval.haplotype diversity in each map interval.

Page 51: SNP

Discussion of StudyDiscussion of Study►The region of chromosome 5q31 may be The region of chromosome 5q31 may be

largely divided into discrete blocks of 10-largely divided into discrete blocks of 10-100 kb; each block has only a few 100 kb; each block has only a few common haplotypes; and the haplotype common haplotypes; and the haplotype correlation between blocks gives rise o correlation between blocks gives rise o long-range LD. long-range LD.

►Focusing on haplotype blocks greatly Focusing on haplotype blocks greatly clarifies LD analyses. Once the haplotype clarifies LD analyses. Once the haplotype blocks are identified, they can be treated blocks are identified, they can be treated as alleles and tested for LD (instead of as alleles and tested for LD (instead of single-marker analyses of LD). single-marker analyses of LD).

Page 52: SNP

Discussion of StudyDiscussion of Study► In analogous fashion, the haplotype In analogous fashion, the haplotype

structure provides a crisp approach for structure provides a crisp approach for testing the association of genomic testing the association of genomic segments with disease. By contrast, segments with disease. By contrast, disease association studies transitionally disease association studies transitionally involve testing individual SNPs in and involve testing individual SNPs in and around a gene. around a gene.

►Once the haplotype blocks are defined, it is Once the haplotype blocks are defined, it is straightforward to examine a subset of straightforward to examine a subset of SNPs that uniquely distinguish the common SNPs that uniquely distinguish the common haplotypes in each block. This allows the haplotypes in each block. This allows the common variation in a gene to tested common variation in a gene to tested exhaustively for association with disease. exhaustively for association with disease.

Page 53: SNP

Discussion of StudyDiscussion of Study►This approach provides a precise This approach provides a precise

framework for creating a comprehensive framework for creating a comprehensive haplotype map of the human genome.haplotype map of the human genome.

►By testing a sufficiently large collections By testing a sufficiently large collections of SNPs, it should be possible to define all of SNPs, it should be possible to define all of the common haplotypes underlying of the common haplotypes underlying blocks of LD. Once such a map is created, blocks of LD. Once such a map is created, it will be possible to select an optimal it will be possible to select an optimal reference set of SNPs for any subsequent reference set of SNPs for any subsequent genotyping study.genotyping study.

►This detailed understanding of common This detailed understanding of common human variation represents an important human variation represents an important step in the Human genome project. step in the Human genome project.

Page 54: SNP

Linkage DisequilibriumLinkage Disequilibrium

► Uses unrelated individualsUses unrelated individuals►Good for fine scale mapping because there Good for fine scale mapping because there

is greater opportunity for recombination to is greater opportunity for recombination to occur.occur.

►Map of loci that contribute to inherited Map of loci that contribute to inherited genetic disordersgenetic disorders

► States can not be considered independent States can not be considered independent because they are related by distance and because they are related by distance and recombination, so individual haplotypes may recombination, so individual haplotypes may not be the cause of disease, but rather a not be the cause of disease, but rather a combination of several haplotypes in blockscombination of several haplotypes in blocks

Page 55: SNP

Linkage DisequilibriumLinkage Disequilibrium

►Greater distance between genes, the Greater distance between genes, the greater chance of recombinationgreater chance of recombination

►Lesser distance between genes, the Lesser distance between genes, the less chance of recombinationless chance of recombination

►Knowing the above and observing Knowing the above and observing inherited alleles, one can estimate the inherited alleles, one can estimate the relative distance between genesrelative distance between genes

Page 56: SNP

Measures of Linkage Measures of Linkage DisequilibriumDisequilibrium

►cM – centiMorganscM – centiMorgans►50cM would mean that two genes 50cM would mean that two genes

have a 50% chance of recombination have a 50% chance of recombination occurring. occurring. Genes are relatively far apartGenes are relatively far apart

Page 57: SNP

Importance of Linkage Importance of Linkage DisequilibriumDisequilibrium

►Offers us a way to measure the distance Offers us a way to measure the distance between genes.between genes.

►Non-randomNon-random►Measure of relation between markers Measure of relation between markers

and disease mutations. and disease mutations. ►Possibly used to map disease genes Possibly used to map disease genes

because high LD areas would be related because high LD areas would be related to recombination and formation of new to recombination and formation of new allelesalleles

Page 58: SNP

Data Mining Applied to Data Mining Applied to Linkage Disequilibrium Linkage Disequilibrium

MappingMapping► HPM - Haplotype Pattern MiningHPM - Haplotype Pattern Mining► Method of data mining LD-based gene mappingMethod of data mining LD-based gene mapping► Uses haplotypes as inputs which can be obtained from genetic simulation Uses haplotypes as inputs which can be obtained from genetic simulation

programs such as GENEHUNTERprograms such as GENEHUNTER► Extension of traditional association analysisExtension of traditional association analysis► Search for shared and flexible haplotypes and find out which ones are strongly Search for shared and flexible haplotypes and find out which ones are strongly

associated with a disease.associated with a disease.► Uses non-parametric statistical model without any genetic models on the basis of Uses non-parametric statistical model without any genetic models on the basis of

the locations of the haplotypesthe locations of the haplotypes

Page 59: SNP

What we knowWhat we know

► LD, which has a non-random association of LD, which has a non-random association of haplotypes to a disease, is likely strongest haplotypes to a disease, is likely strongest around the DS(Disease Susceptibility) gene.around the DS(Disease Susceptibility) gene.

► A locus will most likely be where the A locus will most likely be where the strongest associations are. strongest associations are.

Page 60: SNP

NotationNotation► Haplotype Map Haplotype Map MM has has k k parameters; (parameters; (mm11,…,m,…,mkk))► The haplotype pattern The haplotype pattern PP on on M M consists of the vector space (consists of the vector space (pp11,…,p,…,pkk), where each ), where each ppii is is

an allele of an allele of mmii or a wild-card ( or a wild-card (**))► PP occurs on the haplotype vector, which is simply the chromosome ( occurs on the haplotype vector, which is simply the chromosome (HH), so ), so HH = ( = (hh11,,

…,h…,hkk) where ) where hhii = p = pii or or hhii = = **

► Example: Example: PP11 = (*, 2, 5, *, 3, *, *, *, * , *) = (*, 2, 5, *, 3, *, *, *, * , *) PPCC = (4, 2, 5, 1, 3, 2, 6, 4, 5, 3) = (4, 2, 5, 1, 3, 2, 6, 4, 5, 3)

Page 61: SNP

Issues in Shape of Haplotype Issues in Shape of Haplotype PatternPattern

1.1. Length of the patternLength of the pattern► Defined as maximal distance between any 2 markers measured in Defined as maximal distance between any 2 markers measured in

centiMorganscentiMorgans► Extremely long sequences don’t give us much information, so the size of Extremely long sequences don’t give us much information, so the size of

the P is constrained in HPMthe P is constrained in HPM

2.2. Gaps in sequencesGaps in sequences► Accounts for mutations, errors, missing data, and recombinationAccounts for mutations, errors, missing data, and recombination► Gap size and number can be controlled in HPMGap size and number can be controlled in HPM

Page 62: SNP

ProcedureProcedure► Depth-first search finds all haplotype patterns that exceed the lower Depth-first search finds all haplotype patterns that exceed the lower

bound threshold and meets the association measure bound threshold and meets the association measure ► Calculate the frequency Calculate the frequency f(mf(mii)) of marker of marker mmii with respect to ( with respect to (M, H, Y, xM, H, Y, x), ),

where where YY= phenotype and = phenotype and xx = positive association threshold = positive association threshold► Markers with highest frequencies are predicted to be the area of the DS Markers with highest frequencies are predicted to be the area of the DS

gene, assuming a DS gene is present.gene, assuming a DS gene is present.► Prediction of granularity of marker densityPrediction of granularity of marker density► Ranked based on frequencyRanked based on frequency

Page 63: SNP

Results: Simulated DataResults: Simulated Data

► Founder population which grows from 300 to Founder population which grows from 300 to ~100, 000 in 500 years was simulated in the ~100, 000 in 500 years was simulated in the “Populus simulator package”“Populus simulator package”

► Simulated data used because it is cheaper Simulated data used because it is cheaper and can be easily manipulatedand can be easily manipulated

Page 64: SNP

• List of 11 most strongly disease-associated List of 11 most strongly disease-associated haplotype patterns in the simulated datahaplotype patterns in the simulated data

• Chromosome has 101 markersChromosome has 101 markers

• Dashed line indicates the true gene locationDashed line indicates the true gene location

Page 65: SNP

• Frequency histogram of previous slides data, but Frequency histogram of previous slides data, but with patterns exceeding the threshold of associationwith patterns exceeding the threshold of association

• Dashed line indicates the true gene locationDashed line indicates the true gene location

• Marker 5 now has the highest frequencyMarker 5 now has the highest frequency

Page 66: SNP

• The actual vs. predicted locations for 100 data setsThe actual vs. predicted locations for 100 data sets

Page 67: SNP

a)a) Mutation carrying chromosomes, denoted by AMutation carrying chromosomes, denoted by Ab)b) Sample founder population sizeSample founder population sizec)c) Corrupted dataCorrupted datad)d) Missing dataMissing data

Page 68: SNP

Real Data: HLA complexReal Data: HLA complex► Data consisting of affected sib-pair families with type 1 diabetes from Data consisting of affected sib-pair families with type 1 diabetes from

the UK that were genotyped for 25 markers was usedthe UK that were genotyped for 25 markers was used► Markers covered 14-Mb and covered the entire HLA complex Markers covered 14-Mb and covered the entire HLA complex ► The HLA-DQB1 and HLA-DRB1 loci, which are located in the middle of The HLA-DQB1 and HLA-DRB1 loci, which are located in the middle of

these 14-Mb, are known to be the primary factors for type 1 diabetesthese 14-Mb, are known to be the primary factors for type 1 diabetes► Randomly selected 200 from 385 sample space to compare with Randomly selected 200 from 385 sample space to compare with

simulated resultssimulated results

Page 69: SNP

• Frequency vs. Map Location of HLA markersFrequency vs. Map Location of HLA markers• ___ HPM calculated frequencies___ HPM calculated frequencies• ----- Background LD frequencies----- Background LD frequencies• Vertical lines indicates true locations of Vertical lines indicates true locations of

markersmarkers

Page 70: SNP

Discussion of HPM TechniqueDiscussion of HPM Technique

►Robust to lost and erroneous dataRobust to lost and erroneous data►Applicable to complex gene mappingApplicable to complex gene mapping►Works well with small data sets, but Works well with small data sets, but

accuracy is increased with the accuracy is increased with the increase of dataincrease of data

►Works with real and simulated dataWorks with real and simulated data►Does not include any previously Does not include any previously

derived modelsderived models

Page 71: SNP

ReferencesReferences► Introduction to SNPs: Discovery of Markers of DiseaseIntroduction to SNPs: Discovery of Markers of Disease► SNP seeking long term association with complex diseasesSNP seeking long term association with complex diseases► SNP mapping using Genome-wide Unique SequencesSNP mapping using Genome-wide Unique Sequences► The Structure of Haplotypes Blocks in Human GenomeThe Structure of Haplotypes Blocks in Human Genome► Using Haplotype blocks to map human complex trait lociUsing Haplotype blocks to map human complex trait loci► High Resolution haplotype structure in human genomeHigh Resolution haplotype structure in human genome► Detection of regulatory variation in mouse genesDetection of regulatory variation in mouse genes► http://linkage.rockefeller.edu/wli/lld.htmlhttp://linkage.rockefeller.edu/wli/lld.html► http://statwww.epfl.ch/davison/teaching/Microarrays/snp.ppthttp://statwww.epfl.ch/davison/teaching/Microarrays/snp.ppt► http://www.cs.helsinki.fi/u/htoivone/pubs/ajhg_2000.pdfhttp://www.cs.helsinki.fi/u/htoivone/pubs/ajhg_2000.pdf► Resolution of Haplotypes and Haplotype Frequencies from SNP Genotype of Resolution of Haplotypes and Haplotype Frequencies from SNP Genotype of

Pooled Samples Pooled Samples ► http://www.journals.uchicago.edu/AJHG/journal/issues/v71n6/024386/024386.http://www.journals.uchicago.edu/AJHG/journal/issues/v71n6/024386/024386.

htmlhtml► http://www.ncbi.nlm.nih.gov/entrez/utils/fref.fcgi?http://www.sciencemag.org/chttp://www.ncbi.nlm.nih.gov/entrez/utils/fref.fcgi?http://www.sciencemag.org/c

gi/pmidlookup?view=full&pmid=11452081gi/pmidlookup?view=full&pmid=11452081► http://www.genome.gov/10001665http://www.genome.gov/10001665► http://walnut.usc.edu/~magnus/papers/tig.pdfhttp://walnut.usc.edu/~magnus/papers/tig.pdf