View
216
Download
0
Category
Preview:
Citation preview
Informative SNP Selection Based on Multiple Linear
Regression
Jingwu HeAlex Zelikovsky
Outline
• SNPs, haplotypes, and genotypes• Tagging problem formulation • Tagging based on multiple linear regression• Experimental results
Human Genome
• Length of Human Genome (DNA) 3 billion base pairs: A,C,G, or T.• Our DNA is similar.
99.9% of DNA is common.
SNPs• Genome difference between any two people 0.1% of
genome• These differences are Single Nucleotide Polymorphisms
(SNPs).• Total number of SNPs in human genome 107
A C C G . . . .A A C A G C C A . . . . T T C G G G T C . . . . A G T C
A C C G . . . .A A C A G C C A . . . . T T C G G G T C . . . . A G T C
A C C G . . . .A A C A G C C A . . . . T T C G G G T C . . . . A G T C
A C C G . . . .A A C A G C C A . . . . T T C G G G T C . . . . A G T C
C G G
C A A
T G A
C G G
SNP SNP SNP
A C C G . . . .
A C C G . . . .. . . C A G C C A . . . . T T C G G G T C . . . . A G T CC G G
Haplotyes and Genotypes• Human = diploid organism: two different “copies” of each
chromosome, one from mother, one from father.
A C C G . . . .
. . . C A G C C A . . . . T T C G G G T C . . . . A G T C
. . . C A G C C A . . . . T T C G G G T C . . . . A G T C
. . . C A G C C A . . . . T T C G G G T C . . . . A G T C A C C G . . . .C A A
T G A
C G G
• Since individuals differ in SNPs, we keep only SNPs.• Haplotype: SNPs in a single “copy” of a chromosome• Genotype: A pair of haplotypes
One copy from A
Another copy from A
One copy from B
Another copy from B
C A A
T G
GC
A
G
Haplotype 1 from A
Haplotype 2 from A
Haplotype 3 from B
Haplotype 4 from B
Genotype 1 from A
Genotype 2 from B
C G G
Cause of Variation: Mutations and Recombinations
Mutation Recombinations
One nucleotide is replaced with other
G -> A
One chromatid recombine withanother.
Encoding
• SNPs are generally bi-allelic
• only two alleles in single SNP: wild type and mutation
• 0 stands for wide type, 1 stands for mutation
homozygoushomozygousHeterozygousHeterozygous
Outline
• SNPs, haplotypes, and genotypes• Tagging problem formulation • Tagging based on multiple linear regression• Experimental results
Tagging Motivation
• Decrease SNP genotyping cost and data analysis– Many SNPs are linked (strongly
correlated)– Genotype only informative
SNPs tag SNPs, other SNPs are inferred from tag SNPs
– Perform data analysis only on tag SNPs.
– Cost-saving ratio = m/k
Use only tag SNPs to infer non-tag SNPs
Tagging Problem
• Problem formulation– Given the full pattern of all SNPs in a sample – Find the minimum number of tag SNPs that will allow the
reconstruction of the complete haplotype for each individual.
• Tag Selection Algorithm
• SNP Prediction Algorithm
Step 1: Find tags (SNP position) in sample:
Find tags
(0, 1, 2)
Step 2: Reconstruct complete haplotype Computation Methods
Tagging Methods
• Tagging Methods– HapBlock (K. Zhang, M.S. Waterman, et al.)
• Greedy algorithm for tag selection
• Majority voting for prediction
– V. Bafna, B.V. Halldorson et al.
• Graph algorithm for tag selection
• Majority voting for prediction
– STAMPA (E. Halperin and R. Shamir)
• Dynamic programming for tag selection
• Majority voting for prediction
– …..– Tagging based on Multiple Linear Regression
• Greedy Selection
• Multiple Linear Regression for Prediction
Given the values of k tags of an unknown individual x and the known full sample S, a SNP prediction algorithm Ak predicts the value of a single non-tag SNP in x, which is x(k+1).
Treat each non-tag SNP separately
SNP Prediction Algorithm
Predicting
Tag Selection based on Prediction
• Choose the optimal k tags• It is NP-hard, m choose k
– (m= No. of total SNPs, k= No. of tags)
• Use Stepwise (greedy) Tag Selection Algorithm (STA) to reduce the cost and time
– Starts with the best tag t0, i.e., tag that minimizes error when predicting with Ak all other tags.
– Then STA finds such tag t1, which would be the best extension of {t0}, and continues adding best tags until reaching the set of tags of the given size k.
Projection Method forSNP Prediction
tag t1
tag t2
0
s0 =
s1 =d0 d1
s2 =
d2
0...
1...
2...
span(T)
possibleresolutions
projections
1Ts0
Ts 2Ts
Choose resolution minimizing its distance d to spanning of tag space span (T)
Data Sets
• Daly et al – 616 kilobase region of human Chromosome 5q31
genotyping 103 SNPs for 129 trios.
• Seven ENCODE regions from HapMap. – Regions ENr123 and ENm010 from 2 population: 45
singles Han Chinese (HCB) and 44 singles Japanese(JPT).
– Three regions (ENm013, ENr112, ENr113) from 30 CEPH family trios obtained from HapMapSTAMPA (E. Halperin and R. Shamir)
• Two gene regions: STEAP and TRPM8 – genotyping 23 and 102 SNPs for 30 trios
Experimental Results
Directly to
genotype data
Multivariate Linear Regression Tagging
• Genotype tagging• uses fewer tags (e.g., up to two times less tags to reach 90%
prediction accuracy) than STAMPA (E. Halperin and R. Shamir, ISMB 2005 and Bioinformatics)
• Statistical tagging• Linear recombination of tags statistically cover non-tag SNPs• Traditional methods use single tag to cover non-tag SNPs • uses on average 30% fewer tags than IdSelect (C.S. Carlson et al.
2004) for statistical covering all SNPs.
Thank you
Any Questions?
Thank you
Any Questions?
Recommended