18
Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

Embed Size (px)

Citation preview

Page 1: Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

Informative SNP Selection Based on Multiple Linear

Regression

Jingwu HeAlex Zelikovsky

Page 2: Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

Outline

• SNPs, haplotypes, and genotypes• Tagging problem formulation • Tagging based on multiple linear regression• Experimental results

Page 3: Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

Human Genome

• Length of Human Genome (DNA) 3 billion base pairs: A,C,G, or T.• Our DNA is similar.

99.9% of DNA is common.

Page 4: Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

SNPs• Genome difference between any two people 0.1% of

genome• These differences are Single Nucleotide Polymorphisms

(SNPs).• Total number of SNPs in human genome 107

A C C G . . . .A A C A G C C A . . . . T T C G G G T C . . . . A G T C

A C C G . . . .A A C A G C C A . . . . T T C G G G T C . . . . A G T C

A C C G . . . .A A C A G C C A . . . . T T C G G G T C . . . . A G T C

A C C G . . . .A A C A G C C A . . . . T T C G G G T C . . . . A G T C

C G G

C A A

T G A

C G G

SNP SNP SNP

Page 5: Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

A C C G . . . .

A C C G . . . .. . . C A G C C A . . . . T T C G G G T C . . . . A G T CC G G

Haplotyes and Genotypes• Human = diploid organism: two different “copies” of each

chromosome, one from mother, one from father.

A C C G . . . .

. . . C A G C C A . . . . T T C G G G T C . . . . A G T C

. . . C A G C C A . . . . T T C G G G T C . . . . A G T C

. . . C A G C C A . . . . T T C G G G T C . . . . A G T C A C C G . . . .C A A

T G A

C G G

• Since individuals differ in SNPs, we keep only SNPs.• Haplotype: SNPs in a single “copy” of a chromosome• Genotype: A pair of haplotypes

One copy from A

Another copy from A

One copy from B

Another copy from B

C A A

T G

GC

A

G

Haplotype 1 from A

Haplotype 2 from A

Haplotype 3 from B

Haplotype 4 from B

Genotype 1 from A

Genotype 2 from B

C G G

Page 6: Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

Cause of Variation: Mutations and Recombinations

Mutation Recombinations

One nucleotide is replaced with other

G -> A

One chromatid recombine withanother.

Page 7: Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

Encoding

• SNPs are generally bi-allelic

• only two alleles in single SNP: wild type and mutation

• 0 stands for wide type, 1 stands for mutation

homozygoushomozygousHeterozygousHeterozygous

Page 8: Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

Outline

• SNPs, haplotypes, and genotypes• Tagging problem formulation • Tagging based on multiple linear regression• Experimental results

Page 9: Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

Tagging Motivation

• Decrease SNP genotyping cost and data analysis– Many SNPs are linked (strongly

correlated)– Genotype only informative

SNPs tag SNPs, other SNPs are inferred from tag SNPs

– Perform data analysis only on tag SNPs.

– Cost-saving ratio = m/k

Use only tag SNPs to infer non-tag SNPs

Page 10: Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

Tagging Problem

• Problem formulation– Given the full pattern of all SNPs in a sample – Find the minimum number of tag SNPs that will allow the

reconstruction of the complete haplotype for each individual.

• Tag Selection Algorithm

• SNP Prediction Algorithm

Step 1: Find tags (SNP position) in sample:

Find tags

(0, 1, 2)

Step 2: Reconstruct complete haplotype Computation Methods

Page 11: Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

Tagging Methods

• Tagging Methods– HapBlock (K. Zhang, M.S. Waterman, et al.)

• Greedy algorithm for tag selection

• Majority voting for prediction

– V. Bafna, B.V. Halldorson et al.

• Graph algorithm for tag selection

• Majority voting for prediction

– STAMPA (E. Halperin and R. Shamir)

• Dynamic programming for tag selection

• Majority voting for prediction

– …..– Tagging based on Multiple Linear Regression

• Greedy Selection

• Multiple Linear Regression for Prediction

Page 12: Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

Given the values of k tags of an unknown individual x and the known full sample S, a SNP prediction algorithm Ak predicts the value of a single non-tag SNP in x, which is x(k+1).

Treat each non-tag SNP separately

SNP Prediction Algorithm

Predicting

Page 13: Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

Tag Selection based on Prediction

• Choose the optimal k tags• It is NP-hard, m choose k

– (m= No. of total SNPs, k= No. of tags)

• Use Stepwise (greedy) Tag Selection Algorithm (STA) to reduce the cost and time

– Starts with the best tag t0, i.e., tag that minimizes error when predicting with Ak all other tags.

– Then STA finds such tag t1, which would be the best extension of {t0}, and continues adding best tags until reaching the set of tags of the given size k.

Page 14: Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

Projection Method forSNP Prediction

tag t1

tag t2

0

s0 =

s1 =d0 d1

s2 =

d2

0...

1...

2...

span(T)

possibleresolutions

projections

1Ts0

Ts 2Ts

Choose resolution minimizing its distance d to spanning of tag space span (T)

Page 15: Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

Data Sets

• Daly et al – 616 kilobase region of human Chromosome 5q31

genotyping 103 SNPs for 129 trios.

• Seven ENCODE regions from HapMap. – Regions ENr123 and ENm010 from 2 population: 45

singles Han Chinese (HCB) and 44 singles Japanese(JPT).

– Three regions (ENm013, ENr112, ENr113) from 30 CEPH family trios obtained from HapMapSTAMPA (E. Halperin and R. Shamir)

• Two gene regions: STEAP and TRPM8 – genotyping 23 and 102 SNPs for 30 trios

Page 16: Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

Experimental Results

Directly to

genotype data

Page 17: Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

Multivariate Linear Regression Tagging

• Genotype tagging• uses fewer tags (e.g., up to two times less tags to reach 90%

prediction accuracy) than STAMPA (E. Halperin and R. Shamir, ISMB 2005 and Bioinformatics)

• Statistical tagging• Linear recombination of tags statistically cover non-tag SNPs• Traditional methods use single tag to cover non-tag SNPs • uses on average 30% fewer tags than IdSelect (C.S. Carlson et al.

2004) for statistical covering all SNPs.

Page 18: Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

Thank you

Any Questions?

Thank you

Any Questions?