Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

Informative SNP Selection Based on Multiple Linear

Regression

Jingwu HeAlex Zelikovsky

Outline

• SNPs, haplotypes, and genotypes• Tagging problem formulation • Tagging based on multiple linear regression• Experimental results

Human Genome

• Length of Human Genome (DNA) 3 billion base pairs: A,C,G, or T.• Our DNA is similar.

99.9% of DNA is common.

SNPs• Genome difference between any two people 0.1% of

genome• These differences are Single Nucleotide Polymorphisms

(SNPs).• Total number of SNPs in human genome 107

A C C G . . . .A A C A G C C A . . . . T T C G G G T C . . . . A G T C

SNP SNP SNP

A C C G . . . .

A C C G . . . .. . . C A G C C A . . . . T T C G G G T C . . . . A G T CC G G

Haplotyes and Genotypes• Human = diploid organism: two different “copies” of each

chromosome, one from mother, one from father.

A C C G . . . .

. . . C A G C C A . . . . T T C G G G T C . . . . A G T C

. . . C A G C C A . . . . T T C G G G T C . . . . A G T C A C C G . . . .C A A

• Since individuals differ in SNPs, we keep only SNPs.• Haplotype: SNPs in a single “copy” of a chromosome• Genotype: A pair of haplotypes

One copy from A

Another copy from A

One copy from B

Another copy from B

Haplotype 1 from A

Haplotype 2 from A

Haplotype 3 from B

Haplotype 4 from B

Genotype 1 from A

Genotype 2 from B

Cause of Variation: Mutations and Recombinations

Mutation Recombinations

One nucleotide is replaced with other

G -> A

One chromatid recombine withanother.

Encoding

• SNPs are generally bi-allelic

• only two alleles in single SNP: wild type and mutation

• 0 stands for wide type, 1 stands for mutation

homozygoushomozygousHeterozygousHeterozygous

Outline

• SNPs, haplotypes, and genotypes• Tagging problem formulation • Tagging based on multiple linear regression• Experimental results

Tagging Motivation

• Decrease SNP genotyping cost and data analysis– Many SNPs are linked (strongly

correlated)– Genotype only informative

SNPs tag SNPs, other SNPs are inferred from tag SNPs

– Perform data analysis only on tag SNPs.

– Cost-saving ratio = m/k

Use only tag SNPs to infer non-tag SNPs

Tagging Problem

• Problem formulation– Given the full pattern of all SNPs in a sample – Find the minimum number of tag SNPs that will allow the

reconstruction of the complete haplotype for each individual.

• Tag Selection Algorithm

• SNP Prediction Algorithm

Step 1: Find tags (SNP position) in sample:

Find tags

(0, 1, 2)

Step 2: Reconstruct complete haplotype Computation Methods

Tagging Methods

• Tagging Methods– HapBlock (K. Zhang, M.S. Waterman, et al.)

• Greedy algorithm for tag selection

• Majority voting for prediction

– V. Bafna, B.V. Halldorson et al.

• Graph algorithm for tag selection

– STAMPA (E. Halperin and R. Shamir)

• Dynamic programming for tag selection

– …..– Tagging based on Multiple Linear Regression

• Greedy Selection

• Multiple Linear Regression for Prediction

Given the values of k tags of an unknown individual x and the known full sample S, a SNP prediction algorithm Ak predicts the value of a single non-tag SNP in x, which is x(k+1).

Treat each non-tag SNP separately

SNP Prediction Algorithm

Predicting

Tag Selection based on Prediction

• Choose the optimal k tags• It is NP-hard, m choose k

– (m= No. of total SNPs, k= No. of tags)

• Use Stepwise (greedy) Tag Selection Algorithm (STA) to reduce the cost and time

– Starts with the best tag t0, i.e., tag that minimizes error when predicting with Ak all other tags.

– Then STA finds such tag t1, which would be the best extension of {t0}, and continues adding best tags until reaching the set of tags of the given size k.

Projection Method forSNP Prediction

tag t1

tag t2

s1 =d0 d1

span(T)

possibleresolutions

projections

Ts 2Ts

Choose resolution minimizing its distance d to spanning of tag space span (T)

Data Sets

• Daly et al – 616 kilobase region of human Chromosome 5q31

genotyping 103 SNPs for 129 trios.

• Seven ENCODE regions from HapMap. – Regions ENr123 and ENm010 from 2 population: 45

singles Han Chinese (HCB) and 44 singles Japanese(JPT).

– Three regions (ENm013, ENr112, ENr113) from 30 CEPH family trios obtained from HapMapSTAMPA (E. Halperin and R. Shamir)

• Two gene regions: STEAP and TRPM8 – genotyping 23 and 102 SNPs for 30 trios

Experimental Results

Directly to

genotype data

Multivariate Linear Regression Tagging

• Genotype tagging• uses fewer tags (e.g., up to two times less tags to reach 90%

prediction accuracy) than STAMPA (E. Halperin and R. Shamir, ISMB 2005 and Bioinformatics)

• Statistical tagging• Linear recombination of tags statistically cover non-tag SNPs• Traditional methods use single tag to cover non-tag SNPs • uses on average 30% fewer tags than IdSelect (C.S. Carlson et al.

2004) for statistical covering all SNPs.

Thank you

Any Questions?

Thank you

Any Questions?

Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

Documents

SNP security presentation

SNP comparisons

SNP Optimizer

User Manual-SNP-3120-ENGLISH Web - GfK Etilizecontent.etilize.com/User-Manual/1020310792.pdf · NETWORK CAMERA User Manual SNP-3120/SNP-3120V/ SNP-3120VH

Snp 1000a Eng

Saurav snp

What is a SNP?. Lecture topics What is a SNP? What use are they? SNP discovery SNP genotyping Introduction to Linkage Disequilibrium

Gene-Gene /SNP-SNP Interaction: BIOFILTER

SNP Analysis

Gene Snp 2010

SNP CHART OF ACCOUNTS CONVERSION - service.snp-ag.com · All products that use SNP Transformation Backbone® as platform also use SNP Cockpit as standard user interface. Use the SNP

Nanoparticles for Biomedical Applications Part I: Preparation & Stabilization Jingwu Zhang 5/3/06

Special Needs Plan (C-SNP/D-SNP) - MedStar Provider Networkmedstarprovidernetwork.org/sites/default/files/attachments/MedStar... · Special Needs Plan (C-SNP/D-SNP) Objectives Upon

Snp Catalog

General Purpose SNP-G20 Series 160W~300W SNP-E30 Series · Rev. 2011 SNP-G16 Series SNP-G20 Series SNP-E30 Series General Purpose 160W~300W Product Manager: Claus Technical Supervisor:

1 SNP Educational Session – January 13, 2014 SNP Results 2013 SNP Educational Session - January 13, 2014 Brett Kay, AVP, SNP Assessment, NCQA

SAWN 2006 Energy-Efficient Continuous and Event-Driven Monitoring Authors: Alex Zelikovsky Dumitru Brinza

BA22001 Introduction to Java Instructor: Jingwu He

Sam SNP Final

Humana Gold Plus SNP-DE H6622-015 (HMO D-SNP)