55
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational Biology Lab, Department of Computer Science & Information Engineering, National Taiwan University, Taiwan. Lecture r: Kun-Mao Chao Assista nt: Yao-Ting Huang Thank Yao-Ting for preparing this wonderful lecture note.

National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

Embed Size (px)

Citation preview

Page 1: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Introduction to SNP and Haplotype Analysis

Algorithms and Computational Biology Lab,Department of Computer Science & Information Engineering,

National Taiwan University, Taiwan.

Lecturer: Kun-Mao Chao

Assistant: Yao-Ting Huang

Thank Yao-Ting for preparing this wonderful lecture note.

Page 2: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

2

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Genetic Variations The genetic variations in DNA sequences (e.g.,

insertions, deletions, and mutations) have a major impact on genetic diseases and phenotypic differences. All humans share 99% the same DNA sequence. The genetic variations in the coding region may change

the codon of an amino acid and alters the amino acid sequence.

Page 3: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Single Nucleotide Polymorphism

A Single Nucleotide Polymorphisms (SNP), pronounced “snip,” is a genetic variation when a single nucleotide (i.e., A, T, C, or G) is altered and kept through heredity. SNP: Single DNA base variation found >1% Mutation: Single DNA base variation found <1%

C T T A G C T T

C T T A G T T T

SNP

C T T A G C T T

C T T A G T T T

Mutation

94%

6%

99.9%

0.1%

Page 4: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

4

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Mutations and SNPs

Common Ancestor

time present

Observed genetic variationsMutationsSNPs

Page 5: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

5

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Single Nucleotide Polymorphism

SNPs are the most frequent form among various genetic variations.90% of human genetic variations come from

SNPs.SNPs occur about every 300~600 base pairs.Millions of SNPs have been identified (e.g.,

HapMap and Perlegen). SNPs have become the preferred markers for

association studies because of their high abundance and high-throughput SNP genotyping technologies.

Page 6: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Single Nucleotide Polymorphism

A SNP is usually assumed to be a binary variable. The probability of repeat mutation at the same SNP

locus is quite small. The tri-allele cases are usually considered to be the

effect of genotyping errors. The nucleotide on a SNP locus is called

a major allele (if allele frequency > 50%), or a minor allele (if allele frequency < 50%).

A C T T A G C T T

A C T T A G C T C C: Minor allele

94%

6%

T: Major allele

Page 7: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

7

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Haplotypes A haplotype stands for a set of linked SNPs on the

same chromosome. A haplotype can be simply considered as a binary

string since each SNP is binary.

SNP1 SNP2 SNP3

-A C T T A G C T T-

-A A T T T G C T C-

-A C T T T G C T C-

Haplotype 2

Haplotype 3

C A T

A T C

C T CHaplotype 1

SNP1 SNP2 SNP3

Page 8: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

8

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Genotypes The use of haplotype information has been

limited because the human genome is a diploid. In large sequencing projects, genotypes instead of

haplotypes are collected due to cost consideration.

AC

GT

A T

SNP1 SNP2

C G

Haplotype data

SNP1 SNP2

Genotype data

AC

GT

SNP1 SNP2

A T

C G

SNP1 SNP2

Page 9: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

9

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Problems of Genotypes Genotypes only tell us the alleles at each SNP locus.

But we don’t know the connection of alleles at different SNP loci.

There could be several possible haplotypes for the same genotype.

AC

GT

SNP1 SNP2

Genotype data

orA T

C GSNP1 SNP2

A G

C TSNP1 SNP2

AC

GT

SNP1 SNP2

We don’t know which haplotype pair is real.

Page 10: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

10

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Research Directions of SNPs and Haplotypes in Recent Years

HaplotypeInference

Tag SNPSelection

MaximumParsimony

PerfectPhylogeny

StatisticalMethods

Haplotypeblock

LD binPredictionAccuracy

SNPDatabase

Page 11: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

11

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Haplotype Inference The problem of inferring the haplotypes from a set of

genotypes is called haplotype inference. This problem is already known to be not only NP-hard

but also APX-hard. Most combinatorial methods consider the maximum

parsimony model to solve this problem. This model assumes that the real haplotypes in natural

population is rare. The solution of this problem is a minimum set of

haplotypes that can explain the given genotypes.

Page 12: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

12

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Maximum Parsimony

A Gh3

C Th4

A Th1

C Gh2

A Th1

A Th1

orG1

AC

SNP1 SNP2

GT

G2A

A

SNP1 SNP2

TT

A G

C T

A T

A T

C G

Find a minimum set of haplotypes to explain the given genotypes.

Page 13: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

13

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Related Works Statistical methods:

Niu, et al. (2002) developed a PL-EM algorithm called HAPLOTYPER.

Stephens and Donnelly (2003) designed a MCMC algorithm based on Gibbs sampling called PHASE.

Combinatorial methods: Gusfield (2003) proposed an integer linear programming

algorithm. Wang and Xu (2003) developed a branching and bound

algorithm called HAPAR to find the optimal solution. Brown and Harrower (2004) proposed a new integer

linear formulation of this problem.

Page 14: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

14

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Our Results We formulated this problem as an integer quadratic

programming (IQP) problem. W proposed an iterative semidefinite programming

(SDP) relaxation algorithm to solve the IQP problem. This algorithm finds a solution of O(log n) approximation.

We implemented this algorithm in MatLab and compared with existing methods. Huang, Y.-T., Chao, K.-M., and Chen, T. “An

approximation algorithm for haplotype inference by pure parsimony,” To appear in Journal of Computational Biology, 2005.

Page 15: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

15

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Problem Formulation Input:

A set of n genotypes and m possible haplotypes. Output:

A minimum set of haplotypes that can explain the given genotypes.

A Th1

C Gh2

A Th1

A Th1

G1

AC

SNP1 SNP2

GT

G2A

A

SNP1 SNP2

TT

A Th1

C Gh2

Page 16: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

16

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Integer Quadratic Programming (IQP)

selected.not is haplotypeth - theif 04

)11( and

selected; is haplotypeth - theif 14

)11( since

2

2

i

i

Define xi as an integer variable with values 1 or -1. xi = 1 if the i-th haplotype is selected.

xi = -1 if the i-th haplotype is not selected.

Minimizing the number of selected haplotypes is to minimize the following integer quadratic function:

m

i

ix

1

2

4

)1( Minimize

Page 17: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

17

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Integer Quadratic Programming (IQP)

14

)1)(1(

4

)1)(1(

4

)1)(1( 4321

)}4,3(),2,1{(),(

xxxxxx

tr

tr

Each genotype must be resolved by at least one pair of haplotypes. For genotype G1, the following integer quadratic function

must be satisfied.

G1

AC

SNP1 SNP2

GT

A Th1

C Gh2

A Gh3

C Th4or

1 1Suppose h1 and h2 are selected

Page 18: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

18

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Integer Quadratic Programming (IQP)

Maximum parsimony:

We use the SDP-relaxation technique to solve this IQP problem.

m

i

ix

1

2

4

)1( Minimize Objective

Function

]. ,1[ },1 ,1{

,14

)1)(1( Subject to

),(

njx

xx

i

Shh

tr

jtr

Constraint Functions

to resolve all genotypes.

Find a minimum set of haplotypes

Page 19: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

19

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Research Directions of SNPs and Haplotypes in Recent Years

HaplotypeInference

Tag SNPSelection

MaximumParsimony

PerfectPhylogeny

StatisticalMethods

Haplotypeblock

LD binPredictionAccuracy

SNPDatabase

Page 20: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

20

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Problems of Using SNPs for Association Studies The number of SNPs is still too large to be used for

association studies. There are millions of SNPs in a human body. To reduce the SNP genotyping cost, we wish to use as

few SNPs as possible for association studies. Tag SNPs are a small subset of SNPs that is sufficient

for performing association studies without losing the power of using all SNPs. There are many definitions of tag SNPs. We will first study one definition of tag SNPs based on

haplotype blocks model.

Page 21: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

21

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Haplotype Blocks and Tag SNPs Recent studies have shown that the chromosome can be

partitioned into haplotype blocks interspersed by recombination hotspots (Daly et al, Patil et al.). Within a haplotype block, there is little or no recombination

occurred. The SNPs within a haplotype block tend to be inherited

together. Within a haplotype block, a small subset of SNPs (called tag

SNPs) is sufficient to distinguish each pair of haplotype patterns in the block. We only need to genotype tag SNPs instead of all SNPs

within a haplotype block.

Page 22: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

22

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Recombination Hotspots and Haplotype Blocks

Recombinationhotspots

Chromosome

Haplotypeblocks

P1 P2 P3 P4S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

S11

S12

SNP loci

Haplotype patterns

: Major allele

: Minor allele

Page 23: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

23

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

A Haplotype Block Example

The Chromosome 21 is partitioned into 4,135 haplotype blocks over 24,047 SNPs by Patil et al. (Science, 2001). Blue box: major allele Yellow box: minor allele

Page 24: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

24

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Examples of Tag SNPs

P1 P2 P3 P4S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

S11

S12

SNP loci

Haplotype patterns

Suppose we wish to distinguish an unknown haplotype sample.

We can genotype all SNPs to identify the haplotype sample.

An unknown haplotype sample

: Major allele

: Minor allele

Page 25: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

25

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Examples of Tag SNPs

P1 P2 P3 P4S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

S11

S12

SNP loci

Haplotype pattern

In fact, it is not necessary to genotype all SNPs.

SNPs S3, S4, and S5 can form a set of tag SNPs.

P1 P2 P3 P4

S3

S4

S5

Page 26: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

26

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Examples of Wrong Tag SNPs

P1 P2 P3 P4S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

S11

S12

SNP loci

Haplotype pattern

SNPs S1, S2, and S3 can not form a set of tag SNPs because P1 and P4 will be ambiguous.

P1 P2 P3 P4S1

S2

S3

Page 27: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

27

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Examples of Tag SNPs

P1 P2 P3 P4S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

S11

S12

SNP loci

Haplotype pattern

SNPs S1 and S12 can form a set of tag SNPs.

This set of SNPs is the minimum solution in this example.

P1 P2 P3 P4S1

S12

Page 28: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

28

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Problems of Finding Tag SNPs The problem of finding the minimum set of tag SNPs is

known to be NP-hard. This problem is the minimum test set problem. A number of methods have been proposed to find the

minimum set of tag SNPs (Bafna et al., Zhang, et al.). In reality, we may fail to obtain some tag SNPs if

they do not pass the threshold of data quality. In the current genotyping environment, the missing rate of

SNPs is around 5~10%. We proposed two greedy algorithms and one linear

programming relaxation algorithm to solve this problem.

Page 29: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Introduction to Linkage Disequilibrium and Programming

Assignment

Algorithms and Computational Biology Lab,Department of Computer Science & Information Engineering,

National Taiwan University, Taiwan.

Speaker: Yao-Ting Huang

Page 30: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

30

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Research Directions of SNPs and Haplotypes in Recent Years

HaplotypeInference

Tag SNPSelection

MaximumParsimony

PerfectPhylogeny

StatisticalMethods

Haplotypeblock

LD binPredictionAccuracy

SNPDatabase

Page 31: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

31

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Linkage Disequilibrium The problem of finding tag SNPs can be also solved

from the statistical point of view. We can measure the correlation between SNPs and

identify sets of highly correlated SNPs. For each set of correlated SNPs, only one SNP need

to be genotyped and can be used to predict the values of other SNPs.

Linkage Disequilibrium (LD) is a measure that estimates such correlation between two SNPs. We will formally introduce the detailed information

of LD later.

Page 32: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

32

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Linkage Disequilibrium Bins The statistical methods for finding tag SNPs are

based on the analysis of LD among all SNPs. An LD bin is a set of SNPs such that SNPs within the

same bin are highly correlated with each other. The value of a single SNP in one LD bin can predict the

values of other SNPs of the same bin. These methods try to identify the minimum set of LD

bins.

Page 33: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

33

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

An Example of LD Bins (1/3) SNP1 and SNP2 can not form an LD bin.

e.g., A in SNP1 may imply either G or A in SNP2.

Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6

1 A G A C G T

2 T G C C G C

3 A A A T A T

4 T G C T A C

5 T A C C G C

6 T G C T A C

7 A A A T A T

8 A A A T A T

Page 34: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

34

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

An Example of LD Bins (2/3) SNP1, SNP2, and SNP3 can form an LD bin.

Any SNP in this bin is sufficient to predict the values of others.

Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6

1 A G A C G T

2 T G C C G C

3 A A A T A T

4 T G C T A C

5 T A C C G C

6 T G C T A C

7 A A A T A T

8 A A A T A T

Page 35: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

35

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

An Example of LD Bins (3/3) There are three LD bins, and only three tag SNPs are

required to be genotyped (e.g., SNP1, SNP2, and SNP4).

Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6

1 A G A C G T

2 T G C C G C

3 A A A T A T

4 T G C T A C

5 T A C C G C

6 T G C T A C

7 A A A T A T

8 A A A T A T

Page 36: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

36

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Difference between Haplotype Blocks and LD bins Haplotype blocks are based on the assumption that

SNPs in proximity region should tend to be correlated with each other. The probability of recombination occurs in between is

less. LD bins can group correlated of SNPs distant from

each other. A disease is usually affected by multiple genes instead of

single one. The SNPs in one LD bin can be shared by other bins.

The SNPs in a haplotype block do not appear in another block.

Page 37: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

37

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Introduction to Linkage Disequilibrium

B b Total

A PAB PaB PA

a PaB Pab Pa

Total PB Pb 1.0

A BA ba Ba b

A, B: major alleles

a, b: minor alleles

PA: probability for A alleles at SNP1

Pa: probability for a alleles at SNP1

PB: probability for B alleles at SNP2

PB: probability for b alleles at SNP2

PAB: probability for AB haplotypes

Pab: probability for ab haplotypes

SNP1 SNP2

Page 38: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

38

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Linkage Equilibrium PAB = PAPB

PAb = PAPb = PA(1-PB)

PaB = PaPB = (1-PA) PB

Pab = PaPb = (1-PA) (1-PB)B b Total

A PAB PaB PA

a PaB Pab Pa

Total PB Pb 1.0

SNP1

SNP2

Page 39: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

39

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Linkage Disequilibrium PAB ≠ PAPB

PAb ≠ PAPb = PA(1-PB)

PaB ≠ PaPB = (1-PA) PB

Pab ≠ PaPb = (1-PA) (1-PB)B b Total

A PAB PaB PA

a PaB Pab Pa

Total PB Pb 1.0

SNP1

SNP2

Page 40: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

40

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

An Example of Linkage Disequilibrium

-- A -- -- -- G -- -- --

-- C -- -- -- G -- -- --

-- C -- -- -- C -- -- --

Suppose we have three haplotypes: AG, CG, and CC. There is no AC haplotype, i.e., PAC = 0.

Note that PAC =0, PAPC =1/9, and PAC ≠ PAPC. These two SNPs are linkage disequilibrium.

PA=1/3PC=2/3

PG=2/3PC=1/3

Page 41: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

41

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

An Example of Linkage Equilibrium

-- A -- -- -- G -- -- --

-- C -- -- -- G -- -- --

-- C -- -- -- C -- -- --

-- A -- -- -- C -- -- --

-- A -- -- -- G -- -- --

-- C -- -- -- G -- -- --

-- C -- -- -- C -- -- --

Before recombination After recombination

PA=1/2PC=1/2

PG=1/2PC=1/2

After recombination, PAG = PAPG = 1/4,

PCG = PCPG = 1/4,

PCC = PCPC = 1/4, and

PAC = PAPC = 1/4.

These two SNPs are linkage equilibrium.

Page 42: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

42

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Linkage Disequilibrium There are many formulas to compute LD

between two SNPs, and most of them are usually normalized between -1~1 or 0~1.LD = 1 (perfect positive correlation)LD = 0 (no correlation or linkage equilibrium)LD = -1 (perfect negative correlation)LD = 0.8 (strong positive correlation)LD = 0.12 (weak positive correlation)

Page 43: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

43

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Linkage Disequilibrium Formulas

Mathematical formulas for computing LD: r2 or Δ2:

D’:

Chi-square Test. P value.

)1()1(

)( 22

BBAA

BAAB

PPPP

PPPr

.0 if ,),min(

;0 if ,),min('

DPPPP

D

DPPPP

D

D

BabA

baBA

Page 44: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

44

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Correlation Coefficient The correlation between two random variables A and

B can be measured by the correaltion coefficient:

)1()1(

)(

)(Var)(Var

),(Cov

2

22

BBAA

BAAB

PPPP

PPP

BA

BAr

BAAB PPP

BEAEABEBA

][][][),(Cov

)1(

][][)(V2

22

AA

AA

PP

PP

AEAEAar

Page 45: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

45

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Examples of Computing LD

375.0)

52

53

51

54

)53

54

53

(

)1()1(

)(

2

22

12

BBAA

BAAB

PPPP

PPPr

Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6

1 A T A A G T

2 G T C C T T

3 G A C A G T

4 G A C C T T

5 G A C A G C

1)

51

54

51

54

)54

54

54

(

)1()1(

)(

2

22

13

BBAA

BAAB

PPPP

PPPr

Page 46: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

46

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Minimum Clique Cover Problem

This problem asks for a minimum set of LD bins. The minimum LD value required between two SNPs in

one bin is usually set to 0.8. This problem is known to be the minimum clique

cover problem (by Chao, K.-M., 2005). Consider each SNP as nodes on the graph. There exists an edge between two nodes iff the LD of

these two SNPs ≥ 0.8.

Page 47: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

47

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Relaxation of This Problem The minimum clique cover problem is not easy to be

approximated. The relaxed problem asks for a minimum set of LD bins

such that at least one SNP in an LD bin has r2 ≥ 0.8 with other SNPs in the same bin.

The relaxed problem is known to be the minimum dominating set problem. The minimum dominating set problem is still NP-hard

but is easier to be approximated.

Page 48: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

48

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Minimum Dominating Set Problem

Given a graph G(V, E), the minimum dominating set C is the minimum set of nodes, such that each node in V has at least one edge connecting to nodes in C.

Consider each node as a SNP and each edge as strong LD (r2 ≥ 0.8) between two SNPs. The minimum dominating set of this graph is the set of

tag SNPs. We can only use this set of SNPs to predict other SNPs.

Page 49: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

49

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Experimental Data Sets Hinds et al. (2005)

identified 1,586,383 SNPs across three human populations. African, Americans of

European, and Asian. The database provides

both genotype data and inferred haplotype data.

Page 50: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

50

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

The Programming Assignment Conduct an experiment on the Perlegen SNP

database. http://www.perlegen.com

Find the minimum set of LD bins, such that at least one SNP has strong LD (r2 ≥ 0.8) with other SNPs in the same bin. Please use r2 ≥ 0.8 as the threshold to identify strong

correlation between two SNPs. The focus of this project is to design algorithms for

solving the minimum dominating set problem.

Page 51: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

51

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

Haplotype Data Format

local_id Local unique identifier for this SNP

accession NCBI Build 34 sequence accession number

position Position within the specified Build 34 sequence

alleles The two SNP alleles: order is arbitrary

NA?????_A,NA?????_B

Two inferred haploid alleles.Columns 5-50: African American haplotypesColumns 51-98: European American haplotypesColumns 99-146: Han Chinese haplotypes

Download phased haplotype data from http://genome.perlegen.com/browser/download.html. Please use the 24 phased haplotype data sets.

Page 52: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

52

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

The Programming Assignment Teamwork with up to 5 people in a team.

The program can be written in any programming language.

Exact or approximate algorithms are both welcome (more methods, higher grades). Please provide the analysis of proposed algorithms (e.g.,

the time complexity). If using some existing method, please add appropriate

citations or references.

Page 53: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

53

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

The Project Report The project report should include at least the

following contents (more information, higher grades). (1) Team member information, (2) description of your algorithms, (3) analysis of your algorithms (e.g., time complexity,

approximation ratio), (4) summary of experimental results, and (5) contributions of each team member.

Page 54: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

54

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

The Experimental Setup The summary of your experimental results should

at least include some statistics of the LD bins found by your algorithm.

We encourage you to conduct a comprehensive experiment and analysis.

All Africa European Chinese

1-10 SNPs 15123 12134 13123 11134

≥10 SNPs 1234 1111 1111 1111

Total bins 16357 13245 14234 12245

Page 55: National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

55

National Taiwan UniversityDepartment of Computer Science

and Information Engineering

The Programming Assignment Due date: 12/14 Email your program (with detailed running

procedure) and project report to TA.Yao-Ting Huang : [email protected] may ask you to come to demo your program if

necessary. Important messages will be announced on the

following web page. http://www.csie.ntu.edu.tw/~kmchao/seq05fall/