Alessandra Godi

Preview:

DESCRIPTION

Solving Haplotyping Inference Parsimony problem using a polynomial class representative formulation and a set covering formulation. Alessandra Godi. Martine Labbé. Université Libre de Bruxelles. IASI (CNR) Roma. Airo Winter 2007 - Cortina d’Ampezzo , February 5th -9th, 2007. - PowerPoint PPT Presentation

Citation preview

Alessandra GodiAlessandra Godi

IASI (CNR) IASI (CNR)

RomaRoma

Solving Haplotyping Inference Parsimony problem using a

polynomial class representative formulation and

a set covering formulation

Université Libre Université Libre de Bruxellesde Bruxelles

Martine LabbéMartine Labbé

Airo Winter 2007 - Cortina d’Ampezzo, February 5th -9th, 2007

The alphabet of life…

Base pairs (A-T, G-C) are complementary

DNA structure=Double Helix (Watson-Crick)

Basic unit = nucleotide: Sugar

PhosphateBase (A, G, T, C)

Humans have 23 pairs of chromosomes: 22 autosome pairs 1 pair of sex chrom.

Each chromosome includes hundreds of different genes.

In the nucleus of each cell, the DNA molecule is packaged into thread-like structures called chromosomes.

Human Chromosomes

CM1CM2 P1

C CP2

Children

CM CP

FatherMother

Human Chromosomes

AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGATAATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGAT

AATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGATAATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGAT

AATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCCAGGATAATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCCAGGAT

AATATATCGCTATCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATAATATATCGCTATCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGAT

AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATAATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGAT

AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATAATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGAT

Chromosomes

Chromosomes

A single ‘copy’ of a chromosome is called haplotype, while a description of the mixed data on the two ‘copies’ is called genotype.

For disease association studies, haplotype data is more valuable than genotype data, but haplotype data is hard to collect.

Genotype data is easy to collect.

All humans are 99,99 % identical.

Diversity? polymorphismpolymorphism..

A SNP is a Single Nucleotide Polymorphism - a site in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more).

SNPs

SNP (Single Nucleotide Polymorphism)

A

GG

A

A

A

G

T

T

T

T

G

A

A

CC

C

C

C

C

CT

T

T

AATATATCGAATATATCG

AATATATCGAATATATCG

AATATATCGAATATATCG

AATATATCGAATATATCG

AATATATCGAATATATCG

AATATATCGAATATATCG

TCCGTATACCTATCCGTATACCTA

TCCGTATACCTATCCGTATACCTA

TCCGTATACCTATCCGTATACCTA

TCCGTATACCTATCCGTATACCTA

TCCGTATACCTATCCGTATACCTA

TCCGTATACCTATCCGTATACCTA

GGGGTGTGTGTACGGGGTGTGTGTAC

GGGGTGTGTGTACGGGGTGTGTGTAC

GGGGTGTGTGTACGGGGTGTGTGTAC

GGGGTGTGTGTACGGGGTGTGTGTAC

GGGGTGTGTGTACGGGGTGTGTGTAC

GGGGTGTGTGTACGGGGTGTGTGTAC

TGCTAGCACGCGTGCTAGCACGCG

TGCTAGCACGCGTGCTAGCACGCG

TGCTAGCACGCGTGCTAGCACGCG

TGCTAGCACGCGTGCTAGCACGCG

TGCTAGCACGCGTGCTAGCACGCG

TGCTAGCACGCGTGCTAGCACGCG

TGTGTAATATACGTGTGTAATATACG

TGTGTAATATACGTGTGTAATATACG

TGTGTAATATACGTGTGTAATATACG

TGTGTAATATACGTGTGTAATATACG

TGTGTAATATACGTGTGTAATATACG

TGTGTAATATACGTGTGTAATATACG

SNP (Single Nucleotide Polymorphism)

A

GG

A

A

A

G

T

T

T

T

G

A

A

CC

C

C

C

C

CT

T

T

SNP (Single Nucleotide Polymorphism)

Genotype: A/T T/G A C

Haplotype 1: A G A CHaplotype 2: T T A C

SNP 1 SNP 2 SNP 3 SNP 4A

GG

A

A

A

G

T

T

T

T

G

A

A

CC

C

C

C

C

CT

T

T

Hetero Hetero Homo Homozigous zigous zigous zigous

SNP: encoding

SNP 1 SNP 2 SNP 3 SNP 4A

GG

A

A

A

G

T

T

T

T

G

A

A

CC

C

C

C

C

CT

T

T

011000

100011

110000

000111

Genotype: 0/1 1/0 1 0

Haplotype 1: 0 1 1 0Haplotype 2: 1 0 1 0

2 2 1 0

Haplotyping of a population

Given a set of genotypes G (strings on {0,1,2}n alphabet), find a set of “generating” haplotypes HH (strings on {0,1}n alphabet).

genotype genotype individual individual

The GENOME is the set of genetic information which lies in the DNA sequence of each living organism.

The DNA sequence is a linear disposition of 4 different molecule, nucleotide, or bases:A, T, C, G.

The bases are paired each other by hydrogen bonds.

The DNA implies differences between the individuals of the same species.

What makes us different from each other is called polymorphism.

At DNA level: a Polymorphism is a nucleotide sequence which varies within a chromosome population:

atcagattagttagggcacaggacggacatcagattagttagggcacaggacggac

atcagattagttagggcacaggacgtacatcagattagttagggcacaggacgtacatccgattagttagggcacaggacgtacatccgattagttagggcacaggacgtac

atccgattagttagggcacaggacggacatccgattagttagggcacaggacggac

atccgattagttagggcacaggacgtacatccgattagttagggcacaggacgtac

atcagattagttagggcacaggacggacatcagattagttagggcacaggacggac

atcagattagttagggcacaggacggacgtacatcagattagttagggcacaggacggacgtac

atcagattagttagggcacaggacggacgtacatcagattagttagggcacaggacggacgtac

atcagattagttagggcacaggacggacggacatcagattagttagggcacaggacggacggac

atccgattagttagggcacaggacggacggacatccgattagttagggcacaggacggacggac

SSingle NNucleotide PPolymorphism (SNPSNP)

SSingle NNucleotide PPolymorphism (SNPSNP)

atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcaagattagttagggcacaggacggattagttagggcacaggacgttacacatcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

atcatcccgattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

At DNA level: a Polymorphism is a nucleotide sequence which varies within a chromosome population:

atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcaagattagttagggcacaggacggattagttagggcacaggacgttacacatcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

atcatcccgattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcaagattagttagggcacaggacggattagttagggcacaggacgttacacatcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

atcatcccgattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

ETEROZYGOUSETEROZYGOUS: different alleles

atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcaagattagttagggcacaggacggattagttagggcacaggacgttacacatcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

atcatcccgattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

ETEROZYGOUSETEROZYGOUS: different alleles

atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcaagattagttagggcacaggacggattagttagggcacaggacgttacacatcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

atcatcccgattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

ETEROZYGOUSETEROZYGOUS: different alleles

HAPLOTYPESHAPLOTYPES: chromosome at SNP level

aa gg

aa tt c c tt c c gg

cc tt a a gg

aa tt aa tt

a a gg c c gg

HAPLOTYPESHAPLOTYPES: chromosome at SNP level

ETEROZYGOUSETEROZYGOUS: different alleles

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

aagg aatt

cctt ccgg

cctt aagg

aatt aatt

aagg ccgg

GENOTYPESGENOTYPES: “union” of two haplotypes

OcE

EE

OaE

OaOt

EOg

HAPLOTYPESHAPLOTYPES: chromosome at SNP level

ETEROZYGOUSETEROZYGOUS: different alleles

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

aagg aatt

cctt ccgg

cctt aagg

aatt aatt

aagg ccgg

OcE

EE

OaE

OaOt

EOg

CODINGCODING: each SNP has only 2 possible values in a biological population. Let us call them ‘0’ and ‘1’. Moreover, let ‘2’ be the eterozygous site.

0011

0000 1100 1111

1100 0011

0000 0000

0011 1111

12

22

02

00

21

CODINGCODING: each SNP has only 2 possible values in a biological population.

:{0,1} {0,1,2}

0 0 = 0 1 1 = 1 0 1 = 1 0 = 2

HAPLOTYPING of a population

Given a set GG (strings in {0,1,2}n), find a set of

generator haplotypes HH (strings in {0,1}n)

genotype genotype individual individual

HAPLOTYPING of a population:State of the Art

Perfect Phylogeny (Bafna, Gusfield, Yooseph 02)

Estimation of haplotype frequencies(probabilistic studies: Fallin – Shork, 00)

Parsimony Objective (Gusfield 02, Brown 05)

HAPLOTYPING of a population:Parsimony Objective (NP-hard)

Combinatorial Methods (Gusfield 2002, Brown 2004, LANCIA –Rizzi, 2002):

Exponential and Polynomial ILP formulations

Rule-based methods (HAPINFER - Clark 1990):

Starting from genotypes, haplotypes are inferred

Statistical methods (PHASE- Stephens 2004, HAPLOTYPER – Niu 2001, GERBIL – Shamir 2005)

HAPLOTYPING of a population:our approach to the problem by using ILP

A new exponential

formulation

1. A pure set covering model obtained by Fourier-Motzking procedure by Gusfield (2002)model

2. A branch and cut procedure to decrease the number of constraints

A new polynomial formulation

A formulation using class representatives

A new polynomial formulation

I={h1,…, hq} a solution of the problem

genotypes of length nG={g1,g2,…,gm}

Main idea: class representatives

Each haplotype induces a subset of ordinated genotypes, and each geno belongs to exactly two of these subsets:

h1 {gi, gj, gk,…}

h2 {gi, gl, gr, gs…}

h3 {gk, gl, gs, gt…}

….

….

= Si

= Si’

= Sk

The smallest index geno identifies the subset; the prime appears if the correspondent index has been already used.

K’ = {1’, 2’, …, m’}K = {1, 2, …, m}

A new polynomial formulation

VARIABLES

yk{i,j}=

1

0 Otherwise

If geno gk belongs to two subset of geno’s, one having a geno with smallest index equal to i and the other one having the geno with smallest index j

k K i, j K K’

A new polynomial formulation

Ex:

h1 = 001 {g1, g2} = S1

g1= 021, g2= 002, g3 = 012

h2 = 011 {g1, g3} = S1’

y1{1,1’} = 1

Let us note that some y variables do not exist:

y2{1’,2’} = 0 If y2

{1’,2’} = 1

S1={g1,….}S1’={g1,g2….}

S2={g2,…}S2’={g2,…}

Absurd!!!

A new polynomial formulation

xi =

1 If there exists a subset of geno’s of the solution having geno i as geno with smallest index

i K K’ 0 Otherwise

zi,p =

0 i K K’ p SNP

1It is the value of the p-th coordinate of the haplo explaining the subset of geno’s used in the solution and having geno i as geno with smallest index

OBJECTIVE FUNCTION:

min xii K K’

VARIABLES

A new polynomial formulation

yk{i,j} 1 k K2.

i,j K K’, i≤k,

j≤k

CONSTRAINTS:

xi xi’ i K, i K’1.

A new polynomial formulationCONSTRAINTS:

yk{i,j} + yk

{i,j} ≤ xi k K3.

j K K’, j ≥ i

j K K’, j < i

i K K’,

yk{k,k’} ≤ xk’

k K3a. i = k’

A new polynomial formulationCONSTRAINTS:

4a. zi,p= 0 i K K’

pSNP s.t. gi(p)=0

4b. zi,p= 1 i K K’

pSNP s.t. gi(p)=1

4c. zi,p + zj,p = 1 {i,j} K K’

pSNP s.t. gi(p)=2

A new polynomial formulationCONSTRAINTS:

zi,p ≤ 1 - yk{i,j} - yk

{i,j} xi k K5.j K K’,

j ≥ i

j K K’, j < i

i K K’

pSNP : gk(p)=0

yk{k,k’} + zk’,p ≤ 1 k K, i = k’5a.

pSNP : gk(p)=0

A new polynomial formulationCONSTRAINTS:

zi,p ≥ yk{i,j} + yk

{i,j} k K6.j K K’,

j ≥ i

j K K’, j < i

i K K’

pSNP : gk(p)=1

zk’,p ≥ yk{k,k’} k K, i = k’6a.

pSNP : gk(p)=1

A new polynomial formulationCONSTRAINTS:

zi,p + zj,p ≥ yk{i,j}

k K7.

i,j K K’

pSNP : gk(p)=2

7a. zi,p + zj,p ≤ 2 - yk{i,j}

k K

i,j K K’

pSNP : gk(p)=2

10x10Opt zLP sec

zLP

LP iter

seczILP

MIP iter

B&B nodes

Poly 15 12 0,01 54 0,12 263 14

BrownModel[‘05]

15 2 0,05 140 4,85 16,646 1360

15x15Opt zLP sec

zLP

LP iter

seczILP

MIP iter

B&B nodes

Poly 27 22,83 0,01 173 0,08 173 11

BrownModel[‘05]

27 8 0,02 129 4.25 19.301 2.213

Preliminar results

20x20Opt zLP sec

zLP

LP iter

seczILP

MIP iter

B&B nodes

Poly 16 15 0,2 268 16 573 9

BrownModel[‘05]

16 3 O,07 598 27.604 16*106 540.623

Preliminar results

Let G be the genotype set and H the set of haplotypes which are compatible with some genotype in G.

^

INTEGER VARIABLES

Xh

1 if h is chosen

0 otherwise

1 if (h1,h2) is selected

0 otherwise

yh1,h2

For each g G

Pg = {(h1,h2) con h1,h2H | h1 h2 = g}^

From Gusfield’s formulation (2002)…

min Xh

hH

OBJECTIVE FUNCTION

^

CONSTRAINTS

1 g G1.

X 2.

yh1,h2

(h1,h2) Pg

yh1,h2h1

(h1,h2) Pg , g G

X 3. yh1,h2h2

(h1,h2) Pg , g G

From Gusfield’s formulation (2002)…

min xh

hH

1xh

h=h1 h=h2

g G

ˇ

x {0,1}n

…to a new set covering formulation by using the Fourier- Motzkin procedure

Set-Covering

s.t. (h1,h2) Pg

Genotype Structure +

Basic SC theory

Facets and

Valid Inequalities

g fixed fixedfreeN is the set of SNP

F

N\FF={pN: g(p) {0,1}}

Set-covering for HIP

1. The polytope HSC if full-dimensional IFF g G , |N\F|=2.

2. xj 0 is a facet for HSC IFF g G there exists hi s.t. hj hi=g, we have |N\F|=3.

3. xj 1 is facet j .

Proposition

g

g’

fixed fixed

fixed free

freefree

F

N\F

F’ N\F’

F={pN: g(p) {0,1}}

C=(N\F’)F

F’={pN: g’(p) {0,1}}

xi 1i S

Set-covering for HIP

N is the set of SNPs

|C|=|(N\F’)F|= 2 e (N\F)(N\F’)

|C|=|(N\F’)F| 3

TheoremLet us consider a genotype g and a subset S of haplotypes which are associated to a minimal set covering inequality:

This inequality is facet defining IFF for each genotype g’g one of the following conditions holds:

Set-covering for HIP

1.xh

h S

Set-covering for HIP

1st case: If |C|=|(N\F’)F|= 2 (N\F)(N\F’) =

2nd case : If |C|= |{p}|=1

If C= 3rd case :

the set covering inequality is dominated by another one that can be defined by using a SEQUENTIAL LIFTING procedure.

NOTE: For the following cases:

Set-covering for HIP: main idea

To overcome the exponential structure of the formulation:

1. Add only set-covering inequalities which are facet-defining

2. Add them in branch and cut procedure

Set-covering for HIP: a branch and cut procedure

a fractional solution of a subproblem of the original one

x*

g: (h1, h2 )  (h3,h4)  (h5, h6)  (h7, h8)

All set covering inequalities associated with g have the following structure:

x{1 or 2}+ x{3 or 4} + x{5 or 6}+ x{7 or 8} ≥ 1

Set-covering for HIP: a branch and cut procedure

We want to find a set covering inequality of g that violates x*

If it esists, we have found a set covering inequality which cut off x* !!!

We choose to add it to the system only if it is facet-defining.

min {x*1,x*2} + min {x*3,x*4} + min {x*5,x*6} + min {x*7,x*8} < 1

Branch and Cut preliminar results

Av. on max # of 2s

#constrmaster problem

#constr reduced problem

#added cuts

Solving time

50 genos10 SNPs

5 >60.000

7 30 0.00 sec

50 genos30 SNPs

8 >2512 7 200 0.05 sec

Average on 10 samples for each kind of instance generated by MS (Hudson, 2002) with recombination level r = 0

Future Works

On Polynomial formulation:

1. Strengthening of the model by Clique inequalities on genotype conflict graph

2. Cplex Concert Technologies3. More test vs other polynomial

formuationsOn Exponential formulation:

1. Implementation of Lifting Procedure2. More test in comparison with

Gusfield formulation

Recommended