Alessandra Godi

Alessandra GodiAlessandra Godi

IASI (CNR) IASI (CNR)

RomaRoma

Solving Haplotyping Inference Parsimony problem using a

polynomial class representative formulation and

a set covering formulation

Université Libre Université Libre de Bruxellesde Bruxelles

Martine LabbéMartine Labbé

Airo Winter 2007 - Cortina d’Ampezzo, February 5th -9th, 2007

The alphabet of life…

Base pairs (A-T, G-C) are complementary

DNA structure=Double Helix (Watson-Crick)

Basic unit = nucleotide: Sugar

PhosphateBase (A, G, T, C)

Humans have 23 pairs of chromosomes: 22 autosome pairs 1 pair of sex chrom.

Each chromosome includes hundreds of different genes.

In the nucleus of each cell, the DNA molecule is packaged into thread-like structures called chromosomes.

Human Chromosomes

CM1CM2 P1

Children

FatherMother

Human Chromosomes

AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGATAATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGAT

AATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGATAATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGAT

AATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCCAGGATAATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCCAGGAT

AATATATCGCTATCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATAATATATCGCTATCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGAT

AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATAATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGAT

Chromosomes

A single ‘copy’ of a chromosome is called haplotype, while a description of the mixed data on the two ‘copies’ is called genotype.

For disease association studies, haplotype data is more valuable than genotype data, but haplotype data is hard to collect.

Genotype data is easy to collect.

All humans are 99,99 % identical.

Diversity? polymorphismpolymorphism..

A SNP is a Single Nucleotide Polymorphism - a site in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more).

SNP (Single Nucleotide Polymorphism)

AATATATCGAATATATCG

TCCGTATACCTATCCGTATACCTA

GGGGTGTGTGTACGGGGTGTGTGTAC

TGCTAGCACGCGTGCTAGCACGCG

TGTGTAATATACGTGTGTAATATACG

SNP (Single Nucleotide Polymorphism)

Genotype: A/T T/G A C

Haplotype 1: A G A CHaplotype 2: T T A C

SNP 1 SNP 2 SNP 3 SNP 4A

Hetero Hetero Homo Homozigous zigous zigous zigous

SNP: encoding

SNP 1 SNP 2 SNP 3 SNP 4A

011000

100011

110000

000111

Genotype: 0/1 1/0 1 0

Haplotype 1: 0 1 1 0Haplotype 2: 1 0 1 0

2 2 1 0

Haplotyping of a population

Given a set of genotypes G (strings on {0,1,2}n alphabet), find a set of “generating” haplotypes HH (strings on {0,1}n alphabet).

genotype genotype individual individual

The GENOME is the set of genetic information which lies in the DNA sequence of each living organism.

The DNA sequence is a linear disposition of 4 different molecule, nucleotide, or bases:A, T, C, G.

The bases are paired each other by hydrogen bonds.

The DNA implies differences between the individuals of the same species.

What makes us different from each other is called polymorphism.

At DNA level: a Polymorphism is a nucleotide sequence which varies within a chromosome population:

atcagattagttagggcacaggacggacatcagattagttagggcacaggacggac

atcagattagttagggcacaggacgtacatcagattagttagggcacaggacgtacatccgattagttagggcacaggacgtacatccgattagttagggcacaggacgtac

atccgattagttagggcacaggacggacatccgattagttagggcacaggacggac

atccgattagttagggcacaggacgtacatccgattagttagggcacaggacgtac

atcagattagttagggcacaggacggacatcagattagttagggcacaggacggac

atcagattagttagggcacaggacggacgtacatcagattagttagggcacaggacggacgtac

atcagattagttagggcacaggacggacggacatcagattagttagggcacaggacggacggac

atccgattagttagggcacaggacggacggacatccgattagttagggcacaggacggacggac

SSingle NNucleotide PPolymorphism (SNPSNP)

atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcaagattagttagggcacaggacggattagttagggcacaggacgttacacatcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

atcatcccgattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

At DNA level: a Polymorphism is a nucleotide sequence which varies within a chromosome population:

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

ETEROZYGOUSETEROZYGOUS: different alleles

HAPLOTYPESHAPLOTYPES: chromosome at SNP level

aa tt c c tt c c gg

cc tt a a gg

aa tt aa tt

a a gg c c gg

aagg aatt

cctt ccgg

cctt aagg

aatt aatt

aagg ccgg

GENOTYPESGENOTYPES: “union” of two haplotypes

aagg aatt

cctt ccgg

cctt aagg

aatt aatt

aagg ccgg

CODINGCODING: each SNP has only 2 possible values in a biological population. Let us call them ‘0’ and ‘1’. Moreover, let ‘2’ be the eterozygous site.

0000 1100 1111

1100 0011

0000 0000

0011 1111

CODINGCODING: each SNP has only 2 possible values in a biological population.

:{0,1} {0,1,2}

0 0 = 0 1 1 = 1 0 1 = 1 0 = 2

HAPLOTYPING of a population

Given a set GG (strings in {0,1,2}n), find a set of

generator haplotypes HH (strings in {0,1}n)

genotype genotype individual individual

HAPLOTYPING of a population:State of the Art

Perfect Phylogeny (Bafna, Gusfield, Yooseph 02)

Estimation of haplotype frequencies(probabilistic studies: Fallin – Shork, 00)

Parsimony Objective (Gusfield 02, Brown 05)

HAPLOTYPING of a population:Parsimony Objective (NP-hard)

Combinatorial Methods (Gusfield 2002, Brown 2004, LANCIA –Rizzi, 2002):

Exponential and Polynomial ILP formulations

Rule-based methods (HAPINFER - Clark 1990):

Starting from genotypes, haplotypes are inferred

Statistical methods (PHASE- Stephens 2004, HAPLOTYPER – Niu 2001, GERBIL – Shamir 2005)

HAPLOTYPING of a population:our approach to the problem by using ILP

A new exponential

formulation

1. A pure set covering model obtained by Fourier-Motzking procedure by Gusfield (2002)model

2. A branch and cut procedure to decrease the number of constraints

A new polynomial formulation

A formulation using class representatives

I={h1,…, hq} a solution of the problem

genotypes of length nG={g1,g2,…,gm}

Main idea: class representatives

Each haplotype induces a subset of ordinated genotypes, and each geno belongs to exactly two of these subsets:

h1 {gi, gj, gk,…}

h2 {gi, gl, gr, gs…}

h3 {gk, gl, gs, gt…}

= Si’

The smallest index geno identifies the subset; the prime appears if the correspondent index has been already used.

K’ = {1’, 2’, …, m’}K = {1, 2, …, m}

VARIABLES

yk{i,j}=

0 Otherwise

If geno gk belongs to two subset of geno’s, one having a geno with smallest index equal to i and the other one having the geno with smallest index j

k K i, j K K’

h1 = 001 {g1, g2} = S1

g1= 021, g2= 002, g3 = 012

h2 = 011 {g1, g3} = S1’

y1{1,1’} = 1

Let us note that some y variables do not exist:

y2{1’,2’} = 0 If y2

{1’,2’} = 1

S1={g1,….}S1’={g1,g2….}

S2={g2,…}S2’={g2,…}

Absurd!!!

1 If there exists a subset of geno’s of the solution having geno i as geno with smallest index

i K K’ 0 Otherwise

zi,p =

0 i K K’ p SNP

1It is the value of the p-th coordinate of the haplo explaining the subset of geno’s used in the solution and having geno i as geno with smallest index

OBJECTIVE FUNCTION:

min xii K K’

VARIABLES

yk{i,j} 1 k K2.

i,j K K’, i≤k,

CONSTRAINTS:

xi xi’ i K, i K’1.

A new polynomial formulationCONSTRAINTS:

yk{i,j} + yk

{i,j} ≤ xi k K3.

j K K’, j ≥ i

j K K’, j < i

i K K’,

yk{k,k’} ≤ xk’

k K3a. i = k’

4a. zi,p= 0 i K K’

pSNP s.t. gi(p)=0

4b. zi,p= 1 i K K’

pSNP s.t. gi(p)=1

4c. zi,p + zj,p = 1 {i,j} K K’

pSNP s.t. gi(p)=2

zi,p ≤ 1 - yk{i,j} - yk

{i,j} xi k K5.j K K’,

j ≥ i

j K K’, j < i

i K K’

pSNP : gk(p)=0

yk{k,k’} + zk’,p ≤ 1 k K, i = k’5a.

pSNP : gk(p)=0

zi,p ≥ yk{i,j} + yk

{i,j} k K6.j K K’,

j ≥ i

j K K’, j < i

i K K’

pSNP : gk(p)=1

zk’,p ≥ yk{k,k’} k K, i = k’6a.

pSNP : gk(p)=1

zi,p + zj,p ≥ yk{i,j}

i,j K K’

pSNP : gk(p)=2

7a. zi,p + zj,p ≤ 2 - yk{i,j}

i,j K K’

pSNP : gk(p)=2

10x10Opt zLP sec

LP iter

seczILP

MIP iter

B&B nodes

Poly 15 12 0,01 54 0,12 263 14

BrownModel[‘05]

15 2 0,05 140 4,85 16,646 1360

15x15Opt zLP sec

LP iter

seczILP

MIP iter

B&B nodes

Poly 27 22,83 0,01 173 0,08 173 11

BrownModel[‘05]

27 8 0,02 129 4.25 19.301 2.213

Preliminar results

20x20Opt zLP sec

LP iter

seczILP

MIP iter

B&B nodes

Poly 16 15 0,2 268 16 573 9

BrownModel[‘05]

16 3 O,07 598 27.604 16*106 540.623

Preliminar results

Let G be the genotype set and H the set of haplotypes which are compatible with some genotype in G.

INTEGER VARIABLES

1 if h is chosen

0 otherwise

1 if (h1,h2) is selected

0 otherwise

yh1,h2

For each g G

Pg = {(h1,h2) con h1,h2H | h1 h2 = g}^

From Gusfield’s formulation (2002)…

min Xh

OBJECTIVE FUNCTION

CONSTRAINTS

1 g G1.

yh1,h2

(h1,h2) Pg

yh1,h2h1

(h1,h2) Pg , g G

X 3. yh1,h2h2

(h1,h2) Pg , g G

From Gusfield’s formulation (2002)…

min xh

h=h1 h=h2

x {0,1}n

…to a new set covering formulation by using the Fourier- Motzkin procedure

Set-Covering

s.t. (h1,h2) Pg

Genotype Structure +

Basic SC theory

Facets and

Valid Inequalities

g fixed fixedfreeN is the set of SNP

N\FF={pN: g(p) {0,1}}

Set-covering for HIP

1. The polytope HSC if full-dimensional IFF g G , |N\F|=2.

2. xj 0 is a facet for HSC IFF g G there exists hi s.t. hj hi=g, we have |N\F|=3.

3. xj 1 is facet j .

Proposition

fixed fixed

fixed free

freefree

F’ N\F’

F={pN: g(p) {0,1}}

C=(N\F’)F

F’={pN: g’(p) {0,1}}

xi 1i S

N is the set of SNPs

|C|=|(N\F’)F|= 2 e (N\F)(N\F’)

|C|=|(N\F’)F| 3

TheoremLet us consider a genotype g and a subset S of haplotypes which are associated to a minimal set covering inequality:

This inequality is facet defining IFF for each genotype g’g one of the following conditions holds:

1st case: If |C|=|(N\F’)F|= 2 (N\F)(N\F’) =

2nd case : If |C|= |{p}|=1

If C= 3rd case :

the set covering inequality is dominated by another one that can be defined by using a SEQUENTIAL LIFTING procedure.

NOTE: For the following cases:

Set-covering for HIP: main idea

To overcome the exponential structure of the formulation:

1. Add only set-covering inequalities which are facet-defining

2. Add them in branch and cut procedure

Set-covering for HIP: a branch and cut procedure

a fractional solution of a subproblem of the original one

g: (h1, h2 ) (h3,h4) (h5, h6) (h7, h8)

All set covering inequalities associated with g have the following structure:

x{1 or 2}+ x{3 or 4} + x{5 or 6}+ x{7 or 8} ≥ 1

Set-covering for HIP: a branch and cut procedure

We want to find a set covering inequality of g that violates x*

If it esists, we have found a set covering inequality which cut off x* !!!

We choose to add it to the system only if it is facet-defining.

min {x*1,x*2} + min {x*3,x*4} + min {x*5,x*6} + min {x*7,x*8} < 1

Branch and Cut preliminar results

Av. on max # of 2s

#constrmaster problem

#constr reduced problem

#added cuts

Solving time

50 genos10 SNPs

5 >60.000

7 30 0.00 sec

50 genos30 SNPs

8 >2512 7 200 0.05 sec

Average on 10 samples for each kind of instance generated by MS (Hudson, 2002) with recombination level r = 0

Future Works

On Polynomial formulation:

1. Strengthening of the model by Clique inequalities on genotype conflict graph

2. Cplex Concert Technologies3. More test vs other polynomial

formuationsOn Exponential formulation:

1. Implementation of Lifting Procedure2. More test in comparison with

Gusfield formulation

Alessandra Godi

Documents

Decorating Around The French Styles- Designer Alessandra ...€¦ · Alessandra Branca Veranda Magazine 10/22. Visit access.decorati.com 11/22. Visit peppermintbliss.com Alessandra

Banca Prossima - Alessandra Dalcolle

Julie Alessandra Awards

Alessandra Schneider Henn - repositorio.ufsm.br

YEAR 2020 NAUTI Č KI GODI Š NJAK ZA 2020. GODINU/ THE

ALESSANDRA AMBRÓSIO IS THE NEW FACE OF DAFITI › 47483 › pdf › press... · 2015-04-21 · ALESSANDRA AMBRÓSIO IS THE NEW FACE OF DAFITI Brazilian supermodel Alessandra Ambrósio

Full Description Alessandra

The Godi hold the spiritual connection between Skyldings to the … · 2018-06-24 · given. Once this is complete, the Godi finishes the ritual to remove the curse as follows: 1

ALESSANDRA CENTORBI

Alessandra Torres - Proprioception

Bathroom Vanities Toronto by GODI

Alessandra Feris, pianist

Alessandra jo haber

Akiane Kramarik Painter Inspired By Godi

Adobe Photoshop PDF - Club Economy...berry dried Goji berry G00D sušeni godi bobinl

Villa Godi Malinverni · Villa Godi Malinverni is the very first villa Palladio realized, as quoted in his Four Books on Architecture. The commissioner and owner of the property was

By Dr. Tony Alessandra

Alessandra Pesce V E

Smart Sensors for Domotics and Health Care, Alessandra ...alessandra-flammini.unibs.it/EG_FED/FED_AA1516/... · Smart Sensors for Domotics and Health Care, Alessandra Flammini, Brescia

Managers and Productivity in the Public Sectorpubdocs.worldbank.org/...Fenizia-Alessandra-Slides...Dec 11, 2019 · Alessandra Fenizia EMC 2019 6. Data Alessandra Fenizia EMC 2019