Clustering and optimization in genetic data: the problem of Tag-SNPs selection

Clustering and optimization in genetic data: the problem of

Tag-SNPs selection

Paola Bertolazzi, Serena D‘ Aguanno, Giovanni Felici *, Paola Festa**

* Istituto di Analisi dei Sistemi ed Informatica “Antonio Ruberti”, CNR** Dipartimento di Dipartimento di Matematica e Applicazioni "R.M. Caccioppoli“, Universita’ degli Studi di Napoli “Federico II”

Summary

• Biological background– DNA– Chromosomes– Haplotypes and Genotypes– SNPs

• Haplotype analysis• Tag SNPs selection

– Problem definition– State of the art– Reconstruction Function and Linkage disequilibrium– Clustering techniques– Set covering techniques– Computational results– Conclusions and future work

Double Helix ((Watson-Crick) of two sequences of Nucleotides A, T, C. G

Base pairs (A-T, G-C) are complementary

One DNA sequence contains regions (i.e. genes, introns, exons) located in the same position of the sequence, in each individual of a species

DNA Structure

Chromosomes

One individual genome is organized in Chromosomes, i.e. large DNA macromolecules packaged in linear or circular shape

In polyploid organisms multiple copies of each chromosome exist

In diploid organisms (human) there are two copies of each chromosome, packaged in linear shape.

Each Chromosome includes hundreds of different genes

Four-arm structure during meiosis and mitosis

A single ‘copy’ of a chromosome is called haplotype, while a description of the mixed data on the two ‘copies’ is called genotype.

H1 AATCGCCTTA (maternal chrom) H2 ACACGTCTCA (paternal chrom)

G(H1,H2) A A/C T/A C G T T/C A

• For disease association studies, haplotype data is more valuable than genotype data

• Haplotype data is hard to collect.

• Genotype data is easy to collect

Haplotypes and genotypes

SNPs

All humans are 99,99 % identical.

Diversity? polymorphismpolymorphism..

A SNP is a Single Nucleotide Polymorphism - a site in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more).

A

GG

A

A

A

G

T

T

T

T

G

A

A

CC

C

C

C

C

CT

T

T

AATATATCGAATATATCG

AATATATCGAATATATCG

AATATATCGAATATATCG

AATATATCGAATATATCG

AATATATCGAATATATCG

AATATATCGAATATATCG

TCCGTATACCTATCCGTATACCTA






GGGGTGTGTGTACGGGGTGTGTGTAC






TGCTAGCACGCGTGCTAGCACGCG






TGTGTAATATACGTGTGTAATATACG






Haplotype analysis 1/2

A

GG

A

A

A

G

T

T

T

T

G

A

A

CC

C

C

C

C

CT

T

T

Haplotype analysis* focuses on haplotypes and genotypes that are sequences of SNPs

*http://www.hapmap.org/

To reduce prohibitively expensive haplotyping costs, atwo stage methodology has been proposed [1]•Pilot Study

•All SNPs of interest are genotyped in a small sample of the population•Common haplotypes are inferred using statistical methods•A set of tag SNPs is selected for the population study

•Population Study•Tag SNPs are genotyped in the remaining population•Statistical methods are used to infer haplotypes over the tag SNPs•Haplotypes over the tag SNPs are extrapolated to full haplotypes

•Two problems:•Find a set of minimum cardinality•Find a reconstruction function

Haplotype analysis 2/2

Tag SNPs Selection: methods and models

1. Methods that find a minimum set of clusters of SNPs in high correlation (e.g. linkage disequlibrium) with each other (clusters are called blocks). SNPs prediction should be easier within a block

2. Methods that, given the block structure (based on correlation or on proximity) find a minimum set of SNPs which is able to distinguish each pair of haplotypes in a block; or assume that the number of tag SNPs is given and find a set of Tag which can reconstruct the haplotype of a unknown sample with high accuracy

Tag SNPs Selection: Problem definition

Problem Definition• Given a population of N haplotypes over M SNPs find a

small set of SNPs (Tag SNPs) such that all the values of the other SNPs can be derived, with some reconstruction rule, from the selected values of the Tag SNPs.

Two aspects:(1) Find a reconstruction function(2) Find a set of minimum cardinality that can

reconstruct the other SNPs using (1)And Also:

(3) Given (1) and (2), is there a proper way to identify blocks?

Tag SNPs Selection: Problem definition

The Approaches• Use a reconstruction function based on SNPs similarityMethod 1• Cluster the SNPs according to a proper metric; • Select the centroid of each cluster as a TAG SNPs.

Method 2• Select a subset of SNPs that are able to differentiate each

pair of haplotypes (Set Covering formulation)

• Both method are coherent with the adopted reconstruction function

• The performance in reconstruction can be used to derive the blocks ex-post

The “Majority Vote”1. Given

the set of TAG SNPs A training set T of haplotypes of which we know the

value of all the SNPs A new haplotype H of which we know only the value of

the TAG SNPs

2. Let S be the set of haplotypes in T that have the same values of H on the TAG SNPs

3. For each non-TAG SNPs, determine its most frequent value in S and use it as a prediction of the value of this SNPs of H

The reconstruction function

• The majority vote rule is based on the assumption that TAG SNPs characterize almost completely the haplotype

• If two haplotypes are equal on the TAG SNPs, then they are equal also on the other SNPs.

The reconstruction function

Method 1: SNPs Clustering

• Clustering : find groups of elements with high dissimilarity between groups and small dissimilarity within each group w.r.t. a chosen distance function

• Main Assumption: TAG SNPs are those that are very similar to many other SNPs in the Training Data

Use the TAG SNPs to reconstruct the non-TAG SNPs of new haplotypes using the Majority Rule

cluster the SNPs in the

haplotypes space using Hamming

Distance (HD) with k-means

algorithm, for a proper value of

k

Select k TAG SNPs as those closest to the HD-centroids of each clusters

Method 1: Set Covering Model

The “classical” model: Find a minimal subset of TAG SNPs in such a way that each pair of haplotypes in the training set differ in the value of at least 1 TAG SNPs

Use the TAG SNPs to reconstruct the non-TAG SNPs of new haplotypes using the Majority Rule

Select SNPs associated with xi = 1 in the solution of

the SC problem

otherwise

k SNPon differj and i haplotype ifaijk

0

1

1,0,1

..

min

k

k kijk

k k

x

ji xa

ts

x

The above problem cannot be solved optimally for

realistic sizes

Variants of the Set Covering Model

• The SC problem has a number of constraints quadratic in the number of haplotypes

• We use variations of the SC model (SCV) that enable to control the number of TAGs and their quality in a more effective way

• Used iterative herusitic based on reduced costs

0

1,0

,

..

max

k

k k

k kijk

x

x

ji xa

ts

0

1,0

,

..

min

k

k k

k kijk

x

x

ji xa

ts

Minimize the number of TAGs for a given

level of differentiation

between haplotypes

Maximize the capacity to

differentiate between haplotypes for a

given number of TAG SNPs

Some Remarks

• A good estimation on the number of TAG SNPs to be used in the model can be found efficiently measuring the quality of the clusters for different values of

• The quality of the two methods (Clustering and Set Covering) can be compared directly using the same dimensions of the TAG SNPs set

SC still non tractable if all SNPs are used (most literature uses the first 1000-1500SNPs).

Start with centroids of clustering

Add columns with pricing until LP oprimal

Add columns with metric on SNPs until F.O. increases

Solve IP

Computational results

International HapMap Project

Data on Chromosoma 21 of human genome

YRI : Yoruba in Ibadan, Nigeria. JPT: Japanese in Tokyo, Japan CHB: Han Chinese in Beijing, China CEU : Utah Residents with Northern and Western European

Ancestry

# haplotypes # SNPsYRI 120 38.852 JPT+CHB 180 33.878 CEU 120 34.103


Experiments Setting

a) Limited to the first block of 1500 SNPs (as in related literature), or

b) Using all SNPs ( 40.000)c) Used clustering with standard HD with modal centroids and

random starting centroidsd) Used SCR with fixed using iterative heuristics based on

reduced costs solved with CPLEXe) Reconstruction with majority rulef) Quality of reconstraction: if SNPs value coherent in more

than 70% of matching haplotypes (set S), then predict, else declare undetermined

g) 2/3 of haplotypes used for training, 1/3 for testing


DATASET beta alpha %error %undecided % correct columns %wrong columns

CEU 9 1 20.8 19.33 14,01 39.5

CEU 20 4 20.24 33.29 13.31 25.33

YRI 13 1 21.71 17.11 16.54 40.34

YRI 17 8 18.75 13.33 21.98 28.66

JPT+CHB 9 0 16.18 23.47 17.5 39.57

JPT+CHB 20 4 27.47 10.2 18.58 21.55

DATASET beta alpha %error %undecided % correct columns %wrong columns

CEU 13 2 26.88 14.83 13.27 47.33

YRI 20 4 24.92 13.27 13.2 42.11

JPT+CHB 20 2 26.16 17.04 13.09 50.77

Set Covering results, 1500 SNPs, 0.7 majority threshold

Set Covering results, ALL SNPs, 0.7 majority threshold


DATASET beta iterations %error %undecided % correct columns %wrong columns

CEU 20 12 17.5 14.16 14.52 26.41

YRI 17 11 18.53 11.3 21.82 28.58

JPT+CHB 20 11 17.47 10.76 18.58 21.55

DATASET beta iterations %error %undecided % correct columns %wrong columns

CEU 20 9 28.49 15.46 11.76 47.98

YRI 17 5 25.63 15.61 12.64 45.04

JPT+CHB 20 10 26.1 17.66 12.7 50.89

Clustering results, 1500 SNPs, 0.7 majority threshold

Clustering results, ALL SNPs, 0.7 majority threshold


ObservationsReconstruction error in the range of 20% of the SNPs,

improving on previous results (where comparable)

1. SCV method performs better that clustering expecially when all SNPs are used

2. Best results are obrtained with approx. 30 TAG SNPs. Larger values do not reduce the reconstruction error and slow down the computation

3. First time so many SNPs are treated simultaneously

4. Completely correct SNPs are in the range 10-20%

With 30 TAGs we can reconstruct correctly 6000 SNPs…


Work in ProgressUse the proposed method to indentify the blocks

Use all SNPs on Training Set Apply SCV to select TAG SNPs Apply majority rule to test set and select those SNPs

that are predicted correclty all over the test set Create one block with these SNPs, associate them to

TAG set, remove these SNPs from samples Iterate until sample contains only TAG SNPs or when

no improvement is obtained

…Preliminary results are encouraging

… Larger data sets are needed in order to test the method properly

Documents

Clustering and optimization in genetic data: the problem of Tag-SNPs selection