Dr. Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel

Dr. Robertas DamaševičiusSoftware Engineering Department,

Kaunas University of Technology

Studentų 50-415, Kaunas, Lithuania

[email protected]

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 2

What is splicing?

Splicing: modification of genetic information after transcription, in which introns are removed and exons are joined

Splice junctions: boundary points between exons and introns where splicing occurs

Donor: upstream part of intron, conserved dinucleotide GT Acceptor: downstream part of intron, conserved dinucleotide AG Pseudo splice-sites

…CGATAA AG ATC..AAT GT ATCGCA…

Slice-junction site

Intron Intron Exon Acceptor Donor

Slice-junction site


Problem Splice-junction site recognition

Important for successful gene prediction Study of genetical deseases Understanding of genetic mechanisms

Difficulties Noisy data Pseudo splice sites Non-canonical splice sites (intron is not GT...AG) Alternative splicing Multitude of consensus sequences

Machine Learning: Support Vector Machine (SVM) Feature space mapping for SVM Which frequency-based feature mapping is the best?


Support Vector Machine (SVM)

are training data vectors, are unknown data vectors

, is a target space

is the kernel function.

SVxjiiij

i

bxxKyxg ,sgn

ji xxK ,

Xxi 1,1 YYyi

Xx j


What factors influence quality of classification? Training data

size of dataset, generation of negative examples, imbalanced datasets

Mapping of data into feature space Orthogonal, single nucleotide, nucleotide grouping, ...

Selection of an optimal kernel function linear, polynomial, RBF, sigmoid

Kernel parameters SVM learning parameters

Regularization parameter, Cost factor


SVM feature space

Feature space: multidimensional vector representing data instances

Mapping of data into features: achieving better classification accuracy

Feature space construction: nucleotide position-dependent nucleotide position-independent both nucleotide position-dependent and -independent information

Feature mapping rule:

N – the length of a DNA sequence, M – the length of feature vector

MN fffFsssSFSM ,...,,,,...,,ˆ,ˆ: 2121


K-mers

K-mer: a k-base long sequence (k-tuple) of DNA

K-mer feature vector: constructed using a frequency (or probability) of each k-mer in a DNA sequence

Σ – alphabet, N – length of a DNA sequence, k – length of k-mer,

nj – number of j-th k-mer in a DNA sequence

kiSaa, . . . , , aa ik21 , . . . 2, 1, ,ˆ,

kNjj jS

kN

npS

,...,1,ˆ,1

ˆ


K-mer frequency mapping rules 4-letter (ACGT) : Σ = {A, C, G, T}, ||Σ|| = 4

Disadvantage: feature space growth ~ 4k

Nucleotide grouping based: SW, KM & RY SW : Σ = {S, W}, ||Σ|| = 2

Strong (C, G) nucleotides – 3 H bonds Weak (A, T) nucleotides – 2 H bonds

RY : Σ = {R, Y}, ||Σ|| = 2 A and G – purines (R) C and T – pyrimidines (Y)

KM : Σ = {K, M}, ||Σ|| = 2 A and C – amines (M) G and T – ketones (K)

Example: 2-mer frequency mapping


Mapping rule

ACGT SW RY KM

Sequence AAAGTC WWWSWS WWWSWS MMMKKM 2-mers AA,AC,AG,AT,

CA,CC,CG,CT, GA,GC,GG,GT,

TA,TC,TG,TT

SS,SW,WS,WW RR,RY,YR,YY KK,KM,MK,MM

Feature vector

0,0,5

1,0,

5

1,0,0,0,0,0,0,0,0,

5

1,0,

5

2 5

2,

5

2,

5

10,

5

1,0,

5

1,

5

3 5

2,

5

1,

5

1,

5

1


Case study

Dataset: UCI repository, Genbank 64.1 primate data 3175 sequences, each (-30 bp, +30 bp) with regard to splice site

Three splice site recognition sub-problems: Exon/Intron (EI) vs. Negative (N) Intron/Exon (IE) vs. Negative (N) Exon/Intron (EI) vs. Intron/Exon (IE)

Three datasets: EI vs. N : 767 EI and 1655 N IE vs. N : 768 EI and 1655 N EI vs. IE : 767 EI and 768 EI

Power series kernel Accuracy evaluation metric: F-measure


Classification results: Exon/Intron vs. Negative


Classification results:Intron/Exon vs. Negative


Classification results:Intron/Exon vs. Exon/Intron


Classification time

Feature vector size


Intron/exon splice sites, 2422 sequences


Evaluation of results

Classification accuracy: Exon/Intron vs. N. – 4-mer ACGT frequency mapping (78.05%) Intron/Exon vs. N. – 6-mer ACGT frequency mapping (70.75%) E/I vs. I/E – 6-mer ACGT frequency mapping (90.59%) 4-mers and 6-mers better than 5-mers RY always better than SW or KM

Feature space size: ACGT k-mer: 4k

SW, RY, KM k-mer: 2k

Classification speed: SW/KM/RY k-mer frequency based classification can be ~ 2

times faster than ACGT k-mer classficaion


Why RY is better than SW or KM?Rule Donor (EI) consensus Acceptor (IE) consensus

ACGT (C|A)AG / GT(A|G)AGT (C|T)nN(C|T)AG / G

SW (S|W)SW / WS(S|W)SWS (S|W)nN(S|W)SW / W

KM KKM / MM(K|M)KMM (K|M)nN(K|M)KM / M

RY (R|Y)RR / RYRRRY YnNYRR / R Acceptor consensus sequence has long runs of Pyrimidines (Y)


Conclusions Selection of the appropriate feature mapping rule can greatly

influence the DNA sequence classification results Anomalies in consensus sequences (such as long runs) can

be exploited for better classification results when selecting mapping rules

For trade-off between classification accuracy and speed, RY k-mer frequency based mapping can be used instead of 4-letter k-mer frequency

Open research problem: “forbidden” k-mers


Questions?


SVM kernel function optimization Introduction of additional kernel parameters Introduction of new kernels Power series kernel function

Advantage: more parameters for optimization better separation of classes in feature space

n

k

k

jT

ikjin cxxaxxK1

,


SW k-mer frequency mapping rule SW ({A,T} vs. {C,G}) mapping rule

reflects the difference in the number of hydrogen bonds in the DNA molecule Strong (C, G) nucleotides - 3 H bonds Weak (A, T) nucleotides - 2 H bonds

related to physical-chemical properties of DNA transport of electrons mechanical waves along the DNA helix

kNjj jWSS

kN

npS 2,...,1,,ˆ,

1ˆ


RY k-mer frequency mapping rule The RY mapping rule ({A, G} vs.{C, T})

describes how purines (R) and pyrimidines (Y) are distributed along the DNA sequence. A and G – purines (R) C and T – pyrimidines (Y)

corresponds to the chemical composition bias in the DNA strand

kNjj jYRS

kN

npS 2,...,1,,ˆ,

1ˆ


KM k-mer mapping rule

The KM mapping rule ({A,C} vs. {G,T})

describes how ketones (K) and amines (M) are distributed along the DNA sequence A and C – amines (M) G and T – ketones (K)

kNjj jMKS

kN

npS 2,...,1,,ˆ,

1ˆ


Classification metric

F-measure

Advantage: One measure that takes into account both recall and precision: a

spectacular score in one does not compensate for a bad score in the other

%1002

recallprecision

recallprecisionF

TNFP

TN

nn

nrecall

FNTP

TP

nn

nprecision

Documents

Dr. Robertas Damaševičius Software Engineering Department, Kaunas University of Technology