Upload
calvin
View
43
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel. Dr. Robertas Damaševičius Software Engineering Department, Kaunas University of Technology Student ų 50-415, Kaunas, Lithuania robertas.damasevicius @ktu.lt. - PowerPoint PPT Presentation
Citation preview
Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel
Dr. Robertas DamaševičiusSoftware Engineering Department,
Kaunas University of Technology
Studentų 50-415, Kaunas, Lithuania
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 2
What is splicing?
Splicing: modification of genetic information after transcription, in which introns are removed and exons are joined
Splice junctions: boundary points between exons and introns where splicing occurs
Donor: upstream part of intron, conserved dinucleotide GT Acceptor: downstream part of intron, conserved dinucleotide AG Pseudo splice-sites
…CGATAA AG ATC..AAT GT ATCGCA…
Slice-junction site
Intron Intron Exon Acceptor Donor
Slice-junction site
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 3
Problem Splice-junction site recognition
Important for successful gene prediction Study of genetical deseases Understanding of genetic mechanisms
Difficulties Noisy data Pseudo splice sites Non-canonical splice sites (intron is not GT...AG) Alternative splicing Multitude of consensus sequences
Machine Learning: Support Vector Machine (SVM) Feature space mapping for SVM Which frequency-based feature mapping is the best?
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 4
Support Vector Machine (SVM)
are training data vectors, are unknown data vectors
, is a target space
is the kernel function.
SVxjiiij
i
bxxKyxg ,sgn
ji xxK ,
Xxi 1,1 YYyi
Xx j
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 5
What factors influence quality of classification? Training data
size of dataset, generation of negative examples, imbalanced datasets
Mapping of data into feature space Orthogonal, single nucleotide, nucleotide grouping, ...
Selection of an optimal kernel function linear, polynomial, RBF, sigmoid
Kernel parameters SVM learning parameters
Regularization parameter, Cost factor
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 6
SVM feature space
Feature space: multidimensional vector representing data instances
Mapping of data into features: achieving better classification accuracy
Feature space construction: nucleotide position-dependent nucleotide position-independent both nucleotide position-dependent and -independent information
Feature mapping rule:
N – the length of a DNA sequence, M – the length of feature vector
MN fffFsssSFSM ,...,,,,...,,ˆ,ˆ: 2121
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 7
K-mers
K-mer: a k-base long sequence (k-tuple) of DNA
K-mer feature vector: constructed using a frequency (or probability) of each k-mer in a DNA sequence
Σ – alphabet, N – length of a DNA sequence, k – length of k-mer,
nj – number of j-th k-mer in a DNA sequence
kiSaa, . . . , , aa ik21 , . . . 2, 1, ,ˆ,
kNjj jS
kN
npS
,...,1,ˆ,1
ˆ
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 8
K-mer frequency mapping rules 4-letter (ACGT) : Σ = {A, C, G, T}, ||Σ|| = 4
Disadvantage: feature space growth ~ 4k
Nucleotide grouping based: SW, KM & RY SW : Σ = {S, W}, ||Σ|| = 2
Strong (C, G) nucleotides – 3 H bonds Weak (A, T) nucleotides – 2 H bonds
RY : Σ = {R, Y}, ||Σ|| = 2 A and G – purines (R) C and T – pyrimidines (Y)
KM : Σ = {K, M}, ||Σ|| = 2 A and C – amines (M) G and T – ketones (K)
Example: 2-mer frequency mapping
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 9
Mapping rule
ACGT SW RY KM
Sequence AAAGTC WWWSWS WWWSWS MMMKKM 2-mers AA,AC,AG,AT,
CA,CC,CG,CT, GA,GC,GG,GT,
TA,TC,TG,TT
SS,SW,WS,WW RR,RY,YR,YY KK,KM,MK,MM
Feature vector
0,0,5
1,0,
5
1,0,0,0,0,0,0,0,0,
5
1,0,
5
2 5
2,
5
2,
5
10,
5
1,0,
5
1,
5
3 5
2,
5
1,
5
1,
5
1
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 10
Case study
Dataset: UCI repository, Genbank 64.1 primate data 3175 sequences, each (-30 bp, +30 bp) with regard to splice site
Three splice site recognition sub-problems: Exon/Intron (EI) vs. Negative (N) Intron/Exon (IE) vs. Negative (N) Exon/Intron (EI) vs. Intron/Exon (IE)
Three datasets: EI vs. N : 767 EI and 1655 N IE vs. N : 768 EI and 1655 N EI vs. IE : 767 EI and 768 EI
Power series kernel Accuracy evaluation metric: F-measure
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 11
Classification results: Exon/Intron vs. Negative
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 12
Classification results:Intron/Exon vs. Negative
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 13
Classification results:Intron/Exon vs. Exon/Intron
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 14
Classification time
Feature vector size
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 15
Intron/exon splice sites, 2422 sequences
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 16
Evaluation of results
Classification accuracy: Exon/Intron vs. N. – 4-mer ACGT frequency mapping (78.05%) Intron/Exon vs. N. – 6-mer ACGT frequency mapping (70.75%) E/I vs. I/E – 6-mer ACGT frequency mapping (90.59%) 4-mers and 6-mers better than 5-mers RY always better than SW or KM
Feature space size: ACGT k-mer: 4k
SW, RY, KM k-mer: 2k
Classification speed: SW/KM/RY k-mer frequency based classification can be ~ 2
times faster than ACGT k-mer classficaion
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 17
Why RY is better than SW or KM?Rule Donor (EI) consensus Acceptor (IE) consensus
ACGT (C|A)AG / GT(A|G)AGT (C|T)nN(C|T)AG / G
SW (S|W)SW / WS(S|W)SWS (S|W)nN(S|W)SW / W
KM KKM / MM(K|M)KMM (K|M)nN(K|M)KM / M
RY (R|Y)RR / RYRRRY YnNYRR / R Acceptor consensus sequence has long runs of Pyrimidines (Y)
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 18
Conclusions Selection of the appropriate feature mapping rule can greatly
influence the DNA sequence classification results Anomalies in consensus sequences (such as long runs) can
be exploited for better classification results when selecting mapping rules
For trade-off between classification accuracy and speed, RY k-mer frequency based mapping can be used instead of 4-letter k-mer frequency
Open research problem: “forbidden” k-mers
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 19
Questions?
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 20
SVM kernel function optimization Introduction of additional kernel parameters Introduction of new kernels Power series kernel function
Advantage: more parameters for optimization better separation of classes in feature space
n
k
k
jT
ikjin cxxaxxK1
,
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 21
SW k-mer frequency mapping rule SW ({A,T} vs. {C,G}) mapping rule
reflects the difference in the number of hydrogen bonds in the DNA molecule Strong (C, G) nucleotides - 3 H bonds Weak (A, T) nucleotides - 2 H bonds
related to physical-chemical properties of DNA transport of electrons mechanical waves along the DNA helix
kNjj jWSS
kN
npS 2,...,1,,ˆ,
1ˆ
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 22
RY k-mer frequency mapping rule The RY mapping rule ({A, G} vs.{C, T})
describes how purines (R) and pyrimidines (Y) are distributed along the DNA sequence. A and G – purines (R) C and T – pyrimidines (Y)
corresponds to the chemical composition bias in the DNA strand
kNjj jYRS
kN
npS 2,...,1,,ˆ,
1ˆ
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 23
KM k-mer mapping rule
The KM mapping rule ({A,C} vs. {G,T})
describes how ketones (K) and amines (M) are distributed along the DNA sequence A and C – amines (M) G and T – ketones (K)
kNjj jMKS
kN
npS 2,...,1,,ˆ,
1ˆ
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 24
Classification metric
F-measure
Advantage: One measure that takes into account both recall and precision: a
spectacular score in one does not compensate for a bad score in the other
%1002
recallprecision
recallprecisionF
TNFP
TN
nn
nrecall
FNTP
TP
nn
nprecision