Upload
duonghuong
View
219
Download
1
Embed Size (px)
Citation preview
5/31/2014
Cukurova University ---------------------
University of Aveiro 1
MACHINE LEARNING for
COMPUTATIONAL BIOLOGY Associate Prof. Dr. Turgay Ibrikci
ÇUKUROVA UNIVERSITY
Adana, TURKEY
www.cu.edu.tr 2
Outline
Who we are
Introduction
Researches
– Experiments
– Results and Conclusion
Questions
Our Team
İrem ERSOZ KAYA, PhD. Mersin University, Technical
Technics Faculty, Software Engineering Dept. Mersin, Turkey
Ayça ÇAKMAK PEHLIVANLI, PhD. Mimar Sinan Fine
Arts University, Statistics Dept. Istanbul, Turkey
Mustafa KARABULUT, PhD. Gaziantep University,
Gaziantep, Turkey
Doctorate Students
Esra Mahreseci KARABULUT
Jale BEKTAS
Collaborates
Prof. Dr. Jessica Kissinger, University of
Georgia, USA
Prof. Dr. Okan ERSOY Purdue University,
USA
Prof. Dr. Seyhan TUKEL, Cukurova Univesity,
Turkey
Supporters
Çukurova University Research Fund
The Scientific and technological Research
Council of Turkey TUBITAK
– Subdivision – EEEAG- TBAG
Introduction
Machine Learning Methods – Proteins secondary structures,
– Disorder protein structures,
– Drug Desgn
– Motif Finding on DNA.
Prediction of order/disorder regions of protein is one of the main problems in drug design.
Order / Disorder regions effect on protein’s functions
Druglike selection, Potential drug candidates
Statistical and Computational Learning methods help to predict the disordered regions
5/31/2014
Cukurova University ---------------------
University of Aveiro 2
What is Bioinformatics?
Bioinformatics is the organization and the analysis of biological data.
Computational
Tool
Biological
Data
Biological
Information
Biologists collect molecular data: DNA & Protein
sequences, gene expression, etc.
Computer scientists (+Mathematicians, Statisticians, etc.)
Develop tools, softwares, algorithms
to store and analyze the data.
Bioinformaticians Study biological
questions by analyzing
biological data
What is Bioinformatics?
Informatics Computer Science
Computer Engineering
Information Science
Biology &
Other
Natural
Sciences
Mathematics
& Statistics
Bioinformatics
Ethical, legal, &
social
implications
We can say that not only biology and computer science but also “The field of science in multiple sciences and and information technology merges into a single discipline " Bioinformatics concerns...
Prediction
Comparison
Pattern Recognition
Data Modeling
Data Mining
Optimization
Rendering and Display
Doing it all on a computer….
Biological Data
Genes
– DNA sequences of A, T, C, G
– Annotated with function, features
Proteins
– Amino acid sequences Sequences of 20 letters
– Annotated with structure, function etc.
Proteins
A protein consists of a linear sequence of the twenty naturally occurring amino acids
Protein structures are described through four main hierarchical levels: Primary, Secondary, Tertiary, Quaternary.
R
C
H
OH
O
C' N
H
H
Amino
Group
Carboxyl
Group
Side Chain
5/31/2014
Cukurova University ---------------------
University of Aveiro 3
Levels of Protein Structure Prediction of Disordered Regions of Proteins with New N-Pieces Naive Bayes Algorithm*
13250 amino acid-chains
The SVM with Gaussian-RBF kernel function
Three classes for secondary structures
Sliding windows
The window size 17
Leave-one-out validation
* U. Orhan, I. Ersoz, T. Ibrikci, (2007), Intelligent Engineering Systems Through Artificial Neural Networks: Smart
System Engineering (ANNIE'07), 17: 43-48, 11-14 Nov. 2007, St. Louis, Missouri, USA.
15
Dataset
The dataset R80*
The dataset includes 80 protein taken from the study of Yang
The dataset has two categories
Train and test datasets
* Z. R. Yang, R. Thomson, P. McNeil and R. M. Esnouf. RONN: The Bio-Basis Function Neural
Network Technique Applied To The Detection of Natively Disordered. Bioinformatics, 21, 3369-3376, 2005
Number of chains 80
Number of ordered regions
151
Number of disordered regions
183
Number of residues in the ordered regions
29909
Number of residues in the disordered regions
3649
Total residues in the dataset
33558
16
N-Pieces Naïve Algorithm
It can be told that this algorithm is a pre-classifier algorithm.
Data is divided into parts by some threshold values.
After implementing N-Pieces algorithm, the classical Naïve
is applied for classification.
Beginning of N-Pieces algorithm is a unsupervised
classifier-clustering data in N pieces, then Naïve is a
supervised classification method.
17
Experiments
Two classes - order/disorder
regions
Normalization of the dataset
Sliding windows
Controling the outputs with
five differents measurements
Eight different N Values
The given gap is between 101, 201 for N space.
18
The results of The “N”s - II
The N values are found as 101, 133, and 201 for N-
pieces algorithm.
5/31/2014
Cukurova University ---------------------
University of Aveiro 4
19
Conclusions
We presented a new Naive Bayesian Learning
algorithm which is called NP-NBL.
The algorithm is applied on the prediction of
disorder region of proteins.
The numbers of 101,133 and 201 for partition are
optimal values with the specificity and the correct
classification.
However, the optimal N value depends on
individual applications.
COMPUTATIONAL PREDICTION OF DISORDERED REGIONS IN PROTEINS-
PhD Thesis
Structure-Function Paradigm:
– 3D structure of a protein is a prerequisite for
its biological function.
– The tenet has been arisen more than 100
years ago with Fischer’s “lock and key”
model.
– Generally the loss of function is associated
with the lack of specific 3D structure.
DNA mRNA AA
sequence
3D
structure Function
Objective
Purpose?
– Developing an accurate computational method that
can provide information about the structural class
among ordered and disordered proteins concerning
different biochemical and physical features of amino
acids.
Why?
– Accurate prediction of disorder that is a demanding
problem due to its importance on structural and
functional identification of protein.
Protein Folding Problem
Structural Prediction
– 3D structure (‘the fold’) is uniquely determined by the
sequence.
– The breakthrough brought the outstanding success as the
1972 Nobel Prize to Anfinsen.
– Experimental and computational methods have performed
on that challenge that has widely known as the “The
Protein Folding Problem”.
Protein Non-Folding Problem
Intrinsically Disordered Proteins (IDP)
– Studies on structural genomics indicate that numerous
protein segments remain unfolded in their native states.
– Contrary to the structure – function paradigm, the regions
that fail to fold into a fixed 3D structure yet exhibit function.
– The proteins are generally referred to as “natively unfolded”
or “intrinsically disordered”.
– Intrinsically disordered proteins can also involve in several diseases such as Alzheimer disease, Parkinson disease and certain types of cancer.
– Dealing with structural identification of the proteins was named as “The Protein Non-Folding Problem”.
An Example: Calcineurin
5/31/2014
Cukurova University ---------------------
University of Aveiro 5
Computational Prediction
Alternative to experimental methods
Disorder prediction based on amino acid sequence
Structural properties of amino acids can be used
as a discriminator for characterizing disorder
– Flexibility, amino acid frequency, complexity, charge, and
secondary structure, etc.
Mostly preferred machine learning techniques can
be given as Artificial Neural Networks and Support
Vector Machines
Computational for Disorder Prediction
NAME YEAR ACCURACY COMPUTATINAL METHODS BASED ON
PONDR 2001 70% Feed Forward Neural Networks Physicochemical Properties, Frequency
DisEMBL 2003 64% Feed Forward Neural Networks Parameters on Different Definitions of
Disorder
DISOPRED2 2004 93% Linear Support Vector Machines Position Specific Score Matrices
FoldIndex 2005 83% Low Hydrophobicity/High Net Charge Hydrophobicity, Net Charge
RONN 2005 85% Regional Order Neural Network Homology Alignment Scores
PSSMP 2006 80% Radial Basis Function Neural Network Condensed Position Specific Score
Matrices
Data and Representation
Main Data Set (Training and Testing)
– 80 completely ordered proteins
– 79 completely disordered proteins
Blind Testing Set : 80 partially disordered proteins
Windowing
– Sliding windows technique
– 21 residue length window
– Each window is an input pattern
– The input pattern is labelled with the class of the central amino acid within window (‘1’ for disorder, ‘0’ for order)
Knowledge Presentation
– Average information within window
– Each feature is represented by only one attribute in a pattern
>1B8Z
MNKKELIDRVAKKAGAKKKDVKLILDTILETITEALAKGEKVQIVGFGSF
EVRKAAARKGVNPQTRKPITIPERKVPKFKPGKALKEKVK
PSI-BLAST
PSSM for the sequence
Sequence
20 amino
acids
Pattern for ith residue
The averaged sum of
the row values
within window
for each column
ith residue
window w
-4 -4 -5 -6 -4 -3 -5 -6 -5 -1 2 -4 10 -3 -5 -4 -3 -4 -4 0
-3 -2 6 -2 -4 -2 -3 -4 -3 0 -3 -2 -4 -5 -4 -1 5 -6 -5 -3
-3 2 -1 -3 -6 -1 -2 -4 -3 -5 -4 7 -3 -6 -4 -3 -3 -6 -5 -4
3 -2 0 -3 -4 0 -2 -2 -3 -3 -3 1 -1 -5 -2 4 1 -5 -4 -3
-3 -3 -2 5 -6 2 6 -4 -1 -6 -5 -1 -5 -6 -4 -3 -3 -6 -5 -5
.
.
.
4 4 -3 -2 -4 1 -1 -2 -3 -3 -2 2 -2 -5 -4 0 -2 -5 -3 -1
-2 -5 -5 -6 -3 -5 -5 -6 -5 3 1 -5 0 -1 -5 -4 -2 -5 -3 6
-3 2 -2 -2 -5 3 3 -4 -2 -5 -4 5 -3 -5 -3 -2 -2 -5 -4 -4
-2 -2 -0 -2 -5 -1 -1 -4 -3 -3 -3 0 -1 -5 -4 -1 -1 -5 -5 -3
Training
Main data set was divided into 6 roughly equal
subset
Each subset includes balanced number of
disordered and ordered residues
1 subset was used for validation
5 subsets were used for training via 5 fold cross
validation
5 repetitions of training/testing application
Performance Assessment
(TP+TN)Accuracy ( )
(TP+FP+TN+FN)Acc ( )
TPSensitivity Sens
TP FN
TP FP TN FN TP FN TN
( )' ( )
( ) ( ) ( ) ( )
TP TN FP FNMatthews Correlation Coefficient Mcc
TP FP FP TN TN FN FN TP
( )( ) ( )
TPxTN FPxFNProbability Excess ProbEx
TP FN x TN FP
5/31/2014
Cukurova University ---------------------
University of Aveiro 6
Methods
Border Vectors Detection and Adaptation
Border Vectors Detection and Extended
Adaptation
Border Vectors Detection and Adaptation (BVDA)*
Based on detection of so-called border vectors, adaptation the
vectors by adjusting their positions for wrongly classified input
patterns and addition of new vectors during training
– Detection:
* IEEE TRAN. ON GEOSCIENCE AND REMOTE SENSING, VOL. 45, NO. 12, 3880-3893 DEC. 2007
2
1
arg min
, , (1 ), (1 )( , ) ( ( ) ( )) ,
j
Ni k
j i j i j j j
d
k
i m j no d x d x y i
D
c xD o x
t( , ) = , 1,..., ( m )
arg min
j k j k j
j
j m
w D
D x b x b
w ky y q If then (t) (t 1) ( , )k ky q R Ri=q i=q x
Border Vectors Detection and Adaptation (BVDA)
– Adaptation:
(t 1) (t) (t).( (t))
(t 1) ( (t) (t).( (t))) /wj b
y y y y
y ym . m
b b b bw w w w
w w j w
j w
b b x b
m m x b
(t 1) (t) (t).( (t))
(t 1) ( (t) (t).( (t))) /lj b
y y y y
y ym . m
b b b bl l l l
l l j l
j l
b b x b
m m x b
0 i m
0B B i
{( , )}, (t t ')wj m jy y y B B
t+1 t
jx (t 1) ( (t) ) /( (t) 1)jy y y ym . m
j j jjm m x
1 1 2 2( , ), ( , ), ... , ( , )m my y y0M m m m
Border Vectors Detection and Extended Adaptation (BVDEA)*
Based on detection of so-called border vectors, adaptation the
vectors by adjusting their positions for all input patterns
– Detection:
Applied in the way offered by BVDA
– Adaptation:
At least one border vector is ensured to be adapted in each repetition
*İ. Ersöz Kaya, T. Ibrikci, O. K Ersoy, (2011), “Prediction of Disorder with New Computational Tool: BVDEA”,
Expert Systems with Applications. 38(12): 14451-14459. (ISI), DOI: 10.1016/j.eswa.2011.04.160
( 1) ( ) ( ).( ( ))
( 1) ( ) ( ).( ( ))w
j j j
j b
y y y
t t t ty y
t t t t
w w j w
j
b b x b
b b x b
( 1) ( ) ( ).( ( ))wj by y t t t t w w j wb b x b
Evaluations on Parameters
BDVEA
– Rate ()-Decay Constant ()
– Stopping levels for - pairs
BDVA
– Rate ()-Decay Constant ()
– Stopping levels for - pairs
GRNN
– Sigma ()
LVQ
– Rate ()
– Number of codebook vectors (nC)
Main Testing Results
Methods Sens Spec Acc Mcc ProbEx
BVDEA 0.7964 0.7850 0.7907 0.5858 0.5814
LVQ 0.6981 0.8707 0.7844 0.5818 0.5688
GRNN 0.7263 0.8345 0.7804 0.5714 0.5608
BVDA 0.7309 0.7506 0.7408 0.4878 0.4815
5/31/2014
Cukurova University ---------------------
University of Aveiro 7
Comparison on Main Testing
DisPro
DISOPRED2 BVDEA
DisPSSMP
DisPro
DISOPRED2 BVDEA
DisPSSMP
DisPro
DISOPRED2 BVDEA
DisPSSMP
Conclusions
Many of the intrinsically disordered proteins play key role in vital functions and also in some diseases.
Identification of the disordered regions is a demanding process for structure prediction and functional characterization of proteins.
BDVEA provides more accurate, fast and robust learning as compared to the other methods, GRNN, LVQ and BDVA.
As evident from the comparison results with existing tools, BVDEA can be suggested as an influential method to achieve accurate predictions of disordered regions of proteins without either under-predicting or over-predicting the disorder.
The new method provides a significant contribution on predicting disorder and order regions of proteins.
Support Vector Machine
SVMs attempt to find a hyperplane as the decision surface in such a way that the margin of separation between positive and negative examples is maximized (Vapnik, 1995). B
1
B2
b11
b12
b21
b22
margin
› Find hyperplane that
maximizes the margin
=>B1 is better than
B2
The original input space can always be mapped to some higher-dimensional feature space where the training set is separable
Φ: x → φ(x)
Non-Linear SVMs
Quadratic Optimization Problem
Minimize
subject to
n
i
iCw1
2
2
1
1iyif iii bxwy 1
iii bxwy 1 if 1iy
0i
Solution
i
n
i
iiii
n
ji
i
n
i
i
bxwy
CwbwL
11,
1
2
1
2
1,,
Performance of Feature Selection
SVM_COD159 SVM_ERCOD159
Sens 0,819 0,831
Spec 0,638 0,662
Acc 0,641 0,664
Mcc 0,854 0,862
ProbEx 0,786 0,800
37 attributes was selected by ERGS
Prediction success was increased with the
reduced data, ERCOD159
5/31/2014
Cukurova University ---------------------
University of Aveiro 8
Conclusion
SVM was used with the modeled dataset that was constructed according to several physicochemical properties, evolutionary knowledge, and compositions of amino acids.
SVM_COD159 provides more accurate and robust learning as compared to eleven common tools without either under-predicting or over-predicting the disorder.
The most informative features for separating the disordered/ordered regions in proteins were determined by using the ERGS method.
SVM_ERCOD159 provides a significant contribution on predicting disorder and order regions of proteins.
CONSENSUAL CLASSIFICATION OF
DRUG/NONDRUG COMPOUNDS FOR DRUG
DESIGN- PhD Thesis
•Druglike selection
•Potential drug candidates
•To reduce time and cost
In Vivo In Vitro In Silico
Chemoinformatics & Bioinformatics
Chemoinformatics Bioinformatics
Chemical data (small molecules) Enzymes, genes, proteins, etc.
1960’s 1990’s
User-pay, limited public access Web-based, open access model
Funded by large companies (MDL,
Bielstein, Sigma, CAS)
Funded by large government agencies
(NCBI, EBI, NIH, GC)
Molecular Informatics
• Contains functional group
• Similar physical properties to the known drugs
• ADMET (Absorbtion, Distribution, Metabolism,
Extraction, Toxicity)
• Fail fast, fail cheap
• “Filters”, i.e. “criteira” or “rule”
Druglike Concept
1998
– Ajay et al., 80% accuracy
– Sadowski & Kubinyi, 77% drug, 83% nondrug
2000
– Wagener et al., 17.4% error
– Firumer et al., 88.0%
2001
– Muege et al., 83.7% drug, 75% nondrug
2003
– Murcia-Soler et al., 76.36% drug, 70.15% nondrug
– Byvatov et al., 82% SVM, 80% ANN
2007
– Li et al., 92.73%
– Hutter et al., 71%
Related Works
Therapeutic Categories Murcia-Soler, 2003 Cherkasov, 2007
Drugs 416 1482
Nondrugs 225 1202
Total 641 2684
Data set
5/31/2014
Cukurova University ---------------------
University of Aveiro 9
Separate for each descriptor
Adaptive General Regressional Neural
Networks (adGRNN)
j
j
2
1 1
2
1 1
1exp /
2
1exp /
2
n di i
j j j
i j
n di
j j j
i j
Y x x
Y X
x x
Experiment’s Outline
Murcia-Soler – Modal (a) 61 descriptors
– Modal (b) All 2D MOE descriptors
– Modal (c) Principle Components
60 PCs (full)
103 PCs (within groups)
Cherkasov – Modal with 2 classes All 2D MOE descriptors
– Modal with 3 classes All 2D MOE descriptors
Consensus
by
Genetic Algorithm
Consensus
by
Pseudoinverse
Consensus
by
Equal WeightsPostprocessing
raw data
normalization
(z-scores)
normalized data
Pre-preprocessing unit
. . . . . transformed
data set
transformed
data set
transformed
data set
transformed
data set
output1 output2 output3 outputK
Classifier1 Classifier2 Classifier3 ClassifierK
. . . . .
. . . . .
Consensual Result
by
Genetic Algorithm
Consensual Result
by
Pseudoinverse
Consensual Result
by
Equal Weights
Preprocessing unit
1. Transform descriptor
vectors with random
unifrom matrix
2. 1D median filtering
Results of Murcia-Soler’s Dataset
• Orginal study of Murcia-Soler et al.,
Feedforward 76% drug and 70% nondrugs
•The best individual results of
GRNN 82.37%, adGRNN 82.99% and SOGR 81.90%
Results of Murcia-Soler’s Dataset Results of Cherkasov’s Dataset
5/31/2014
Cukurova University ---------------------
University of Aveiro 10
Individual SOGR Results of 3-Class Cherkasov
60.000
64.000
68.000
72.000
76.000
80.000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Classifier numbers
(%)
antimicrobials drugs drug-likes
Results of Cherkasov’s Dataset
Model with 3
Classes
Antimicrobials... 1
Drugs…………..2
Druglikes……… 3
Contribute automation tools to the drug designers at
the preprocessing stage of the design process
A novel consensual classification approach
Genetic Algorithm was used first time for combining
SOGR and also consensus approach was applied first
time to chemical data
Conclusions
PART II
DNA MOTIFS-PhD
DNA MOTIFS Outline
DNA Motif Discovery
(Pattern Recognition in DNA Sequences) Introduction
Biological background of DNA motifs
Motif finding
– Existing algorithms
– Motif representation
Machine learning / Clustering for motif-finding
First clustering implementation impressions
Further ideas – Improving first implementation – Utilizing optimization techniques to support motif-finding
Particle Swarm Optimization (PSO)
Genetic Algorithm (GA)
Conclusion
Introduction & Motivation
Details of the transcription process have received considerable attention since the process is an important step in protein synthesis which is vital for life.
Identification of transcription binding sites, which are essentially short DNA subsequences with variability, has been one of the major and challenging tasks for bioinformaticians.
Introduction & Motivation
These genomic patterns (motifs) reside on very
long DNA sequences, which make the task
irresolvable for traditional computational
methods.
Motifs are very short (generally 6-20
nucleotides long) and may have variations as a
result of mutations, insertions and deletions.
5/31/2014
Cukurova University ---------------------
University of Aveiro 11
Introduction & Motivation
Many algorithms are proposed in the literature,
but none of has reached perfect accuracy yet !
Thus, research over motif discovery is still an
important task.
Biological background
The transcription of each gene is controlled by
a regulatory region of DNA relatively near the
transcription start site (TSS).
two types of fundamental components
– short DNA regulatory elements (motifs)
– gene regulatory proteins that recognize and bind to
them.
Gene Regulatory Element, TF binding site, TF
binding motif, cis-regulatory motif (element)
RNA polymerase
(Protein)
Transcription Factor (TF) (Protein)
DNA
Regulation of Genes
Biological background
Promoter
Gene
RNA
polymerase
Transcription Factor
(Protein)
Regulatory Element
DNA
Regulation of Genes
Biological background
Gene
RNA
polymerase Transcription
Factor
Regulatory Element
DNA
New protein
Regulation of Genes
Biological background Biological background
WHAT IS A MOTIF ?
A subsequence (substring) that occurs in multiple sequences with a biological importance.
Motifs can be totally constant or they have variable elements.
DNA Motifs (regulatory elements) – Binding sites for proteins
– Short sequences (5-25)
– Up to 1000 bp (or farther) from gene
– Inexactly repeating(overrepresented) patterns
5/31/2014
Cukurova University ---------------------
University of Aveiro 12
GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA
TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA
CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA
TTAGAGGCACAATTGCTTGGGTGGTGCACAAAAAAACAAG
AACAGCCTTGGATTAGCTGCTGGGGGGGTGAGTGGTCCAC
ATCAGAATGGGTGGTCCATATATCCCAAAGAAGAGGGTAG
TF
TF
TF
TF
TF
TF
123456789
TGGGTGGTC
TGGGTGGTA
TGGGAGGTC
TGGGTGGTG
TGAGTGGTC
TGGGTGGTC
Transcription Factor Binding Sites (TFBS)
DNA motif representation with consensus sequence:
TGGGTGGTN (Consensus sequence)
Biological background
DNA motif representation with Position
Weight Matrix (PWM):
Biological background
( | )log
( | )
P S PFMPWM
P S B
A: 0.25
T: 0.25
G: 0.25
C: 0.25
A: 0.25
T: 0.25
G: 0.25
C: 0.25
Background DNA (B)
.2
.2
.5
.1
.7 .2 .2 .1 .3
.1 .2 .4 .5 .4
.1 .2 .2 .2 .2
.1 .4 .1 .2 .1 A
C
G
T -0.3
-0.3
1
-1.3
1.4 -0.3 -0.3 -1.3 0.3
-1.3 -0.3 0.6 1 0.6
-1.3 -0.3 0.3 -0.3 -0.3
-1.3 0.6 -1.3 -0.3 -1.3 A
C
G
T
Position Weight
Matrix (PWM) PFM
DNA motif representation with sequence logo:
Biological background
Represent both base frequency and conservation at
each position
Height of letter proportional
to frequency of base at that position
Height of stack proportional
to conservation at that position
Motif finding
The Problem
– Given a collection of genes with common expression (co-
expressed)
– Find the TF-binding sites (motifs)
Difficulties
– Motif pattern is unknown
– Motif locations are unknown
– Motifs can differ slightly from one gene to another
– How to discern it from “random” motifs?
Motif finding
Existing algorithms generally fall into two
categories
– Probabilistic
MEME(1995), AlignACE(1999)
Try to optimize a Position Weight Matrix (PWM)
Fast
– Word enumerative exhaustive search
YMF(2000), Weeder(2004)
Slower but more reliable
Motif finding
Challenges that cause low prediction performance – Low signal/noise (background) ratio
– Very long DNA sequences
– Low signal strength (weak conservation )
Solutions – Using more biological information (e.g. Phylogenetic
footprints)
– Hybrid combinations of existing algorithms
– Ensembling existing algorithms
– Developing new algorithms (By using Machine learning methods, evolutionary algorithms, Genetic Algorithm, PSO etc)
5/31/2014
Cukurova University ---------------------
University of Aveiro 13
Machine Learning/Clustering for motif-finding
Machine learning
– A relatively newer promising direction
Mahony et al 2006 – Self Organizing Maps of Position
Weight Matrices
– Clustering as a local alignment tool
Self-Organizing map
Fuzzy C-Means
K-Means
Gaussian Mixture Models
Machine Learning/Clustering for motif-finding
Clustering as a motif finding tool (The Steps of the process)
– Extracting inputs with sliding-windows
– Scoring inputs against Position weight matrices
– Associating inputs with clusters (PWMs) and
generating several local alignments
– Selecting statistically most significant alignments as
motif candidates
Machine Learning/Clustering for motif-finding
Extracting inputs with sliding-windows
N-l number of windows, i.e., inputs to
the algorithm
Machine Learning/Clustering for motif-finding
Scoring inputs against Position weight matrices
(1)
– PWMs are initialized randomly at the beginning
– PWMs are recalculated at each iteration of the
algorithm
Machine Learning/Clustering for motif-finding
Scoring inputs against Position weight matrices (2)
1 2 3 4 5 6 7 8 9
A -10 -10 -14 -12 -10 5 -2 -10 -6
C 5 -10 -13 -13 -7 -15 -13 3 -4
G -3 -14 -13 -11 5 -12 -13 2 -7
T -5 5 5 5 -10 -9 5 -11 5 C T T T G A T C T
INPUT SEQ 1 5 + 5 + 5 + 5 + 5 + 5 + 5 + 3 + 5 = 43
A C G T A C G T A
INPUT SEQ 2 -10 -10 -13 + 5 -10 -15 -13 -11 - 6 = -83
Machine Learning/Clustering for motif-finding
Associating inputs with clusters (PWMs) and
generating several local alignments
(a) a set of associated sequences, (b) the
probability matrix, (c) the background
model and (d) the resultant PWM.
5/31/2014
Cukurova University ---------------------
University of Aveiro 14
Machine Learning/Clustering for motif-finding
Selecting statistically most significant
alignments as motif candidates
– Rank all PWMs with Z-Score values
O Ez score
O = the number of associated subsequences
E = the number of subsequences which coincide to the
node by chance
σ = the standard-deviation of the coincidence
First clustering implementation impressions
– Karabulut, M., İbrikci, T., “Fuzzy C-Means Based DNA
Motif Discovery”, Lecture Notes in Computer Science,
Vol. 5226, pp 189-195 (2008)
– İbrikci, T., Karabulut, M., “Employing Fuzzy C-Means For
DNA Transcription Factor Binding Site Identification”,
Journal of Circuits, Systems and Computers, Vol 19-1,
pp 15-30 (2010)
First clustering implementation impressions
Some modifications to utilize FCM(Fuzzy C-Means) for the task
are required
First off, the classical FCM Algorithm in two main steps:
– Membership calculation
– Cluster update
1/( 1)
2
1/( 1)
21
1( )
( )
1( )
( )
q
j i
ij Mq
k j k
d x cu
d x c
1
1
( )
( )
Nij q
j
j
i Nij q
j
u x
c
u
First clustering implementation impressions
Classical FCM generally uses Euclidean distance
Euclidean distance does not fit DNA Sequence-PWM scoring
A scoring function similar to previously mentioned scoring
scheme should be used
, , ,
, ,
0
,
( , ) 1/ ( ( , ) )
1( , )
0
A C G Tl
i i c i c
i c A
i
i i c
i
D x m e x m m
x ce x m
x c
x = DNA Sequence, m = PWM, l = length
First clustering implementation impressions
Additionally, updating clusters with the inputs is not a
straightforward task, either.
The x terms in the cluster update formula should be replaced
with the following R(x,c) function:
, , ,
0
1 2
1 2
1 2
( , ) ( , )
1( , )
0
A C G Tl
i
i c A
R x c eq x c
c ceq c c
c c
x = DNA Sequence, c = PWM (Cluster)
First clustering implementation impressions
A Sample Dataset : Genomic sequences from
the organism Saccharomyces Cerevisiae
5/31/2014
Cukurova University ---------------------
University of Aveiro 15
First clustering implementation impressions
Performance Measures
First clustering implementation impressions
Comparison of performances of FCM, MEME and MDScan in
terms of Mathews Correlation Coefficient (MCC)
First clustering implementation impressions
Visual results, and Sequence logo format
First clustering implementation impressions
FCM implementation conclusions – Machine learning especially clustering techniques
present a promising direction for DNA motif
discovery
– FCM has the potential to outperform the well-known
MEME and MDSCan
– Results are good at lower organisms but algorithm
is not proven at higher organisms in which many
motif discovery algorithms also fail
– No perfect accuracy yet
Further ideas
How to improve the first clustering based
implementation (FCM) ?
– Utilizing some other clustering algorithms instead of
FCM
Self-Organizing Map (SOM)
K-Means
Gaussian Mixture Models / Expectation Maximization
– To take advantage of different clustering algorithms
for different datasets
Overall performances of the algorithms
5/31/2014
Cukurova University ---------------------
University of Aveiro 16
Sequence logos of the known motifs and the predicted ones Further ideas
How to improve the first clustering based
implementation (FCM) ?
– Another alternative approach:
Post-processing clustering results with an optimizer
– Methods
Particle Swarm Optimization (PSO)
Genetic Algorithm (GA)
– Problems:
Proper fitness function
Is “Information-content” sufficient ?
Further ideas
How to improve the first clustering based implementation (FCM) ? – An alternative approach:
Developing ensemble methods from existing clustering algorithms
Ensemble methods are proven to be effective for motif-finding task. Out of many relevant papers:
– Yanover et al, M are better than one: an ensemble-based motif finder and its application to regulatory element prediction (2009)
– Yang and Kihara, EMD: an ensemble algorithm for discovering regulatory motifs in DNA sequences (2006)
– Romer et al, WebMOTIFS: automated discovery, filtering and scoring of DNA sequence motifs using multiple programs and Bayesian approaches (2007)
Further ideas
How to improve the first clustering based
implementation (FCM) ?
– Another alternative approach:
Post-processing clustering results with an optimizer
– Methods
Particle Swarm Optimization (PSO)
Genetic Algorithm (GA)
Combining learning with an optimizer
Particle Swarm Optimization (PSO)
PSO is a robust stochastic optimization technique based on the movement and intelligence of swarms.
PSO applies the concept of social interaction to problem solving.
It was developed in 1995 by James Kennedy (social-psychologist) and Russell Eberhart (electrical engineer).
It uses a number of agents (particles) that constitute a swarm moving around in the search space looking for the best solution.
Each particle is treated as a point in a N-dimensional space which adjusts its “flying” according to its own flying experience as well as the flying experience of other particles.
• Each particle keeps track of its coordinates in the solution
space which are associated with the best solution (fitness)
that has achieved so far by that particle. This value is called
personal best , pbest.
• Another best value that is tracked by the PSO is the best
value obtained so far by any particle in the neighborhood of
that particle. This value is called gbest.
• The basic concept of PSO lies in accelerating each particle
toward its pbest and the gbest locations, with a random
weighted accelaration at each time step as shown in Fig.1
Particle Swarm Optimization (PSO)
96
5/31/2014
Cukurova University ---------------------
University of Aveiro 17
Fig.1 Concept of modification of a
searching point by PSO
sk : current searching point.
sk+1: modified searching point.
vk: current velocity.
vk+1: modified velocity.
vpbest : velocity based on pbest.
vgbest : velocity based on gbest
sk
vk
vpbest
vgbest
sk+1
vk+1
sk
vk
vpbest
vgbest
sk+1
vk+1
x
y Particle Swarm Optimization (PSO) PSO-Motif finding
In the literature, PSO is used for motif finding – Optimization as a post processing operation
– Standalone application
– Sample papers Particle Swarm Optimisation for Protein Motif Discovery, In:
Genetic Programming and Evolvable Machines, 5(2):203—214 [2004]
Improved Hidden Markov Model training for multiple sequence alignment by a particle swarm optimization-evolutionary algorithm hybrid.. In: Biosystems, 72(1-2):5—17 [2003]
Identification of Transcription Factor Binding Sites Using Hybrid Particle Swarm Optimization, In: Proc. 10th International Conference on Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing (RSFDGrC 2005). Volume 3642 of LNCS. pp. 438—445 [2005].
Datasets Methods
Motif Width Cons. DatasetLeng
th
PSO (Best)*
PSO (B.Ring)
GAME MEME BioPros.
Short High Small 0,80 0,80 0,75 0,85 0,78
Short Low Small 0,54 0,50 0,30 0,39 0,39
Short High Large 0,84 0,84 0,83 0,83 0,76
Short Low Large 0,46 0,46 0,36 0,42 0,45
Long High Small 0,96 0,96 0,97 0,98 0,97
Long Low Small 0,85 0,85 0,82 0,88 0,83
Long High Large 0,98 0,98 0,98 0,98 0,96
Long Low Large 0,90 0,90 0,90 0,90 0,80
AVERAGE 0,82 0,79 0,79 0,78 0,74
Performance comparison of motif-finding tools for
synthetic datasets Performance of PSO per number of particles
Conclusion
Machine learning methods specifically clustering algorithms are efficient means for DNA Motif Discovery.
Clustering algorithms should be further studied to improve standalone DNA motif-finding performance
Ensemble methods are proven to be efficient for DNA motif finding task
Optimizers such as PSO and GA may help Machine learning methods – As a post-processor
– Combined for learning
THANKS
Çukurova University Research Fund
TUBITAK – EEEAG- TBAG
for supporting the collaborations
ERASMUS
ICGEB - International Centre for Genetic Engineering and Biotechnology
5/31/2014
Cukurova University ---------------------
University of Aveiro 18