Upload
arien
View
31
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Two Classifiers for Bioinformatics. Course: CSI7162 Present to Dr. Stan Matwin Presented by Jun Ouyang. Road Map. Introduction Discovery of regulatory connections in Microarray data (Gene level) Classifying protein fingerprints (Protein Level) Conclusion Research Challenges. - PowerPoint PPT Presentation
Citation preview
1
Two Classifiers for Bioinformatics
Course: CSI7162
Present toDr. Stan Matwin
Presented byJun Ouyang
2
Road Map
Introduction Discovery of regulatory connections in
Microarray data (Gene level) Classifying protein fingerprints (Protein
Level) Conclusion Research Challenges
3
Introduction to Bioinformatics An emerging interdisciplinary research area Interface between biological and computational
sciences Computational management of all kinds of
biological information Research scope of bioinformatics:
DNA mRNA protein protein interactions informational pathways informational networks cells tissues or networks of cells an organism populations ecologies (hierarchical biological information)
www.ee.nthu.edu.tw/bschen/files/Bioinformatics.ppt
4
Introduction to Bioinformatics DNA level:
DNA sequence alignment; gene prediction; gene evolution;…
RNA level: Study of gene expression; transcription mechanism; post-
transcription modification;… Protein level:
protein 2D and 3D structure prediction; protein active site prediction; protein-protein interactions; protein-DNA interactions;…
System level: (pathways, networks) Genome (gene-to-gene interactions)
Ex: use gene chips to study gene regulatory network Proteome (protein-protein interactions)
Ex: use protein chips to study protein interaction network
www.ee.nthu.edu.tw/bschen/files/Bioinformatics.ppt
5
Discovery of Regulatory Connections in Microarray Data (Gene level) What is a microarray? Obtaining microarray data Definition of regulatory relations between sets of genes
(class labels) Reversible Jump Markov Chain Monte Carlo (RJMCMC)
Algorithm to learn dynamic Bayesian network Use dynamic Bayesian network classifiers to predict
regulatory relations Conclusion (1)
6
What is a Microarray?
A kind of gene chip used to discover gene function or gene expression patterns
Allow these patterns to be studied in parallel Example:
Colour indicates the relative abundance of a labeled cDNA, meaning the gene has been activated
In each location, a known probe (cDNA) is placed with cDNA from a certain sample
For example, cDNA from cancerous and healthy cells with different probes (known strands of cDNA)
7
Obtaining Microarray Data
What are the steps?
[1] Choose cell population [or sample for diagnosis][2] mRNA extraction, purify[3] Fluorescent label cDNA[4] Combine different strands of cDNA on microarray[5] Scan data over time[6] Interpret time series to study gene regulation over time
6
time
8
Regulatory Relations between Sets of Genes
Goal of this work is to identify which genes regulate each other
We are interested in two types of gene regulation: Co-regulation
Two genes are perfectly co-regulated when their relative
abundance functions w.r.t time have the same first-order derivatives
Control-regulation Two genes are inversely co-regulated, or control-regulated,
when their relative abundance functions w.r.t time have first-order derivatives which are inverses of one another
9
Microarray Data Representation From the microarray, we obtain a time series describing the gene interactions At different moments in time the microarray would show a different colour
depending on which gene is active
10
Discretization and Classifier Construction We must discretize the time signal in order to
facilitate learning with a Markov model For each point in time, the sample value is set to
change, local minimum or local maximum. These features are used to learn a dynamic Bayesian
classifier using a variant of the Reversible Jump Markov Chain Monte Carlo (RJMCMC) technique
The classifier then can identify gene interactions as co-regulatory and controlled-regulatory
Details provided later…
11
Discretization of Continuous Measurements
Time series represented as change, local minima, local maxima
Re-encoding of data using 2 binary variables for each of
the 3 possible values
12
Monte Carlo Principle If we take a sample every 1/100th of a second and we
measure for 10 minutes, we get 60000 samples per gene We need a method for reducing the number of samples
without destroying the pertinent details Given a very large set X and a distribution p(x) over it We draw an i.i.d. set of N samples We can then approximate the distribution using these
samples
N
i
iN xx
Nx
1
)( )1(1
)(pX
p(x)
)p(xN
An Introduction to Markov Chain Monte Carlo, Teg Grenager
13
Dynamic Bayesian Network Classifiers
A Bayesian network: a statistical model for capturing the direct dependencies between discrete stochastic variables
Train dynamic Bayesian networks to discover relations between sets of genes
The Bayesian networks are trained using a variant of the RJMCMC sampling algorithm
Control-regulated Co-regulated
time time
14
Learning with Probabilistic Network Classifiers
D: the learning database G: a DAG structure.X: a set of predictor variables C: a classification variableL(C): a score function D=(C,X): the learning database is
separated into X and C
Goal: sample models from the above target distribution P(L(C)|D).
),);,,,|((),,,|)(( **
Dd
dGxcPlDGXCLP
),,( yl : a modif. step function : the span of indeterminacy : the bounding probability
Objective function of network classifier
15
Experiments Result
Target gene
Lag(0) Valid Close Lag(-1,-2) Valid
CLB1 CLB2 Y Y
BUD4 CLB2 Y Y
SWI4 CLB2 Y
CDC6 CLB2 Y
AGA1 CLB2 Y
ASH1 CLB2(SWI5) Y
CDC45 CLB2(MCM1) Y
CDC47 CDC28 N N
CTS1 SWI5 Y
FUS1 SWI5(MCM1) N (?) N
MFA2 CLN3(CLB2) N(Y)
Testing result of yeast cell-cycle expression dataset
Co-regulatedControlled-regulated
16
Conclusion (1)
A new approach to discovering supposed regulatory relations between genes
Processing techniques capture dynamic relations between sets of genes
The results obtained from the microarray data are promising
17
Classifying Protein Fingerprints
MotivationTask and Data RepresentationData Preprocessing and Classification
MethodsExperimentation and ResultsConclusion (2)
18
Motivation
The need for automated protein fingerprint labeling Protein fingerprint: group of amino acid motifs
used to depict protein families These fingerprints may be useful in grouping
proteins together Improve on PRECIS (an annotation tool)
This tool performs poorly: 40% error rate Classifies fingerprints using simple heuristics
19
Task and Data Representation
Goal: replace PRECIS’s handcrafted heuristics with classification models extracted from data.
Three distinct kinds of fingerprints Fingerprint itself Its component motifs (motif is a common
sequence of amino acids) Protein
20
Task and Data RepresentationFingerprint
Number of motifs (nmt) Number of proteins (npr) True positive rate Partial positive rate
Motif Motif length (average, std, etc.) Motif coverage ( average, stdev, etc.) Motif entropy ( average, stdev, etc.) Motif entropy ( average, stdev, etc.) Intermotif distance (average, stdev, etc.)
21
Task and Data Representation Protein sequence
SWISS-PROT ID: fraction of proteins with ID LHS: frac of proteins whose length>=3|4 chars
frac of proteins with common first 1|2|3|4 chars
in LHS
entropy of LHS averaged over first 1|2|3|4chars RHS: frac of proteins with a common RHS (species)
entropy of RHS taken as a unit CC-belongs: sequence belongs to family CC-contains: sequence contains domain
22
Data Preprocessing
Dealt with missing values using a technique based on KNN
Considered several feature selection algorithms Ranking based on information gain
I(X,Y)=H(X)-H(X|Y)= H(Y)-H(Y|X) Ranking based on mutual information
)()(
),()()(2),(
YHXH
YXHYHXHYXU
23
Classification Methods
This work compares the performance of several machine learning algorithms when combined with a feature selection method
ML Algorithms considered: Logic-based learning algorithm
Decision trees and rules (J48 and C5.0, etc) Density-estimation based learners
NBayes, IBL, Lindisc, MLPS, SVM-RBF
24
Experimentation and Results
Method Parameters CV error HO error Defaults PRECIS
45.60. 39.55
46.19 40.28
SVM-RBF RandomForest C5.0boost MLP IBL Lindisc LTree J48 Part Nbayes
G=0.05,C=50 I=100,K=6 B=10,C=0.1 H=10 K=10 - C=0.05 C=0.01 C=0.05 K
14.06 14.59 15.13 15.13 15.47 15.80 16.27 16.48 19.97 23.20
14.65 17.46 18.59 16.62 19.44 17.18 17.46 19.15 21.69 27.07
Error rates on the full 45-feature set
CV: Cross-validation, HO: Holdout
Best performance
25
Experimentation and Results
Cross-validation and Holdout error rates after feature selection
Method Parameters Feature selector
#features CV error HO error
SVM-RBF RandomForest C5.0boost MLP IBL Lindisc LTree J48 Part Nbayes
G=0.05,C=50 I=100,K=6 B=10,C=0.1 H=10 K=10 - C=0.05 C=0.01 C=0.05 K
ReliefF InfoGain ReliefF ReliefF SymmU ReliefF SymmU SymmU CFS CFS
36 40 32 40 32 40 32 32 7-10 7-10
14.09 14.19 14.79 14.86 14.93 15.40 15.53 15.53 17.35 18.02
14.08 16.61 16.90 16.90 17.46 17.18 18.59 19.72 18.03 23.66
26
Conclusion (2)
SVM does not seem to benefit from the feature selection process (feature selection only removed 9 features!)
Using a SVM-RBF learned classifier achieves a 26% improvement in accuracy over PRECIS
27
Research Challenges
First paper Validate the new approach on real data
sets (only simulated data was used)Second paper
Correcting data imbalance to increase accuracy
Incorporate available data from other databases
28
References M. Egmont-Petersen. W. de Jonge, A. Siebes. "Discovery of regulatory connections in microarray
data," In Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD) 2004:149-160
Melanie Hilario, Alex Mitchell, Jee-Hyub Kim, Paul Bradley, Terri K. Attwood: Classifying Protein Fingerprints. PKDD 2004: 197-208
M. Egmont-Petersen. "Discovering possible co-relations and control-regulations between gene pairs in
time series microarray data using salient dynamic features", Presented at the working group Bioinformatics, Symposium 2004.
M. Egmont-Petersen. "Feature selection by Markov Chain Monte Carlo Sampling - a Bayesian approach," In Structural, Syntactic, and Statistical Pattern Recognition, Proceedings of the Joint IAPR Workshops SSPR 2004 and SPR 2004, Lecture Notes in Computer Science 3138, Eds. A. Fred et al., pp. 1034-1042, 2004.
Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,” Molecular Biology of the Cell, Vol. 9, No. 12, pp. 3273-3297, 1998.
Green PJ. “Reversible jump Markov chain Monte Carlo computation and Bayesian model
determination,” Biometrika, Vol. 82, No. 4, pp. 711-732, 1995. An Introduction to Markov Chain Monte Carlo. Teg Grenager. July 1, 2004. Agenda
www.ee.nthu.edu.tw/bschen/files/Bioinformatics.ppt
29
Q & A
Thank you!