Two Classifiers for Bioinformatics

1

Two Classifiers for Bioinformatics

Course: CSI7162

Present toDr. Stan Matwin

Presented byJun Ouyang

2

Road Map

Introduction Discovery of regulatory connections in

Microarray data (Gene level) Classifying protein fingerprints (Protein

Level) Conclusion Research Challenges

3

Introduction to Bioinformatics An emerging interdisciplinary research area Interface between biological and computational

sciences Computational management of all kinds of

biological information Research scope of bioinformatics:

DNA mRNA protein protein interactions informational pathways informational networks cells tissues or networks of cells an organism populations ecologies (hierarchical biological information)

www.ee.nthu.edu.tw/bschen/files/Bioinformatics.ppt

4

Introduction to Bioinformatics DNA level:

DNA sequence alignment; gene prediction; gene evolution;…

RNA level: Study of gene expression; transcription mechanism; post-

transcription modification;… Protein level:

protein 2D and 3D structure prediction; protein active site prediction; protein-protein interactions; protein-DNA interactions;…

System level: (pathways, networks) Genome (gene-to-gene interactions)

Ex: use gene chips to study gene regulatory network Proteome (protein-protein interactions)

Ex: use protein chips to study protein interaction network


5

Discovery of Regulatory Connections in Microarray Data (Gene level) What is a microarray? Obtaining microarray data Definition of regulatory relations between sets of genes

(class labels) Reversible Jump Markov Chain Monte Carlo (RJMCMC)

Algorithm to learn dynamic Bayesian network Use dynamic Bayesian network classifiers to predict

regulatory relations Conclusion (1)

6

What is a Microarray?

A kind of gene chip used to discover gene function or gene expression patterns

Allow these patterns to be studied in parallel Example:

Colour indicates the relative abundance of a labeled cDNA, meaning the gene has been activated

In each location, a known probe (cDNA) is placed with cDNA from a certain sample

For example, cDNA from cancerous and healthy cells with different probes (known strands of cDNA)

7

Obtaining Microarray Data

What are the steps?

[1] Choose cell population [or sample for diagnosis][2] mRNA extraction, purify[3] Fluorescent label cDNA[4] Combine different strands of cDNA on microarray[5] Scan data over time[6] Interpret time series to study gene regulation over time

6

time

8

Regulatory Relations between Sets of Genes

Goal of this work is to identify which genes regulate each other

We are interested in two types of gene regulation: Co-regulation

Two genes are perfectly co-regulated when their relative

abundance functions w.r.t time have the same first-order derivatives

Control-regulation Two genes are inversely co-regulated, or control-regulated,

when their relative abundance functions w.r.t time have first-order derivatives which are inverses of one another

9

Microarray Data Representation From the microarray, we obtain a time series describing the gene interactions At different moments in time the microarray would show a different colour

depending on which gene is active

10

Discretization and Classifier Construction We must discretize the time signal in order to

facilitate learning with a Markov model For each point in time, the sample value is set to

change, local minimum or local maximum. These features are used to learn a dynamic Bayesian

classifier using a variant of the Reversible Jump Markov Chain Monte Carlo (RJMCMC) technique

The classifier then can identify gene interactions as co-regulatory and controlled-regulatory

Details provided later…

11

Discretization of Continuous Measurements

Time series represented as change, local minima, local maxima

Re-encoding of data using 2 binary variables for each of

the 3 possible values

12

Monte Carlo Principle If we take a sample every 1/100th of a second and we

measure for 10 minutes, we get 60000 samples per gene We need a method for reducing the number of samples

without destroying the pertinent details Given a very large set X and a distribution p(x) over it We draw an i.i.d. set of N samples We can then approximate the distribution using these

samples

N

i

iN xx

Nx

1

)( )1(1

)(pX

p(x)

)p(xN

An Introduction to Markov Chain Monte Carlo, Teg Grenager

13

Dynamic Bayesian Network Classifiers

A Bayesian network: a statistical model for capturing the direct dependencies between discrete stochastic variables

Train dynamic Bayesian networks to discover relations between sets of genes

The Bayesian networks are trained using a variant of the RJMCMC sampling algorithm

Control-regulated Co-regulated

time time

14

Learning with Probabilistic Network Classifiers

D: the learning database G: a DAG structure.X: a set of predictor variables C: a classification variableL(C): a score function D=(C,X): the learning database is

separated into X and C

Goal: sample models from the above target distribution P(L(C)|D).

),);,,,|((),,,|)(( **

Dd

dGxcPlDGXCLP

),,( yl : a modif. step function : the span of indeterminacy : the bounding probability

Objective function of network classifier

15

Experiments Result

Target gene

Lag(0) Valid Close Lag(-1,-2) Valid

CLB1 CLB2 Y Y

BUD4 CLB2 Y Y

SWI4 CLB2 Y

CDC6 CLB2 Y

AGA1 CLB2 Y

ASH1 CLB2(SWI5) Y

CDC45 CLB2(MCM1) Y

CDC47 CDC28 N N

CTS1 SWI5 Y

FUS1 SWI5(MCM1) N (?) N

MFA2 CLN3(CLB2) N(Y)

Testing result of yeast cell-cycle expression dataset

Co-regulatedControlled-regulated

16

Conclusion (1)

A new approach to discovering supposed regulatory relations between genes

Processing techniques capture dynamic relations between sets of genes

The results obtained from the microarray data are promising

17

Classifying Protein Fingerprints

MotivationTask and Data RepresentationData Preprocessing and Classification

MethodsExperimentation and ResultsConclusion (2)

18

Motivation

The need for automated protein fingerprint labeling Protein fingerprint: group of amino acid motifs

used to depict protein families These fingerprints may be useful in grouping

proteins together Improve on PRECIS (an annotation tool)

This tool performs poorly: 40% error rate Classifies fingerprints using simple heuristics

19

Task and Data Representation

Goal: replace PRECIS’s handcrafted heuristics with classification models extracted from data.

Three distinct kinds of fingerprints Fingerprint itself Its component motifs (motif is a common

sequence of amino acids) Protein

20

Task and Data RepresentationFingerprint

Number of motifs (nmt) Number of proteins (npr) True positive rate Partial positive rate

Motif Motif length (average, std, etc.) Motif coverage ( average, stdev, etc.) Motif entropy ( average, stdev, etc.) Motif entropy ( average, stdev, etc.) Intermotif distance (average, stdev, etc.)

21

Task and Data Representation Protein sequence

SWISS-PROT ID: fraction of proteins with ID LHS: frac of proteins whose length>=3|4 chars

frac of proteins with common first 1|2|3|4 chars

in LHS

entropy of LHS averaged over first 1|2|3|4chars RHS: frac of proteins with a common RHS (species)

entropy of RHS taken as a unit CC-belongs: sequence belongs to family CC-contains: sequence contains domain

22

Data Preprocessing

Dealt with missing values using a technique based on KNN

Considered several feature selection algorithms Ranking based on information gain

I(X,Y)=H(X)-H(X|Y)= H(Y)-H(Y|X) Ranking based on mutual information

)()(

),()()(2),(

YHXH

YXHYHXHYXU

23

Classification Methods

This work compares the performance of several machine learning algorithms when combined with a feature selection method

ML Algorithms considered: Logic-based learning algorithm

Decision trees and rules (J48 and C5.0, etc) Density-estimation based learners

NBayes, IBL, Lindisc, MLPS, SVM-RBF

24

Experimentation and Results

Method Parameters CV error HO error Defaults PRECIS

45.60. 39.55

46.19 40.28

SVM-RBF RandomForest C5.0boost MLP IBL Lindisc LTree J48 Part Nbayes

G=0.05,C=50 I=100,K=6 B=10,C=0.1 H=10 K=10 - C=0.05 C=0.01 C=0.05 K

14.06 14.59 15.13 15.13 15.47 15.80 16.27 16.48 19.97 23.20

14.65 17.46 18.59 16.62 19.44 17.18 17.46 19.15 21.69 27.07

Error rates on the full 45-feature set

CV: Cross-validation, HO: Holdout

Best performance

25

Experimentation and Results

Cross-validation and Holdout error rates after feature selection

Method Parameters Feature selector

#features CV error HO error

SVM-RBF RandomForest C5.0boost MLP IBL Lindisc LTree J48 Part Nbayes

G=0.05,C=50 I=100,K=6 B=10,C=0.1 H=10 K=10 - C=0.05 C=0.01 C=0.05 K

ReliefF InfoGain ReliefF ReliefF SymmU ReliefF SymmU SymmU CFS CFS

36 40 32 40 32 40 32 32 7-10 7-10

14.09 14.19 14.79 14.86 14.93 15.40 15.53 15.53 17.35 18.02

14.08 16.61 16.90 16.90 17.46 17.18 18.59 19.72 18.03 23.66

26

Conclusion (2)

SVM does not seem to benefit from the feature selection process (feature selection only removed 9 features!)

Using a SVM-RBF learned classifier achieves a 26% improvement in accuracy over PRECIS

27

Research Challenges

First paper Validate the new approach on real data

sets (only simulated data was used)Second paper

Correcting data imbalance to increase accuracy

Incorporate available data from other databases

28

References M. Egmont-Petersen. W. de Jonge, A. Siebes. "Discovery of regulatory connections in microarray

data," In Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD) 2004:149-160

Melanie Hilario, Alex Mitchell, Jee-Hyub Kim, Paul Bradley, Terri K. Attwood: Classifying Protein Fingerprints. PKDD 2004: 197-208

M. Egmont-Petersen. "Discovering possible co-relations and control-regulations between gene pairs in

time series microarray data using salient dynamic features", Presented at the working group Bioinformatics, Symposium 2004.

M. Egmont-Petersen. "Feature selection by Markov Chain Monte Carlo Sampling - a Bayesian approach," In Structural, Syntactic, and Statistical Pattern Recognition, Proceedings of the Joint IAPR Workshops SSPR 2004 and SPR 2004, Lecture Notes in Computer Science 3138, Eds. A. Fred et al., pp. 1034-1042, 2004.

Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,” Molecular Biology of the Cell, Vol. 9, No. 12, pp. 3273-3297, 1998.

Green PJ. “Reversible jump Markov chain Monte Carlo computation and Bayesian model

determination,” Biometrika, Vol. 82, No. 4, pp. 711-732, 1995. An Introduction to Markov Chain Monte Carlo. Teg Grenager. July 1, 2004. Agenda


29

Q & A

Thank you!

Documents

Two Classifiers for Bioinformatics