Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Limsoon WongLaboratories for Information Technology

Singapore

From Dataminingto Bioinformatics

What is Bioinformatics?

Themes of Bioinformatics

Bioinformatics = Data Mgmt + Knowledge Discovery

Data Mgmt =Integration + Transformation + Cleansing

Knowledge Discovery = Statistics + Algorithms + Databases

Benefits of Bioinformatics

To the patient:Better drug, better treatment

To the pharma:Save time, save cost, make more $

To the scientist:Better science

From Informatics to Bioinformatics

IntegrationTechnology(Kleisli)

Cleansing & Warehousing (FIMM)

MHC-PeptideBinding(PREDICT)

Protein InteractionsExtraction (PIES)

Gene Expression & Medical RecordDatamining (PCL)

Gene FeatureRecognition (Dragon)

VenomInformatics

1994 19981996 2000 2002

8 years of bioinformaticsR&D in Singapore

ISS KRDL LIT

Quick Samplings

Epitope PredictionTRAP-559AAMNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSEEVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLNLNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRSLLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVILTDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNRFLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEKTASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQCEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENIIDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQKPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDNQNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGNRHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHEKPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVPGAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN

Epitope Prediction Results Prediction by our ANN model for HLA-A11

29 predictions 22 epitopes 76% specificity

1 66 100Rank by BIMAS

Number of experimental binders 19 (52.8%) 5 (13.9%) 12 (33.3%)

Prediction by BIMAS matrix for HLA-A*1101

Transcription Start Prediction

Transcription Start Prediction Results

Medical Record Analysis

Looking for patterns that are valid novel useful understandable

age sex chol ecg heart sick49 M 266 Hyp 171 N64 M 211 Norm 144 N58 F 283 Hyp 162 N58 M 284 Hyp 160 Y58 M 224 Abn 173 Y

Gene Expression Analysis

Classifying gene expression profiles find stable differentially expressed genes find significant gene groups derive coordinated gene expression

Medical Record & Gene Expression Analysis Results

PCL, a novel “emerging pattern’’ method

Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks

Works well for gene expressions

Cancer Cell, March 2002, 1(2)

Behind the Scene

Vladimir Bajic Vladimir Brusic Jinyan Li See-Kiong Ng Limsoon Wong Louxin Zhang

Allen Chong Judice Koh SPT Krishnan Huiqing Liu Seng Hong Seah Soon Heng Tan Guanglan Zhang Zhuo Zhangand many more:

students, folks from geneticXchange,MolecularConnections, and other collaborators….

Questions?

A More Detailed Account

Jonathan’s rules : Blue or CircleJessica’s rules : All the rest

What is Datamining?

Whose block is this?

Jonathan’s blocks

Jessica’s blocks

What is Datamining?

Question: Can you explain how?

The Steps of Data Mining

Training data gathering Signal generation

k-grams, colour, texture, domain know-how, ... Signal selection

Entropy, 2, CFS, t-test, domain know-how... Signal integration

SVM, ANN, PCL, CART, C4.5, kNN, ...

Translation Initiation Recognition

Microsoft Word Document

A Sample cDNA

299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT............................................................ 80................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

What makes the second ATG the translation initiation site?

Signal Generation

K-grams (ie., k consecutive letters) K = 1, 2, 3, 4, 5, … Window size vs. fixed position Up-stream, downstream vs. any where in window In-frame vs. any frame

0

0.5

1

1.5

2

2.5

3

A C G T

seq1seq2seq3

Too Many Signals

For each value of k, there are4k * 3 * 2 k-grams

If we use k = 1, 2, 3, 4, 5, we have4 + 24 + 96 + 384 + 1536 + 6144 = 8188features!

This is too many for most machine learning algorithms

Signal Selection (Basic Idea)

Choose a signal w/ low intra-class distance Choose a signal w/ high inter-class distance

Which of the following 3 signals is good?

Signal Selection (eg., t-statistics)

Signal Selection (eg., MIT-correlation)

Signal Selection (eg., 2)

Signal Selection (eg., CFS)

Instead of scoring individual signals, how about scoring a group of signals as a whole?

CFS A good group contains signals that are highly

correlated with the class, and yet uncorrelated with each other

Homework: find a formula that captures the key idea of CFS above

Sample k-grams Selected

Position –3 in-frame upstream ATG in-frame downstream

TAA, TAG, TGA, CTG, GAC, GAG, and GCC

Kozak consensusLeaky scanning

Stop codon

Codon bias

Signal Integration

kNNGiven a test sample, find the k training samples

that are most similar to it. Let the majority class win.

SVMGiven a group of training samples from two

classes, determine a separating plane that maximises the margin of error.

Naïve Bayes, ANN, C4.5, ...

Results (on Pedersen & Nielsen’s mRNA)

TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy

Naïve Bayes 84.3% 86.1% 66.3% 85.7%

SVM 73.9% 93.2% 77.9% 88.5%

Neural Network 77.6% 93.2% 78.8% 89.4%

Decision Tree 74.0% 94.4% 81.1% 89.4%

Acknowledgements

Roland Yap Zeng Fanfan A.G. Pedersen H. Nielsen

Questions?

Common Mistakes

Self-fulfilling Oracle

Consider this scenario Given classes C1 and C2 w/ explicit signals Use 2 to C1 and C2 to select signals s1, s2, s3 Run 3-fold x-validation on C1 and C2 using s1,

s2, s3 and get accuracy of 90% Is the accuracy really 90%? What can be wrong with this?

Phil Long’s Experiment

Let there be classes C1 and C2 w/ 100000 features having randomly generated values

Use 2 to select 20 features Run k-fold x-validation on C1 and C2 w/ these

20 features Expect: 50% accuracy Get: 90% accuracy! Lesson: choose features at each fold

Apples vs Oranges

Consider this scenario: Fanfan reported 89% accuracy on his TIS

prediction method Hatzigeorgiou reported 94% accuracy on her

TIS prediction method So Hatzigeorgiou’s method is better What is wrong with this conclusion?

Apples vs Oranges Differences in datasets used:

Fanfan’s expt used Pedersen’s dataset Hatzigeorgiou’s used her own dataset

Differences in counting: Fanfan’s expt was on a per ATG basis Hatzigeorgiou’s expt used the scanning rule and

thus was on a per cDNA basis When Fanfan ran the same dataset and count

the same way as Hatzigeorgiou, got 94% also!

Questions?

Documents

Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics