39
Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Embed Size (px)

DESCRIPTION

Themes of Bioinformatics Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases

Citation preview

Page 1: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Limsoon WongLaboratories for Information Technology

Singapore

From Dataminingto Bioinformatics

Page 2: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

What is Bioinformatics?

Page 3: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Themes of Bioinformatics

Bioinformatics = Data Mgmt + Knowledge Discovery

Data Mgmt =Integration + Transformation + Cleansing

Knowledge Discovery = Statistics + Algorithms + Databases

Page 4: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Benefits of Bioinformatics

To the patient:Better drug, better treatment

To the pharma:Save time, save cost, make more $

To the scientist:Better science

Page 5: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

From Informatics to Bioinformatics

IntegrationTechnology(Kleisli)

Cleansing & Warehousing (FIMM)

MHC-PeptideBinding(PREDICT)

Protein InteractionsExtraction (PIES)

Gene Expression & Medical RecordDatamining (PCL)

Gene FeatureRecognition (Dragon)

VenomInformatics

1994 19981996 2000 2002

8 years of bioinformaticsR&D in Singapore

ISS KRDL LIT

Page 6: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Quick Samplings

Page 7: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Epitope PredictionTRAP-559AAMNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSEEVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLNLNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRSLLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVILTDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNRFLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEKTASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQCEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENIIDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQKPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDNQNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGNRHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHEKPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVPGAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN

Page 8: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Epitope Prediction Results Prediction by our ANN model for HLA-A11

29 predictions 22 epitopes 76% specificity

1 66 100Rank by BIMAS

Number of experimental binders 19 (52.8%) 5 (13.9%) 12 (33.3%)

Prediction by BIMAS matrix for HLA-A*1101

Page 9: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Transcription Start Prediction

Page 10: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Transcription Start Prediction Results

Page 11: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Medical Record Analysis

Looking for patterns that are valid novel useful understandable

age sex chol ecg heart sick49 M 266 Hyp 171 N64 M 211 Norm 144 N58 F 283 Hyp 162 N58 M 284 Hyp 160 Y58 M 224 Abn 173 Y

Page 12: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Gene Expression Analysis

Classifying gene expression profiles find stable differentially expressed genes find significant gene groups derive coordinated gene expression

Page 13: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Medical Record & Gene Expression Analysis Results

PCL, a novel “emerging pattern’’ method

Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks

Works well for gene expressions

Cancer Cell, March 2002, 1(2)

Page 14: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Behind the Scene

Vladimir Bajic Vladimir Brusic Jinyan Li See-Kiong Ng Limsoon Wong Louxin Zhang

Allen Chong Judice Koh SPT Krishnan Huiqing Liu Seng Hong Seah Soon Heng Tan Guanglan Zhang Zhuo Zhangand many more:

students, folks from geneticXchange,MolecularConnections, and other collaborators….

Page 15: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Questions?

Page 16: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

A More Detailed Account

Page 17: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Jonathan’s rules : Blue or CircleJessica’s rules : All the rest

What is Datamining?

Whose block is this?

Jonathan’s blocks

Jessica’s blocks

Page 18: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

What is Datamining?

Question: Can you explain how?

Page 19: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

The Steps of Data Mining

Training data gathering Signal generation

k-grams, colour, texture, domain know-how, ... Signal selection

Entropy, 2, CFS, t-test, domain know-how... Signal integration

SVM, ANN, PCL, CART, C4.5, kNN, ...

Page 20: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Translation Initiation Recognition

Microsoft Word Document

Page 21: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

A Sample cDNA

299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT............................................................ 80................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

What makes the second ATG the translation initiation site?

Page 22: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Signal Generation

K-grams (ie., k consecutive letters) K = 1, 2, 3, 4, 5, … Window size vs. fixed position Up-stream, downstream vs. any where in window In-frame vs. any frame

0

0.5

1

1.5

2

2.5

3

A C G T

seq1seq2seq3

Page 23: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Too Many Signals

For each value of k, there are4k * 3 * 2 k-grams

If we use k = 1, 2, 3, 4, 5, we have4 + 24 + 96 + 384 + 1536 + 6144 = 8188features!

This is too many for most machine learning algorithms

Page 24: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Signal Selection (Basic Idea)

Choose a signal w/ low intra-class distance Choose a signal w/ high inter-class distance

Which of the following 3 signals is good?

Page 25: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Signal Selection (eg., t-statistics)

Page 26: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Signal Selection (eg., MIT-correlation)

Page 27: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Signal Selection (eg., 2)

Page 28: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Signal Selection (eg., CFS)

Instead of scoring individual signals, how about scoring a group of signals as a whole?

CFS A good group contains signals that are highly

correlated with the class, and yet uncorrelated with each other

Homework: find a formula that captures the key idea of CFS above

Page 29: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Sample k-grams Selected

Position –3 in-frame upstream ATG in-frame downstream

TAA, TAG, TGA, CTG, GAC, GAG, and GCC

Kozak consensusLeaky scanning

Stop codon

Codon bias

Page 30: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Signal Integration

kNNGiven a test sample, find the k training samples

that are most similar to it. Let the majority class win.

SVMGiven a group of training samples from two

classes, determine a separating plane that maximises the margin of error.

Naïve Bayes, ANN, C4.5, ...

Page 31: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Results (on Pedersen & Nielsen’s mRNA)

TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy

Naïve Bayes 84.3% 86.1% 66.3% 85.7%

SVM 73.9% 93.2% 77.9% 88.5%

Neural Network 77.6% 93.2% 78.8% 89.4%

Decision Tree 74.0% 94.4% 81.1% 89.4%

Page 32: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Acknowledgements

Roland Yap Zeng Fanfan A.G. Pedersen H. Nielsen

Page 33: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Questions?

Page 34: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Common Mistakes

Page 35: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Self-fulfilling Oracle

Consider this scenario Given classes C1 and C2 w/ explicit signals Use 2 to C1 and C2 to select signals s1, s2, s3 Run 3-fold x-validation on C1 and C2 using s1,

s2, s3 and get accuracy of 90% Is the accuracy really 90%? What can be wrong with this?

Page 36: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Phil Long’s Experiment

Let there be classes C1 and C2 w/ 100000 features having randomly generated values

Use 2 to select 20 features Run k-fold x-validation on C1 and C2 w/ these

20 features Expect: 50% accuracy Get: 90% accuracy! Lesson: choose features at each fold

Page 37: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Apples vs Oranges

Consider this scenario: Fanfan reported 89% accuracy on his TIS

prediction method Hatzigeorgiou reported 94% accuracy on her

TIS prediction method So Hatzigeorgiou’s method is better What is wrong with this conclusion?

Page 38: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Apples vs Oranges Differences in datasets used:

Fanfan’s expt used Pedersen’s dataset Hatzigeorgiou’s used her own dataset

Differences in counting: Fanfan’s expt was on a per ATG basis Hatzigeorgiou’s expt used the scanning rule and

thus was on a per cDNA basis When Fanfan ran the same dataset and count

the same way as Hatzigeorgiou, got 94% also!

Page 39: Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

Questions?