Upload
lucinda-jones
View
217
Download
0
Embed Size (px)
DESCRIPTION
Themes of Bioinformatics Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases
Citation preview
Limsoon WongLaboratories for Information Technology
Singapore
From Dataminingto Bioinformatics
What is Bioinformatics?
Themes of Bioinformatics
Bioinformatics = Data Mgmt + Knowledge Discovery
Data Mgmt =Integration + Transformation + Cleansing
Knowledge Discovery = Statistics + Algorithms + Databases
Benefits of Bioinformatics
To the patient:Better drug, better treatment
To the pharma:Save time, save cost, make more $
To the scientist:Better science
From Informatics to Bioinformatics
IntegrationTechnology(Kleisli)
Cleansing & Warehousing (FIMM)
MHC-PeptideBinding(PREDICT)
Protein InteractionsExtraction (PIES)
Gene Expression & Medical RecordDatamining (PCL)
Gene FeatureRecognition (Dragon)
VenomInformatics
1994 19981996 2000 2002
8 years of bioinformaticsR&D in Singapore
ISS KRDL LIT
Quick Samplings
Epitope PredictionTRAP-559AAMNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSEEVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLNLNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRSLLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVILTDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNRFLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEKTASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQCEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENIIDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQKPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDNQNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGNRHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHEKPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVPGAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN
Epitope Prediction Results Prediction by our ANN model for HLA-A11
29 predictions 22 epitopes 76% specificity
1 66 100Rank by BIMAS
Number of experimental binders 19 (52.8%) 5 (13.9%) 12 (33.3%)
Prediction by BIMAS matrix for HLA-A*1101
Transcription Start Prediction
Transcription Start Prediction Results
Medical Record Analysis
Looking for patterns that are valid novel useful understandable
age sex chol ecg heart sick49 M 266 Hyp 171 N64 M 211 Norm 144 N58 F 283 Hyp 162 N58 M 284 Hyp 160 Y58 M 224 Abn 173 Y
Gene Expression Analysis
Classifying gene expression profiles find stable differentially expressed genes find significant gene groups derive coordinated gene expression
Medical Record & Gene Expression Analysis Results
PCL, a novel “emerging pattern’’ method
Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks
Works well for gene expressions
Cancer Cell, March 2002, 1(2)
Behind the Scene
Vladimir Bajic Vladimir Brusic Jinyan Li See-Kiong Ng Limsoon Wong Louxin Zhang
Allen Chong Judice Koh SPT Krishnan Huiqing Liu Seng Hong Seah Soon Heng Tan Guanglan Zhang Zhuo Zhangand many more:
students, folks from geneticXchange,MolecularConnections, and other collaborators….
Questions?
A More Detailed Account
Jonathan’s rules : Blue or CircleJessica’s rules : All the rest
What is Datamining?
Whose block is this?
Jonathan’s blocks
Jessica’s blocks
What is Datamining?
Question: Can you explain how?
The Steps of Data Mining
Training data gathering Signal generation
k-grams, colour, texture, domain know-how, ... Signal selection
Entropy, 2, CFS, t-test, domain know-how... Signal integration
SVM, ANN, PCL, CART, C4.5, kNN, ...
Translation Initiation Recognition
Microsoft Word Document
A Sample cDNA
299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT............................................................ 80................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
What makes the second ATG the translation initiation site?
Signal Generation
K-grams (ie., k consecutive letters) K = 1, 2, 3, 4, 5, … Window size vs. fixed position Up-stream, downstream vs. any where in window In-frame vs. any frame
0
0.5
1
1.5
2
2.5
3
A C G T
seq1seq2seq3
Too Many Signals
For each value of k, there are4k * 3 * 2 k-grams
If we use k = 1, 2, 3, 4, 5, we have4 + 24 + 96 + 384 + 1536 + 6144 = 8188features!
This is too many for most machine learning algorithms
Signal Selection (Basic Idea)
Choose a signal w/ low intra-class distance Choose a signal w/ high inter-class distance
Which of the following 3 signals is good?
Signal Selection (eg., t-statistics)
Signal Selection (eg., MIT-correlation)
Signal Selection (eg., 2)
Signal Selection (eg., CFS)
Instead of scoring individual signals, how about scoring a group of signals as a whole?
CFS A good group contains signals that are highly
correlated with the class, and yet uncorrelated with each other
Homework: find a formula that captures the key idea of CFS above
Sample k-grams Selected
Position –3 in-frame upstream ATG in-frame downstream
TAA, TAG, TGA, CTG, GAC, GAG, and GCC
Kozak consensusLeaky scanning
Stop codon
Codon bias
Signal Integration
kNNGiven a test sample, find the k training samples
that are most similar to it. Let the majority class win.
SVMGiven a group of training samples from two
classes, determine a separating plane that maximises the margin of error.
Naïve Bayes, ANN, C4.5, ...
Results (on Pedersen & Nielsen’s mRNA)
TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy
Naïve Bayes 84.3% 86.1% 66.3% 85.7%
SVM 73.9% 93.2% 77.9% 88.5%
Neural Network 77.6% 93.2% 78.8% 89.4%
Decision Tree 74.0% 94.4% 81.1% 89.4%
Acknowledgements
Roland Yap Zeng Fanfan A.G. Pedersen H. Nielsen
Questions?
Common Mistakes
Self-fulfilling Oracle
Consider this scenario Given classes C1 and C2 w/ explicit signals Use 2 to C1 and C2 to select signals s1, s2, s3 Run 3-fold x-validation on C1 and C2 using s1,
s2, s3 and get accuracy of 90% Is the accuracy really 90%? What can be wrong with this?
Phil Long’s Experiment
Let there be classes C1 and C2 w/ 100000 features having randomly generated values
Use 2 to select 20 features Run k-fold x-validation on C1 and C2 w/ these
20 features Expect: 50% accuracy Get: 90% accuracy! Lesson: choose features at each fold
Apples vs Oranges
Consider this scenario: Fanfan reported 89% accuracy on his TIS
prediction method Hatzigeorgiou reported 94% accuracy on her
TIS prediction method So Hatzigeorgiou’s method is better What is wrong with this conclusion?
Apples vs Oranges Differences in datasets used:
Fanfan’s expt used Pedersen’s dataset Hatzigeorgiou’s used her own dataset
Differences in counting: Fanfan’s expt was on a per ATG basis Hatzigeorgiou’s expt used the scanning rule and
thus was on a per cDNA basis When Fanfan ran the same dataset and count
the same way as Hatzigeorgiou, got 94% also!
Questions?