Upload
tyler-spencer
View
215
Download
0
Embed Size (px)
Citation preview
Limsoon WongLaboratories for Information Technology
Singapore
From Informaticsto Bioinformatics
What is Bioinformatics?
Themes of Bioinformatics
Bioinformatics = Data Mgmt + Knowledge Discovery
Data Mgmt =Integration + Transformation + Cleansing
Knowledge Discovery = Statistics + Algorithms + Databases
Benefits of Bioinformatics
To the patient:Better drug, better treatment
To the pharma:Save time, save cost, make more $
To the scientist:Better science
From Informatics to Bioinformatics
IntegrationTechnology(Kleisli)
Cleansing & Warehousing (FIMM)
MHC-PeptideBinding(PREDICT)
Protein InteractionsExtraction (PIES)
Gene Expression & Medical RecordDatamining (PCL)
Gene FeatureRecognition (Dragon)
VenomInformatics
1994 19981996 2000 2002
8 years of bioinformaticsR&D in Singapore
ISS KRDL LIT
Data IntegrationA DOE “impossible query”:
For each gene on a given cytogenetic band, find its non-human homologs.
source type location remarks
GDB Sybase Baltimore Flat tablesSQL joinsLocation info
Entrez ASN.1 Bethesda Nested tablesKeywordsHomolog info
Data Integration Resultssybase-add (#name:”GDB", ...);
create view L from locus_cyto_location using GDB;
create view E from object_genbank_eref using GDB;
select
#accn: g.#genbank_ref, #nonhuman-homologs: H
from
L as c, E as g,
{select u
from g.#genbank_ref.na-get-homolog-summary as u
where not(u.#title string-islike "%Human%") andalso
not(u.#title string-islike "%H.sapien%")} as H
where
c.#chrom_num = "22” andalso
g.#object_id = c.#locus_id andalso
not (H = { });
• Using Kleisli:
• Clear
• Succinct
• Efficient
• Handles
•heterogeneity
•complexity
Data WarehousingMotivation
efficiency
availabilty
“denial of service”
data cleansing
Requirements
efficient to query
easy to update.
model data naturally
{(#uid: 6138971,
#title: "Homo sapiens adrenergic ...",
#accession: "NM_001619",
#organism: "Homo sapiens",
#taxon: 9606,
#lineage: ["Eukaryota", "Metazoa", …],
#seq: "CTCGGCCTCGGGCGCGGC...",
#feature: {
(#name: "source",
#continuous: true,
#position: [
(#accn: "NM_001619",
#start: 0, #end: 3602,
#negative: false)],
#anno: [
(#anno_name: "organism",
#descr: "Homo sapiens"), …] ), …)}
Data Warehousing Results
Relational DBMS is insufficient because it forces us to fragment data into 3NF.
Kleisli turns flat relational DBMS into nested relational DBMS. It can use flat relational DBMS such as Sybase, Oracle, MySQL, etc. to be its update-able complex object store.
! Log inoracle-cplobj-add (#name: "db", ...);
! Define table
create table GP (#uid: "NUMBER", #detail: "LONG")using db;
! Populate table with GenPept reportsselect #uid: x.#uid, #detail: x into GPfrom aa-get-seqfeat-general "PTP” as xusing db;
! Map GP to that tablecreate view GP from GP using db;
! Run a queryto get title of 131470select x.#detail.#title from GP as xwhere x.#uid = 131470;
Epitope PredictionTRAP-559AAMNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSEEVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLNLNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRSLLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVILTDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNRFLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEKTASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQCEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENIIDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQKPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDNQNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGNRHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHEKPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVPGAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN
Epitope Prediction Results Prediction by our ANN model for HLA-A11
29 predictions 22 epitopes 76% specificity
1 66 100Rank by BIMAS
Number of experimental binders 19 (52.8%) 5 (13.9%) 12 (33.3%)
Prediction by BIMAS matrix for HLA-A*1101
Transcription Start Prediction
Transcription Start Prediction Results
Medical Record Analysis
Looking for patterns that are valid novel useful understandable
age sex chol ecg heart sick
49 M 266 Hyp 171 N64 M 211 Norm 144 N58 F 283 Hyp 162 N58 M 284 Hyp 160 Y58 M 224 Abn 173 Y
Gene Expression Analysis
Classifying gene expression profiles find stable differentially expressed genes find significant gene groups derive coordinated gene expression
Medical Record & Gene Expression Analysis Results
PCL, a novel “emerging pattern’’ method
Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks
Works well for gene expressions
Cancer Cell, March 2002, 1(2)
Protein Interaction Extraction
“What are the protein-protein interaction pathwaysfrom the latest reported discoveries?”
Protein Interaction Extraction Results Rule-based system for
processing free texts in scientific abstracts
Specialized in extracting protein
names extracting protein-
protein interactions
Behind the Scene
Vladimir Bajic Vladimir Brusic Jinyan Li See-Kiong Ng Limsoon Wong Louxin Zhang
Allen Chong Judice Koh SPT Krishnan Huiqing Liu Seng Hong Seah Soon Heng Tan Guanglan Zhang Zhuo Zhangand many more:
students, folks from geneticXchange,MolecularConnections, and other collaborators….