Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Ping GongPing Gong
Environmental Genomics and GeneticsEnvironmental Genomics and Genetics(EGG) (EGG) Team Team @ Environmental Laboratory@ Environmental Laboratory
Embracing the PostEmbracing the Post--OmicsOmics
Era with Era with Computational Biology/ToxicologyComputational Biology/Toxicology
USM Seminar 1/22/2010
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Outline
• Introduction• Bioinformatics• Knowledgebase• Reverse engineering• Machine learning• Predictive modeling• Summary
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Outline
• Introduction• Bioinformatics• Knowledgebase• Reverse engineering• Machine learning• Predictive modeling• Summary
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Omics: technological evolution
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Adaptation to evolving technologies
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
U.S. EPA’s stand
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
U.S. Army’s stand
Limited # of chemicals(< 50 parent compounds)
Too many organisms(millions of species)
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Computational biology/toxicologyComputational Biology: The development and application of data-
analytical and theoretical methods, mathematical
modeling and computational
simulation techniques to the study of biological, behavioral, and social systems. (NIH 2000)
Computational biology is an interdisciplinary field that applies the techniques of computer
science, applied mathematics
and statistics
to address biological
problems. (Wikipedia)
Computational Toxicology: "integration of modern computing
and information technology with molecular biology
to improve Agency prioritization of data requirements and risk assessment of chemicals" (U.S. EPA 2003)
Computational toxicology is the application of mathematical
and computer
models to predict adverse effects and to better understand the mechanism(s) through which a given chemical
causes harm. (US National Library of Medicine, NIH)
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Computational biology/toxicology
+ =Computational
Biology/Toxicology
?Nautilussupercomputer
(IBM)
+Supercomputercluster
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Outline
• Introduction• Bioinformatics• Knowledgebase• Reverse engineering• Machine learning• Predictive modeling• Summary
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
BioinformaticsBioinformatics (NIH2000):
Research, development, or application of computational
tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize
such data.
Bioinformatics (Oxford Journal)Scope Guidelines:•
Genome analysis•
Sequence analysis•
Phylogenetics•
Structural bioinformatics•
Gene expression•
Genetics and population analysis•
Systems biology•
Data and text mining•
Databases and ontologies(Wikipedia)
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Sanger sequence analysisBase-calling algorithm for
automated DNA sequencingPhred
(Phil Green)
Sanger electrophoresis sequencing
Q = −10 ×
log10
(Pe
)where Pe
: error probability
1: locate predicted peaks2: locate observed peaks3: match observed and
predicted peaks4: find missed peaks
Genome Res. (1998) 8:175-
185. & 186-198.
CodonCode
Q Pe Accuracy
10 10% 90%
20 1% 99%
30 0.1% 99.9%
40 0.01% 99.99%
ABI PRISM® 3130xl DNA Sequencer
One template per reaction
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
454 sequence analysisHigh throughput sequencing (454, SOLiD, Solexa/Illumina)
454 pyrosequencing (sequencing by synthesis)
Basecalling
Workflow comparison between 454 and Sanger
454 GS FLX Titanium series:•
1 M Q20 reads (400~600 Mb) per 10-hr run
•
Average read length = 400 bp
(1000 bp
in year 2010)
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Computational Sequence assemblyPrinciple: Align
Overlap
Consensus
MIRA: a hybrid 454/Sanger reads assembler(Streptococcus pneumoniae TIGR4 genome)(www.chevreux.org/mira_ex_454sanger.html)
Sequence assembly
Computational Biology and Chemistry 2009. 33 (2): 121-136http://genome.ku.dk/resources/assembly/methods.html
Low Sanger coverage(one overcall)
High Sanger coverage(4 of 5 trace undercall)
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Bioinformatics –
sequence annotationSimilarity/Homology/Motif-based tools: BLAST, InterProScan
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Earthworm transcriptome
project
E. fetida cDNA sequences Sanger 454 GSRaw sequence reads 4032 562327Quality filtered sequence reads 3144 518350Average sequence length (base) 310 104Assembled contigs 448 31114Reads assembled into contigs 1361 236963Unassembled singletons 1783 157070Repeats 121090Outliers and partial sequences 3227Unique sequences 2231 188184
Annotation:BLAST: 10K matches (E≤
10-4); InterProScan: 35K matches;Transcription factor (HMM prediction): 2627 matches
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
In-depth
functional annotation
Sanger sequences only
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Outline
• Introduction• Bioinformatics• Knowledgebase• Reverse engineering• Machine learning• Predictive modeling• Summary
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Knowledgebase: a new DOE initiative
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
What’s a knowledgebase
Systems Biology Knowledgebase (Knowledgebase) serves as a foundation on which scientists can integrate modeling, simulation, experimentation, and bioinformatics. It will not only give scientists free and broad access to diverse data types but will also provide sophisticated tools for data analysis, visualization, and integration.
Specific objectives of DE-FOA-0000143:(1)
develop methods to integrate multiple data types;
(2)
develop new methods to infer and curate (meta)genomic
functional annotations;(3)
develop methods to couple multiple cellular pathways and processes; and
(4)
develop new methods to model whole cellular processes.
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Omics
databases ≠
Knowledgebase
DECIPHER
DBD
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Knowledgebase if data knowledge
A query-oriented data management system developed jointly by
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Earthworm Toxicogenomics
Knowledgebase
ETKB
Proteomic data
Transcription factors
microRNA
& target mRNA
Genome sequence & annotation
QTLSNP
Metabolites
Toolkits(BLOM)
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Outline
• Introduction• Bioinformatics• Knowledgebase• Reverse engineering• Machine learning• Predictive modeling• Summary
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Reverse engineering (RE) is the process
of discovering
the technological principles of a device, object
or system
through analysis
of its structure, function
and operation. (Wikipedia)
Network reconstruction:A fundamental problem in functional genomics is to determine the structure and dynamics of genetic networks based on expression data.
Gene network by Andrey
Rzhetsky
(From: The New Genetics)An embryonic developmental gene network in a fruit fly.
Reverse engineering
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
GRN in diseased neurons
Changes in gene regulatory networks during disease progression
-- Leroy Hood “Dynamics of a prion perturbed network in mice”
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Changes in gene regulatory networks
Gene expression alteration Gene connectivity alternation
Normal state
Stressed
Diseased
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
DREAM project
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
RE for network inferenceComparison of different models for GRN inference
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Earthworm Neurotransmission Network reconstruction using BLOM
1t t tx Fx w t t ty Hx v
Inverse Model(State estimate andparameter learning)Observed
measurementsEstimate of
system state
Update GRNsPrior
knowledge
System of Gene Regulation
Measuring devices
Measurementerror source
System state(desired but unknown)
System error source
Controls
Dynamic System(Stochastic)
Forward Model(Deterministic)
Biological system and measurement GRN reconstruction model
Bayesian Learning and Optimization Model (BLOM)
Controlor
RDX(2µg/cm2)
or
Carbaryl(20 ng/cm2)
1) Apply three treatments
to earthworms
0.5 ±
0.1 g
mature
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
RecoveryExposure
Acclimation
Time series toxicity test results
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Genes involved in cholinergic pathway
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Carbaryl_exposure
Carbaryl_recoveryControl
Reconstructed network using BLOM
Preliminary results(1/3 of samples)
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
RDX_exposure
ControlRDX_recovery
Reconstructed network using BLOM
Preliminary results(1/3 of samples)
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Outline
• Introduction• Bioinformatics• Knowledgebase• Reverse engineering• Machine learning• Predictive modeling• Summary
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Machine Learning is the study of computer algorithms that improve automatically through experience. Applications range from data mining programs that discover general rules in large data sets, to information filtering systems that automatically learn users' interests.
What’s Machine Learning?
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
• DAC = discriminant
analysis and clustering
• Treatments = control,
RDX‐exposed, TNT‐exposed
• Statistical Analysis
• Tree‐based classification
• Support Vector Machine
(SVM) model
• Hierarchical Clustering
• Numbers in brackets
indicate the amount of
genes remaining after each
step
DAC pipeline for biomarker discovery
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Outline
• Introduction• Bioinformatics• Knowledgebase• Reverse engineering• Machine learning• Predictive modeling• Summary
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Mathematical Modeling Of Biology Is Not A New EndeavourLarge-scale Systems Model Of Cardiovascular Physiology – Guyton 1972
Mathematical modeling in biology
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
From simulation to prediction
v-Lung
v-LiverTM
v-CanaryPredictiveToxicology
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Math will rock the world
All models are wrong, but some are useful.
-
George Box
•Models as containers of belief or disbelief.
•Belief = Understanding
•Biology = High complexity
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Challenges
•
Incomplete knowledge
•
Wide time and size scale
•
Extrapolation from in vitro in vivo•
Validation often difficult
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
iSimBioSys
“A discrete event based stochastic simulation approach for studying the dynamics of biological networks”
iSimBioSys
- Discrete event based biosimulation engine
Stochastic Modeling of biological events and logic modules
In Silico
Results
Integrated platform
HimSim
– Flow based simulation of metabolic network engine
Integrated database of biological networks & pathways
Credit: P. Ghosh
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
CompuCell3D
http://www.biocomplexity.indiana.eduhttp://www.compucell3d.org Credit: J. Glazier
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
CompuCell3D
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
CompuCell3D
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Simulation & visualization
http://www.iac.rm.cnr.it/~filippo/OldWWW/CImmSimWeb/
examples.html
CANCER GROWTH
A 3D plot of the tumor growth in a simulation.
US
Arm
y En
gine
er R
esea
rch
& D
evel
opm
ent C
ente
r
One Team, One ERDC . . . Relevant, Ready, Responsive, Reliable
Summary