View
213
Download
0
Category
Tags:
Preview:
Citation preview
Towards the virtual organism
• PART I: Databases and tools for biochemical pathways
• PART II: Relating expression data and pathways
• PART III: Guided Tour: elucidate organelle-related pathways
Pathway diagramWIT database
Major contributions of Pathways databases
• Information Resource - Literature compilation
• Gene Ontology
• Sequence and Genome Annotation
• Relationship between pathways (function) and chromosomal position
• Analysis of Gene Expression Arrays
• Understanding Cellular Dynamics
• Disease Process Modeling
Without context and purpose, information is mere data . - Clement Mok
As when a highly connected node in the internet breaks down, the disruption of p53 has severe consequences.
Jeong et al. 2001 Nature
Towards the virtual organism
Introduce biochemical pathways resources• What Is There (WIT/PUMA/EMP/ERGO)
• Kyoto Enzyclopedia of Genes and Genomes (KEGG)
• Signalling Databases
• Pathways Database (PathDB)
Focus on• Accessability
• Database contents and models
• Query features
• Gene/Protein/Pathway analysis
• Visualization
Why do all these projects the same thing?
Why do all these projects seem to do the same thing?
• Data model is a view of the world
– Different database management systems
– Tools particular to data model and database management systems
– Different content
• Analogous to model system approach to biology
– E.coli, yeast, C.elegans, Drosophila, Mouse, etc. are all used to provide
understanding of human biology
• No one system does everything, but concepts and data can often be shared
He may have stole that song from me, but I steal from everybody. - Woody Guthrie
WIT/PUMA/EMP System
• Argonne National Lab and Integrated Genomics Inc, USA • http://wit.mcs.anl.gov/WIT2/
• Ross Overbeek, Evgeni Selkov, Natalia Maltsev• Team: 7
• WIT is freely downloadable (ftp://ftp.mcs.anl.gov/pub/Genomics/WIT2/)
WIT/PUMA/EMP System
• Annotation/Literature database• Blast, PSI-Blast• ClustalW• COG• ProtScale• Transmembrane helices/topology• Prodom• ProSite• Operons (Pairs of close bidirectional best hits)
Focus on: sequence analysis, annotation of genomes with respect to metabolism
Ways to go: from genes to pathways
Starting from -• Gene/protein sequence• Gene/protein name• Organism/Genome (‘Metabolic reconstruction’)
To Pathways of -– Metabolism– DNA– Regulation of metabolism
From Blast results to genes
From genes to pathways
WIT Pathway diagrams:Tabular format
WIT Pathway Diagrams:Picture
Links to furtherinformation
WIT Detail pages:Enzyme
Name, ReactionEC, Description
SpecificActivity
PreparativeProtocol
Substrates, Coenzymes,Inhibitors, Modification, Kinetics, Genomes ….
4788
3304
6502
6306
9500
6914
39
Kyoto Encyclopedia of Genes and GenomesKEGG
• Institute for Chemical Research, Kyoto University• http://www.genome.ad.jp/kegg/
• Minoru Kanehisa • System development: 9• Data entry and curation: 18
• Academic users may freely download the package• ftp://kegg.genome.ad.jp/mirror/
KEGG: Data content and statistics
• 3705 EC numbers
• 11132 Enzyme names
• 3794 Substrates
• 5284 Metabolic reactions
• 113 Pathways
– mostly metabolic
• 36 Organisms
KEGG: Query capabilities
• Reconstruct pathway maps using blast
• Search and color genes, enzymes and compounds in pathway diagrams
and ortholog tables
• Sequence: blast and fasta
• Genome Maps
• Generate reaction paths between compounds
Focus on: display gene-centric data
in the context of predefined pathways
´State of the Art´
KEGG picture of the glycolysis
genes present in E. coli
static Network
manually compiled
manually drawn
textbook knowledge
versus
static Network
manually compiled
manually drawn
textbook knowledge
Representation of Networks
dynamic Network
features complete knowledge
restriction of content is up to the user
experimental data can be reflected in net structure
include user-owned data
Pathway related projects
• KEGG Metabolic Pathways
• EMP - Enzymes and Metabolic Pathways
• WIT - Metabolic Reconstruction
• UM-BBD - Microbial Biocatalysis/Biodegradatation
• EcoCyc - E. coli Genes and Metabolism
• SoyBase - Soybean Metabolism
• Metalgen - Genes and Metabolism
• Boehringer Mannheim - Biochemical Pathways
• IUBMB-Nicholson Minimaps
• PathDB - Plant Metabolic Pathways
Metabolic Pathways
Protein-Protein Interactions
• BRITE Database for Biomolecular Relations
• DIP - Database of Interacting Proteins
• BIND - Biomolecular Interaction Network Database
• KEGG Regulatory Pathways
• SPAD - Signal Transduction
• CSNDB - Cell Signaling Networks
• Yeast Pathways in MIPS
• Interactive Fly - Drosophila Genes
• GIF_DB - Drosophila Gene Interactions
• FlyNets - Drosophila Molecular Interactions
• GeNet - Gene Networks Database
• HOX-Pro - Homeobox Genes Database
• Wnt Signaling Pathway
• TRANSPATH - Gene Regulatory Pathways
• GenMapp - Mostly mouse pathways
Regulatory Pathways
• LIGAND - Chemical Database for Enzyme
Reactions
• ENZYME - Enzymes
• BRENDA - Comprehensive Enzyme
Information System
• Worthington Enzyme Manual
• Klotho - Biochemical Compounds
• ChemFinder - Searching Chemicals
• ChemIDplus at NLM
• PROMISE - Prosthetic Groups and Metal Ions
• GlycoSuiteDB - Glycan Structure Database
• CarbBank - Complex Carbohydrate Structure
Database
• WebElements - Periodic Table
Enzymes, Compounds
• TRANSFAC - Transcription Factor Database
• RegulonDB - E. coli Transcriptional Regulation
• DBTBS - B. subtilis Transcription Factors
• DPInteract - DNA binding proteins
Transcription Factors
• IUBMB - Nomenclature
• IUPAC - Nomenclature
• SWISS-PROT - Documents
• GO - Gene Ontology
(FlyBase/SGD/MGD/TAIR/WormBase)
Nomenclature - General
Simulation of biochemical reactions and cellular process
• BioKin - Enzyme kinetic software
• BioQuest - Metabolic Simulation
• BioSpice - still in progess
• Bioxml.org - a site collecting together a number of biologically-oriented open-source projects
• DBsolve - Software for metabolic, enzymatic and receptor-ligand binding simulation
• DMSS - Scalable, Discrete Event Metabolic Simulation System
• E-Cell - A simulation platform for the modelling of cells at a molecular level
• Electronic Arc - experimental visual simulator
• Elementary Modes - has a Java simulation
• Gepasi - A software package for modelling systems of biochemical reactions
• Jarnac - A language for describing and manipulating cellular system models
• StochSim - A general-purpose stochastic simulator of biological reaction networks.
• Systems Biology Workbench - An XML based integration system
• Virtual Cell - A general computational framework for modeling cell biological processes
Signal transduction browser (Transpath)
Signal transduction browser (Transpath)
Signal transduction browser (Transpath)
PathDB
• National Center for Genome Resources• http://www.ncgr.org/software/pathdb/
• Jeff Blanchard• Software Development: 5• Literature Curation: 4
• The software is freely available (Client)• The database server can be installed at the site of cooperation
partners
PathDB data model
• Compounds• Macromolecules: lipids, polysaccharides• Information molecules: DNA, RNA• States: development, disease, genotype, phenotype,
environment
• metabolic reactions• protein modifications and interactions• Regulation: transcriptional, translational, posttranslational• Transport• biological hierarchies, ontologies
• incomplete and conflicting knowledge
PathDB datamodel
Mediator
Substrate
Product
BiochemicalEntity
Step
Transitionof Entities
Constructionof Entities
Protein
Subunit
Compound
DNA
BuildingBlocks
RNA
Location BiolProcess GenotypeAttributes
Phenotype Environment
Platform for Network Analysis
Focus on: building custom networks, compare to large scale experiments
Relational database for metabolic reactions, regulation and states
(disease, genotype, phenotype)
QueryTool
Query the database, e.g. to collect a set of reactions
transform between types: proteins, compounds, steps
restrict to attributes: organism, location, states
PathwayViewer
Visualize the results of the search
Query window showing“Proteins involved in Biological process DNA repair”
• Transform to ‘Phenotype’• Select ‘Caffeine Sensitivity’ and get all Proteins• Do Intersection and get all Steps
PathwayViewer
• Inspect and manipulate pathways or routes between
metabolites.
• Alternate topological representations of a pathway: primary and
secondary metabolites
• Manipulate layout on screen
• Control how much data is displayed
• Automatically lays out pathways
– hierarchical or circular algorithm
• Visualization of gene expression and metabolic profiling data
Visualize Steps involved in DNA synthesis and
Caffeine sensitivity
1
2
3
Exploring the network neighborhood- build pathways on the fly
Large-Scale Experiments
SequencesAnnotation
What datasources are out there ?
SW
GenBank
MIPS Gene expression
Protein-Protein
Metabolic profiling
Protein-SmallMolEMBL
Protein expressionOntologies
GOUMLS/MESH
MBO EcoCyc
RegulationMetabolism
KEGG
WIT
BRENDA
PathDB
CSNdb
BIND
aMAZE
BRITE
DIP
Knowledge
Medline
Ontology: Bind genes to hierarchies
GO Gene Ontology, 2000
Translation/Mapping between:
Cellular LocationAnatomy
Biological Process
Molecular Function
Browsing the ontology
Hierarchy of Complexity
Entities or States
Processes
molecular
molecular
micro
micro
macro
macro
metabolic reactionsprotein-protein Interactions
conformation change
protein, RNA, DNA, compounds
mitosisapoptosis
transcription
organellescell types, tissues
diseasedevelopmentenvironment
disease statesdevelopment states
phenotype
AnnotationSequences
RegulationMetabolism
Processes/Entities and experimental support
Gene expression
Protein-Protein
Metabolic profiling
Protein-SmallMol
Large-Scale Experiments
Protein expression
Knowledge
Ontologies
PathDBComplete Wiring Diagram
Reference experimentalsupport
How well does my set of gene expression arrays support my model of cellular processes?
Questions
What is the difference between between a normal and a cancer cell?
What is the effect of a knockout mutation on the cellular network?
What “classical” pathways are up or down regulated in my gene expression data?
How does a drug perturb a cellular network as judged through gene expression data?
What experiment promises to distinguish between contradictory hypotheses?
PART II
Relating gene expression and pathways
Analysis of Expression Data
Clustering of time coursesIyer et.al., Science, 1999
„Scatter plot“ comparingtwo experimentsRoberts et.al., Cell, 2000
Using pathways to contextualize gene expression arrays
Miki et al. PNAS, 2001
Expression Pattern Clustering
J-Express B. Dysvik / I. Jonassen, U.Bergen, Norway
Mapping of Jexpress Cluster onto Pathways
sce00051 Fructose and mannose metabolism EC 3.1.3.46 Fructose-2,6-bisphosphate 2-phosphatase; Fructose-2,6-bisphosphatasesce00190 Oxidative phosphorylation EC 1.9.3.1 Cytochrome-c oxidase; Cytochrome oxidase; Cytochrome a3; Cytochrome aa3 EC 3.6.1.34 H+-transporting ATP synthase; H+-transporting ATPase; Mitochondrial ATPase; Coupling facotrs (F0-F1 and C0-F1); Chloroplast ATPase; Bacterial Ca2+/Mg2+ ATPase EC 3.6.1.38 Ca2+-transporting ATPase; Calcium pumpsce00251 Glutamate metabolism EC 2.6.1.19 4-Aminobutyrate transaminase; beta-Alanine--oxoglutarate transaminasesce00252 Alanine and aspartate metabolism EC 2.6.1.19 4-Aminobutyrate transaminase; beta-Alanine--oxoglutarate transaminasesce00410 beta-Alanine metabolism EC 2.6.1.19 4-Aminobutyrate transaminase; beta-Alanine--oxoglutarate transaminasesce00640 Propanoate metabolism EC 2.6.1.19 4-Aminobutyrate transaminase; beta-Alanine--oxoglutarate transaminasesce00650 Butanoate metabolism EC 2.6.1.19 4-Aminobutyrate transaminase; beta-Alanine--oxoglutarate transaminase sce03110 ATP Synthase EC 3.6.1.34 H+-transporting ATP synthase; H+-transporting ATPase; Mitochondrial ATPase; Coupling facotrs (F0-F1 and C0-F1); Chloroplast ATPase; Bacterial Ca2+/Mg2+ ATPase
Cluster represents genes
of different contexts
Clustering and Incremental Pathway Construction
• Genes mapped to reactions• dynamically build networks from reaction DB and clustered genes
A pathway (10 genes) from five clusters with 57 EC-annotated genes
24 (out of 54) gene clusters(6153 ORFs, 694 EC-annotated)
Fellenberg&Mewes, 99
Pathway represents 10 genes out of 500
Principal Component Analysis (PCA)
• Eigen Analysis• solve for eigenvalues and eigenvectors of a square symmetric
matrix– pure sums of squares and cross products (SSCP)– scaled sums of squares and cross products (Covariance)– sums of squares and cross products (Correlation)
w1 arg maxw 1
E wT x 2 wk arg max
w 1E wT x wiwi
T xi1
k 1
2
Principal componentsand visualization
J-Express B. Dysvik / I. Jonassen, U.Bergen, Norway
Data driven vs hypotheses driven approach
• Erroneous and noisy expression data• Many genes, measurements• Many spurious hits/clusters of expression patterns• Incomplete data (measurements, kinetic parameters) • Cost of regulation: partially regulated pathways
The data driven approach to Genome and Expression Analysis
Basic Assumptions ( Pathways Cluster )
• Expression time courses for pathways do not necessarily
cluster together
• Clustered genes do not necessarily form pathways
Expression Data and Pathways
Biological Knowledge
Outline of a Hypothesis Driven Approach
GPE-Score(Pathway)
Different Questions - different Scoring Functionscorrelated
combined : correlated + conspicuous
conspicuous
Diauxic shift data, DeRisi et al, Science, 1997
Distribution of Relative Expression Levels: Error Model
Distribution of Relative Expression Levels: Null Model
err
gtt sd
mgP
0*2:)( ,0
Measurement error Null model
Conspicuousness Score: Gene and Pathway Score
0
)(1
1:)(
tTtt gscore
Tgscore
Pg
gscoreP
Pscore )(1
:)(
)(log:)( 0 gPgscore tt
Gene score
Pathway score
g
t
Diauxic shift data, DeRisi et al, Science, 1997
GPPathways
Pathway model
Pg
gscoreP
Pscore )(1
:)(Pathway score
gphg
p hgccp
gscore ),(1
:)(Gene score
hg
hg
sdsdhgcc ,cov
:),(
)(,| 0, gggtg msdsdtTtmm
Covariance/Synchrony
Normalization/Gene Variability
otherwiseP
PggPPg
Normalization/Conspicuousness
errerr
hg
sdsdhgcc ,* cov
:),(
Combined Score
/ Combined Score
Statistical Evaluation of Expression Data: CorrelationCorrelation
Combinatorial Mapping of Proteins to Genes
Each row isone possiblepathwayin gene space
Each columnspecifies a genefor one nodein pathway
3
2
3
2
Total: 36
Pathways, Functions, EC-numbers, Proteins, Genes
Nodes are labeled with(sets of) proteins
Nodes ={ <2.7.1.2: YCL040W, YDR516c>, <2.7.1.1: YFR053C, YGL253W> <3.1.3.1: YMR105C, YHR215W, YDL024C, ...> <3.1.3.58: YCL040W, ...>, <3.1.3.8: YNL141W, ...>, <3.1.3.9: YBR011C, YJL130C> }
Nodes are labeled with(sets of) EC-numbers
Pathway = (Nodes, Edges)
Nodes = { 2.7.1.2, 2.7.1.4, 2.7.1.59, 2.7.1.61, 2.7.1.63, 2.7.1.7, 2.7.1.1, 3.1.3.1, 3.1.3.58, 3.1.3.8, 3.1.3.9 }
Extract frommetabolic DB +Systematic Generationof Pathways
Nodes are labeled withreactions 900 different ORF pathways
Glycolysis pathways: Combined Score
---- 10000 random “pathways”---- 900 putative glycolysis pathways
---- 10000 random “pathways”
---- 10000 random “pathways”---- 900 putative glycolysis pathways ---- 36 valid glycolysis pathways---- 36 valid glycolysis pathways
BiologicallyMeaningful
Identifying Genes via Pathway Scores
ORFs fitting into a given pathway according to specific scoring function
ORFs related to a given pathway according to specific scoring function
High scoring genes(correlated with TCA cycle genes)
Score correlated w.r.t TCA cycle
Score not correlated
Score negatively correlated
Example 1: Oxidative Phosphorylation
Low scoring genes(anti-correlated with TCA cycle genes)
Score correlated w.r.t TCA cycleScore not correlatedScore negatively correlated
Example 2: Biosynthesis of Aminoacids
Average scoring genes(not correlated with TCA cycle genes)
Score correlated w.r.t TCA cycleScore not correlatedScore negatively correlated
Example 3: Urea cycle
Pathway Scores ...
… are suitable for ...
– interpreting time series
– coping with erroneous data
– ranking pathways with respect to plausibility
– interpreting how well pathway genes fit to the pathway
– go fishing for further genes correlated to the pathway (with great care)
– posing different questions by defining new scoring functions
• Boolean Models– logical relationships between variables
• Differential equations– continuous dynamics of biological reactions
• Bayesian Networks– statistical testing of hypotheses– use gene/protein annotations as priors to represent
biological knowledge
Genetic Network Reconstruction
BayesianScore (S) log p(S |D)
logp(S)logp(D|S)c
• Variables: Gal80p, Gal4m, Gal4p genes
• Binary quantization using maximum likelyhood
• Compare all models possible in the system
• Experiment: reproduce currently accepted model of galactose
regulation
Bayesian Networks
Comparing two models for the control of thegalactose metabolism in yeast
Gal80p represses Gal4m Gal80p inhibits Gal4m posttranslationally
Gal2 independent of Gal80m Gal4m independent of Gal80m
Edge annotation as Bayesian priors
• No annotation between X and Y• Positive stimulation: X increases activity of Y• Negative stimulation• Undefined
Constraints on the dependence between the genes
Permits scoring of annotated models as unannotated models
Bayesian Networks:Evaluate and extend networks
• Retrieve network from database
• Curate part of the network
• Automatically generate hypotheses on the rest
• Quantify hypotheses using Bayesian metric
• Present high-scoring hypotheses to the user
• Present scores for single genes/edges in the network
• Manual investigation of high scoring hypotheses: New facts
• Generate another iteration of hypotheses
PART III
Guided Tour:
Eluciadate organelle related pathways
Compartments in the eukaryotic cell
Voet & Voet, Biochemistry
Construction of models in yeast
The Network• Yeast2Hybrid interactions 5774 ( 6121)• Other protein-protein interactions 2347 ( 4384)• Other gene product interactions (MIPS) 6654 (15245)• Protein complexes 934 ( 1020)• Metabolism 2135 ( 4258)
17529 (31028)
Subcellular Localization• Subcellular localization catalogue (MIPS) 2800 (2300)• Prediction: TargetP 720• Custom motiv search
Generate Networks fromExperiments: Yeast2Hybrid
‘The protein-protein interaction network of yeast’, Uetz et al., 2000
Map of protein-protein interactions
red, lethal; green, non-lethal; orange, slow growth; yellow, unknown
Jeong et al. 2001
Gray et al. 1999 Science
Origin of mitochondria
Subcellular location in yeast
• MIPS– localization for 2300 gene products– wide range of subcellular compartments
ER 159Peroxisome 39Transport vesicles 48Vacuole 54Nucleus 820Cytoskeleton 113Cell wall 37
Golgi 81Endosome 11Cytoplasm 583Plasma membrane 153Mitochondria 376Chromosome struct. 23
Subcellular localization of gene products
• Based on N-Terminal sequence• Except Peroxisome: C-Terminal sequence
• TargetP– neural network based– distinguishes between mitochondrion, chloroplast, secretory
pathways– estimated accuracy: 85%– plants: 10% mitochondrion, 15% chloroplast
• Results for yeast: – 383 mitochondrion– 294 secretory pathways
Model of yeast mitochondrion
Mitochondr.Ribosom
OM import
F1F0 ATP synthaseMitochondr.
Ribosom
ExonucleaseCydochrome red
Cydochrome ox
Succinate DH
Isocitrate DH
Citrate cycle
Aminoacids
Liponamide
IM transloc
Glycolysis
Evolution of peroxisomal import
Olsen et al. 2001
C-termN-term
Subcellular localization: Peroxisome
PTS 1PTS 2
S K LC R MA H
P N IG Q VT S FK S YN A
SmallUnchargedHydrophobic
nonpolarHydrophobic
Basicpos. charged
R L X5 H LK I Q A
QV
10 .. 40
most
common
acceptable
46126
657
Model of yeast peroxisome
Communication between peroxisomes and the cell
Peroxisomes and phenotype
ISYS - a platform for the integration of software tools and databases
• "plug and play" tools of interest.
• separately developed and independently evolving
• DynamicDiscovery - an exploratory environment to pass objects among components
• Supports visual synchronization among components.
• Integrates web-based resources with desktop applications
Double edged intergration problem - technology and IP/licensing
(at least for non-profits)
ISYS
NCGRStanford
BerkeleyWash. U
Manchester
Web
Other thirdparty software
Your organization’s tools
PathDBCMD Tool
Table Viewer Sequence ViewerSimilarity Search
Viewer
X-Cluster
GO Browser
ATV
MaxD
Entrez - NCBIBLAST - NCBI
GeneScan - MITGoogle
TAIR - NCGRGeneX - NCGR
Compare regulated genes with Gene Ontology and MaxD
MaxD: David HancockUniversity of Manchester
GO: Michael AshburnerThe Gene Ontology Cons.
Perform statistical analysis:MaxD and Pathway Scoring
MaxD: David HancockUniversity of Manchester
Rosetta: Compendium of expression arrays
• 300 Yeast expression arrays: Hughes et al., 2000, Cell– 280 gene knockout mutants
– 20 titration experiments
• Nutritients
• Antibiotics
• Choose 40 experiments– Pex12
– 5 genes: human expert knowledge
• involved in gluconeogenesis, ER, vacuolar transport
– 36 more: contained in peroxisomal network
• Cellular organization
• Transcriptional control
• Metabolism, focus energy
• Protein destination, cellular transport
Regulation of metabolismInterface peroxisome and cytoplasm
Regulation of metabolismInterface peroxisome and cytoplasm
Pathway scores:Comparing network and expression
YOR184W 39.68YER090W 37.98YPR145W 36.61YGL062W 36.18YKL211C 35.90YLR027C 35.09
Pathway 36.90
YIR034C 86.46YIL116W 76.63YOL058W 75.38YMR062C 72.87YDL182W 58.25YDL066W 52.03YER052C 49.59YOL140W 47.78YDL131W 43.24YGL202W 39.86YHR208W 38.35YJR148W 36.32YNL220W 36.06YCR005C 35.38YHR137W 35.21YNL037C 35.14YNR050C 34.85YMR300C 32.54YLL018C 29.77YAR015W 28.77
Iteration 1
Pathway scores:Distribution of correlated genes
Pathway scores:Distribution of correlated genes
Pathway scores:Regulatory proteins with correlated expression
Oleate response element (ORE)
Transcription and threonine pool
Non-classical protein export
Conclusion
• Organism-wide virtual experiments can be performed• Comprehensive models can be constructed and evaluated
– complete sequence– abundance of edges between genes/geneproducts or genes
and phenotype– abundance of information from annotations, large scale
expression experiments and Yeast2Hybrid
• What we do not yet understand (well enough)– relationship between proximity in the network and protein
sequence– networks properties: high degree of interconnectivity yet
limited effects from gene disruption– translational and posttranslational regulation– How to apply large scale experiments to regulation
In case of simple eukaryotes the ‘virtual organism’ is in reach
Recommended