Towards the virtual organism PART I: Databases and tools for biochemical pathways PART II: Relating...

Preview:

Citation preview

Towards the virtual organism

• PART I: Databases and tools for biochemical pathways

• PART II: Relating expression data and pathways

• PART III: Guided Tour: elucidate organelle-related pathways

Pathway diagramWIT database

Major contributions of Pathways databases

• Information Resource - Literature compilation

• Gene Ontology

• Sequence and Genome Annotation

• Relationship between pathways (function) and chromosomal position

• Analysis of Gene Expression Arrays

• Understanding Cellular Dynamics

• Disease Process Modeling

Without context and purpose, information is mere data . - Clement Mok

As when a highly connected node in the internet breaks down, the disruption of p53 has severe consequences.

Jeong et al. 2001 Nature

Towards the virtual organism

Introduce biochemical pathways resources• What Is There (WIT/PUMA/EMP/ERGO)

• Kyoto Enzyclopedia of Genes and Genomes (KEGG)

• Signalling Databases

• Pathways Database (PathDB)

Focus on• Accessability

• Database contents and models

• Query features

• Gene/Protein/Pathway analysis

• Visualization

Why do all these projects the same thing?

Why do all these projects seem to do the same thing?

• Data model is a view of the world

– Different database management systems

– Tools particular to data model and database management systems

– Different content

• Analogous to model system approach to biology

– E.coli, yeast, C.elegans, Drosophila, Mouse, etc. are all used to provide

understanding of human biology

• No one system does everything, but concepts and data can often be shared

He may have stole that song from me, but I steal from everybody. - Woody Guthrie

WIT/PUMA/EMP System

• Argonne National Lab and Integrated Genomics Inc, USA • http://wit.mcs.anl.gov/WIT2/

• Ross Overbeek, Evgeni Selkov, Natalia Maltsev• Team: 7

• WIT is freely downloadable (ftp://ftp.mcs.anl.gov/pub/Genomics/WIT2/)

WIT/PUMA/EMP System

• Annotation/Literature database• Blast, PSI-Blast• ClustalW• COG• ProtScale• Transmembrane helices/topology• Prodom• ProSite• Operons (Pairs of close bidirectional best hits)

Focus on: sequence analysis, annotation of genomes with respect to metabolism

Ways to go: from genes to pathways

Starting from -• Gene/protein sequence• Gene/protein name• Organism/Genome (‘Metabolic reconstruction’)

To Pathways of -– Metabolism– DNA– Regulation of metabolism

From Blast results to genes

From genes to pathways

WIT Pathway diagrams:Tabular format

WIT Pathway Diagrams:Picture

Links to furtherinformation

WIT Detail pages:Enzyme

Name, ReactionEC, Description

SpecificActivity

PreparativeProtocol

Substrates, Coenzymes,Inhibitors, Modification, Kinetics, Genomes ….

4788

3304

6502

6306

9500

6914

39

Kyoto Encyclopedia of Genes and GenomesKEGG

• Institute for Chemical Research, Kyoto University• http://www.genome.ad.jp/kegg/

• Minoru Kanehisa • System development: 9• Data entry and curation: 18

• Academic users may freely download the package• ftp://kegg.genome.ad.jp/mirror/

KEGG: Data content and statistics

• 3705 EC numbers

• 11132 Enzyme names

• 3794 Substrates

• 5284 Metabolic reactions

• 113 Pathways

– mostly metabolic

• 36 Organisms

KEGG: Query capabilities

• Reconstruct pathway maps using blast

• Search and color genes, enzymes and compounds in pathway diagrams

and ortholog tables

• Sequence: blast and fasta

• Genome Maps

• Generate reaction paths between compounds

Focus on: display gene-centric data

in the context of predefined pathways

´State of the Art´

KEGG picture of the glycolysis

genes present in E. coli

static Network

manually compiled

manually drawn

textbook knowledge

versus

static Network

manually compiled

manually drawn

textbook knowledge

Representation of Networks

dynamic Network

features complete knowledge

restriction of content is up to the user

experimental data can be reflected in net structure

include user-owned data

Pathway related projects

• KEGG Metabolic Pathways

• EMP - Enzymes and Metabolic Pathways

• WIT - Metabolic Reconstruction

• UM-BBD - Microbial Biocatalysis/Biodegradatation

• EcoCyc - E. coli Genes and Metabolism

• SoyBase - Soybean Metabolism

• Metalgen - Genes and Metabolism

• Boehringer Mannheim - Biochemical Pathways

• IUBMB-Nicholson Minimaps

• PathDB - Plant Metabolic Pathways

Metabolic Pathways

Protein-Protein Interactions

• BRITE Database for Biomolecular Relations

• DIP - Database of Interacting Proteins

• BIND - Biomolecular Interaction Network Database

• KEGG Regulatory Pathways

• SPAD - Signal Transduction

• CSNDB - Cell Signaling Networks

• Yeast Pathways in MIPS

• Interactive Fly - Drosophila Genes

• GIF_DB - Drosophila Gene Interactions

• FlyNets - Drosophila Molecular Interactions

• GeNet - Gene Networks Database

• HOX-Pro - Homeobox Genes Database

• Wnt Signaling Pathway

• TRANSPATH - Gene Regulatory Pathways

• GenMapp - Mostly mouse pathways

Regulatory Pathways

• LIGAND - Chemical Database for Enzyme

Reactions

• ENZYME - Enzymes

• BRENDA - Comprehensive Enzyme

Information System

• Worthington Enzyme Manual

• Klotho - Biochemical Compounds

• ChemFinder - Searching Chemicals

• ChemIDplus at NLM

• PROMISE - Prosthetic Groups and Metal Ions

• GlycoSuiteDB - Glycan Structure Database

• CarbBank - Complex Carbohydrate Structure

Database

• WebElements - Periodic Table

Enzymes, Compounds

• TRANSFAC - Transcription Factor Database

• RegulonDB - E. coli Transcriptional Regulation

• DBTBS - B. subtilis Transcription Factors

• DPInteract - DNA binding proteins

Transcription Factors

• IUBMB - Nomenclature

• IUPAC - Nomenclature

• SWISS-PROT - Documents

• GO - Gene Ontology

(FlyBase/SGD/MGD/TAIR/WormBase)

Nomenclature - General

Simulation of biochemical reactions and cellular process

• BioKin - Enzyme kinetic software

• BioQuest - Metabolic Simulation

• BioSpice - still in progess

• Bioxml.org - a site collecting together a number of biologically-oriented open-source projects

• DBsolve - Software for metabolic, enzymatic and receptor-ligand binding simulation

• DMSS - Scalable, Discrete Event Metabolic Simulation System

• E-Cell - A simulation platform for the modelling of cells at a molecular level

• Electronic Arc - experimental visual simulator

• Elementary Modes - has a Java simulation

• Gepasi - A software package for modelling systems of biochemical reactions

• Jarnac - A language for describing and manipulating cellular system models

• StochSim - A general-purpose stochastic simulator of biological reaction networks.

• Systems Biology Workbench - An XML based integration system

• Virtual Cell - A general computational framework for modeling cell biological processes

Signal transduction browser (Transpath)

Signal transduction browser (Transpath)

Signal transduction browser (Transpath)

PathDB

• National Center for Genome Resources• http://www.ncgr.org/software/pathdb/

• Jeff Blanchard• Software Development: 5• Literature Curation: 4

• The software is freely available (Client)• The database server can be installed at the site of cooperation

partners

PathDB data model

• Compounds• Macromolecules: lipids, polysaccharides• Information molecules: DNA, RNA• States: development, disease, genotype, phenotype,

environment

• metabolic reactions• protein modifications and interactions• Regulation: transcriptional, translational, posttranslational• Transport• biological hierarchies, ontologies

• incomplete and conflicting knowledge

PathDB datamodel

Mediator

Substrate

Product

BiochemicalEntity

Step

Transitionof Entities

Constructionof Entities

Protein

Subunit

Compound

DNA

BuildingBlocks

RNA

Location BiolProcess GenotypeAttributes

Phenotype Environment

Platform for Network Analysis

Focus on: building custom networks, compare to large scale experiments

Relational database for metabolic reactions, regulation and states

(disease, genotype, phenotype)

QueryTool

Query the database, e.g. to collect a set of reactions

transform between types: proteins, compounds, steps

restrict to attributes: organism, location, states

PathwayViewer

Visualize the results of the search

Query window showing“Proteins involved in Biological process DNA repair”

• Transform to ‘Phenotype’• Select ‘Caffeine Sensitivity’ and get all Proteins• Do Intersection and get all Steps

PathwayViewer

• Inspect and manipulate pathways or routes between

metabolites.

• Alternate topological representations of a pathway: primary and

secondary metabolites

• Manipulate layout on screen

• Control how much data is displayed

• Automatically lays out pathways

– hierarchical or circular algorithm

• Visualization of gene expression and metabolic profiling data

Visualize Steps involved in DNA synthesis and

Caffeine sensitivity

1

2

3

Exploring the network neighborhood- build pathways on the fly

Large-Scale Experiments

SequencesAnnotation

What datasources are out there ?

SW

GenBank

MIPS Gene expression

Protein-Protein

Metabolic profiling

Protein-SmallMolEMBL

Protein expressionOntologies

GOUMLS/MESH

MBO EcoCyc

RegulationMetabolism

KEGG

WIT

BRENDA

PathDB

CSNdb

BIND

aMAZE

BRITE

DIP

Knowledge

Medline

Ontology: Bind genes to hierarchies

GO Gene Ontology, 2000

Translation/Mapping between:

Cellular LocationAnatomy

Biological Process

Molecular Function

Browsing the ontology

Hierarchy of Complexity

Entities or States

Processes

molecular

molecular

micro

micro

macro

macro

metabolic reactionsprotein-protein Interactions

conformation change

protein, RNA, DNA, compounds

mitosisapoptosis

transcription

organellescell types, tissues

diseasedevelopmentenvironment

disease statesdevelopment states

phenotype

AnnotationSequences

RegulationMetabolism

Processes/Entities and experimental support

Gene expression

Protein-Protein

Metabolic profiling

Protein-SmallMol

Large-Scale Experiments

Protein expression

Knowledge

Ontologies

PathDBComplete Wiring Diagram

Reference experimentalsupport

How well does my set of gene expression arrays support my model of cellular processes?

Questions

What is the difference between between a normal and a cancer cell?

What is the effect of a knockout mutation on the cellular network?

What “classical” pathways are up or down regulated in my gene expression data?

How does a drug perturb a cellular network as judged through gene expression data?

What experiment promises to distinguish between contradictory hypotheses?

PART II

Relating gene expression and pathways

Analysis of Expression Data

Clustering of time coursesIyer et.al., Science, 1999

„Scatter plot“ comparingtwo experimentsRoberts et.al., Cell, 2000

Using pathways to contextualize gene expression arrays

Miki et al. PNAS, 2001

Expression Pattern Clustering

J-Express B. Dysvik / I. Jonassen, U.Bergen, Norway

Mapping of Jexpress Cluster onto Pathways

sce00051 Fructose and mannose metabolism EC 3.1.3.46 Fructose-2,6-bisphosphate 2-phosphatase; Fructose-2,6-bisphosphatasesce00190 Oxidative phosphorylation EC 1.9.3.1 Cytochrome-c oxidase; Cytochrome oxidase; Cytochrome a3; Cytochrome aa3 EC 3.6.1.34 H+-transporting ATP synthase; H+-transporting ATPase; Mitochondrial ATPase; Coupling facotrs (F0-F1 and C0-F1); Chloroplast ATPase; Bacterial Ca2+/Mg2+ ATPase EC 3.6.1.38 Ca2+-transporting ATPase; Calcium pumpsce00251 Glutamate metabolism EC 2.6.1.19 4-Aminobutyrate transaminase; beta-Alanine--oxoglutarate transaminasesce00252 Alanine and aspartate metabolism EC 2.6.1.19 4-Aminobutyrate transaminase; beta-Alanine--oxoglutarate transaminasesce00410 beta-Alanine metabolism EC 2.6.1.19 4-Aminobutyrate transaminase; beta-Alanine--oxoglutarate transaminasesce00640 Propanoate metabolism EC 2.6.1.19 4-Aminobutyrate transaminase; beta-Alanine--oxoglutarate transaminasesce00650 Butanoate metabolism EC 2.6.1.19 4-Aminobutyrate transaminase; beta-Alanine--oxoglutarate transaminase sce03110 ATP Synthase EC 3.6.1.34 H+-transporting ATP synthase; H+-transporting ATPase; Mitochondrial ATPase; Coupling facotrs (F0-F1 and C0-F1); Chloroplast ATPase; Bacterial Ca2+/Mg2+ ATPase

Cluster represents genes

of different contexts

Clustering and Incremental Pathway Construction

• Genes mapped to reactions• dynamically build networks from reaction DB and clustered genes

A pathway (10 genes) from five clusters with 57 EC-annotated genes

24 (out of 54) gene clusters(6153 ORFs, 694 EC-annotated)

Fellenberg&Mewes, 99

Pathway represents 10 genes out of 500

Principal Component Analysis (PCA)

• Eigen Analysis• solve for eigenvalues and eigenvectors of a square symmetric

matrix– pure sums of squares and cross products (SSCP)– scaled sums of squares and cross products (Covariance)– sums of squares and cross products (Correlation)

w1 arg maxw 1

E wT x 2 wk arg max

w 1E wT x wiwi

T xi1

k 1

2

Principal componentsand visualization

J-Express B. Dysvik / I. Jonassen, U.Bergen, Norway

Data driven vs hypotheses driven approach

• Erroneous and noisy expression data• Many genes, measurements• Many spurious hits/clusters of expression patterns• Incomplete data (measurements, kinetic parameters) • Cost of regulation: partially regulated pathways

The data driven approach to Genome and Expression Analysis

Basic Assumptions ( Pathways Cluster )

• Expression time courses for pathways do not necessarily

cluster together

• Clustered genes do not necessarily form pathways

Expression Data and Pathways

Biological Knowledge

Outline of a Hypothesis Driven Approach

GPE-Score(Pathway)

Different Questions - different Scoring Functionscorrelated

combined : correlated + conspicuous

conspicuous

Diauxic shift data, DeRisi et al, Science, 1997

Distribution of Relative Expression Levels: Error Model

Distribution of Relative Expression Levels: Null Model

err

gtt sd

mgP

0*2:)( ,0

Measurement error Null model

Conspicuousness Score: Gene and Pathway Score

0

)(1

1:)(

tTtt gscore

Tgscore

Pg

gscoreP

Pscore )(1

:)(

)(log:)( 0 gPgscore tt

Gene score

Pathway score

g

t

Diauxic shift data, DeRisi et al, Science, 1997

GPPathways

Pathway model

Pg

gscoreP

Pscore )(1

:)(Pathway score

gphg

p hgccp

gscore ),(1

:)(Gene score

hg

hg

sdsdhgcc ,cov

:),(

)(,| 0, gggtg msdsdtTtmm

Covariance/Synchrony

Normalization/Gene Variability

otherwiseP

PggPPg

Normalization/Conspicuousness

errerr

hg

sdsdhgcc ,* cov

:),(

Combined Score

/ Combined Score

Statistical Evaluation of Expression Data: CorrelationCorrelation

Combinatorial Mapping of Proteins to Genes

Each row isone possiblepathwayin gene space

Each columnspecifies a genefor one nodein pathway

3

2

3

2

Total: 36

Pathways, Functions, EC-numbers, Proteins, Genes

Nodes are labeled with(sets of) proteins

Nodes ={ <2.7.1.2: YCL040W, YDR516c>, <2.7.1.1: YFR053C, YGL253W> <3.1.3.1: YMR105C, YHR215W, YDL024C, ...> <3.1.3.58: YCL040W, ...>, <3.1.3.8: YNL141W, ...>, <3.1.3.9: YBR011C, YJL130C> }

Nodes are labeled with(sets of) EC-numbers

Pathway = (Nodes, Edges)

Nodes = { 2.7.1.2, 2.7.1.4, 2.7.1.59, 2.7.1.61, 2.7.1.63, 2.7.1.7, 2.7.1.1, 3.1.3.1, 3.1.3.58, 3.1.3.8, 3.1.3.9 }

Extract frommetabolic DB +Systematic Generationof Pathways

Nodes are labeled withreactions 900 different ORF pathways

Glycolysis pathways: Combined Score

---- 10000 random “pathways”---- 900 putative glycolysis pathways

---- 10000 random “pathways”

---- 10000 random “pathways”---- 900 putative glycolysis pathways ---- 36 valid glycolysis pathways---- 36 valid glycolysis pathways

BiologicallyMeaningful

Identifying Genes via Pathway Scores

ORFs fitting into a given pathway according to specific scoring function

ORFs related to a given pathway according to specific scoring function

High scoring genes(correlated with TCA cycle genes)

Score correlated w.r.t TCA cycle

Score not correlated

Score negatively correlated

Example 1: Oxidative Phosphorylation

Low scoring genes(anti-correlated with TCA cycle genes)

Score correlated w.r.t TCA cycleScore not correlatedScore negatively correlated

Example 2: Biosynthesis of Aminoacids

Average scoring genes(not correlated with TCA cycle genes)

Score correlated w.r.t TCA cycleScore not correlatedScore negatively correlated

Example 3: Urea cycle

Pathway Scores ...

… are suitable for ...

– interpreting time series

– coping with erroneous data

– ranking pathways with respect to plausibility

– interpreting how well pathway genes fit to the pathway

– go fishing for further genes correlated to the pathway (with great care)

– posing different questions by defining new scoring functions

• Boolean Models– logical relationships between variables

• Differential equations– continuous dynamics of biological reactions

• Bayesian Networks– statistical testing of hypotheses– use gene/protein annotations as priors to represent

biological knowledge

Genetic Network Reconstruction

BayesianScore (S) log p(S |D)

logp(S)logp(D|S)c

• Variables: Gal80p, Gal4m, Gal4p genes

• Binary quantization using maximum likelyhood

• Compare all models possible in the system

• Experiment: reproduce currently accepted model of galactose

regulation

Bayesian Networks

Comparing two models for the control of thegalactose metabolism in yeast

Gal80p represses Gal4m Gal80p inhibits Gal4m posttranslationally

Gal2 independent of Gal80m Gal4m independent of Gal80m

Edge annotation as Bayesian priors

• No annotation between X and Y• Positive stimulation: X increases activity of Y• Negative stimulation• Undefined

Constraints on the dependence between the genes

Permits scoring of annotated models as unannotated models

Bayesian Networks:Evaluate and extend networks

• Retrieve network from database

• Curate part of the network

• Automatically generate hypotheses on the rest

• Quantify hypotheses using Bayesian metric

• Present high-scoring hypotheses to the user

• Present scores for single genes/edges in the network

• Manual investigation of high scoring hypotheses: New facts

• Generate another iteration of hypotheses

PART III

Guided Tour:

Eluciadate organelle related pathways

Compartments in the eukaryotic cell

Voet & Voet, Biochemistry

Construction of models in yeast

The Network• Yeast2Hybrid interactions 5774 ( 6121)• Other protein-protein interactions 2347 ( 4384)• Other gene product interactions (MIPS) 6654 (15245)• Protein complexes 934 ( 1020)• Metabolism 2135 ( 4258)

17529 (31028)

Subcellular Localization• Subcellular localization catalogue (MIPS) 2800 (2300)• Prediction: TargetP 720• Custom motiv search

Generate Networks fromExperiments: Yeast2Hybrid

‘The protein-protein interaction network of yeast’, Uetz et al., 2000

Map of protein-protein interactions

red, lethal; green, non-lethal; orange, slow growth; yellow, unknown

Jeong et al. 2001

Gray et al. 1999 Science

Origin of mitochondria

Subcellular location in yeast

• MIPS– localization for 2300 gene products– wide range of subcellular compartments

ER 159Peroxisome 39Transport vesicles 48Vacuole 54Nucleus 820Cytoskeleton 113Cell wall 37

Golgi 81Endosome 11Cytoplasm 583Plasma membrane 153Mitochondria 376Chromosome struct. 23

Subcellular localization of gene products

• Based on N-Terminal sequence• Except Peroxisome: C-Terminal sequence

• TargetP– neural network based– distinguishes between mitochondrion, chloroplast, secretory

pathways– estimated accuracy: 85%– plants: 10% mitochondrion, 15% chloroplast

• Results for yeast: – 383 mitochondrion– 294 secretory pathways

Model of yeast mitochondrion

Mitochondr.Ribosom

OM import

F1F0 ATP synthaseMitochondr.

Ribosom

ExonucleaseCydochrome red

Cydochrome ox

Succinate DH

Isocitrate DH

Citrate cycle

Aminoacids

Liponamide

IM transloc

Glycolysis

Evolution of peroxisomal import

Olsen et al. 2001

C-termN-term

Subcellular localization: Peroxisome

PTS 1PTS 2

S K LC R MA H

P N IG Q VT S FK S YN A

SmallUnchargedHydrophobic

nonpolarHydrophobic

Basicpos. charged

R L X5 H LK I Q A

QV

10 .. 40

most

common

acceptable

46126

657

Model of yeast peroxisome

Communication between peroxisomes and the cell

Peroxisomes and phenotype

ISYS - a platform for the integration of software tools and databases

• "plug and play" tools of interest.

• separately developed and independently evolving

• DynamicDiscovery - an exploratory environment to pass objects among components

• Supports visual synchronization among components.

• Integrates web-based resources with desktop applications

Double edged intergration problem - technology and IP/licensing

(at least for non-profits)

ISYS

NCGRStanford

BerkeleyWash. U

Manchester

Web

Other thirdparty software

Your organization’s tools

PathDBCMD Tool

Table Viewer Sequence ViewerSimilarity Search

Viewer

X-Cluster

GO Browser

ATV

MaxD

Entrez - NCBIBLAST - NCBI

GeneScan - MITGoogle

TAIR - NCGRGeneX - NCGR

Compare regulated genes with Gene Ontology and MaxD

MaxD: David HancockUniversity of Manchester

GO: Michael AshburnerThe Gene Ontology Cons.

Perform statistical analysis:MaxD and Pathway Scoring

MaxD: David HancockUniversity of Manchester

Rosetta: Compendium of expression arrays

• 300 Yeast expression arrays: Hughes et al., 2000, Cell– 280 gene knockout mutants

– 20 titration experiments

• Nutritients

• Antibiotics

• Choose 40 experiments– Pex12

– 5 genes: human expert knowledge

• involved in gluconeogenesis, ER, vacuolar transport

– 36 more: contained in peroxisomal network

• Cellular organization

• Transcriptional control

• Metabolism, focus energy

• Protein destination, cellular transport

Regulation of metabolismInterface peroxisome and cytoplasm

Regulation of metabolismInterface peroxisome and cytoplasm

Pathway scores:Comparing network and expression

YOR184W 39.68YER090W 37.98YPR145W 36.61YGL062W 36.18YKL211C 35.90YLR027C 35.09

Pathway 36.90

YIR034C 86.46YIL116W 76.63YOL058W 75.38YMR062C 72.87YDL182W 58.25YDL066W 52.03YER052C 49.59YOL140W 47.78YDL131W 43.24YGL202W 39.86YHR208W 38.35YJR148W 36.32YNL220W 36.06YCR005C 35.38YHR137W 35.21YNL037C 35.14YNR050C 34.85YMR300C 32.54YLL018C 29.77YAR015W 28.77

Iteration 1

Pathway scores:Distribution of correlated genes

Pathway scores:Distribution of correlated genes

Pathway scores:Regulatory proteins with correlated expression

Oleate response element (ORE)

Transcription and threonine pool

Non-classical protein export

Conclusion

• Organism-wide virtual experiments can be performed• Comprehensive models can be constructed and evaluated

– complete sequence– abundance of edges between genes/geneproducts or genes

and phenotype– abundance of information from annotations, large scale

expression experiments and Yeast2Hybrid

• What we do not yet understand (well enough)– relationship between proximity in the network and protein

sequence– networks properties: high degree of interconnectivity yet

limited effects from gene disruption– translational and posttranslational regulation– How to apply large scale experiments to regulation

In case of simple eukaryotes the ‘virtual organism’ is in reach

Recommended