Upload
nathan-butler
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Computational Exploration of
Metabolic Networks with Pathway ToolsPart 1: Overview & Representations
Suzanne PaleyBioinformatics Research Group
SRI International
[email protected]://BioCyc.org/
SRI InternationalBioinformatics
Motivation: Theories of Cellular Function Too Large for One Mind to Grasp Example: E. coli metabolic network
160 pathways involving 744 reactions and 791 substrates Example: E. coli genetic network
Control by 97 transcription factors of 1174 genes in 630 transcription units
Past solutions: Partition theories across multiple minds Encode theories in natural-language text
We cannot compute with theories in those forms Evaluate theories for consistency with new data: microarrays Refine theories with respect to new data Compare theories describing different organisms
SRI InternationalBioinformaticsSolution:
Biological Knowledge Bases
Store biological knowledge and theories in computers in a declarative form
Amenable to computational analysis and generative user interfaces
Establish ongoing efforts to curate (maintain, refine, embellish) these knowledge bases
A high quality comprehensive knowledge base enables us to ask and answer important new questions
SRI InternationalBioinformaticsTerminology
Model Organism Database (MOD) – DB describing genome and other information about an organism
Pathway/Genome Database (PGDB) – MOD that combines information about
Pathways, reactions, substrates Enzymes, transporters Genes, replicons Transcription factors, promoters,
operons, DNA binding sites
BioCyc – Collection of 15 PGDBs at BioCyc.org
EcoCyc, AgroCyc, HumanCyc
SRI InternationalBioinformaticsPathway Tools Software
PathoLogic Prediction of metabolic network from genome Computational creation of new Pathway/Genome Databases
Pathway/Genome Editors Distributed curation of genome annotations Distributed object database system Interactive editing tools
Pathway/Genome Navigator WWW publishing of PGDBs Graphic depictions of pathways, chromosomes, operons Analysis operations
Pathway visualization of gene-expression data Global comparisons of metabolic networks
SRI InternationalBioinformaticsPathway Tools Software
Pathway/ Genome Databases
Pathway/GenomeNavigator
PathoLogic Pathway
Predictor
Pathway/GenomeEditors
SRI InternationalBioinformaticsPathway/Genome Database
Chromosomes,Plasmids
Genes
Proteins
Reactions
Pathways
Compounds
CELL
Operons,Promoters,DNA Binding Sites
SRI InternationalBioinformaticsPathway Tools Algorithms
Visualization and editing tools for following datatypes
Full Metabolic Map Paint gene expression data on metabolic network;
compare metabolic networksPathways
Pathway predictionReactions
Balance checkerCompounds
Chemical substructure comparisonEnzymes, Transporters, Transcription FactorsGenesChromosomesOperons
Operon prediction; visualize genetic network
SRI InternationalBioinformaticsDefinitions
Chemical reactions interconvert chemical compounds
An enzyme is a protein that accelerates chemical reactions
A pathway is a linked set of reactions Often regulated as a unit
A conceptual unit of cell’s biochemical machine
A + B C + D
A C E
SRI InternationalBioinformatics
SRI InternationalBioinformatics
SRI InternationalBioinformatics
SRI InternationalBioinformatics
SRI InternationalBioinformatics
SRI InternationalBioinformatics
SRI InternationalBioinformatics
SRI InternationalBioinformatics
SRI InternationalBioinformatics
SRI InternationalBioinformatics
SRI InternationalBioinformaticsOperations of the
Metabolic Overview
Find pathways, compounds
Find reactions By enzyme name, EC number, substrates, modulation All with isozymes All occurring in multiple pathways By EC class, pathway class
Find genes By name, gene class All regulated by transcriptional regulator protein
SRI InternationalBioinformaticsMetabolic Overview Queries
Species comparison Highlight reactions that are
Shared/not-shared with Any-one/All-of A specified set of species
Overlay expression data Colors reflects expression level and are user-configurable Can show single experiment or animated time series
SRI InternationalBioinformaticsEcoCyc Project
E. coli Encyclopedia Model-Organism Database for E. coli Began in 1992 as collaboration between Karp and Riley Over 3500 literature citations
Collaborative development via Internet Karp (SRI) -- Bioinformatics architect John Ingraham -- Advisor (SRI) Metabolic pathways Saier (UCSD) and Paulsen (TIGR)-- Transport Collado (UNAM)-- Regulation of gene expression
Ontology: 1000 biological classes Database content: 17,700 instances
SRI InternationalBioinformatics
EcoCyc = E.coli Dataset + Pathway/Genome
Navigator
Genes: 4,393
Proteins: 4,273
Reactions: 2,760
Pathways: 165
Compounds: 774
http://BioCyc.org/
Transcription Units: 724 Factors: 110
Enzymes: 914Transporters: 162
Promoters: 812TransFac Sites: 956
Citations: 3,508
SRI InternationalBioinformaticsMetaCyc: Metabolic
Encyclopedia
Nonredundant metabolic pathway database Describe a representative sample of every experimentally
determined metabolic pathway
Literature-based DB with extensive references and commentary
Pathways, reactions, enzymes, substrates 460 pathways, 1267 enzymes, 4294 reactions
172 E. coli pathways, 2735 citations Nucleic Acids Research 30:59-61 2002.
Jointly developed by SRI and Carnegie Institution New focus on plant pathways
SRI InternationalBioinformaticsMetaCyc Data
MetaCyc contains one DB object for each distinct pathway
Distinct in terms of reaction steps Each pathway labeled with species it occurs in
MetaCyc pathways are experimentally determined
4218 reactions in MetaCyc 401 lack EC numbers
SRI InternationalBioinformaticsMetaCyc Enzyme Data
Reaction(s) catalyzedAlternative substratesCofactors / prosthetic groupsActivators and inhibitorsSubunit structureMolecular weight, pIComment, literature citationsSpecies
SRI InternationalBioinformaticsMetaCyc Frequent Organisms
Escherichia coli 156
Arabidopsis thaliana 47
Homo sapiens 30
Pseudomonas 21
Bacillus subtilis 20
Salmonella typhimurium 20
Sulfolobus solfataricus 18
Pseudomonas putida 14
Saccharomyces cerevisiae 14
Haemophilus influenzae 13
Glycine max 11
Deinococcus radiourans 10
SRI InternationalBioinformaticsEcoCyc and MetaCyc
Review level databasesData derived primarily from biomedical literature
Manual entry by staff curators Updates by staff curators only
Data validation Consistency constraints Lisp programs that verify other semantic relationships
Unbalanced chemical reactions
SRI InternationalBioinformaticsComputationally-Derived PGDBs
Pathway/GenomeDatabase
Annotated GenomicSequence
Genes/ORFs
Gene Products
DNA Sequences
Reactions
Pathways
Compounds
Multi-organism PathwayDatabase (MetaCyc)
PathoLogic Software
Integrates genome and pathway data to identify
putative metabolic networks
Genomic Map
Genes
Gene Products
Reactions
Pathways
Compounds
SRI InternationalBioinformaticsPathoLogic Input/Output
Inputs: File listing genetic elements
http://bioinformatics.ai.sri.com/ptools/genetic-elements.dat Files containing DNA sequence for each genetic element Files containing annotation for each genetic element MetaCyc database
Output: Pathway/genome database for the subject organism Directory tree for the subject organism Reports that summarize:
Evidence contained in the input genome for the presence of reference pathways
Reactions missing from inferred pathways
SRI InternationalBioinformaticsPathoLogic Functionality
Initialize schema for new PGDBTransform existing genome to PGDB formInfer metabolic pathways and store in PGDBInfer operons and store in PGDBAssist user with manual tasks
Assign enzymes to reactions they catalyze Identify false-positive pathway predictions Build protein complexes from monomers Assemble Overview diagram
SRI InternationalBioinformaticsBioCyc Collection of
Pathway/Genome DBs
Literature-based Datasets:Escherichia coli (EcoCyc) MetaCyc
PGDBs at other sites:Arabidopsis thaliana (TAIR)Methanococcus jannaschii (EBI)Saccharomyces cerevisiae (SGD)Synechocystis PCC6803
Computationally-derived datasets:Agrobacterium tumefaciensCaulobacter crescentusChlamydia trachomatisBacillus subtilisHelicobacter pyloriHaemophilus influenzaeHomo sapiensMycobacterium tuberculosis RvH37Mycobacterium tuberculosis CDC1551Mycoplasma pneumoniaPseudomonas aeruginosaTreponema pallidumVibrio cholerae
Yellow = Open Database
http://BioCyc.org/
SRI InternationalBioinformatics
HumanCyc: Human Metabolic PathwayDatabase PGDB of human metabolic pathways built using PathoLogic Contains information on 28,700 genes, their products, and the
metabolic reactions and pathways they catalyze (no signalling pathways)
Chromosome and contigs from Ensembl Human genetic loci from LocusLink
Mitochondrion data from GenBank Ensembl and LocusLink gene entries were merged to eliminate
redundancies where possible. Contains links to human genome web sites Plan to hire one curator to refine and curate with respect to literature
over a 2 year period Remove false-positive predictions Insert known pathways missed by PathoLogic Add comments and citations from pathways and enzymes to the literature Add enzyme activators, inhibitors, cofactors, tissue information
Funded by commercial consortium
SRI InternationalBioinformaticsBioCyc and Pathway Tools
Availability
WWW BioCyc freely available to all BioCyc.org Six BioCyc DBs openly available to all
BioCyc DBs freely available to non-profits Flatfiles downloadable from BioCyc.org Binary executable:
Sun UltraSparc-170 w/ 64MB memory PC, 400MHz CPU, 64MB memory, Windows-98 or newer
PerlCyc API
Pathway Tools freely available to non-profits
SRI InternationalBioinformaticsInformation Sources
Pathway Tools User’s Guide aic-export/ecocyc/genopath/released/doc/userguide1.pdf
Pathway/Genome Navigator Appendix A: Guide to the Pathway Tools Schema
aic-export/ecocyc/genopath/released/doc/userguide2.pdf PathoLogic, Editing Tools
Pathway Tools Web Site http://bioinformatics.ai.sri.com/ptools/ Publications, programming examples, etc.
Pathway Tools Tutorial http://bioinformatics.ai.sri.com/ptools/tutorial/
SRI InternationalBioinformaticsPathway Tools Implementation
Details
Allegro Common LispSun and PC platforms
Ocelot object database
250,000 lines of code
Lisp-based WWW server at BioCyc.org Manages 15 PGDBs
SRI InternationalBioinformaticsFrame Data Model
Frame Data Model -- organizational structure for a PGDB
Knowledge base (KB, Database, DB)
Frames
Slots
SRI InternationalBioinformaticsKnowledge Base
Collection of frames and their associated slots, values, facets, and annotations
AKA: Database, PGDB
Can be stored within An Oracle DB A disk file A Pathway Tools binary program
SRI InternationalBioinformaticsFrames
Entities with which facts are associated
Kinds of frames: Classes: Genes, Pathways, Biosynthetic Pathways Instances (objects): trpA, TCA cycle
Classes: Superclass(es) Subclass(es) Instance(s)
A symbolic frame name (id, key) uniquely identifies each frame
SRI InternationalBioinformaticsSlots
Encode attributes/properties of a frame Integer, real number, string
Represent relationships between frames The value of a slot is the identifier of another frame
Every slot is described by a “slot frame” in a KB that defines meta information about that slot
SRI InternationalBioinformaticsProperties of Slots
Number of values Single valued Multivalued: sets, bags
Slot values Any LISP object: Integer, real, string, symbol (frame name)
Slotunits define properties of slots: datatypes, classes, constraints
Two slots are inverses if they encode opposite relationships
Slot Product in class Genes Slot Gene in class Polypeptides
SRI InternationalBioinformaticsPathway Tools Ontology
1064 classes Main classes such as:
Pathways, Reactions, Compounds, Macromolecules, Proteins, Replicons, DNA-Segments (Genes, Operons, Promoters)
Taxonomies for Pathways, Reactions, Compounds
205 slots Meta-data: Creator, Creation-Date Comment, Citations, Common-Name, Synonyms Attributes: Molecular-Weight, DNA-Footprint-Size Relationships: Catalyzes, Component-Of, Product
Classes, instances, slots all stored side by side in DBMS, share a single namespace
SRI InternationalBioinformaticsSlot Links from Gene to
Pathway Frame
Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2
sdhA sdhB sdhC sdhD
succinate + FAD = fumarate + FADH2
Enzymatic-reaction
Succinate dehydrogenase
TCA Cycle
product
component-of
catalyzes
reaction
in-pathway
Chrom
succinate
FAD
fumarate
FADH2
left
right
SRI InternationalBioinformatics
Enzymatic-reaction frame stores properties of pairing between enzyme and reaction
Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2
sdhA sdhB sdhC sdhD
Succinate + FAD = fumarate + FADH2
Enzymatic-reaction
Succinate dehydrogenase
TCA Cycle
EC#Keq
CofactorsInhibitors
Molecular wtpI
Left-end-position
SRI InternationalBioinformaticsMonofunctional Monomer
Gene
Reaction
Enzymatic-reaction
Monomer
Pathway
SRI InternationalBioinformaticsBifunctional Monomer
Gene
Reaction
Enzymatic-reaction
Monomer
Pathway
Reaction
Enzymatic-reaction
SRI InternationalBioinformaticsMonofunctional Multimer
Monomer Monomer Monomer Monomer
Gene Gene Gene Gene
Reaction
Enzymatic-reaction
Multimer
Pathway
SRI InternationalBioinformaticsPathway and Substrates
Reactant-1
Reaction
Pathway
ReactionReactionReaction
Reactant-2
Product-2
Product-1
in-pathwayleft
right
SRI InternationalBioinformaticsGenetic Network Representation
Describe biological entities involved in control of transcription initiation
Promoters, operators, transcription factors, operons, terminators
Describe molecular interactions among these entities
Modulation of transcription factor activity Binding of transcription factors to DNA binding sites Effects on transcription initiation
SRI InternationalBioinformaticsOntology for
Transcriptional Regulation
One DB object defined for each biological entity and for each molecular interaction
site001
pro001
trpE
trpD
trpC
trpB
trpA
trpL
Int002 RpoSig70
TrpR*trpInt001
trpLEDCBA
trp
apoTrpRComplexation reaction
Int001 (binding of TrpR*trp to site001) inhibits Int002 (binding of RNA Polymerase to promoter) and consequently prevents transcription
of genes in transcription unit.
SRI InternationalBioinformaticsPrinciple Classes
Class names are capitalized, plural
Genetic-Elements, with subclasses: Chromosomes Plasmids
GenesTranscription-UnitsRNAsProteins, with subclasses:
Polypeptides Protein-Complexes
SRI InternationalBioinformaticsPrinciple Classes
Reactions, with subclasses: Transport-Reactions
Enzymatic-Reactions
Pathways
Compounds-And-Elements
SRI InternationalBioinformaticsSlots in Multiple Classes
Common-NameSynonymsNames (computed as union of Common-Name,
Synonyms)
CommentCitations
DB-Links
SRI InternationalBioinformaticsGenes Slots
ChromosomeLeft-End-PositionRight-End-PositionCentisome-PositionTranscription-DirectionProduct
SRI InternationalBioinformaticsProteins Slots
Molecular-Weight-SeqMolecular-Weight-Exp
pILocations
Modified-FormUnmodified-Form
Component-Of
SRI InternationalBioinformaticsPolypeptides Slots
Gene
SRI InternationalBioinformaticsProtein-Complexes Slots
Components
SRI InternationalBioinformaticsReactions Slots
EC-Number
Left, RightSubstrates (computed as union of Left, Right)Enzymatic-Reaction
DeltaG0
Spontaneous?
SRI InternationalBioinformaticsEnzymatic-Reactions Slots
EnzymeReactionActivatorsInhibitorsPhysiologically-RelevantCofactorsProsthetic-GroupsAlternative-SubstratesAlternative-CofactorsReaction-direction
SRI InternationalBioinformaticsPathways Slots
Reaction-ListPredecessorsPrimaries