Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Reconstructing the metabolic network of a bacterium from its
genome
Marko Laakso21.2.2007
Reference
The model reconstruction process has been reviewed in:
Francke, C., Siezen, R. and Teusink, B.,Reconstructing the metabolic network of a bacterium from its
genome. Trends in Microbiology, 13,11(2005), pages 550–558. URL http://dx.doi.org/10.1016/j.tim.2005.09.001.
More references can be found from the seminar report.
Agenda● Bacteria and their genome● Identification of the genes● Sequence alignments● Gene annotations● Mapping proteins to pathways● Model curation
Metabolic Network● A graph of biochemical reactions connected to
their reactants● Represents the metabolism of an organism● Subgraphs may be referred as metabolic
pathways and they may represent alterations of key metabolites or certain important cascade of reactions
● Involves about 30% of the genes
Goals● Model should be
– Applicable to contemporary computers● Formal● Predictions should be solvable
– Accurate enough to be of any practical use– Cover all essential pathways
● Reconstruction should be– Fast and inexpensive
● Automated as much as possible● Take an advantage existing information
– Well known and systematic process● Recycling of the tools and methods● Easy curation and fixing
Why to Study Bacteria?● Bacteria are found practically everywhere on
Earth● Vital part of the human body (normal flora)● Known pathogens● Industrial interests
– Waste water treatment– Production of chemicals– Food production
● Model organisms
Bacteria● Unicellular● Prokaryotic
– No nuclei, Golgi complex, endoplasmic reticulum, etc...
● There is a vast variety of species and variants
● Escherichia coli is the best known bacterium
(c) Mariana Ruiz
Genomes of Bacteria● Circular chromosome
– Only 'few' associated proteins– May be found in multiple copies (for replication)
● Plasmids● From 0.58 Mbp to over 10 Mbp● The intracellular parasites are out of the scope
of this presentation as their metabolism is tight to their host cells
● Genes have known promoter sequences● No introns in genes
Identification of the Genes● Sequencing of the genome● Gene identification
– Codon based methods● Start codon (ATG, GTG, TTG, CTG)● Stop codon (TAA, TGA, TAG)● Selection of the longest open reading frames● All three reading frames should be used for both ends
(some bacteria have programmed frame shifts)– Sequencing of the mRNA– Align based algorithms
Types of Similar Genes● Homologous genes have a common ancestor
– Orthologous genes of two organisms originate from the same gene of their last common ancestor (the function may be conserved)
– Paralogous genes are copies of the same gene (likely to have functional differences)
● Analogous genes have the same function but they are not homologous– Hard to identify from the plain sequence
Enzymes● Enzymes are biological catalyst that is they
decrease the activation energy of a reaction● All chemical reactions are reversible● All chemical reactions may take place without a
catalyst if they can be catalysed● It is hard to predict enzyme activities from
sequences– 3D model– Molecular kinetics
Gene Annotations● Some genes encode proteins with enzyme
activity● Information about the known genes and their
enzyme activities have been collected into biological databases– The sequence and the protein database may not be
the same● Enzyme activities of the known genes may
apply to their homologies in other organism
Sequence Alignment● The sequences of homologous genes may be
similar and homologous genes can be found by comparing sequences of known genes to genes of the bacteria that is concerned
(c) Atin Jaiswal
Sequence Aligning Algorithms● Algorithms that align two strings so that the
order of symbols is preserved and the similarity between the symbols of the same locus is maximised
● Strings may be the nucleotide sequences or the amino acid sequences of proteins
● Similarities can be given as a distant matrix, which gives the 'probabilities' for the substitutions of the symbols– BLOcks Substitution Matrix– Point Accepted Mutation
Dynamic Programming● Recursive algorithms calculating values based
on already calculated values● Many alignment algorithms have been derived
from the same concept of a 2D array, which represents the sequences on its axis
● Global alignment: Needleman-Wunsch● Local alignment: Smith-Waterman
Needleman-Wunsch (global)
© R. Durbin et al.
BLAST● Basic local alignment search tool● Widely used and heuristic algorithm for
sequence comparisons● Steps:
1.Create a vocabulary of short 'words'2.Found all high scoring matches of these words3.Extend matches from both ends (maximum score)4.Sort results based on the scoring
FASTA● A heuristic alignment
algorithm that reduces the size of the alignment matrix by concentrating on the most identical parts of the sequences
(c) Geoffry Barton
Hidden Markov Models● Statistical models that can be used to find most
probable interpretations for different parts of sequences or likelihoods that sequences represent the given phenomenon such as certain active parts of a protein
● Identification of the common motifs of proteins● Consists of states, emission probabilities, and
transition probabilities
Reconstructing the Network● Enzymes are bind to their substrates and end
products of the reaction● Metabolic networks consists of reactions and
reactants– Anabolic reactions consume some energy in order
to produce complex molecules from the simple ones
– Catabolic reactions produce energy by breaking complex molecules
Binding Reactions to Reactants
Enzyme Commission numbers● Enzyme commission numbers (EC numbers) are widely
used identifiers for enzyme catalysed reaction● Nomenclature Committee of the International Union of
Biochemistry and Molecular Biology (NC-IUBMB)● Hierarchical structure of four layers each providing a
decimal number● Top level categories are: oxidoreductases, transferases,
hydrolases, lyases, isomerases, and ligases● EC numbers may be partial, which may lead to some
issues in model construction (the exact reaction may not be catalysed by the found enzymes of the group)
EC number Examples● MUTYH gene (myhY homolog (E. coli))
– A/G-specific adenine DNA glycosylase– EC 3.2.2.-
● hydrolases● glykosylases● hydrolysing N-glycosyl compounds
● CDC2L1 gene (cell division cycle 2-like 1 (PITSLRE proteins))– EC 2.7.11.22
● transferases● transferring phosphorus-containing groups● protein-serine/threonine kinases● cyclin-dependent kinase
Manual Curation● Biological databases consists of measurements
and predictions– Some are false– Many are inaccurate
● Alignments and phylogenetic trees are guesses● Gene functions may not be conserved● Bugs in programs
Manual Curation● All reactions should be verified
– Wrong information should be excluded from the model construction software
● The model should include references back to the original data sources– Verification of the results and automated bindings– Model updates as the underlying data gets changed
(well verified model may stand on its own)– Reasoning for the model and proper citing– References may provide an integration platform for
other tools
Model Integrity● Model should include all essential pathways● Consistent with the known physiology● There should be no gaps
– Reactions producing accumulating products– Reactions without reactants– Unidentified gene, missing annotations– Convergent evolution
● Elemental balance of all reactions
Reference Scoring● Annotations may be scored based on...
– Phylogenetic relationships– Common regulation
● Order of the genes● Expression profiles● Regulation motifs
– Co-occurrence– Gene fusions– Existence of chaperones and other modifiers
Further Improvements● Context modelling
– Flux estimation– Steady states– Gene regulation
● Standardisation of the model representations– Comparisons between the species– Simulations of co-operations– Biotechnological tools and simulators
Conclusions● Genes can be compared using sequence
aligning algorithms● Annotations may be copied from homologous
genes● There are many databases for protein functions● It is easier to improve network models
iteratively than construct them right from the scratch