Reconstructing the metabolic network of a bacterium from ... · Reconstructing the metabolic network of a bacterium from its genome Marko Laakso 21.2.2007. ... More references can

Reconstructing the metabolic network of a bacterium from its

genome

Marko Laakso21.2.2007

Reference

The model reconstruction process has been reviewed in:

Francke, C., Siezen, R. and Teusink, B.,Reconstructing the metabolic network of a bacterium from its

genome. Trends in Microbiology, 13,11(2005), pages 550–558. URL http://dx.doi.org/10.1016/j.tim.2005.09.001.

More references can be found from the seminar report.

Agenda● Bacteria and their genome● Identification of the genes● Sequence alignments● Gene annotations● Mapping proteins to pathways● Model curation

Metabolic Network● A graph of biochemical reactions connected to

their reactants● Represents the metabolism of an organism● Subgraphs may be referred as metabolic

pathways and they may represent alterations of key metabolites or certain important cascade of reactions

● Involves about 30% of the genes

Goals● Model should be

– Applicable to contemporary computers● Formal● Predictions should be solvable

– Accurate enough to be of any practical use– Cover all essential pathways

● Reconstruction should be– Fast and inexpensive

● Automated as much as possible● Take an advantage existing information

– Well known and systematic process● Recycling of the tools and methods● Easy curation and fixing

Why to Study Bacteria?● Bacteria are found practically everywhere on

Earth● Vital part of the human body (normal flora)● Known pathogens● Industrial interests

– Waste water treatment– Production of chemicals– Food production

● Model organisms

Bacteria● Unicellular● Prokaryotic

– No nuclei, Golgi complex, endoplasmic reticulum, etc...

● There is a vast variety of species and variants

● Escherichia coli is the best known bacterium

(c) Mariana Ruiz

Genomes of Bacteria● Circular chromosome

– Only 'few' associated proteins– May be found in multiple copies (for replication)

● Plasmids● From 0.58 Mbp to over 10 Mbp● The intracellular parasites are out of the scope

of this presentation as their metabolism is tight to their host cells

● Genes have known promoter sequences● No introns in genes

Identification of the Genes● Sequencing of the genome● Gene identification

– Codon based methods● Start codon (ATG, GTG, TTG, CTG)● Stop codon (TAA, TGA, TAG)● Selection of the longest open reading frames● All three reading frames should be used for both ends

(some bacteria have programmed frame shifts)– Sequencing of the mRNA– Align based algorithms

Types of Similar Genes● Homologous genes have a common ancestor

– Orthologous genes of two organisms originate from the same gene of their last common ancestor (the function may be conserved)

– Paralogous genes are copies of the same gene (likely to have functional differences)

● Analogous genes have the same function but they are not homologous– Hard to identify from the plain sequence

Enzymes● Enzymes are biological catalyst that is they

decrease the activation energy of a reaction● All chemical reactions are reversible● All chemical reactions may take place without a

catalyst if they can be catalysed● It is hard to predict enzyme activities from

sequences– 3D model– Molecular kinetics

Gene Annotations● Some genes encode proteins with enzyme

activity● Information about the known genes and their

enzyme activities have been collected into biological databases– The sequence and the protein database may not be

the same● Enzyme activities of the known genes may

apply to their homologies in other organism

Sequence Alignment● The sequences of homologous genes may be

similar and homologous genes can be found by comparing sequences of known genes to genes of the bacteria that is concerned

(c) Atin Jaiswal

Sequence Aligning Algorithms● Algorithms that align two strings so that the

order of symbols is preserved and the similarity between the symbols of the same locus is maximised

● Strings may be the nucleotide sequences or the amino acid sequences of proteins

● Similarities can be given as a distant matrix, which gives the 'probabilities' for the substitutions of the symbols– BLOcks Substitution Matrix– Point Accepted Mutation

Dynamic Programming● Recursive algorithms calculating values based

on already calculated values● Many alignment algorithms have been derived

from the same concept of a 2D array, which represents the sequences on its axis

● Global alignment: Needleman-Wunsch● Local alignment: Smith-Waterman

Needleman-Wunsch (global)

© R. Durbin et al.

BLAST● Basic local alignment search tool● Widely used and heuristic algorithm for

sequence comparisons● Steps:

1.Create a vocabulary of short 'words'2.Found all high scoring matches of these words3.Extend matches from both ends (maximum score)4.Sort results based on the scoring

FASTA● A heuristic alignment

algorithm that reduces the size of the alignment matrix by concentrating on the most identical parts of the sequences

(c) Geoffry Barton

Hidden Markov Models● Statistical models that can be used to find most

probable interpretations for different parts of sequences or likelihoods that sequences represent the given phenomenon such as certain active parts of a protein

● Identification of the common motifs of proteins● Consists of states, emission probabilities, and

transition probabilities

Reconstructing the Network● Enzymes are bind to their substrates and end

products of the reaction● Metabolic networks consists of reactions and

reactants– Anabolic reactions consume some energy in order

to produce complex molecules from the simple ones

– Catabolic reactions produce energy by breaking complex molecules

Binding Reactions to Reactants

Enzyme Commission numbers● Enzyme commission numbers (EC numbers) are widely

used identifiers for enzyme catalysed reaction● Nomenclature Committee of the International Union of

Biochemistry and Molecular Biology (NC-IUBMB)● Hierarchical structure of four layers each providing a

decimal number● Top level categories are: oxidoreductases, transferases,

hydrolases, lyases, isomerases, and ligases● EC numbers may be partial, which may lead to some

issues in model construction (the exact reaction may not be catalysed by the found enzymes of the group)

EC number Examples● MUTYH gene (myhY homolog (E. coli))

– A/G-specific adenine DNA glycosylase– EC 3.2.2.-

● hydrolases● glykosylases● hydrolysing N-glycosyl compounds

● CDC2L1 gene (cell division cycle 2-like 1 (PITSLRE proteins))– EC 2.7.11.22

● transferases● transferring phosphorus-containing groups● protein-serine/threonine kinases● cyclin-dependent kinase

Manual Curation● Biological databases consists of measurements

and predictions– Some are false– Many are inaccurate

● Alignments and phylogenetic trees are guesses● Gene functions may not be conserved● Bugs in programs

Manual Curation● All reactions should be verified

– Wrong information should be excluded from the model construction software

● The model should include references back to the original data sources– Verification of the results and automated bindings– Model updates as the underlying data gets changed

(well verified model may stand on its own)– Reasoning for the model and proper citing– References may provide an integration platform for

other tools

Model Integrity● Model should include all essential pathways● Consistent with the known physiology● There should be no gaps

– Reactions producing accumulating products– Reactions without reactants– Unidentified gene, missing annotations– Convergent evolution

● Elemental balance of all reactions

Reference Scoring● Annotations may be scored based on...

– Phylogenetic relationships– Common regulation

● Order of the genes● Expression profiles● Regulation motifs

– Co-occurrence– Gene fusions– Existence of chaperones and other modifiers

Further Improvements● Context modelling

– Flux estimation– Steady states– Gene regulation

● Standardisation of the model representations– Comparisons between the species– Simulations of co-operations– Biotechnological tools and simulators

Conclusions● Genes can be compared using sequence

aligning algorithms● Annotations may be copied from homologous

genes● There are many databases for protein functions● It is easier to improve network models

iteratively than construct them right from the scratch

Documents

Reconstructing the metabolic network of a bacterium from ... · Reconstructing the metabolic network of a bacterium from its genome Marko Laakso 21.2.2007. ... More references can