Upload
tevin
View
51
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Molecular biology in silico. Mikhail Gelfand Research and Training Center “Bioinformatics”, Institute for Information Transmission Problems, RAS AlBio06, Moscow, July 2006. red: papers (experiments) blue: sequence fragments. Propaganda. Complete genomes. - PowerPoint PPT Presentation
Citation preview
Molecular biology in silico
Mikhail GelfandResearch and Training Center “Bioinformatics”,
Institute for Information Transmission Problems, RAS
AlBio06, Moscow, July 2006
Propaganda
100
1000
10000
100000
1000000
10000000
1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
год
red: papers (experiments)blue: sequence fragments
Complete genomes
2
149
4
18
30
55
84
8
19
422
1
107
4321
15
0
10
20
30
40
50
60
70
80
90
1995 1996 1997 1998 1999 2000 2001 2002
GOLD db.(III.2006):361 complete genomesIncomplete (in the process):
952 bacteria 58 archaea607 eukaryotes (incl. ESTs) 46 metagenomes
More propaganda
Most genes will never be studied in experimentEven in E.coli: only 20-30 new genes per year (hundreds are still uncharacterized)
Bioinformatics = molecular biology in silico• ~2% of all recent papers in biological journals• Essential component of biological research• Make predictions about function and regulation of genes
(many quite reliable!)• Metabolic reconstruction and prediction of phenotype given
genome• Identify really interesting cases, fill gaps in knowledge
– “Universally missing genes” – not a single known gene even for ~10% reactions of central metabolism. No genes for >40% reactions overall
– “Conserved hypothetical genes” (5-15% of any bacterial genome) – essential, but unknown function
Haemophilus influenzae, 1995
Vibrio cholerae, 2000
How?
Similarity to known proteins
• Useful for many purposes (allows one to annotate 50-75% genes in a bacterial genome)
• Necessary first step• May be automated
– … to some extent …– in particular, care is needed to avoid too specific
predictions– Problem: propagation of annotation errors
• Boring (nothing new)
Noradrenaline transporter in an archaeon?
SOURCE Methanococcus jannaschii. ORGANISM Methanococcus jannaschii Archaea; Euryarchaeota; Methanococcales; Methanococcaceae; Methanococcus.
FEATURES Location/Qualifiers source 1..492 /organism="Methanococcus jannaschii" /db_xref="taxon:2190" Protein 1..492
/product="sodium-dependent noradrenaline transporter" CDS 1..492 /gene="MJ1319" /note="similar to EGAD:HI0736 percent identity: 38.5;
identified by sequence similarity; putative" /coded_by="U67572:71..1549" /transl_table=11
Now corrected: Hypothetical sodium-dependent transporter MJ1319.
Similarity to hypothetical proteins: somebody else’s errors…
The correct annotation
Genes with curious functional assignments
• C75604: Probable head morphogenesis protein, Deinococcus radiodurans
• O05360: Automembrane protein H, Yersinia enterocolitica
• Q8TID9: Benzodiazepine (valium) receptor TspO, Methanosarcina acetivorans
• NP_069403: DR-beta chain MHC class II, Archaeoglobus fulgidus
Errors in experimental papers
SwissProt:
DEFINITION Hypothetical 43.6 kDa protein.ACCESSION P48012
...
KEYWORDS Hypothetical protein.
SOURCE Debaryomyces occidentalis
ORGANISM Debaryomyces occidentalis
Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes;
Saccharomycetales; Saccharomycetaceae; Debaryomyces.
[CAUTION] Was originally (Ref.1) thought to be 3-isopropylmalate dehydrogenase (LEU2).
PIR:DEFINITION 3-isopropylmalate dehydrogenase (EC 1.1.1.85)
- yeast(Schwanniomyces occidentalis).
ACCESSION S55845
KEYWORDS oxidoreductase.
SwissProt entry DSDX_ECOLI
-!- CAUTION: An ORF called dsdC was originally (Ref.3) assigned to the wrong DNA strand and thought to be a D-serine deaminase activator, it was then resequenced by Ref.2 and still thought to be "dsdC", but this time to function as a D-serine permease. It is Ref.1 that showed that dsdC is another gene and that this sequence should be called dsdX. It should also be noted that the C-terminal part of dsdX (from 338 onward) was also sequenced (Ref.6 and Ref.7) and was thought to be a separate ORF (don't worry, we also had difficulties understanding what happened!).
Positional clustering
• Genes that are located in immediate proximity tend to be involved in the same metabolic pathway or functional subsystem – mainly in prokaryotes, very weak in eukaryotes– caused by operon structure, but not only
• horizontal transfer of loci containing several functionally linked operons
• compartmentalisation of products in the cytoplasm
– very weak evidence• stronger if observed in may unrelated genomes
• May be measured– e.g. the STRING database/server (P.Bork, EMBL) – and other sources
STRING: trpB –
positional clusters
Functionally dependent genes tend to cluster on chromosomes in many different organisms
Vertical axis: number of gene pairs with association score exceeding a threshold.
Control: same graph, random re-labeling of vertices
More genomes (stronger links) => highly significant clustering
Especially in linear pathways (right)
Fusions
• If two (or more) proteins form a single multidomain protein in some organism, they all are likely to be tightly functionally related
• Very useful for the analysis of eukaryotes• Sometimes useful for the analysis of
prokaryotes
STRING: trpB – fusions
Phyletic patterns
• Functionally linked genes tend to occur together
• Enzymes with the same function (isozymes) have complementary phyletic profiles
STRING: trpB – co-
occurrence (phyletic profiles)
Phyletic profiles in the Phe/Tyr pathway
shikimate kinase
Archaeal shikimate-kinaseChorismate biosynthesis pathway (E. coli)
Arithmetics of phyletic patterns
3-dehydroquinate dehydratase (EC 4.2.1.10):Class I (AroD) COG0710 aompkzyq---lb-e----n---i-- Class II (AroQ) COG0757 ------y-vdr-bcefghs-uj----
Two forms combined aompkzyqvdrlbcefghsnuj-i--+
5-enolpyruvylshikimate 3-phosphate synthase (EC 2.5.1.19) AroA COG0128 aompkzyqvdrlbcefghsnuj-i--
Shikimate dehydrogenase (EC 1.1.1.25):AroE COG0169 aompkzyqvdrlbcefghsnuj-i--
+
Shikimate kinase (EC 2.7.1.71):Typical (AroK) COG0703 ------yqvdrlbcefghsnuj-i--Archaeal-type COG1685 aompkz--------------------
Two forms combined aompkzyqvdrlbcefghsnuj-i--
Chorismate synthase (EC 2.5.1.19) AroC COG0082 aompkzyqvdrlbcefghsnuj-i--
Distribution of association scores (monotonic for subunits,
bimodal for isozymes)
E.g. transporters
• Transporters of end products of metabolic pathways may substitute the entire pathway
• Transporters of compounds for catabolic pathways co-occur with pathways
• Transporters for intermediates substitute upstream parts of pathways
Example: bioY
Other approaches to phyletic patterns
• Gene signatures of lifestyles – e.g. thermophily:
DNA gyrase is the only gene specific to all hyperthermophiles (bacterial and archaeal)
– see COGs
• Regulators and signals
Example: bioR
gene: black arrow;
candidate site: red dot
Comparative analysis of regulation
• Phylogenetic footprinting: regulatory sites are more conserved than non-coding regions in general and are often seen as conserved islands in alignments of gene upstream regions
• Consistency filtering: regulons (sets of co-regulated genes) are conserved =>– true sites occur upstream of orthologous genes– false sites are scattered at random
Enzymes
• Identification of a gap in a pathway (universal, taxon-specific, or in individual genomes)
• Search for candidates assigned to the pathway by co-localization and co-regulation (in many genomes)
• Prediction of general biochemical function from (distant) similarity and functional patterns
• Tentative filling of the gap• Verification by analysis of phylogenetic patterns:
– Absence in genomes without this pathway
– Complementary distribution with known enzymes for the same function
Transporters
• Identification of candidates assigned to the pathway by co-localization and co-regulation (in many genomes)
• Prediction of general function by analysis of transmembrane segments and similarity
• Prediction of specificity by analysis of phylogenetic patterns:– End product if present in genomes lacking this pathway
(substituting the biosynthetic pathway for an essential compound)
– Input metabolite if absent in genomes without the pathway (catabolic, also precursors in biosynthetic pathways)
– Entry point in the middle if substituting an upper or side part of the pathway in some genomes
5’ UTR regions of riboflavin genes from bacteria 1 2 2’ 3 Add. 3’ Variable 4 4’ 5 5’ 1’ =========> ==> <== ===> -><- <=== -> <- ====> <==== ==> <== <========= BS TTGTATCTTCGGGG-CAGGGTGGAAATCCCGACCGGCGGT 21 AGCCCGTGAC-- 8 4 8 -----TGGATTCAGTTTAA-GCTGAAGCCGACAGTGAA-AGTCTGGAT-GGGAGAAGGATGAT BQ AGCATCCTTCGGGG-TCGGGTGAAATTCCCAACCGGCGGT 19 AGTCCGTGAC-- 8 5 8 -----TGGATCTAGTGAAACTCTAGGGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGGATATG BE TGCATCCTTCGGGG-CAGGGTGAAATTCCCGACCGGCGGT 20 AGCCCGCGA--- 3 4 3 -----AGGATCCGGTGCGATTCCGGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGGATGCC HD TTTATCCTTCGGGG-CTGGGTGGAAATCCCGACCGGCGGT 19 AGTCCGTGAC-- 10 4 10 ----–TGGACCTGGTGAAAATCCGGGACCGACAGTGAA-AGTCTGGAT-GGGAGAAGGAAACG Bam TGTATCCTTCGGGG-CTGGGTGAAAATCCCGACCGGCGGT 23 AGCCCGTGAC-- 8 4 8 ----–TGGATTCAGTGAAAAGCTGAAGCCGACAGTGAA-AGTCTGGAT-GGGAGAAGGATGAG CA GATGTTCTTCAGGG-ATGGGTGAAATTCCCAATCGGCGGT 2 AGCCCGCAA--- 3 4 3 ------AGATCCGGTTAAACTCCGGGGCCGACAGTTAA-AGTCTGGAT-GAAAGAAGAAATAG DF CTTAATCTTCGGGG-TAGGGTGAAATTCCCAATCGGCGGT 2 AGCCCGCG---- 7 6 7 --------ATTTGGTTAAATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GGAAGAAGATATTT SA TAATTCTTTCGGGG-CAGGGTGAAATTCCCAACCGGCAGT 6 AGCCTGCGAC-- 11 3 11 ----–CTGATCTAGTGAGATTCTAGAGCCGACAGTTAA-AGTCTGGAT-GGGAGAAAGAATGT LLX ATAAATCTTCAGGG-CAGGGTGTAATTCCCTACCGGCGGT 2 AGCCCGCGA--- 4 4 4 -----ATGATTCGGTGAAACTCCGAGGCCGACAGT-AT-AGTCTGGAT-GAAAGAAGATAATA PN AACTATCTTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT 2 AGCCCACGA--- 3 4 3 -----ATGATTTGGTGAAATTCCAAAGCCGACAGT-AT-AGTCTGGAT-GAAAGAAGATAAAA TM AAACGCTCTCGGGG-CAGGGTGGAATTCCCGACCGGCGGT 3 AGCCCGCGAG-- 5 4 5 ----–TTGACCCGGTGGAATTCCGGGGCCGACGGTGAA-AGTCCGGAT-GGGAGAGAGCGTGA DR GACCTCTTTCGGGG-CGGGGCGAAATTCCCCACCGGCGGT 15 AGCCCGCGAA-- 8 12 9 ----–CCGATGCCGCGCAACTCGGCAGCCGACGGTCAC-AGTCCGGAC-GAAAGAAGGAGGAG TQ CACCTCCTTCGGGG-CGGGGTGGAAGTCCCCACCGGCGGT 3 AGCCCGCGAA-- 5 4 5 -----CCGACCCGGTGGAATTCCGGGGCCGACGGTGAA-AGTCCGGAT-GGGAGAAGGAGGGC AO AATAATCTTCAGGG-CAGGGTGAAATTCCCGATCGGCGGT 2 AGTCCGCGA--- 7 7 7 -----AGGAACCGGTGAGATTCCGGTACCGACAGT-AT-AGTCTGGAT-GGAAGAAGATGAAA DU TTTAATCTTCAGGG-CAGGGTGAAATTCCCGATCGGTGGT 2 AGTCCGCGA--- 13 4 12 -----AGGAACTAGTGAAATTCTAGTACCGACAGT-AT-AGTCTGGAT-GGAAGAAGAGCAGA CAU GAAGACCTTCGGGG-CAAGGTGAAATTCCTGATCGGCGGT 20 AGCCCGCGA--- 3 4 3 -----AGGACCCGGTGTGATTCCGGGGCCGACGGT-AT-AGTCCGGAT-GGGAGAAGGTCGGC FN TAAAGTCTTCAGGG-CAGGGTGAAATTCCCGACCGGTGGT 2 AGTCCACG---- 5 4 5 -------GATTTGGTGAAATTCCAAAACCGACAGT-AG-AGTCTGGAT-GGGAGAAGAATTAG TFU ACGCGTGCTCCGGG-GTCGGTGAAAGTCCGAACCGGCGGT 3 AGTCCGCGAC-- 8 5 8 -----TGGAACCGGTGAAACTCCGGTACCGACGGTGAA-AGTCCGGAT-GGGAGGTAGTACGTG SX -AGCGCACTCCGGG-GTCGGTGAAAGTCCGAACCGGCGGT 3 AGTCCGCGAC-- 8 5 8 -----TTGACCAGGTGAAATTCCTGGACCGACGGTTAA-AGTCCGGAT-GGGAGGCAGTGCGCG BU GTGCGTCTTCAGGG-CGGGGTGAAATTCCCCACCGGCGGT 30 AGCCCGCGAGCG 137 GTCAGCAGATCTGGTGAGAAGCCAGAGCCGACGGTTAG-AGTCCGGAT-GGAAGAAGATGTGC BPS GTGCGTCTTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT 21 AGCCCGCGAGCG 8 4 8 GTCAGCAGATCTGGTCCGATGCCAGAGCCGACGGTCAT-AGTCCGGAT-GAAAGAAGATGTGC REU TTACGTCTTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT 31 AGCCCGCGAGCG 7 5 7 GTCAGCAGATCTGGTGAGAGGCCAGGGCCGACGGTTAA-AGTCCGGAT-GAAAGAAGATGGGC RSO GTACGTCTTCAGGG-CGGGGTGGAATTCCCCACCGGCGGT 21 AGCCCGCGAGCG 11 3 11 GTCAGCAGATCCGGTGAGATGCCGGGGCCGACGGTCAG-AGTCCGGAT-GGAAGAAGATGTGC EC GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT 17 AGCCCGCGAGCG 8 4 8 GACAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAG-AGTCCGGAT-GGGAGAGAGTAACG TY GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT 67 AGCCCGCGAGCG 8 3 8 GTCAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAA-AGTCCGGAT-GGGAGAGGGTAACG KP GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT 20 AGCCCGCGAGCG 8 4 8 GTCAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAA-AGTCCGGAT-GGGAGAGAGTAACG HI TCGCATTCTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT 2 AGCCCACGAGCG 26 9 30 GTCAGCAGATTTGGTGAAATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GAAAGAGAATAAAA VK GCGCATTCTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT 14 AGCCCACGAGCG 11 9 11 GTCAGCAGATTTGGTGAGAATCCAAAGCCGACAGT-AT-AGTCTGGAT-GAAAGAGAATAAGC VC CAATATTCTCAGGG-CGGGGCGAAATTCCCCACCGGTGGT 13 AGCCCACGAGCG 5 4 5 GTCAGCAGATCTGGTGAGAAGCCAGGGCCGACGGTTAC-AGTCCGGAT-GAGAGAGAATGACA YP GCTTATTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT 40 AGCCCGCGAGCG 16 6 16 GTCAGCAGACCCGGTGTAATTCCGGGGCCGACGGTTAT-AGTCCGGAT-GGGAGAGAGTAACG AB GCGCATTCTCAGGG-CAGGGTGAAAGTCCCTACCGGTGGT 25 AGCCCACGAGCG 16 4 27 GTCAGCAGATTTGGTGCGAATCCAAAGCCGACAGTGAC-AGTCTGGAT-GAAAGAGAATAAAA BP GTACGTCTTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT 18 AGCCCGCGAGCG 10 4 10 GTCAGCAGACCTGGTGAGATGCCAGGGCCGACGGTCAT-AGTCCGGAT-GAGAGAAGATGTGC AC ACATCGCTTCAGGG-CGGGGCGTAATTCCCCACCGGCGGT 16 AGCCCGCGAGCA 10 3 11 ---CGCAGATCTGGTGTAAATCCAGAGCCGACGGT-AT-AGTCCGGAT-GAAAGAAGACGACG Spu AACAATTCTCAGGG-CGGGGTGAAACTCCCCACCGGCGGT 34 AGCCCGCGAGCG 6 6 6 GTCAGCAGATCTGGTG 52 TCCAGAGCCGACGGT 31 AGTCCGGAT-GGAAGAGAATGTAA PP GTCGGTCTTCAGGG-CGGGGTGTAAGTCCCCACCGGCGGT 13 AGCCCGCGAGCG 7 3 7 GTCAGCAGATCTGGTGCAACTCCAGAGCCGACGGTCAT-AGTCCGGAT-GAAAGAAGGCGTCA AU GGTTGTTCTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT 17 AGCCCGCGAGCG 7 9 7 GTCAGCAGATCCGGTGAGAGGCCGGAGCCGACGGT-AT-AGTCCGGAT-GGAAGAGGACAAGG PU AAACGTTCTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT 19 AGCCCGCGAGCG 19 4 18 GTCAGCAGACCCGGTGTGATTCCGGGGCCGACGGTCAC-AGTCCGGATGAAGAGAGAACGGGA PY TAACGTTCTCAGGG-CGGGGTGCAACTCCCCACCGGCGGT 19 AGCCCGCGAGCG 15 4 16 GTCAGCAGACCCGGTGTGATTCCGGGGCCGACGGTCAT-AGTCCGGATGAAGAGAGAGCGGGA PA TAACGTTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT 19 AGCCCGCGAGCG 14 4 13 GTCAGCAGACCCGGTGCGATTCCGGGGCCGACGGTCAT-AGTCCGGATAAAGAGAGAACGGGA MLO TAAAGTTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT 16 AGCCCGCGAGCG 8 5 8 GTCAGCAGATCCGGTGTGATTCCGGAGCCGACGGTTAG-AGTCCGGAT-GAAAGAGGACGAAA SM AAGCGTTCTCAGGG-CGGGGTGAAATTCCCCACCGGCGGT 34 AGCCCGCGAGCG 8 3 8 GTCAGCAGATCCGGTCGAATTCCGGAGCCGACGGTTAT-AGTCCGGAT-GGAAGAGAGCAAGC BME GCTTGTTCTCGGGG-CGGGGTGAAACTCCCCACCGGCGGT 17 AGCCCGCGAGCG 10 15 10 GTCAGCAGATCCGGTGAGATGCCGGAGCCGACGGTTAA-AGTCCGGAT-GGAAGAGAGCGAAT BS ATCAATCTTCGGGG-CAGGGTGAAATTCCCTACCGGCGGT 18 AGCCCGCGA--- 5 4 5 -----AGGATTCGGTGAGATTCCGGAGCCGACAGT-AC-AGTCTGGAT-GGGAGAAGATGGAG BQ GTCTATCTTCGGGG-CAGGGTGAAAATCCCGACCGGCGGT 27 AGCCCGCGA—-- 3 5 3 -----AGGATTTGGTGTGATTCCAAAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGGAG BE ATTCATCTTCGGGG-CAGGGTGAAATTCCCGACCGGCGGT 20 AGCCCGCGA--- 3 4 3 -----AGGATCCGGTGCGAGTCCGGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGAAG CA AATGATCTTCAGGG-CAGGGTGAAATTCCCTACCGGCGGT 2 AGCCCGCGAG-- 3 4 3 ----TATGATCCGGTTTGATTCCGGAGCCGACAGT-AA-AGTCTGGAT-GAAAGAAGATATAT DF GAAGATCTTCGGGG-CAGGGTGAAATTCCCTACCGGCGGT 2 AGCCCGCG---- 6 4 6 -------GATTTGGTGAGATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GAGAGAAGATATTT EF GTTCGTCTTCAGGGGCAGGGTGTAATTCCCGACCGGTGGT 3 AGTCCACGAC-- 5 3 5 ----ATTGAATTGGTGTAATTCCAATACCGACAGT-AT-AGTCTGGAT—-AAAGAAGATAGGG LLX AAATATCTTCAGGG-CACCGTGTAATTCGGGACCGGCGGT 21 ACTCCGCGAT-- 4 4 4 ----–TTGAAGCAGTGAGAATCTGCTAGCGACAGT-AA-AGTCTGGAT-GGAAGAAGATGAAC LO GTTCATCTTCGGGG-CAGGGTGCAATTCCCGACCGGTGGT 3 AGTCCACGAT-- 3 10 3 ----TTGACTCTGGTGTAATTCCAGGACCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGTTG PN AAGAGTCTTCAGGG-CAGGGTGAAATTCCCGACCGGCGGT 125 AGTCCGTG---- 3 4 3 -------GATGTGGTGAGATTCCACAACCGACAGT-AT-AGTCTGGAT-GGGAGAAGACGAAA ST AAGTGTCTTCAGGG-CAGGGTGTGATTCCCGACCGGCGGT 14 AGTCCGCG---- 3 4 3 -------GATGTGGTGTAACTCCACAACCGACAGT-AT-AGTCTGGAT-GAGAGAAGACCGGG MN AAGTGTCTTCAGGG-CAGGGTGAGATTCCCGACCGGCGGT 104 AGTCCGCG---- 3 4 3 -------GATGTGGTGAAATTCCACAACCGACAGT-AA-AGTCTGGAT-GGGAGAAGACTGAG SA ATTCATCTTCGGGG-TCGGGTGTAATTCCCAACCGGCAGT 6 AGCCTGCGAC-- 11 3 11 ----–CTGATCTAGTGAGATTCTAGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGGAG AMI TCACAGTTTCAGGG-CGGGGTGCAATTCCCCACTGGCGGT 14 AGCCCGCGC--- 5 5 5 ------TGATCTGGTGCAAATCCAGAGCCAACGGT-AT-AGTCCGGAT-GGAAGAAACGGAGC DHA ACGAACCTTCGAGG-TAGGGTGAAATTCCCGACCGGCGGT 20 AGCCCGCAAC-- 11 4 11 --CGACTGACTTGGTGAGACTCCAAGGCCGACGGT-AT-AGTCCGGAT-GGGAGAAGGTACAA FN AATAATCTTCGGGG-CAGGGTGAAATTCCCGACCGGTGGT 2 AGTCCACG---- 4 6 4 -------GATTTGGTGAAATTCCAAAACCGACAGT-AG-AGTCTGGAT-GAGAGAAGAAAAGA GLU ---TGTTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT 28 AGCCCGCGAGCG 10 4 10 GTCAGCAGATCCGGTTAAATTCCGGAGCCGACGGTCAT-AGTCCGGAT-GCAAGAGAACC---
Conserved secondary structure of the RFN-element
NNNNyYYUC
NNNNrRRAG
NgGGNcCC
rgGGxc
ARRgxuAG
GRCCYG
AcCG
AGCCRGY
GG YRCC
GRYBy CYRVrG N
YGNaA N U U x N
Nx
AGU
UrN A g
Y
variab lestem -loop
additionalstem -loop
3 4
2
1
5
5 ’ 3 ’
u K NRA
xK
*
****
Capitals: invariant (absolutely conserved) positions.
Lower case letters: strongly conserved positions.
Dashes and stars: obligatory and facultative base pairs
Degenerate positions: R = A or G; Y = C or U; K = G or U; B= not A; V = not U. N: any nucleotide. X: any nucleotide or deletion
RFN: the mechanism of regulation
• Transcription attenuation
• Translation attenuation
Early observation: an uncharacterized gene (ypaA) with an upstream RFN element
Phylogenetic tree of RFN-elements (regulation of riboflavin biosynthesis)
duplications
no riboflavin biosynthesis
no riboflavin biosynthesis
YpaA: riboflavin (vitamin B2) transporter in Gram-positive bacteria
• 5 predicted transmembrane segments => a transporter• Upstream RFN element (likely co-regulation with riboflavin
genes) => transport of riboflaving or a precursor• S. pyogenes, E. faecalis, Listeria sp.: ypaA, no riboflavin
pathway => transport of riboflavinPrediction: YpaA is riboflavin transporter (Gelfand et al., 1999)
Validation:• YpaA transports flavines (riboflavin, FMN, FAD) (by genetic
analysis, Kreneva et al., 2000)• ypaA is regulated by riboflavin (by microarray expression
study, Lee et al., 2001)• … via attenuation of transcription (and to some extent
inhibition of translaition) (Winkler et al., 2003)
A new family of nickel/cobalt transporters
• No experimental data
• No structural data
• Specificity predicted by comparative genomics
• … and then validated in experiment
• Mutational analysis under way
Conserved signal upstream of nrd genes
Identification of the candidate regulator by the analysis of phyletic patterns
• COG1327: the only COG with exactly the same phylogenetic pattern as the signal– “large scale” on the level of major taxa– “small scale” within major taxa:
• absent in small parasites among alpha- and gamma-proteobacteria
• absent in Desulfovibrio spp. among delta-proteobacteria
• absent in Nostoc sp. among cyanobacteria
• absent in Oenococcus and Leuconostoc among Firmicutes
• present only in Treponema denticola among four spirochetes
COG1327 “Predicted transcriptional regulator, consists of a Zn-ribbon and ATP-cone domains”: regulator of the riboflavin pathway?
Additional evidence
• sometimes clustered with nrd genes or with replication genes dnaB, dnaI, polA
• candidate signals upstream of other replication-related genes
• dNTP salvage• topoisomerase I, replication initiator dnaA,
chromosome partitioning, DNA helicase II
• experimental confirmation in Streptomyces (Borovok et al., 2004)
Multiple sites (nrd genes): FNR, DnaA, NrdR
Mode of regulation
• Repressor (overlaps with promoters)
• Co-operative binding:– most sites occur in tandem (> 90% cases)– the distance between the copies (centers of
palindromes) equals an integer number of DNA turns:
• mainly (94%) 30-33 bp, in 84% 31-32 bp – 3 turns• 21 bp (2 turns) in Vibrio spp.• 41-42 bp (4 turns) in some Firmicutes
Combined regulatory network for iron homeostasis genes in -proteobacteria.
RirA IrrFeS heme
RirA
degraded
FurFe
Fur
Iron uptake systems
Siderophoreuptake
Fe / Feuptake Transcription
factors
2+ 3+
Iron storage ferritins
FeS synthesis
Heme synthesis
Iron-requiring enzymes
[iron cofactor]
IscR
Irr
[- Fe] [+Fe]
[+Fe][- Fe]
[+Fe][ Fe]-
FeS
FeS statusof cell
The connecting line denote regulatory interactions, which the thickness reflecting the frequency of the interaction in the analyzed genomes. The suggested negative or positive mode of operation is shown by dead-end and arrow-end of the line.
Rhizobiales
Bradyrhizobiaceae
Rhizobiaceae
Rhodo-bacterales
Hyphomonadaceae
Rhodo-bacteraceae
Rickettsiales
Rhodo-spirillales
Sphingomo-nadales
- pro
teo
bacte
ria
Organism Irr MntR
Sinorhizobium meliloti
Rhizobium leguminosarum
Rhizobium etli
Agrobacterium tumefaciens
Mesorhizobium loti
Mesorhizobium sp. BNC1
Brucella melitensis
Bartonella quintana and spp.
Bradyrhizobium japonicum
Rhodopseudomonas palustris
Nitrobacter hamburgensis
Nitrobacter winogradskyi
Rhodobacter capsulatus
Rhodobacter sphaeroides
Silicibacter sp. TM1040
Silicibacter pomeroyi
Jannaschia sp.CC51
Rhodobacterales bacterium HTCC2654
Roseobacter sp. MED193
Roseovarius nubinhibens ISM
Roseovarius sp.217
Loktanella vestfoldensis SKA53
Sulfitobacter sp. EE-36
Oceanicola batsensis HTCC2597
Oceanicaulis alexandrii HTCC2633
Caulobacter crescentu s
Parvularcula bermudensis HTCC2503
Erythrobacter litoralis
Novosphingobium aromaticivorans
Sphinopyxis alaskensis g RB2256
Zymomonas mobilis
Gluconobacter oxydans
Rhodospirillum rubrum
Magnetospirillum magneticum
Pelagibacter ubique HTCC1002
SM +
MUR /
FUR RirA IscR
RL
RHE
AGR
ML
MBNC
BME
BQ
BJ
RPA
Nham
Nwi
RC
Rsph
STM
S PO
Jann
RB2654
MED193
ISM
ROS217
SKA53
EE36
OB2597
OA2633
CC
PB2503
ELI
Saro
Sala
ZM
GOX
Rrub
Amb
Abb.
PU1002
+ +- -
+ + +- -
+ + +- -
+ + +- -
+ + -
+ + +- -
+ + +- -
+ + +- -
+ + - -
+
+
+
-
-
+ + - --
+ + - --
+ + - --
+
+
+ ++- ++ ++ - +
+ ++ - +
+ ++ - +
+ + -
+ ++ - +
+ ++ - +
+ + - +
+ ++ - +
+ + - +
+ + - +
+ + - +
+ - +
#?
#?
#?
#?#?
- -
+ - +- -
+ - +- -
+ - +- -
+ - +- -
+ - +- -
+ - +- -
+ +- -
+ - +- -
+ - +- -
- +-
+
+
+
+
Group
Caulobacterales
Parvularculales
Rickettsia and Ehrlichia species - +- --
+ +SAR11 cluster
A.
B.
C.
D.
Fe and Mn regulons
Distribution of Irr,
Fur/Mur, MntR,
RirA, and IscR regulons
in α-proteobacteria
#?' in RirA column denotesthe absence of the rirA gene in an unfinished genomic sequence and the presence of candidate RirA-binding sites upstream of the iron uptake genes.
Phylogenetic tree of the Fur family of transcription factors in -proteobacteria - I
Fur in - and - proteobacteria
Fur in - proteobacteria Fur in Firmicutes
in proteobacteria
Fur
MBNC03003593
RB2654 19538AGR C 620
RL mur
Nwi 0013RPA0450
BJ furROS217 18337
Jann 1799SPO2477
STM1w01000993MED193 22541
OB2597 02997SKA53 03101Rsph03000505ISM 15430
GOX0771ZM01411
Saro02001148Sala 1452
ELI1325OA2633 10204
PB2503 04877CC0057
Rrub02001143Amb1009Amb4460
SM murMBNC03003179
BQ fur2BMEI0375
Mesorhizobium sp. BNC1 (I)
Sinorhizobium meliloti
Bartonella quintana
Rhodopseudomonas palustris
Bradyrhizobium japonicum
Caulobacter crescentus
Zmomonas mobilisy
Rhodobacter sphaeroides
Silicibacter sp. TM1040
Silicibacter pomeroyi
Agrobacterium tumefaciens
Rhizobium leguminosarum
Brucella melitensis
Mesorhizobium sp. BNC1 (II)
Rhodobacterales bacterium HTCC2654
Nitrobacter winogradskyiNham 0990 Nitrobacter hamburgensis X14
Jannaschia sp. CC51Roseovarius sp.217
Roseobacter sp. MED193Oceanicola batsensis HTCC2597
Loktanella vestfoldensis SKA53
Roseovarius nubinhibens ISM
Gluconobacter oxydans
Erythrobacter litoralis
Novosphingobium aromaticivoransSphinopyxis alaskensis RB2256
Oceanicaulis alexandrii HTCC2633
Rhodospirillum rubrum
Parvularcula bermudensis HTCC2503
Magnetospirillum magneticum (I)
EE36 12413Sulfitobacter sp. EE-36
ECOLIPSEAE
NEIMAHELPY
BACSUHelicobacter pylori : sp|O25671
Bacillus subtilis : P54574sp|
Neisseria meningitidis : sp|P0A0S7
Pseudomonas aeruginosa : sp|Q03456Escherichia coli: P0A9A9sp|
Mur
Fur
Magnetospirillum magneticum (II)
RHE_CH00378Rhizobium etli
PU1002 04436Pelagibacter ubique HTCC1002
Irr
in proteobacteria
proteobacteria
Regulator of manganese uptake genes (sit, mntH)
Regulator of iron uptake and metabolism genes
The A, B, and C groups
of - proteobacteria - Mur
Caulobacter crescentus
Zymomonas mobilis
Gluconobacter oxydans
Erythrobacter litoralis
Novosphingobium aromaticivorans
Rhodospirillum rubrum
Magnetospirillum magneticum
Escherichia coli
Sphinopyxis alaskensis
Parvularcula bermudensis -
Oceanicaulis alexandrii
Bacillus subtilis
Sequence logos for the identified Fur-binding sites in the D group of proteobacteria
Sequence logos for the known Fur-binding sites in Escherichia coli and Bacillus subtilis
Identified Mur-binding sites
Phylogenetic tree of the Fur family of transcription factors in -proteobacteria - II
Fur in - and - proteobacteria
Fur in - proteobacteria Fur in Firmicutes
Irr in proteo-bacteria regulator of ironhomeostasis
proteobacteria Fur
ECOLIPSEAE
NEIMAHELPY
BACSUHelicobacter pylori : sp|O25671
Bacillus subtilis : P54574sp|
Neisseria meningitidis : sp|P0A0S7
Pseudomonas aeruginosa : sp|Q03456Escherichia coli : P0A9A9sp|
Mur /
Fur
Irr-
AGR C 249SM irr
RL irr1RL irr2
MLr5570MBNC03003186
BQ fur1BMEI1955BMEI1563BJ blr1216
RB2654 182SKA53 01126
ROS217 15500ISM 00785
OB2597 14726Jann 1652
Rsph03001693EE36 03493
STM1w01001534MED193 17849
SPOA0445RC irr
RPA2339RPA0424*
BJ irr*Nwi 0035*Nham 1013* Nitrobacter hamburgensis X14
Nitrobacter winogradskyi
Bradyrhizobium japonicum (I)
Agrobacterium tumefaciens
Rhizobium leguminosarum (I)
Mesorhizobium sp. BNC1
Sinorhizobium meliloti
Mesorhizobium loti
Bartonella quintanaBrucella melitensis (I)
Bradyrhizobium japonicum (II)
Rhodobacter sphaeroides
Rhodobacter capsulatusSilicibacter pomeroyi
Silicibacter sp. TM1040Roseobacter sp. MED193
Sulfitobacter sp. EE-36
Jannaschia sp. CC51Oceanicola batsensis HTCC2597Roseovarius nubinhibens ISMRoseovarius sp.217Loktanella vestfoldensis SKA53
Rhodobacterales bacterium HTCC2654
Rhizobium etliRHE CH00106
Rhizobium leguminosarum (II)
Brucella melitensis (II)
Rhodopseudomonas palustris (II)Rhodopseudomonas palustris (I)
PU1002 04361 Pelagibacter ubique HTCC1002
Sequence logos for the identified Irr binding sites in -proteobacteria.
(8 species) - IrrThe A group
The B group (4 species) - Irr
The C group (12 species) - Irr
Phylogenetic tree of the Rrf2 family of transcription factors in -proteobacteria
proteins with the conserved C-X(6-9)-C(4-6)-C motif within effector-responsive domain proteins without a cysteine triad motif
Iron repressor RirA (Rhizobium leguminosarum)
Nitrite/NO-sensing regulator NsrR (Nitrosomonas europeae, Escherichia coli)
Cysteine metabolism repressor CymR(Bacillus subtilis)
Iron-Sulfur cluster synthesis repressor IscR(Escherichia coli)
Positional clustering of rrf2-like genes with:iron uptake and storage genes;
Fe-S cluster synthesis operons;genes involved in nitrosative stress protection;
sulfate uptake/assimilation genes;thioredoxin reductase;
carboxymuconolactone decarboxylase-family genes;
hmc cytochrome operon
Cytochrome complex regulator Rrf2(Desulfovibrio vulgaris)
ZMO0116
GOX0099
Rrub02000219
ZMO0422
Sala_1236
ELI0458
Saro3534
DV Rrf2
OA2633_03246CC1866
Ricket.
Am
b3030
Rrub 02002540
PB2503_09884
STM_3629
MED193_04321
ISM_16015
OB2597_03589
RO
S2
17
_ 20
54
2RB
26
54
040
09
SKA53_
05183RC_0477
Rsph023725SPO2025
EE36_14302
EC IscR
RPA0663GOX1196
Amb0200Rrub_1115
Sa
la_2
595
Sa
r o02
00
1 62
0
CC
2 62
5
PB
250
3 _0
371 2
R rub02002859
RC_0031Rsph023756
AGR_C_1499
RHE_CH01133
RL_1316
AGR_L_2801SMb20994SMc02267
RHE_CH03364RL_3916
MLl4516MLr1674
Rrub02001767Amb1054
ROS217_16231STM_634
MED193_09800
SPO0432Rsph023178
RB2654_19993RC 0780
BQ04990MBNC02002196
MLr1147BMEII0707
AGR_C_344
RL RirA
SMc00785RHE CH00735
OA2633_11510
Nwi_0743
NE NsrR
Amb1318GOX0860RC NsrR
ROS217_15206Rsph03001477
EC
_Ns
rR
SPOA0186
Ricket.
Sala_1049Saro02000305
OB2597_05195ROS217_02155
ROS217_14291
CC0132
SMc01160
BJ blr7974
RL_5159AGR_L_2343
AGR_C_402
AGR_L_1131
SPO3722RHE_CH02777RL_3336
SPO1393
MBNC02000669MLl1642
SMc02238AGR_C_872
RL_619RHE_CH00547
MBNC03004487
RirA
NsrR
IscR
IscR-II
Rhizo biales
Rh o dob acterales
Jann_2366
BS CymR
The A group - RirA (8 species)
(12 species)The C group - RirA
Sequence logos for the identified RirA-binding sites in -proteobacteria
Genes Functions:Iron uptakeIron storageFeS synthesis
Iron usageHeme biosynthesisRegulatory genesManganese uptake
Distribution of the conserved members of the Fe- and Mn-responsive regulons and the predicted RirA, Fur/Mur, Irr, and DtxR binding sites in -proteobacteria
An attempt to reconstruct the history
Acknowledgements• Dmitry Rodionov (comparative genomics)• Andrei Mironov (software)• Alexei Vitreschak (riboswitches)
• Slides:– Michael Galperin (NCBI, Bethesda)– Andrei Osterman (Burnham Institute, San-Diego)
• Collaboration:– Thomas Eitinger (Humboldt University, Berlin) – Co/Ni transporters– Andy Johnston (University of East Anglia) – Fe in alphas
• Funding:– Howard Hughes Medical Institute– Russian Fund of Basic Research– RAS, program “Molecular and Cellular Biology”– INTAS