Molecular biology in silico

Molecular biology in silico

Mikhail GelfandResearch and Training Center “Bioinformatics”,

Institute for Information Transmission Problems, RAS

AlBio06, Moscow, July 2006

Propaganda

100

1000

10000

100000

1000000

10000000

1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

год

red: papers (experiments)blue: sequence fragments

Complete genomes

2

149

4

18

30

55

84

8

19

422

1

107

4321

15

0

10

20

30

40

50

60

70

80

90

1995 1996 1997 1998 1999 2000 2001 2002

GOLD db.(III.2006):361 complete genomesIncomplete (in the process):

952 bacteria 58 archaea607 eukaryotes (incl. ESTs) 46 metagenomes

More propaganda

Most genes will never be studied in experimentEven in E.coli: only 20-30 new genes per year (hundreds are still uncharacterized)

Bioinformatics = molecular biology in silico• ~2% of all recent papers in biological journals• Essential component of biological research• Make predictions about function and regulation of genes

(many quite reliable!)• Metabolic reconstruction and prediction of phenotype given

genome• Identify really interesting cases, fill gaps in knowledge

– “Universally missing genes” – not a single known gene even for ~10% reactions of central metabolism. No genes for >40% reactions overall

– “Conserved hypothetical genes” (5-15% of any bacterial genome) – essential, but unknown function

Haemophilus influenzae, 1995

Vibrio cholerae, 2000

How?

Similarity to known proteins

• Useful for many purposes (allows one to annotate 50-75% genes in a bacterial genome)

• Necessary first step• May be automated

– … to some extent …– in particular, care is needed to avoid too specific

predictions– Problem: propagation of annotation errors

• Boring (nothing new)

Noradrenaline transporter in an archaeon?

SOURCE Methanococcus jannaschii. ORGANISM Methanococcus jannaschii Archaea; Euryarchaeota; Methanococcales; Methanococcaceae; Methanococcus.

FEATURES Location/Qualifiers source 1..492 /organism="Methanococcus jannaschii" /db_xref="taxon:2190" Protein 1..492

/product="sodium-dependent noradrenaline transporter" CDS 1..492 /gene="MJ1319" /note="similar to EGAD:HI0736 percent identity: 38.5;

identified by sequence similarity; putative" /coded_by="U67572:71..1549" /transl_table=11

Now corrected: Hypothetical sodium-dependent transporter MJ1319.

Similarity to hypothetical proteins: somebody else’s errors…

The correct annotation

Genes with curious functional assignments

• C75604: Probable head morphogenesis protein, Deinococcus radiodurans

• O05360: Automembrane protein H, Yersinia enterocolitica

• Q8TID9: Benzodiazepine (valium) receptor TspO, Methanosarcina acetivorans

• NP_069403: DR-beta chain MHC class II, Archaeoglobus fulgidus

Errors in experimental papers

SwissProt:

DEFINITION Hypothetical 43.6 kDa protein.ACCESSION P48012

...

KEYWORDS Hypothetical protein.

SOURCE Debaryomyces occidentalis

ORGANISM Debaryomyces occidentalis

Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes;

Saccharomycetales; Saccharomycetaceae; Debaryomyces.

[CAUTION] Was originally (Ref.1) thought to be 3-isopropylmalate dehydrogenase (LEU2).

PIR:DEFINITION 3-isopropylmalate dehydrogenase (EC 1.1.1.85)

- yeast(Schwanniomyces occidentalis).

ACCESSION S55845

KEYWORDS oxidoreductase.

SwissProt entry DSDX_ECOLI

-!- CAUTION: An ORF called dsdC was originally (Ref.3) assigned to the wrong DNA strand and thought to be a D-serine deaminase activator, it was then resequenced by Ref.2 and still thought to be "dsdC", but this time to function as a D-serine permease. It is Ref.1 that showed that dsdC is another gene and that this sequence should be called dsdX. It should also be noted that the C-terminal part of dsdX (from 338 onward) was also sequenced (Ref.6 and Ref.7) and was thought to be a separate ORF (don't worry, we also had difficulties understanding what happened!).

Positional clustering

• Genes that are located in immediate proximity tend to be involved in the same metabolic pathway or functional subsystem – mainly in prokaryotes, very weak in eukaryotes– caused by operon structure, but not only

• horizontal transfer of loci containing several functionally linked operons

• compartmentalisation of products in the cytoplasm

– very weak evidence• stronger if observed in may unrelated genomes

• May be measured– e.g. the STRING database/server (P.Bork, EMBL) – and other sources

STRING: trpB –

positional clusters

Functionally dependent genes tend to cluster on chromosomes in many different organisms

Vertical axis: number of gene pairs with association score exceeding a threshold.

Control: same graph, random re-labeling of vertices

More genomes (stronger links) => highly significant clustering

Especially in linear pathways (right)

Fusions

• If two (or more) proteins form a single multidomain protein in some organism, they all are likely to be tightly functionally related

• Very useful for the analysis of eukaryotes• Sometimes useful for the analysis of

prokaryotes

STRING: trpB – fusions

Phyletic patterns

• Functionally linked genes tend to occur together

• Enzymes with the same function (isozymes) have complementary phyletic profiles

STRING: trpB – co-

occurrence (phyletic profiles)

Phyletic profiles in the Phe/Tyr pathway

shikimate kinase

Archaeal shikimate-kinaseChorismate biosynthesis pathway (E. coli)

Arithmetics of phyletic patterns

3-dehydroquinate dehydratase (EC 4.2.1.10):Class I (AroD) COG0710 aompkzyq---lb-e----n---i-- Class II (AroQ) COG0757 ------y-vdr-bcefghs-uj----

Two forms combined aompkzyqvdrlbcefghsnuj-i--+

5-enolpyruvylshikimate 3-phosphate synthase (EC 2.5.1.19) AroA COG0128 aompkzyqvdrlbcefghsnuj-i--

Shikimate dehydrogenase (EC 1.1.1.25):AroE COG0169 aompkzyqvdrlbcefghsnuj-i--

+

Shikimate kinase (EC 2.7.1.71):Typical (AroK) COG0703 ------yqvdrlbcefghsnuj-i--Archaeal-type COG1685 aompkz--------------------

Two forms combined aompkzyqvdrlbcefghsnuj-i--

Chorismate synthase (EC 2.5.1.19) AroC COG0082 aompkzyqvdrlbcefghsnuj-i--

Distribution of association scores (monotonic for subunits,

bimodal for isozymes)

E.g. transporters

• Transporters of end products of metabolic pathways may substitute the entire pathway

• Transporters of compounds for catabolic pathways co-occur with pathways

• Transporters for intermediates substitute upstream parts of pathways

Example: bioY

Other approaches to phyletic patterns

• Gene signatures of lifestyles – e.g. thermophily:

DNA gyrase is the only gene specific to all hyperthermophiles (bacterial and archaeal)

– see COGs

• Regulators and signals

Example: bioR

gene: black arrow;

candidate site: red dot

Comparative analysis of regulation

• Phylogenetic footprinting: regulatory sites are more conserved than non-coding regions in general and are often seen as conserved islands in alignments of gene upstream regions

• Consistency filtering: regulons (sets of co-regulated genes) are conserved =>– true sites occur upstream of orthologous genes– false sites are scattered at random

Enzymes

• Identification of a gap in a pathway (universal, taxon-specific, or in individual genomes)

• Search for candidates assigned to the pathway by co-localization and co-regulation (in many genomes)

• Prediction of general biochemical function from (distant) similarity and functional patterns

• Tentative filling of the gap• Verification by analysis of phylogenetic patterns:

– Absence in genomes without this pathway

– Complementary distribution with known enzymes for the same function

Transporters

• Identification of candidates assigned to the pathway by co-localization and co-regulation (in many genomes)

• Prediction of general function by analysis of transmembrane segments and similarity

• Prediction of specificity by analysis of phylogenetic patterns:– End product if present in genomes lacking this pathway

(substituting the biosynthetic pathway for an essential compound)

– Input metabolite if absent in genomes without the pathway (catabolic, also precursors in biosynthetic pathways)

– Entry point in the middle if substituting an upper or side part of the pathway in some genomes

5’ UTR regions of riboflavin genes from bacteria 1 2 2’ 3 Add. 3’ Variable 4 4’ 5 5’ 1’ =========> ==> <== ===> -><- <=== -> <- ====> <==== ==> <== <========= BS TTGTATCTTCGGGG-CAGGGTGGAAATCCCGACCGGCGGT 21 AGCCCGTGAC-- 8 4 8 -----TGGATTCAGTTTAA-GCTGAAGCCGACAGTGAA-AGTCTGGAT-GGGAGAAGGATGAT BQ AGCATCCTTCGGGG-TCGGGTGAAATTCCCAACCGGCGGT 19 AGTCCGTGAC-- 8 5 8 -----TGGATCTAGTGAAACTCTAGGGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGGATATG BE TGCATCCTTCGGGG-CAGGGTGAAATTCCCGACCGGCGGT 20 AGCCCGCGA--- 3 4 3 -----AGGATCCGGTGCGATTCCGGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGGATGCC HD TTTATCCTTCGGGG-CTGGGTGGAAATCCCGACCGGCGGT 19 AGTCCGTGAC-- 10 4 10 ----–TGGACCTGGTGAAAATCCGGGACCGACAGTGAA-AGTCTGGAT-GGGAGAAGGAAACG Bam TGTATCCTTCGGGG-CTGGGTGAAAATCCCGACCGGCGGT 23 AGCCCGTGAC-- 8 4 8 ----–TGGATTCAGTGAAAAGCTGAAGCCGACAGTGAA-AGTCTGGAT-GGGAGAAGGATGAG CA GATGTTCTTCAGGG-ATGGGTGAAATTCCCAATCGGCGGT 2 AGCCCGCAA--- 3 4 3 ------AGATCCGGTTAAACTCCGGGGCCGACAGTTAA-AGTCTGGAT-GAAAGAAGAAATAG DF CTTAATCTTCGGGG-TAGGGTGAAATTCCCAATCGGCGGT 2 AGCCCGCG---- 7 6 7 --------ATTTGGTTAAATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GGAAGAAGATATTT SA TAATTCTTTCGGGG-CAGGGTGAAATTCCCAACCGGCAGT 6 AGCCTGCGAC-- 11 3 11 ----–CTGATCTAGTGAGATTCTAGAGCCGACAGTTAA-AGTCTGGAT-GGGAGAAAGAATGT LLX ATAAATCTTCAGGG-CAGGGTGTAATTCCCTACCGGCGGT 2 AGCCCGCGA--- 4 4 4 -----ATGATTCGGTGAAACTCCGAGGCCGACAGT-AT-AGTCTGGAT-GAAAGAAGATAATA PN AACTATCTTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT 2 AGCCCACGA--- 3 4 3 -----ATGATTTGGTGAAATTCCAAAGCCGACAGT-AT-AGTCTGGAT-GAAAGAAGATAAAA TM AAACGCTCTCGGGG-CAGGGTGGAATTCCCGACCGGCGGT 3 AGCCCGCGAG-- 5 4 5 ----–TTGACCCGGTGGAATTCCGGGGCCGACGGTGAA-AGTCCGGAT-GGGAGAGAGCGTGA DR GACCTCTTTCGGGG-CGGGGCGAAATTCCCCACCGGCGGT 15 AGCCCGCGAA-- 8 12 9 ----–CCGATGCCGCGCAACTCGGCAGCCGACGGTCAC-AGTCCGGAC-GAAAGAAGGAGGAG TQ CACCTCCTTCGGGG-CGGGGTGGAAGTCCCCACCGGCGGT 3 AGCCCGCGAA-- 5 4 5 -----CCGACCCGGTGGAATTCCGGGGCCGACGGTGAA-AGTCCGGAT-GGGAGAAGGAGGGC AO AATAATCTTCAGGG-CAGGGTGAAATTCCCGATCGGCGGT 2 AGTCCGCGA--- 7 7 7 -----AGGAACCGGTGAGATTCCGGTACCGACAGT-AT-AGTCTGGAT-GGAAGAAGATGAAA DU TTTAATCTTCAGGG-CAGGGTGAAATTCCCGATCGGTGGT 2 AGTCCGCGA--- 13 4 12 -----AGGAACTAGTGAAATTCTAGTACCGACAGT-AT-AGTCTGGAT-GGAAGAAGAGCAGA CAU GAAGACCTTCGGGG-CAAGGTGAAATTCCTGATCGGCGGT 20 AGCCCGCGA--- 3 4 3 -----AGGACCCGGTGTGATTCCGGGGCCGACGGT-AT-AGTCCGGAT-GGGAGAAGGTCGGC FN TAAAGTCTTCAGGG-CAGGGTGAAATTCCCGACCGGTGGT 2 AGTCCACG---- 5 4 5 -------GATTTGGTGAAATTCCAAAACCGACAGT-AG-AGTCTGGAT-GGGAGAAGAATTAG TFU ACGCGTGCTCCGGG-GTCGGTGAAAGTCCGAACCGGCGGT 3 AGTCCGCGAC-- 8 5 8 -----TGGAACCGGTGAAACTCCGGTACCGACGGTGAA-AGTCCGGAT-GGGAGGTAGTACGTG SX -AGCGCACTCCGGG-GTCGGTGAAAGTCCGAACCGGCGGT 3 AGTCCGCGAC-- 8 5 8 -----TTGACCAGGTGAAATTCCTGGACCGACGGTTAA-AGTCCGGAT-GGGAGGCAGTGCGCG BU GTGCGTCTTCAGGG-CGGGGTGAAATTCCCCACCGGCGGT 30 AGCCCGCGAGCG 137 GTCAGCAGATCTGGTGAGAAGCCAGAGCCGACGGTTAG-AGTCCGGAT-GGAAGAAGATGTGC BPS GTGCGTCTTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT 21 AGCCCGCGAGCG 8 4 8 GTCAGCAGATCTGGTCCGATGCCAGAGCCGACGGTCAT-AGTCCGGAT-GAAAGAAGATGTGC REU TTACGTCTTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT 31 AGCCCGCGAGCG 7 5 7 GTCAGCAGATCTGGTGAGAGGCCAGGGCCGACGGTTAA-AGTCCGGAT-GAAAGAAGATGGGC RSO GTACGTCTTCAGGG-CGGGGTGGAATTCCCCACCGGCGGT 21 AGCCCGCGAGCG 11 3 11 GTCAGCAGATCCGGTGAGATGCCGGGGCCGACGGTCAG-AGTCCGGAT-GGAAGAAGATGTGC EC GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT 17 AGCCCGCGAGCG 8 4 8 GACAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAG-AGTCCGGAT-GGGAGAGAGTAACG TY GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT 67 AGCCCGCGAGCG 8 3 8 GTCAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAA-AGTCCGGAT-GGGAGAGGGTAACG KP GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT 20 AGCCCGCGAGCG 8 4 8 GTCAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAA-AGTCCGGAT-GGGAGAGAGTAACG HI TCGCATTCTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT 2 AGCCCACGAGCG 26 9 30 GTCAGCAGATTTGGTGAAATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GAAAGAGAATAAAA VK GCGCATTCTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT 14 AGCCCACGAGCG 11 9 11 GTCAGCAGATTTGGTGAGAATCCAAAGCCGACAGT-AT-AGTCTGGAT-GAAAGAGAATAAGC VC CAATATTCTCAGGG-CGGGGCGAAATTCCCCACCGGTGGT 13 AGCCCACGAGCG 5 4 5 GTCAGCAGATCTGGTGAGAAGCCAGGGCCGACGGTTAC-AGTCCGGAT-GAGAGAGAATGACA YP GCTTATTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT 40 AGCCCGCGAGCG 16 6 16 GTCAGCAGACCCGGTGTAATTCCGGGGCCGACGGTTAT-AGTCCGGAT-GGGAGAGAGTAACG AB GCGCATTCTCAGGG-CAGGGTGAAAGTCCCTACCGGTGGT 25 AGCCCACGAGCG 16 4 27 GTCAGCAGATTTGGTGCGAATCCAAAGCCGACAGTGAC-AGTCTGGAT-GAAAGAGAATAAAA BP GTACGTCTTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT 18 AGCCCGCGAGCG 10 4 10 GTCAGCAGACCTGGTGAGATGCCAGGGCCGACGGTCAT-AGTCCGGAT-GAGAGAAGATGTGC AC ACATCGCTTCAGGG-CGGGGCGTAATTCCCCACCGGCGGT 16 AGCCCGCGAGCA 10 3 11 ---CGCAGATCTGGTGTAAATCCAGAGCCGACGGT-AT-AGTCCGGAT-GAAAGAAGACGACG Spu AACAATTCTCAGGG-CGGGGTGAAACTCCCCACCGGCGGT 34 AGCCCGCGAGCG 6 6 6 GTCAGCAGATCTGGTG 52 TCCAGAGCCGACGGT 31 AGTCCGGAT-GGAAGAGAATGTAA PP GTCGGTCTTCAGGG-CGGGGTGTAAGTCCCCACCGGCGGT 13 AGCCCGCGAGCG 7 3 7 GTCAGCAGATCTGGTGCAACTCCAGAGCCGACGGTCAT-AGTCCGGAT-GAAAGAAGGCGTCA AU GGTTGTTCTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT 17 AGCCCGCGAGCG 7 9 7 GTCAGCAGATCCGGTGAGAGGCCGGAGCCGACGGT-AT-AGTCCGGAT-GGAAGAGGACAAGG PU AAACGTTCTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT 19 AGCCCGCGAGCG 19 4 18 GTCAGCAGACCCGGTGTGATTCCGGGGCCGACGGTCAC-AGTCCGGATGAAGAGAGAACGGGA PY TAACGTTCTCAGGG-CGGGGTGCAACTCCCCACCGGCGGT 19 AGCCCGCGAGCG 15 4 16 GTCAGCAGACCCGGTGTGATTCCGGGGCCGACGGTCAT-AGTCCGGATGAAGAGAGAGCGGGA PA TAACGTTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT 19 AGCCCGCGAGCG 14 4 13 GTCAGCAGACCCGGTGCGATTCCGGGGCCGACGGTCAT-AGTCCGGATAAAGAGAGAACGGGA MLO TAAAGTTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT 16 AGCCCGCGAGCG 8 5 8 GTCAGCAGATCCGGTGTGATTCCGGAGCCGACGGTTAG-AGTCCGGAT-GAAAGAGGACGAAA SM AAGCGTTCTCAGGG-CGGGGTGAAATTCCCCACCGGCGGT 34 AGCCCGCGAGCG 8 3 8 GTCAGCAGATCCGGTCGAATTCCGGAGCCGACGGTTAT-AGTCCGGAT-GGAAGAGAGCAAGC BME GCTTGTTCTCGGGG-CGGGGTGAAACTCCCCACCGGCGGT 17 AGCCCGCGAGCG 10 15 10 GTCAGCAGATCCGGTGAGATGCCGGAGCCGACGGTTAA-AGTCCGGAT-GGAAGAGAGCGAAT BS ATCAATCTTCGGGG-CAGGGTGAAATTCCCTACCGGCGGT 18 AGCCCGCGA--- 5 4 5 -----AGGATTCGGTGAGATTCCGGAGCCGACAGT-AC-AGTCTGGAT-GGGAGAAGATGGAG BQ GTCTATCTTCGGGG-CAGGGTGAAAATCCCGACCGGCGGT 27 AGCCCGCGA—-- 3 5 3 -----AGGATTTGGTGTGATTCCAAAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGGAG BE ATTCATCTTCGGGG-CAGGGTGAAATTCCCGACCGGCGGT 20 AGCCCGCGA--- 3 4 3 -----AGGATCCGGTGCGAGTCCGGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGAAG CA AATGATCTTCAGGG-CAGGGTGAAATTCCCTACCGGCGGT 2 AGCCCGCGAG-- 3 4 3 ----TATGATCCGGTTTGATTCCGGAGCCGACAGT-AA-AGTCTGGAT-GAAAGAAGATATAT DF GAAGATCTTCGGGG-CAGGGTGAAATTCCCTACCGGCGGT 2 AGCCCGCG---- 6 4 6 -------GATTTGGTGAGATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GAGAGAAGATATTT EF GTTCGTCTTCAGGGGCAGGGTGTAATTCCCGACCGGTGGT 3 AGTCCACGAC-- 5 3 5 ----ATTGAATTGGTGTAATTCCAATACCGACAGT-AT-AGTCTGGAT—-AAAGAAGATAGGG LLX AAATATCTTCAGGG-CACCGTGTAATTCGGGACCGGCGGT 21 ACTCCGCGAT-- 4 4 4 ----–TTGAAGCAGTGAGAATCTGCTAGCGACAGT-AA-AGTCTGGAT-GGAAGAAGATGAAC LO GTTCATCTTCGGGG-CAGGGTGCAATTCCCGACCGGTGGT 3 AGTCCACGAT-- 3 10 3 ----TTGACTCTGGTGTAATTCCAGGACCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGTTG PN AAGAGTCTTCAGGG-CAGGGTGAAATTCCCGACCGGCGGT 125 AGTCCGTG---- 3 4 3 -------GATGTGGTGAGATTCCACAACCGACAGT-AT-AGTCTGGAT-GGGAGAAGACGAAA ST AAGTGTCTTCAGGG-CAGGGTGTGATTCCCGACCGGCGGT 14 AGTCCGCG---- 3 4 3 -------GATGTGGTGTAACTCCACAACCGACAGT-AT-AGTCTGGAT-GAGAGAAGACCGGG MN AAGTGTCTTCAGGG-CAGGGTGAGATTCCCGACCGGCGGT 104 AGTCCGCG---- 3 4 3 -------GATGTGGTGAAATTCCACAACCGACAGT-AA-AGTCTGGAT-GGGAGAAGACTGAG SA ATTCATCTTCGGGG-TCGGGTGTAATTCCCAACCGGCAGT 6 AGCCTGCGAC-- 11 3 11 ----–CTGATCTAGTGAGATTCTAGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGGAG AMI TCACAGTTTCAGGG-CGGGGTGCAATTCCCCACTGGCGGT 14 AGCCCGCGC--- 5 5 5 ------TGATCTGGTGCAAATCCAGAGCCAACGGT-AT-AGTCCGGAT-GGAAGAAACGGAGC DHA ACGAACCTTCGAGG-TAGGGTGAAATTCCCGACCGGCGGT 20 AGCCCGCAAC-- 11 4 11 --CGACTGACTTGGTGAGACTCCAAGGCCGACGGT-AT-AGTCCGGAT-GGGAGAAGGTACAA FN AATAATCTTCGGGG-CAGGGTGAAATTCCCGACCGGTGGT 2 AGTCCACG---- 4 6 4 -------GATTTGGTGAAATTCCAAAACCGACAGT-AG-AGTCTGGAT-GAGAGAAGAAAAGA GLU ---TGTTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT 28 AGCCCGCGAGCG 10 4 10 GTCAGCAGATCCGGTTAAATTCCGGAGCCGACGGTCAT-AGTCCGGAT-GCAAGAGAACC---

Conserved secondary structure of the RFN-element

NNNNyYYUC

NNNNrRRAG

NgGGNcCC

rgGGxc

ARRgxuAG

GRCCYG

AcCG

AGCCRGY

GG YRCC

GRYBy CYRVrG N

YGNaA N U U x N

Nx

AGU

UrN A g

Y

variab lestem -loop

additionalstem -loop

3 4

2

1

5

5 ’ 3 ’

u K NRA

xK

*

****

Capitals: invariant (absolutely conserved) positions.

Lower case letters: strongly conserved positions.

Dashes and stars: obligatory and facultative base pairs

Degenerate positions: R = A or G; Y = C or U; K = G or U; B= not A; V = not U. N: any nucleotide. X: any nucleotide or deletion

RFN: the mechanism of regulation

• Transcription attenuation

• Translation attenuation

Early observation: an uncharacterized gene (ypaA) with an upstream RFN element

Phylogenetic tree of RFN-elements (regulation of riboflavin biosynthesis)

duplications

no riboflavin biosynthesis

no riboflavin biosynthesis

YpaA: riboflavin (vitamin B2) transporter in Gram-positive bacteria

• 5 predicted transmembrane segments => a transporter• Upstream RFN element (likely co-regulation with riboflavin

genes) => transport of riboflaving or a precursor• S. pyogenes, E. faecalis, Listeria sp.: ypaA, no riboflavin

pathway => transport of riboflavinPrediction: YpaA is riboflavin transporter (Gelfand et al., 1999)

Validation:• YpaA transports flavines (riboflavin, FMN, FAD) (by genetic

analysis, Kreneva et al., 2000)• ypaA is regulated by riboflavin (by microarray expression

study, Lee et al., 2001)• … via attenuation of transcription (and to some extent

inhibition of translaition) (Winkler et al., 2003)

A new family of nickel/cobalt transporters

• No experimental data

• No structural data

• Specificity predicted by comparative genomics

• … and then validated in experiment

• Mutational analysis under way

Conserved signal upstream of nrd genes

Identification of the candidate regulator by the analysis of phyletic patterns

• COG1327: the only COG with exactly the same phylogenetic pattern as the signal– “large scale” on the level of major taxa– “small scale” within major taxa:

• absent in small parasites among alpha- and gamma-proteobacteria

• absent in Desulfovibrio spp. among delta-proteobacteria

• absent in Nostoc sp. among cyanobacteria

• absent in Oenococcus and Leuconostoc among Firmicutes

• present only in Treponema denticola among four spirochetes

COG1327 “Predicted transcriptional regulator, consists of a Zn-ribbon and ATP-cone domains”: regulator of the riboflavin pathway?

Additional evidence

• sometimes clustered with nrd genes or with replication genes dnaB, dnaI, polA

• candidate signals upstream of other replication-related genes

• dNTP salvage• topoisomerase I, replication initiator dnaA,

chromosome partitioning, DNA helicase II

• experimental confirmation in Streptomyces (Borovok et al., 2004)

Multiple sites (nrd genes): FNR, DnaA, NrdR

Mode of regulation

• Repressor (overlaps with promoters)

• Co-operative binding:– most sites occur in tandem (> 90% cases)– the distance between the copies (centers of

palindromes) equals an integer number of DNA turns:

• mainly (94%) 30-33 bp, in 84% 31-32 bp – 3 turns• 21 bp (2 turns) in Vibrio spp.• 41-42 bp (4 turns) in some Firmicutes

Combined regulatory network for iron homeostasis genes in -proteobacteria.

RirA IrrFeS heme

RirA

degraded

FurFe

Fur

Iron uptake systems

Siderophoreuptake

Fe / Feuptake Transcription

factors

2+ 3+

Iron storage ferritins

FeS synthesis

Heme synthesis

Iron-requiring enzymes

[iron cofactor]

IscR

Irr

[- Fe] [+Fe]

[+Fe][- Fe]

[+Fe][ Fe]-

FeS

FeS statusof cell

The connecting line denote regulatory interactions, which the thickness reflecting the frequency of the interaction in the analyzed genomes. The suggested negative or positive mode of operation is shown by dead-end and arrow-end of the line.

Rhizobiales

Bradyrhizobiaceae

Rhizobiaceae

Rhodo-bacterales

Hyphomonadaceae

Rhodo-bacteraceae

Rickettsiales

Rhodo-spirillales

Sphingomo-nadales

- pro

teo

bacte

ria

Organism Irr MntR

Sinorhizobium meliloti

Rhizobium leguminosarum

Rhizobium etli

Agrobacterium tumefaciens

Mesorhizobium loti

Mesorhizobium sp. BNC1

Brucella melitensis

Bartonella quintana and spp.

Bradyrhizobium japonicum

Rhodopseudomonas palustris

Nitrobacter hamburgensis

Nitrobacter winogradskyi

Rhodobacter capsulatus

Rhodobacter sphaeroides

Silicibacter sp. TM1040

Silicibacter pomeroyi

Jannaschia sp.CC51

Rhodobacterales bacterium HTCC2654

Roseobacter sp. MED193

Roseovarius nubinhibens ISM

Roseovarius sp.217

Loktanella vestfoldensis SKA53

Sulfitobacter sp. EE-36

Oceanicola batsensis HTCC2597

Oceanicaulis alexandrii HTCC2633

Caulobacter crescentu s

Parvularcula bermudensis HTCC2503

Erythrobacter litoralis

Novosphingobium aromaticivorans

Sphinopyxis alaskensis g RB2256

Zymomonas mobilis

Gluconobacter oxydans

Rhodospirillum rubrum

Magnetospirillum magneticum

Pelagibacter ubique HTCC1002

SM +

MUR /

FUR RirA IscR

RL

RHE

AGR

ML

MBNC

BME

BQ

BJ

RPA

Nham

Nwi

RC

Rsph

STM

S PO

Jann

RB2654

MED193

ISM

ROS217

SKA53

EE36

OB2597

OA2633

CC

PB2503

ELI

Saro

Sala

ZM

GOX

Rrub

Amb

Abb.

PU1002

+ +- -

+ + +- -

+ + +- -

+ + +- -

+ + -

+ + +- -

+ + +- -

+ + +- -

+ + - -

+

+

+

-

-

+ + - --

+ + - --

+ + - --

+

+

+ ++- ++ ++ - +

+ ++ - +

+ ++ - +

+ + -

+ ++ - +

+ ++ - +

+ + - +

+ ++ - +

+ + - +

+ + - +

+ + - +

+ - +

#?

#?

#?

#?#?

- -

+ - +- -

+ - +- -

+ - +- -

+ - +- -

+ - +- -

+ - +- -

+ +- -

+ - +- -

+ - +- -

- +-

+

+

+

+

Group

Caulobacterales

Parvularculales

Rickettsia and Ehrlichia species - +- --

+ +SAR11 cluster

A.

B.

C.

D.

Fe and Mn regulons

Distribution of Irr,

Fur/Mur, MntR,

RirA, and IscR regulons

in α-proteobacteria

#?' in RirA column denotesthe absence of the rirA gene in an unfinished genomic sequence and the presence of candidate RirA-binding sites upstream of the iron uptake genes.

Phylogenetic tree of the Fur family of transcription factors in -proteobacteria - I

Fur in - and - proteobacteria

Fur in - proteobacteria Fur in Firmicutes

in proteobacteria

Fur

MBNC03003593

RB2654 19538AGR C 620

RL mur

Nwi 0013RPA0450

BJ furROS217 18337

Jann 1799SPO2477

STM1w01000993MED193 22541

OB2597 02997SKA53 03101Rsph03000505ISM 15430

GOX0771ZM01411

Saro02001148Sala 1452

ELI1325OA2633 10204

PB2503 04877CC0057

Rrub02001143Amb1009Amb4460

SM murMBNC03003179

BQ fur2BMEI0375

Mesorhizobium sp. BNC1 (I)


Bartonella quintana

Rhodopseudomonas palustris

Bradyrhizobium japonicum

Caulobacter crescentus

Zmomonas mobilisy


Silicibacter sp. TM1040

Silicibacter pomeroyi


Rhizobium leguminosarum

Brucella melitensis

Mesorhizobium sp. BNC1 (II)


Nitrobacter winogradskyiNham 0990 Nitrobacter hamburgensis X14

Jannaschia sp. CC51Roseovarius sp.217

Roseobacter sp. MED193Oceanicola batsensis HTCC2597

Loktanella vestfoldensis SKA53

Roseovarius nubinhibens ISM



Novosphingobium aromaticivoransSphinopyxis alaskensis RB2256

Oceanicaulis alexandrii HTCC2633


Parvularcula bermudensis HTCC2503

Magnetospirillum magneticum (I)

EE36 12413Sulfitobacter sp. EE-36

ECOLIPSEAE

NEIMAHELPY

BACSUHelicobacter pylori : sp|O25671

Bacillus subtilis : P54574sp|

Neisseria meningitidis : sp|P0A0S7

Pseudomonas aeruginosa : sp|Q03456Escherichia coli: P0A9A9sp|

Mur

Fur

Magnetospirillum magneticum (II)

RHE_CH00378Rhizobium etli

PU1002 04436Pelagibacter ubique HTCC1002

Irr

in proteobacteria

proteobacteria

Regulator of manganese uptake genes (sit, mntH)

Regulator of iron uptake and metabolism genes

The A, B, and C groups

of - proteobacteria - Mur

Caulobacter crescentus

Zymomonas mobilis



Novosphingobium aromaticivorans


Magnetospirillum magneticum

Escherichia coli

Sphinopyxis alaskensis

Parvularcula bermudensis -

Oceanicaulis alexandrii

Bacillus subtilis

Sequence logos for the identified Fur-binding sites in the D group of proteobacteria

Sequence logos for the known Fur-binding sites in Escherichia coli and Bacillus subtilis

Identified Mur-binding sites

Phylogenetic tree of the Fur family of transcription factors in -proteobacteria - II

Fur in - and - proteobacteria

Fur in - proteobacteria Fur in Firmicutes

Irr in proteo-bacteria regulator of ironhomeostasis

proteobacteria Fur

ECOLIPSEAE

NEIMAHELPY

BACSUHelicobacter pylori : sp|O25671

Bacillus subtilis : P54574sp|

Neisseria meningitidis : sp|P0A0S7

Pseudomonas aeruginosa : sp|Q03456Escherichia coli : P0A9A9sp|

Mur /

Fur

Irr-

AGR C 249SM irr

RL irr1RL irr2

MLr5570MBNC03003186

BQ fur1BMEI1955BMEI1563BJ blr1216

RB2654 182SKA53 01126

ROS217 15500ISM 00785

OB2597 14726Jann 1652

Rsph03001693EE36 03493

STM1w01001534MED193 17849

SPOA0445RC irr

RPA2339RPA0424*

BJ irr*Nwi 0035*Nham 1013* Nitrobacter hamburgensis X14

Nitrobacter winogradskyi

Bradyrhizobium japonicum (I)


Rhizobium leguminosarum (I)

Mesorhizobium sp. BNC1


Mesorhizobium loti

Bartonella quintanaBrucella melitensis (I)

Bradyrhizobium japonicum (II)


Rhodobacter capsulatusSilicibacter pomeroyi

Silicibacter sp. TM1040Roseobacter sp. MED193

Sulfitobacter sp. EE-36

Jannaschia sp. CC51Oceanicola batsensis HTCC2597Roseovarius nubinhibens ISMRoseovarius sp.217Loktanella vestfoldensis SKA53


Rhizobium etliRHE CH00106

Rhizobium leguminosarum (II)

Brucella melitensis (II)

Rhodopseudomonas palustris (II)Rhodopseudomonas palustris (I)

PU1002 04361 Pelagibacter ubique HTCC1002

Sequence logos for the identified Irr binding sites in -proteobacteria.

(8 species) - IrrThe A group

The B group (4 species) - Irr

The C group (12 species) - Irr

Phylogenetic tree of the Rrf2 family of transcription factors in -proteobacteria

proteins with the conserved C-X(6-9)-C(4-6)-C motif within effector-responsive domain proteins without a cysteine triad motif

Iron repressor RirA (Rhizobium leguminosarum)

Nitrite/NO-sensing regulator NsrR (Nitrosomonas europeae, Escherichia coli)

Cysteine metabolism repressor CymR(Bacillus subtilis)

Iron-Sulfur cluster synthesis repressor IscR(Escherichia coli)

Positional clustering of rrf2-like genes with:iron uptake and storage genes;

Fe-S cluster synthesis operons;genes involved in nitrosative stress protection;

sulfate uptake/assimilation genes;thioredoxin reductase;

carboxymuconolactone decarboxylase-family genes;

hmc cytochrome operon

Cytochrome complex regulator Rrf2(Desulfovibrio vulgaris)

ZMO0116

GOX0099

Rrub02000219

ZMO0422

Sala_1236

ELI0458

Saro3534

DV Rrf2

OA2633_03246CC1866

Ricket.

Am

b3030

Rrub 02002540

PB2503_09884

STM_3629

MED193_04321

ISM_16015

OB2597_03589

RO

S2

17

_ 20

54

2RB

26

54

040

09

SKA53_

05183RC_0477

Rsph023725SPO2025

EE36_14302

EC IscR

RPA0663GOX1196

Amb0200Rrub_1115

Sa

la_2

595

Sa

r o02

00

1 62

0

CC

2 62

5

PB

250

3 _0

371 2

R rub02002859

RC_0031Rsph023756

AGR_C_1499

RHE_CH01133

RL_1316

AGR_L_2801SMb20994SMc02267

RHE_CH03364RL_3916

MLl4516MLr1674

Rrub02001767Amb1054

ROS217_16231STM_634

MED193_09800

SPO0432Rsph023178

RB2654_19993RC 0780

BQ04990MBNC02002196

MLr1147BMEII0707

AGR_C_344

RL RirA

SMc00785RHE CH00735

OA2633_11510

Nwi_0743

NE NsrR

Amb1318GOX0860RC NsrR

ROS217_15206Rsph03001477

EC

_Ns

rR

SPOA0186

Ricket.

Sala_1049Saro02000305

OB2597_05195ROS217_02155

ROS217_14291

CC0132

SMc01160

BJ blr7974

RL_5159AGR_L_2343

AGR_C_402

AGR_L_1131

SPO3722RHE_CH02777RL_3336

SPO1393

MBNC02000669MLl1642

SMc02238AGR_C_872

RL_619RHE_CH00547

MBNC03004487

RirA

NsrR

IscR

IscR-II

Rhizo biales

Rh o dob acterales

Jann_2366

BS CymR

The A group - RirA (8 species)

(12 species)The C group - RirA

Sequence logos for the identified RirA-binding sites in -proteobacteria

Genes Functions:Iron uptakeIron storageFeS synthesis

Iron usageHeme biosynthesisRegulatory genesManganese uptake

Distribution of the conserved members of the Fe- and Mn-responsive regulons and the predicted RirA, Fur/Mur, Irr, and DtxR binding sites in -proteobacteria

An attempt to reconstruct the history

Acknowledgements• Dmitry Rodionov (comparative genomics)• Andrei Mironov (software)• Alexei Vitreschak (riboswitches)

• Slides:– Michael Galperin (NCBI, Bethesda)– Andrei Osterman (Burnham Institute, San-Diego)

• Collaboration:– Thomas Eitinger (Humboldt University, Berlin) – Co/Ni transporters– Andy Johnston (University of East Anglia) – Fe in alphas

• Funding:– Howard Hughes Medical Institute– Russian Fund of Basic Research– RAS, program “Molecular and Cellular Biology”– INTAS

Documents

Molecular biology in silico