Upload
alexandra-rose
View
223
Download
2
Tags:
Embed Size (px)
Citation preview
Identify Archaeal and Bacterial Phylogenetic Markers
Dongying Wu
We have more than 700 compete genome sequences:
1.Select 100 representatives2.Build gene families3.Identify families that present in all organisms with equal numbers 4.Hmm building and phylogenetic analysis to identify the true makers
Proteobacteria Firmicutes
Phylogenetic Tree of Bacteria (built from 31 concatenate marker alignments)
313,139 genes from 100 genomes => 28,710,015 links
Blastp: E value cutoff 1e-10, report 10000 hitsOnly blastp hits that span 80% of the lengths of both genes are kept as links
Gene Family Classification
Links (matrix of sequence similarities)
Expansion
Inflation (I=2)
MCL Clustering Algorithm
equilibrium state
73686 Singletons, 23336 families(239453 genes)
Rules for Families of Markers:
1.The family has to cover all 100 genomes (high universality)2.Each genome has to have equal numbers (high evenness)
Evenness=100×e−4×Ng×∑
i¿ Ni−Nm /¿
¿
Ni: the number of the gene family members from the genome i;Nm: the medium of Ni of the 100 genomes;Ng: the total genome number;
Universality is the genome number a family involves
Phylogenetic Marker Identification
Out of the 502 families with high universality:
* 31 phylogenetic markers from AMPHORA
* 39 marker candidates with high evenness number (>=80) (25 families are either single copied in each genome or double copiedin one genome that co-branched in phylogenetic trees)
Build PHYML trees with the AMPHORA markers and 25 marker candidates, andcompare the tree topologies with the genome tree
NODAL distance
(TOPD/FMTS)
Split (Robinson-Foulds) Distance(TOPD/FMTS)
ratio of the internal edges being bad (0-1)
A
B
CD
E
FG
A
B
CD
E
F
G
good edgegood edge
good edge good edge
bad edge
bad edge
bad edge
bad edge
0 1 2 3 4 5 6
rRNA16SruvBnusArplB
purArpsJ
secYrpsI
pyrHrpsErplPrplNrpsCruvArplFrplAserSrplKrpsKpriA
smpBrpsGguaArpsQrpsLrplUrplOrpsMinfCrplSrplVrplCrpsPrplErplTrplLrplQrpsH
mraWrpsOrpsBrplI
rplMrplR
ttffrrtsf
rplDradArpsS
trmDcoaE
rpmA
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
nusArpsCrpsEpriArplBsecY
rRNA16SrpsJ
rpsBruvBguaArplNserSrplF
frrrplArplErplCinfCrplDrplK
purAradAruvArpsMpyrH
rplIrplMrpsGrpsL
mraWrpsI
ttfrplS
trmDtsf
rplUrpsKrpsPrplOrplTrplVrpsSrplP
rpsOsmpBrpsHrplQrplR
rpsQrplL
rpmAcoaE
Ribosomal protein Transcription/translation related protein DNA repair protein Protein of other functionAMPHORA marker
Distance between the genome tree and 100 random trees (average standard deviation)
NODAL distance SPLIT distance
Distances between gene trees and the AMPHORA concatenated genome tree
SALINISPORA TROPICA CNB 440
CORYNEBACTERIUM EFFICIENS YS 314NOCARDIA FARCINICA IFM 10152
FRANKIA ALNI ACN14ASTREPTOMYCES COELICOLOR A3 2ARTHROBACTER AURESCENS TC1
CLAVIBACTER MICHIGANENSIS SUBSP MICHIGANENSIS NCPPB 382PROPIONIBACTERIUM ACNES KPA171202BIFIDOBACTERIUM LONGUM NCC2705
LISTERIA WELSHIMERI SEROVAR 6B STR SLCC5334
LACTOBACILLUS CASEI ATCC 334LACTOBACILLUS HELVETICUS DPC 4571
LACTOBACILLUS REUTERI F275OENOCOCCUS OENI PSU 1
LACTOCOCCUS LACTIS SUBSP CREMORIS SK11
STAPHYLOCOCCUS SAPROPHYTICUS SUBSP SAPROPHYTICUS ATCC 15305
BACILLUS LICHENIFORMIS ATCC 14580OCEANOBACILLUS IHEYENSIS HTE831
THERMOANAEROBACTER TENGCONGENSIS MB4
CLOSTRIDIUM DIFFICILE 630CLOSTRIDIUM KLUYVERI DSM 555
CARBOXYDOTHERMUS HYDROGENOFORMANS Z 2901PELOTOMACULUM THERMOPROPIONICUM SIDESULFITOBACTERIUM HAFNIENSE Y51
SYMBIOBACTERIUM THERMOPHILUM IAM 14863
DEHALOCOCCOIDES SP CBDB1CHLOROFLEXUS AURANTIACUS J 10 FL
PROCHLOROCOCCUS MARINUS SUBSP PASTORIS STR CCMP1986SYNECHOCYSTIS SP PCC 6803
GLOEOBACTER VIOLACEUS PCC 7421
FUSOBACTERIUM NUCLEATUM SUBSP NUCLEATUM ATCC 25586
THERMUS THERMOPHILUS HB27DEINOCOCCUS RADIODURANS R1
AQUIFEX AEOLICUS VF5THERMOTOGA MARITIMA MSB8
RHODOPIRELLULA BALTICA SH 1CANDIDATUS PROTOCHLAMYDIA AMOEBOPHILA UWE25
CYTOPHAGA HUTCHINSONII ATCC 33406PORPHYROMONAS GINGIVALIS W83
GRAMELLA FORSETII KT0803
SALINIBACTER RUBER DSM 13855CHLOROBIUM TEPIDUM TLS
TREPONEMA DENTICOLA ATCC 35405LEPTOSPIRA BORGPETERSENII SEROVAR HARDJO BOVIS L550
LEGIONELLA PNEUMOPHILA STR LENSCOXIELLA BURNETII RSA 493
XYLELLA FASTIDIOSA TEMECULA1NITROSOCOCCUS OCEANI ATCC 19707
METHYLOCOCCUS CAPSULATUS STR BATH
ESCHERICHIA COLI K12PSYCHROMONAS INGRAHAMII 37COLWELLIA PSYCHRERYTHRAEA 34H
SACCHAROPHAGUS DEGRADANS 2 40PSEUDOMONAS SYRINGAE PV SYRINGAE B728A
HAHELLA CHEJUENSIS KCTC 2396ACINETOBACTER SP ADP1
ALCANIVORAX BORKUMENSIS SK2
THIOMICROSPIRA CRUNOGENA XCL 2
FRANCISELLA TULARENSIS SUBSP TULARENSIS FSC198
NITROSOMONAS EUTROPHA C91METHYLIBIUM PETROLEIPHILUM PM1
BURKHOLDERIA MALLEI ATCC 23344
NEISSERIA MENINGITIDIS Z2491
ROSEOBACTER DENITRIFICANS OCH 114CAULOBACTER CRESCENTUS CB15
HYPHOMONAS NEPTUNIUM ATCC 15444
BARTONELLA HENSELAE STR HOUSTON 1NITROBACTER HAMBURGENSIS X14
ZYMOMONAS MOBILIS SUBSP MOBILIS ZM4
MAGNETOSPIRILLUM MAGNETICUM AMB 1GLUCONOBACTER OXYDANS 621H
BDELLOVIBRIO BACTERIOVORUS HD100
MYXOCOCCUS XANTHUS DK 1622SORANGIUM CELLULOSUM SO CE 56
SYNTROPHUS ACIDITROPHICUS SBDESULFOTALEA PSYCHROPHILA LSV54
DESULFOVIBRIO VULGARIS SUBSP VULGARIS STR HILDENBOROUGH
GEOBACTER SULFURREDUCENS PCA
NITRATIRUPTOR SP SB155 2SULFURIMONAS DENITRIFICANS DSM 1251
ARCOBACTER BUTZLERI RM4018SULFUROVUM SP NBC37 1CAMPYLOBACTER JEJUNI SUBSP JEJUNI NCTC 11168
HELICOBACTER HEPATICUS ATCC 51449HELICOBACTER PYLORI J99
0.2
gamma
beta
alpha
delta
Epsilon
SpirochaetesPlanctomycetesChlamydiaeChlorobi
Bacteroidetes
Fusobacteria
Actinobacteria
Cyanobacteria
Chloroflexi
Firmicutes
Genome Tree rpmA
METHYLIBIUM PETROLEIPHILUM PM1 ACINETOBACTER SP ADP1
XYLELLA FASTIDIOSA TEMECULA1
NOCARDIA FARCINICA IFM 10152 CLAVIBACTER MICHIGANENSIS SUBSP MICHIGANENSIS NCPPB 382
STREPTOMYCES COELICOLOR A3 2 ARTHROBACTER AURESCENS TC1
SALINISPORA TROPICA CNB 440 FRANKIA ALNI ACN14A
CORYNEBACTERIUM EFFICIENS YS 314 PROPIONIBACTERIUM ACNES KPA171202
BIFIDOBACTERIUM LONGUM NCC2705
PROCHLOROCOCCUS MARINUS SUBSP PASTORIS STR CCMP1986 SALINIBACTER RUBER DSM 13855
BARTONELLA HENSELAE STR HOUSTON 1
PORPHYROMONAS GINGIVALIS W83 GRAMELLA FORSETII KT0803
CYTOPHAGA HUTCHINSONII ATCC 33406 CANDIDATUS PROTOCHLAMYDIA AMOEBOPHILA UWE25
CAULOBACTER CRESCENTUS CB15
ZYMOMONAS MOBILIS SUBSP MOBILIS ZM4 MAGNETOSPIRILLUM MAGNETICUM AMB 1
ROSEOBACTER DENITRIFICANS OCH 114 NITROBACTER HAMBURGENSIS X14
GLUCONOBACTER OXYDANS 621H
ESCHERICHIA COLI K12 COLWELLIA PSYCHRERYTHRAEA 34H
PSYCHROMONAS INGRAHAMII 37 NITROSOCOCCUS OCEANI ATCC 19707
THIOMICROSPIRA CRUNOGENA XCL 2
SACCHAROPHAGUS DEGRADANS 2 40
ALCANIVORAX BORKUMENSIS SK2 PSEUDOMONAS SYRINGAE PV SYRINGAE B728A
NEISSERIA MENINGITIDIS Z2491
NITROSOMONAS EUTROPHA C91 BURKHOLDERIA MALLEI ATCC 23344 1
HAHELLA CHEJUENSIS KCTC 2396 METHYLOCOCCUS CAPSULATUS STR BATH
COXIELLA BURNETII RSA 493 LEGIONELLA PNEUMOPHILA STR LENS FRANCISELLA TULARENSIS SUBSP TULARENSIS FSC198
HYPHOMONAS NEPTUNIUM ATCC 15444
CHLOROBIUM TEPIDUM TLS TREPONEMA DENTICOLA ATCC 35405
DEINOCOCCUS RADIODURANS R1 1FUSOBACTERIUM NUCLEATUM SUBSP NUCLEATUM ATCC 25586
LEPTOSPIRA BORGPETERSENII SEROVAR HARDJO BOVIS L550 1
THERMUS THERMOPHILUS HB27
THERMOTOGA MARITIMA MSB8 AQUIFEX AEOLICUS VF5
DEHALOCOCCOIDES SP CBDB1 BDELLOVIBRIO BACTERIOVORUS HD100
SULFUROVUM SP NBC37 1 SULFURIMONAS DENITRIFICANS DSM 1251
HELICOBACTER PYLORI J99 HELICOBACTER HEPATICUS ATCC 51449 ARCOBACTER BUTZLERI RM4018
NITRATIRUPTOR SP SB155 2 CAMPYLOBACTER JEJUNI SUBSP JEJUNI NCTC 11168
SYNTROPHUS ACIDITROPHICUS SB
MYXOCOCCUS XANTHUS DK 1622 SORANGIUM CELLULOSUM SO CE 56
RHODOPIRELLULA BALTICA SH 1 DESULFOTALEA PSYCHROPHILA LSV54
DESULFOVIBRIO VULGARIS SUBSP VULGARIS STR HILDENBOROUGH CHLOROFLEXUS AURANTIACUS J 10 FL
GEOBACTER SULFURREDUCENS PCA GLOEOBACTER VIOLACEUS PCC 7421 SYNECHOCYSTIS SP PCC 6803
CLOSTRIDIUM DIFFICILE 630 SYMBIOBACTERIUM THERMOPHILUM IAM 14863
DESULFITOBACTERIUM HAFNIENSE Y51 THERMOANAEROBACTER TENGCONGENSIS MB4
CARBOXYDOTHERMUS HYDROGENOFORMANS Z 2901 PELOTOMACULUM THERMOPROPIONICUM SI
CLOSTRIDIUM KLUYVERI DSM 555
OCEANOBACILLUS IHEYENSIS HTE831 STAPHYLOCOCCUS SAPROPHYTICUS SUBSP SAPROPHYTICUS ATCC 15305
BACILLUS LICHENIFORMIS ATCC 14580 LISTERIA WELSHIMERI SEROVAR 6B STR SLCC5334
LACTOCOCCUS LACTIS SUBSP CREMORIS SK11
OENOCOCCUS OENI PSU 1 LACTOBACILLUS HELVETICUS DPC 4571
LACTOBACILLUS REUTERI F275 LACTOBACILLUS CASEI ATCC 334
0.2
Better Tree Comparison is Needed
Not all edges are equal
We need to know how a marker performs atdifferent taxonomic levels and groups
Only 63 competed actinobacterial genomes are included in this study
Basic rules:
Every genome should have only one copy from a family for that family to be counted as marker candidate (plus/minus 1)
Actinobacteria:
47 pre-GEBA genomes26 GEBA genomes(16 completed)
63 genome (251585 proteins, 18534 large family-proteins)
20460854 links
38450 MCL clusters
818 cluster (>=62 members and <2000 members)
BLASTP (cutoff 1e-10 over 80% span)
MCL (I=2)
170 families with 62-64 members:
105 can be marker candidates Universality 100 (size=63-64), 98 (size=62)
ISSUE ONE:
Are there any markers embedded in the larger clusters?
YchF
ObgE
GTP-binding protein
Automatic Tree Screening:
1.Pick clades with the desirable number of taxa
1.calculate universality and envenness
1.Generate families and Building HMMs
1.Search the hmm profiles against the entire actinobacterial peptides to see if the families are distinct
818 trees (60-2000 genes/tree, Build by MUSCLE/FastTree)
155 clades with leave-number=63, universality=100, evenness=100
The Good
murD UDP-N-acetylmuramoyl-L-alanyl-D-glutamate synthetase
The Bad
serS seryl-tRNA synthetase
HMMER3 hmmbuild into 155 profiles
Search against the actinobacterial genomes, only keep the following HMMs:
Scenario 1: Seeds hit E-value<=1e-20, none-seeds E-value >1e-3
Scenario 2:Best None-seed hit E-value (En) <= 1e-3 The Worst seed hit E-value <= 1e-17*EnThe Worst seed bit-score is more than twice of the None-seed bit-score
extreme value distribution
Best none-seed
Worst seed
ISSUE TWO:
Are there any markers families torn apart in the clustering process?
BLASTP linksExclude (1) Marker candidates(2) Large MCL family members (>=1000/family)
Single Linkage Families
Single linkage clustering
ISSUE THREE: Miss-placed deep branch
lepA GTP-binding protein
136 actinobacterial markers from 63 Actinobacterial genomes
One copy/genome
One duplication in one genome
One deletion in one genome
18 22
96
original MCL clusters
Tree-based pickingfrom MCL clusters
single-linkage clusters
tree topology correction
93
32
9 2
Select completed genomes from IMG for the following group
Archaea Actinobacteria
Alphaproteobacteria Bacteriodetes
Betaproteobacteria Chlamydae
Gammaproteobacteria Chloroflexi
Deltaproteobacteria Cyanobacteria
Epsilonproteobacteria Firmicutes
Spirochaetes
Thermi
Thermotogae
Use wget to get the sequences from the website
Gene marker identify pipeline (BLASTP,MCL clustering, tree building, clade evaluation)
Screen gene markers for any given taxonomic group
Phylogenetic group Genome Number Gene Number Maker Candidates
Archaea 62 145415 106
Actinobacteria 63 267783 136
Alphaproteobacteria 94 347287 121
Betaproteobacteria 56 266362 311
Gammaproteobacteria 126 483632 118
Deltaproteobacteria 25 102115 206
Epislonproteobacteria 18 33416 455
Bacteriodes 25 71531 286
Chlamydae 13 13823 560
Chloroflexi 10 33577 323
Cyanobacteria 36 124080 590
Firmicutes 106 312309 87
Spirochaetes 18 38832 176
Thermi 5 14160 974
Thermotogae 9 17037 684
Cluster HMM profiles
Hmm Profile One Consensus Sequence
Consensus Sequences for all HMMs (5133 lineage specific families + 56 Bacterial marker families)
All vs All BLASTP (E value cutoff = 1e-3)
Single Linkage Clustering -> 570 clusters (size >= 2) -> build 404 trees
Example of A tree (SIN323: carbamoyl-phosphate synthase)
Sampling and Analysis the Tree Automatically
Split the tree one edge at a time
Evaluate evenness
If evenness = 100 and single copied for each group
HMM profile building and search against all the consensus sequences
Are the seed peptides distinct? What cutoff to use?
A sample of HMM search output (the seeds are marked red)
THERMI894 RELAX5_SIN270.tre.ID542.faa.trim 3.8e-57 191.5THERMO479 RELAX5_SIN270.tre.ID542.faa.trim 4.8e-56 187.9EPSI254 RELAX5_SIN270.tre.ID542.faa.trim 7.1e-56 187.3ALPHA58 RELAX5_SIN270.tre.ID542.faa.trim 4.2e-55 184.8CHLOFL232 RELAX5_SIN270.tre.ID542.faa.trim 8.5e-52 174.0BARIO164 RELAX5_SIN270.tre.ID542.faa.trim 4.5e-51 171.7CHLAM424 RELAX5_SIN270.tre.ID542.faa.trim 1.3e-50 170.1GAMMA93 RELAX5_SIN270.tre.ID542.faa.trim 1.2e-43 147.4CYANO551 RELAX5_SIN270.tre.ID542.faa.trim 1.4e-42 144.0ARCH63 RELAX5_SIN270.tre.ID542.faa.trim 7.2e-21 73.3CYANO18 RELAX5_SIN270.tre.ID542.faa.trim 2.8e-06 25.7THERMO26 RELAX5_SIN270.tre.ID542.faa.trim 3.1e-06 25.6EPSI354 RELAX5_SIN270.tre.ID542.faa.trim 6.5e-05 21.3
Exponential curve Fitting to identify Hmmsearch E value cutoff
B =A = e
lg(Eseed_cutoff/Etop_none_seed) = A e B lg(Etop_none_seed)
lg(Eseed_cutoff/
Etop_none_seed)
Lg(Etop_none_seed)
Position 1: [x1=lg(1e-3) y1=lg(1e-15/1e-3)] Position 2: [x2=lg(1e-250) y2=lg(1e-1000/1e-250)]
Get all the potential groups that can be marker candidates
A.One consensus sequence from one phylogenetic group in one cladeB.The sequences are distinct from other sequences
Overlap problems and solutions
Example of A tree (SIN323: carbamoyl-phosphate synthase)
684 families that span multiple taxonomic group
Family Size
Fam
ily N
umber
Accumulative Distribution
Simple Distribution
383 families that span >=4 taxonomic groups
We have 382 clades (including whole trees) that are potential marker families thateach spans at least four different taxonomic groups
Example: Family 00001 (ribosomal protein S4)
Included Not included
Archaea Betaproteobacteria
Alphaproteobacteria Deltaproteobacteria
Gammaproteobacteria Actinobacteria
Epsilonproteobacteria Firmicutes
Bacteriodetes Spirochaetes
Chlamydae
Chloroflexi
Cyanobacteria
Thermi
Thermotogae
A group of HMM profiles
Combine the seeds and build a new profile HMM
Search one group that is missing from the HMM list
Get a large number of hits from the top (2 x genome number) and mark the very top hits
Tree building, and evaluate the clades:(1)Must include the very top hits(2)HMM building from the clades to estimate uniqueness
*Manual examinations are required in some cases
Search a group of genomes using a HMM profile of insiders
Search a group of genomes using a distant HMM profile
Alignment
Hmm Profile
Hmm search against all the complete genome database
Look through the hmm search results and determine if the hmm can distinguish family members from others
All the peptides from for a given family
MUSCLE
Tree building
Alignment ZORRO mask
Use 0.1 as the first round ZORRO cutoff
Trim the alignments and calculate the second ZORRO mask score
Ribosomal protein S4 PHYML tree (MF00001)
Build PHYML trees for all the families (alignments trimmed by the second ZORRO mask)
Monophyletic Analysis
A list of taxa that are assumed to be monophyletic can be divided into separate clades
A monophyletic value is designed to estimate if given list of taxa are monophyletic or not quantitatively
Shannon entropy measures uncertainty in a dataset
All taxa from a phylum form a monophyletic clade: 100% Uncertainty -> 0
All taxa from a phylum spread into N clades:p1 Uncertainty increases if
(1)Clades number increase(2)Evenness increase
p2
p3
p1+p2+p3=1
Shannon entropy calculation:
Calculate Shannon entropy for 100 taxa distributed in N bins (N=2..10)(repeat the calculation for 10,000 random simulations for each N)
H
Sam
ple num
ber
Monophyletic Value = 100 xShannon Entropy
05/04/10
Mon
ophy
ly V
alue
Shannon Entropy
ribosomal protein PHYML tree (MF00001)
161 families are kept
For at least 4 taxonomic groupsUniversality * Evenness * monophyly >= 90*90*90
PMPROK00023: ribosome recycling factor
LIST:ARCH UNIVERSALITY:NA EVENNESS:NA MONOPHYLY:NALIST:BACT UNIVERSALITY:99.67 EVENNESS:98.68 MONOPHYLY:NALIST:ACTINO UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:78.84LIST:BARIO UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:59.78LIST:CHLAM UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:100.00LIST:CHLOFL UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:60.37LIST:CYANO UNIVERSALITY:100.00 EVENNESS:81.04 MONOPHYLY:100.00LIST:FIRM UNIVERSALITY:99.06 EVENNESS:100.00 MONOPHYLY:85.98LIST:SPIRO UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:53.75LIST:THERMI UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:100.00LIST:THERMO UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:100.00LIST:PROTEO UNIVERSALITY:99.69 EVENNESS:100.00 MONOPHYLY:44.61LIST:ALPHA UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:63.18LIST:BETAGAMMA UNIVERSALITY:99.45 EVENNESS:100.00 MONOPHYLY:97.47LIST:BETA UNIVERSALITY:98.21 EVENNESS:100.00 MONOPHYLY:100.00LIST:GAMMA UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:79.67LIST:DELTA UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:88.17LIST:EPSI UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:100.00
What is next:
1. Search IMG again to update the seqs and accessions
2. Develop CGI scripts to retrieve user defined markers (calculate universality, evenness and monophyly on the fly)