43
Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Embed Size (px)

Citation preview

Page 1: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Identify Archaeal and Bacterial Phylogenetic Markers

Dongying Wu

Page 2: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

We have more than 700 compete genome sequences:

1.Select 100 representatives2.Build gene families3.Identify families that present in all organisms with equal numbers 4.Hmm building and phylogenetic analysis to identify the true makers

Proteobacteria Firmicutes

Phylogenetic Tree of Bacteria (built from 31 concatenate marker alignments)

Page 3: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

313,139 genes from 100 genomes => 28,710,015 links

Blastp: E value cutoff 1e-10, report 10000 hitsOnly blastp hits that span 80% of the lengths of both genes are kept as links

Gene Family Classification

Links (matrix of sequence similarities)

Expansion

Inflation (I=2)

MCL Clustering Algorithm

equilibrium state

Page 4: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

73686 Singletons, 23336 families(239453 genes)

Rules for Families of Markers:

1.The family has to cover all 100 genomes (high universality)2.Each genome has to have equal numbers (high evenness)

Evenness=100×e−4×Ng×∑

i¿ Ni−Nm /¿

¿

Ni: the number of the gene family members from the genome i;Nm: the medium of Ni of the 100 genomes;Ng: the total genome number;

Universality is the genome number a family involves

Page 5: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Phylogenetic Marker Identification

Out of the 502 families with high universality:

* 31 phylogenetic markers from AMPHORA

* 39 marker candidates with high evenness number (>=80) (25 families are either single copied in each genome or double copiedin one genome that co-branched in phylogenetic trees)

Build PHYML trees with the AMPHORA markers and 25 marker candidates, andcompare the tree topologies with the genome tree

Page 6: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

NODAL distance

(TOPD/FMTS)

Page 7: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Split (Robinson-Foulds) Distance(TOPD/FMTS)

ratio of the internal edges being bad (0-1)

A

B

CD

E

FG

A

B

CD

E

F

G

good edgegood edge

good edge good edge

bad edge

bad edge

bad edge

bad edge

Page 8: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

0 1 2 3 4 5 6

rRNA16SruvBnusArplB

purArpsJ

secYrpsI

pyrHrpsErplPrplNrpsCruvArplFrplAserSrplKrpsKpriA

smpBrpsGguaArpsQrpsLrplUrplOrpsMinfCrplSrplVrplCrpsPrplErplTrplLrplQrpsH

mraWrpsOrpsBrplI

rplMrplR

ttffrrtsf

rplDradArpsS

trmDcoaE

rpmA

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

nusArpsCrpsEpriArplBsecY

rRNA16SrpsJ

rpsBruvBguaArplNserSrplF

frrrplArplErplCinfCrplDrplK

purAradAruvArpsMpyrH

rplIrplMrpsGrpsL

mraWrpsI

ttfrplS

trmDtsf

rplUrpsKrpsPrplOrplTrplVrpsSrplP

rpsOsmpBrpsHrplQrplR

rpsQrplL

rpmAcoaE

Ribosomal protein Transcription/translation related protein DNA repair protein Protein of other functionAMPHORA marker

Distance between the genome tree and 100 random trees (average standard deviation)

NODAL distance SPLIT distance

Distances between gene trees and the AMPHORA concatenated genome tree

Page 9: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

SALINISPORA TROPICA CNB 440

CORYNEBACTERIUM EFFICIENS YS 314NOCARDIA FARCINICA IFM 10152

FRANKIA ALNI ACN14ASTREPTOMYCES COELICOLOR A3 2ARTHROBACTER AURESCENS TC1

CLAVIBACTER MICHIGANENSIS SUBSP MICHIGANENSIS NCPPB 382PROPIONIBACTERIUM ACNES KPA171202BIFIDOBACTERIUM LONGUM NCC2705

LISTERIA WELSHIMERI SEROVAR 6B STR SLCC5334

LACTOBACILLUS CASEI ATCC 334LACTOBACILLUS HELVETICUS DPC 4571

LACTOBACILLUS REUTERI F275OENOCOCCUS OENI PSU 1

LACTOCOCCUS LACTIS SUBSP CREMORIS SK11

STAPHYLOCOCCUS SAPROPHYTICUS SUBSP SAPROPHYTICUS ATCC 15305

BACILLUS LICHENIFORMIS ATCC 14580OCEANOBACILLUS IHEYENSIS HTE831

THERMOANAEROBACTER TENGCONGENSIS MB4

CLOSTRIDIUM DIFFICILE 630CLOSTRIDIUM KLUYVERI DSM 555

CARBOXYDOTHERMUS HYDROGENOFORMANS Z 2901PELOTOMACULUM THERMOPROPIONICUM SIDESULFITOBACTERIUM HAFNIENSE Y51

SYMBIOBACTERIUM THERMOPHILUM IAM 14863

DEHALOCOCCOIDES SP CBDB1CHLOROFLEXUS AURANTIACUS J 10 FL

PROCHLOROCOCCUS MARINUS SUBSP PASTORIS STR CCMP1986SYNECHOCYSTIS SP PCC 6803

GLOEOBACTER VIOLACEUS PCC 7421

FUSOBACTERIUM NUCLEATUM SUBSP NUCLEATUM ATCC 25586

THERMUS THERMOPHILUS HB27DEINOCOCCUS RADIODURANS R1

AQUIFEX AEOLICUS VF5THERMOTOGA MARITIMA MSB8

RHODOPIRELLULA BALTICA SH 1CANDIDATUS PROTOCHLAMYDIA AMOEBOPHILA UWE25

CYTOPHAGA HUTCHINSONII ATCC 33406PORPHYROMONAS GINGIVALIS W83

GRAMELLA FORSETII KT0803

SALINIBACTER RUBER DSM 13855CHLOROBIUM TEPIDUM TLS

TREPONEMA DENTICOLA ATCC 35405LEPTOSPIRA BORGPETERSENII SEROVAR HARDJO BOVIS L550

LEGIONELLA PNEUMOPHILA STR LENSCOXIELLA BURNETII RSA 493

XYLELLA FASTIDIOSA TEMECULA1NITROSOCOCCUS OCEANI ATCC 19707

METHYLOCOCCUS CAPSULATUS STR BATH

ESCHERICHIA COLI K12PSYCHROMONAS INGRAHAMII 37COLWELLIA PSYCHRERYTHRAEA 34H

SACCHAROPHAGUS DEGRADANS 2 40PSEUDOMONAS SYRINGAE PV SYRINGAE B728A

HAHELLA CHEJUENSIS KCTC 2396ACINETOBACTER SP ADP1

ALCANIVORAX BORKUMENSIS SK2

THIOMICROSPIRA CRUNOGENA XCL 2

FRANCISELLA TULARENSIS SUBSP TULARENSIS FSC198

NITROSOMONAS EUTROPHA C91METHYLIBIUM PETROLEIPHILUM PM1

BURKHOLDERIA MALLEI ATCC 23344

NEISSERIA MENINGITIDIS Z2491

ROSEOBACTER DENITRIFICANS OCH 114CAULOBACTER CRESCENTUS CB15

HYPHOMONAS NEPTUNIUM ATCC 15444

BARTONELLA HENSELAE STR HOUSTON 1NITROBACTER HAMBURGENSIS X14

ZYMOMONAS MOBILIS SUBSP MOBILIS ZM4

MAGNETOSPIRILLUM MAGNETICUM AMB 1GLUCONOBACTER OXYDANS 621H

BDELLOVIBRIO BACTERIOVORUS HD100

MYXOCOCCUS XANTHUS DK 1622SORANGIUM CELLULOSUM SO CE 56

SYNTROPHUS ACIDITROPHICUS SBDESULFOTALEA PSYCHROPHILA LSV54

DESULFOVIBRIO VULGARIS SUBSP VULGARIS STR HILDENBOROUGH

GEOBACTER SULFURREDUCENS PCA

NITRATIRUPTOR SP SB155 2SULFURIMONAS DENITRIFICANS DSM 1251

ARCOBACTER BUTZLERI RM4018SULFUROVUM SP NBC37 1CAMPYLOBACTER JEJUNI SUBSP JEJUNI NCTC 11168

HELICOBACTER HEPATICUS ATCC 51449HELICOBACTER PYLORI J99

0.2

gamma

beta

alpha

delta

Epsilon

SpirochaetesPlanctomycetesChlamydiaeChlorobi

Bacteroidetes

Fusobacteria

Actinobacteria

Cyanobacteria

Chloroflexi

Firmicutes

Genome Tree rpmA

METHYLIBIUM PETROLEIPHILUM PM1 ACINETOBACTER SP ADP1

XYLELLA FASTIDIOSA TEMECULA1

NOCARDIA FARCINICA IFM 10152 CLAVIBACTER MICHIGANENSIS SUBSP MICHIGANENSIS NCPPB 382

STREPTOMYCES COELICOLOR A3 2 ARTHROBACTER AURESCENS TC1

SALINISPORA TROPICA CNB 440 FRANKIA ALNI ACN14A

CORYNEBACTERIUM EFFICIENS YS 314 PROPIONIBACTERIUM ACNES KPA171202

BIFIDOBACTERIUM LONGUM NCC2705

PROCHLOROCOCCUS MARINUS SUBSP PASTORIS STR CCMP1986 SALINIBACTER RUBER DSM 13855

BARTONELLA HENSELAE STR HOUSTON 1

PORPHYROMONAS GINGIVALIS W83 GRAMELLA FORSETII KT0803

CYTOPHAGA HUTCHINSONII ATCC 33406 CANDIDATUS PROTOCHLAMYDIA AMOEBOPHILA UWE25

CAULOBACTER CRESCENTUS CB15

ZYMOMONAS MOBILIS SUBSP MOBILIS ZM4 MAGNETOSPIRILLUM MAGNETICUM AMB 1

ROSEOBACTER DENITRIFICANS OCH 114 NITROBACTER HAMBURGENSIS X14

GLUCONOBACTER OXYDANS 621H

ESCHERICHIA COLI K12 COLWELLIA PSYCHRERYTHRAEA 34H

PSYCHROMONAS INGRAHAMII 37 NITROSOCOCCUS OCEANI ATCC 19707

THIOMICROSPIRA CRUNOGENA XCL 2

SACCHAROPHAGUS DEGRADANS 2 40

ALCANIVORAX BORKUMENSIS SK2 PSEUDOMONAS SYRINGAE PV SYRINGAE B728A

NEISSERIA MENINGITIDIS Z2491

NITROSOMONAS EUTROPHA C91 BURKHOLDERIA MALLEI ATCC 23344 1

HAHELLA CHEJUENSIS KCTC 2396 METHYLOCOCCUS CAPSULATUS STR BATH

COXIELLA BURNETII RSA 493 LEGIONELLA PNEUMOPHILA STR LENS FRANCISELLA TULARENSIS SUBSP TULARENSIS FSC198

HYPHOMONAS NEPTUNIUM ATCC 15444

CHLOROBIUM TEPIDUM TLS TREPONEMA DENTICOLA ATCC 35405

DEINOCOCCUS RADIODURANS R1 1FUSOBACTERIUM NUCLEATUM SUBSP NUCLEATUM ATCC 25586

LEPTOSPIRA BORGPETERSENII SEROVAR HARDJO BOVIS L550 1

THERMUS THERMOPHILUS HB27

THERMOTOGA MARITIMA MSB8 AQUIFEX AEOLICUS VF5

DEHALOCOCCOIDES SP CBDB1 BDELLOVIBRIO BACTERIOVORUS HD100

SULFUROVUM SP NBC37 1 SULFURIMONAS DENITRIFICANS DSM 1251

HELICOBACTER PYLORI J99 HELICOBACTER HEPATICUS ATCC 51449 ARCOBACTER BUTZLERI RM4018

NITRATIRUPTOR SP SB155 2 CAMPYLOBACTER JEJUNI SUBSP JEJUNI NCTC 11168

SYNTROPHUS ACIDITROPHICUS SB

MYXOCOCCUS XANTHUS DK 1622 SORANGIUM CELLULOSUM SO CE 56

RHODOPIRELLULA BALTICA SH 1 DESULFOTALEA PSYCHROPHILA LSV54

DESULFOVIBRIO VULGARIS SUBSP VULGARIS STR HILDENBOROUGH CHLOROFLEXUS AURANTIACUS J 10 FL

GEOBACTER SULFURREDUCENS PCA GLOEOBACTER VIOLACEUS PCC 7421 SYNECHOCYSTIS SP PCC 6803

CLOSTRIDIUM DIFFICILE 630 SYMBIOBACTERIUM THERMOPHILUM IAM 14863

DESULFITOBACTERIUM HAFNIENSE Y51 THERMOANAEROBACTER TENGCONGENSIS MB4

CARBOXYDOTHERMUS HYDROGENOFORMANS Z 2901 PELOTOMACULUM THERMOPROPIONICUM SI

CLOSTRIDIUM KLUYVERI DSM 555

OCEANOBACILLUS IHEYENSIS HTE831 STAPHYLOCOCCUS SAPROPHYTICUS SUBSP SAPROPHYTICUS ATCC 15305

BACILLUS LICHENIFORMIS ATCC 14580 LISTERIA WELSHIMERI SEROVAR 6B STR SLCC5334

LACTOCOCCUS LACTIS SUBSP CREMORIS SK11

OENOCOCCUS OENI PSU 1 LACTOBACILLUS HELVETICUS DPC 4571

LACTOBACILLUS REUTERI F275 LACTOBACILLUS CASEI ATCC 334

0.2

Page 10: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Better Tree Comparison is Needed

Not all edges are equal

We need to know how a marker performs atdifferent taxonomic levels and groups

Page 11: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Only 63 competed actinobacterial genomes are included in this study

Basic rules:

Every genome should have only one copy from a family for that family to be counted as marker candidate (plus/minus 1)

Actinobacteria:

47 pre-GEBA genomes26 GEBA genomes(16 completed)

Page 12: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

63 genome (251585 proteins, 18534 large family-proteins)

20460854 links

38450 MCL clusters

818 cluster (>=62 members and <2000 members)

BLASTP (cutoff 1e-10 over 80% span)

MCL (I=2)

170 families with 62-64 members:

105 can be marker candidates Universality 100 (size=63-64), 98 (size=62)

Page 13: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

ISSUE ONE:

Are there any markers embedded in the larger clusters?

YchF

ObgE

GTP-binding protein

Page 14: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Automatic Tree Screening:

1.Pick clades with the desirable number of taxa

1.calculate universality and envenness

1.Generate families and Building HMMs

1.Search the hmm profiles against the entire actinobacterial peptides to see if the families are distinct

Page 15: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

818 trees (60-2000 genes/tree, Build by MUSCLE/FastTree)

155 clades with leave-number=63, universality=100, evenness=100

The Good

murD UDP-N-acetylmuramoyl-L-alanyl-D-glutamate synthetase

Page 16: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

The Bad

serS seryl-tRNA synthetase

Page 17: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

HMMER3 hmmbuild into 155 profiles

Search against the actinobacterial genomes, only keep the following HMMs:

Scenario 1: Seeds hit E-value<=1e-20, none-seeds E-value >1e-3

Scenario 2:Best None-seed hit E-value (En) <= 1e-3 The Worst seed hit E-value <= 1e-17*EnThe Worst seed bit-score is more than twice of the None-seed bit-score

extreme value distribution

Best none-seed

Worst seed

Page 18: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

ISSUE TWO:

Are there any markers families torn apart in the clustering process?

BLASTP linksExclude (1) Marker candidates(2) Large MCL family members (>=1000/family)

Single Linkage Families

Single linkage clustering

Page 19: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

ISSUE THREE: Miss-placed deep branch

lepA GTP-binding protein

Page 20: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

136 actinobacterial markers from 63 Actinobacterial genomes

One copy/genome

One duplication in one genome

One deletion in one genome

18 22

96

original MCL clusters

Tree-based pickingfrom MCL clusters

single-linkage clusters

tree topology correction

93

32

9 2

Page 21: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Select completed genomes from IMG for the following group

Archaea Actinobacteria

Alphaproteobacteria Bacteriodetes

Betaproteobacteria Chlamydae

Gammaproteobacteria Chloroflexi

Deltaproteobacteria Cyanobacteria

Epsilonproteobacteria Firmicutes

Spirochaetes

Thermi

Thermotogae

Use wget to get the sequences from the website

Gene marker identify pipeline (BLASTP,MCL clustering, tree building, clade evaluation)

Page 22: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Screen gene markers for any given taxonomic group

Phylogenetic group Genome Number Gene Number Maker Candidates

Archaea 62 145415 106

Actinobacteria 63 267783 136

Alphaproteobacteria 94 347287 121

Betaproteobacteria 56 266362 311

Gammaproteobacteria 126 483632 118

Deltaproteobacteria 25 102115 206

Epislonproteobacteria 18 33416 455

Bacteriodes 25 71531 286

Chlamydae 13 13823 560

Chloroflexi 10 33577 323

Cyanobacteria 36 124080 590

Firmicutes 106 312309 87

Spirochaetes 18 38832 176

Thermi 5 14160 974

Thermotogae 9 17037 684

Page 23: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Cluster HMM profiles

Hmm Profile One Consensus Sequence

Consensus Sequences for all HMMs (5133 lineage specific families + 56 Bacterial marker families)

All vs All BLASTP (E value cutoff = 1e-3)

Single Linkage Clustering -> 570 clusters (size >= 2) -> build 404 trees

Page 24: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Example of A tree (SIN323: carbamoyl-phosphate synthase)

Page 25: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Sampling and Analysis the Tree Automatically

Split the tree one edge at a time

Evaluate evenness

If evenness = 100 and single copied for each group

HMM profile building and search against all the consensus sequences

Page 26: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Are the seed peptides distinct? What cutoff to use?

A sample of HMM search output (the seeds are marked red)

THERMI894 RELAX5_SIN270.tre.ID542.faa.trim 3.8e-57 191.5THERMO479 RELAX5_SIN270.tre.ID542.faa.trim 4.8e-56 187.9EPSI254 RELAX5_SIN270.tre.ID542.faa.trim 7.1e-56 187.3ALPHA58 RELAX5_SIN270.tre.ID542.faa.trim 4.2e-55 184.8CHLOFL232 RELAX5_SIN270.tre.ID542.faa.trim 8.5e-52 174.0BARIO164 RELAX5_SIN270.tre.ID542.faa.trim 4.5e-51 171.7CHLAM424 RELAX5_SIN270.tre.ID542.faa.trim 1.3e-50 170.1GAMMA93 RELAX5_SIN270.tre.ID542.faa.trim 1.2e-43 147.4CYANO551 RELAX5_SIN270.tre.ID542.faa.trim 1.4e-42 144.0ARCH63 RELAX5_SIN270.tre.ID542.faa.trim 7.2e-21 73.3CYANO18 RELAX5_SIN270.tre.ID542.faa.trim 2.8e-06 25.7THERMO26 RELAX5_SIN270.tre.ID542.faa.trim 3.1e-06 25.6EPSI354 RELAX5_SIN270.tre.ID542.faa.trim 6.5e-05 21.3

Page 27: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Exponential curve Fitting to identify Hmmsearch E value cutoff

B =A = e

lg(Eseed_cutoff/Etop_none_seed) = A e B lg(Etop_none_seed)

lg(Eseed_cutoff/

Etop_none_seed)

Lg(Etop_none_seed)

Position 1: [x1=lg(1e-3) y1=lg(1e-15/1e-3)] Position 2: [x2=lg(1e-250) y2=lg(1e-1000/1e-250)]

Page 28: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Get all the potential groups that can be marker candidates

A.One consensus sequence from one phylogenetic group in one cladeB.The sequences are distinct from other sequences

Overlap problems and solutions

Page 29: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Example of A tree (SIN323: carbamoyl-phosphate synthase)

Page 30: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

684 families that span multiple taxonomic group

Family Size

Fam

ily N

umber

Accumulative Distribution

Simple Distribution

383 families that span >=4 taxonomic groups

Page 31: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

We have 382 clades (including whole trees) that are potential marker families thateach spans at least four different taxonomic groups

Example: Family 00001 (ribosomal protein S4)

Included Not included

Archaea Betaproteobacteria

Alphaproteobacteria Deltaproteobacteria

Gammaproteobacteria Actinobacteria

Epsilonproteobacteria Firmicutes

Bacteriodetes Spirochaetes

Chlamydae

Chloroflexi

Cyanobacteria

Thermi

Thermotogae

Page 32: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

A group of HMM profiles

Combine the seeds and build a new profile HMM

Search one group that is missing from the HMM list

Get a large number of hits from the top (2 x genome number) and mark the very top hits

Tree building, and evaluate the clades:(1)Must include the very top hits(2)HMM building from the clades to estimate uniqueness

*Manual examinations are required in some cases

Search a group of genomes using a HMM profile of insiders

Search a group of genomes using a distant HMM profile

Page 33: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Alignment

Hmm Profile

Hmm search against all the complete genome database

Look through the hmm search results and determine if the hmm can distinguish family members from others

All the peptides from for a given family

MUSCLE

Page 34: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Tree building

Alignment ZORRO mask

Page 35: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Use 0.1 as the first round ZORRO cutoff

Trim the alignments and calculate the second ZORRO mask score

Page 36: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Ribosomal protein S4 PHYML tree (MF00001)

Build PHYML trees for all the families (alignments trimmed by the second ZORRO mask)

Page 37: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Monophyletic Analysis

A list of taxa that are assumed to be monophyletic can be divided into separate clades

A monophyletic value is designed to estimate if given list of taxa are monophyletic or not quantitatively

Page 38: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Shannon entropy measures uncertainty in a dataset

All taxa from a phylum form a monophyletic clade: 100% Uncertainty -> 0

All taxa from a phylum spread into N clades:p1 Uncertainty increases if

(1)Clades number increase(2)Evenness increase

p2

p3

p1+p2+p3=1

Shannon entropy calculation:

Page 39: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

Calculate Shannon entropy for 100 taxa distributed in N bins (N=2..10)(repeat the calculation for 10,000 random simulations for each N)

H

Sam

ple num

ber

Monophyletic Value = 100 xShannon Entropy

Page 40: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

05/04/10

Mon

ophy

ly V

alue

Shannon Entropy

Page 41: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

ribosomal protein PHYML tree (MF00001)

Page 42: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

161 families are kept

For at least 4 taxonomic groupsUniversality * Evenness * monophyly >= 90*90*90

PMPROK00023: ribosome recycling factor

LIST:ARCH UNIVERSALITY:NA EVENNESS:NA MONOPHYLY:NALIST:BACT UNIVERSALITY:99.67 EVENNESS:98.68 MONOPHYLY:NALIST:ACTINO UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:78.84LIST:BARIO UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:59.78LIST:CHLAM UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:100.00LIST:CHLOFL UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:60.37LIST:CYANO UNIVERSALITY:100.00 EVENNESS:81.04 MONOPHYLY:100.00LIST:FIRM UNIVERSALITY:99.06 EVENNESS:100.00 MONOPHYLY:85.98LIST:SPIRO UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:53.75LIST:THERMI UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:100.00LIST:THERMO UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:100.00LIST:PROTEO UNIVERSALITY:99.69 EVENNESS:100.00 MONOPHYLY:44.61LIST:ALPHA UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:63.18LIST:BETAGAMMA UNIVERSALITY:99.45 EVENNESS:100.00 MONOPHYLY:97.47LIST:BETA UNIVERSALITY:98.21 EVENNESS:100.00 MONOPHYLY:100.00LIST:GAMMA UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:79.67LIST:DELTA UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:88.17LIST:EPSI UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:100.00

Page 43: Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu

What is next:

1. Search IMG again to update the seqs and accessions

2. Develop CGI scripts to retrieve user defined markers (calculate universality, evenness and monophyly on the fly)