35
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih Yu

Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

Embed Size (px)

Citation preview

Page 1: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

Analysis and comparison of very large metagenomes with fast

clustering and functional annotation

Weizhong Li, BMC Bioinformatics 2009Present by Chuan-Yih Yu

Page 2: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

Outline

• Rapid Analysis of Multiple Metagenomes with a Clustering and Annotation Pipeline (RAMMCAP)– Goal– Methodology– Metagenome comparison– Conclusion

• Discussion

Page 3: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

Goal

• Reduce computation time– Global Ocean Survey(GOS): 1 M CPU Hours = 144 yrs

• Discover the novel gene or protein families– Metagenomic Profiling of Nice Biomes(BIOME) :

~90% sequences unknown– GOS: double the protein families

• Compare metagenome data– Clustering-based– Protein family-based

Page 4: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

RAMMCAP

Page 5: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

RNA

Page 6: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

RAMMCAP

Page 7: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

Meta_RNA & tRNA scan‐

• High sensitivity, Low specificity(Except 16S)

“Identification of ribosomal RNA genes in metagenomic fragments.“, Huang, Y., Gilna, P. & Li, W. Z. Bioinformatics

“tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence.“, Lowe, T.M. and Eddy, S.R. Nucleic Acids Res

Page 8: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

CLUSTERINGCD-HIT

Page 9: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

RAMMCAP

Page 10: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

CD-HIT

• Greedy incremental clustering algorithm• Whole pairwise alignment avoid• Short word (2~5)• Index table

"Clustering of highly homologous sequences to reduce the size of large protein database", Weizhong Li, et al. Bioinformatics, (2001) "Tolerating some redundancy significantly speeds up clustering of large protein databases", Weizhong Li, et al. Bioinformatics, (2002) "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li, et al. Bioinformatics, (2006).

Page 11: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih
Page 12: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

Limitation of CD-HIT

• Evenly distributed mismatches

• Greedy issue– Group in first meet cluster

Page 13: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

CD-HIT Performance

Page 14: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

ORFS CLUSTERING

Page 15: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

RAMMCAP

Page 16: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

Why Cluster ORFs

• Function studies• Novel genes finding

Page 17: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

ORF Prediction

• ORF_finder

• Metagene

Page 18: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

ORF Prediction Performance• MetaSim– Average 100, 200, 400, 800 bp, 1 million reads

• True ORF (sensitivity)– Overlap 30 AA with NCBI annotated ORF

• Predicted ORF (specificity)– 50% overlap with true ORF

Page 19: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

ORF Clustering

• Run 1 clustering– 90~95% identity

• Run 2 clustering– 60% identity over 80% of length (454)– 30% identity over 80% of length (Sanger)

• Merge run 1 & 2 result

Page 20: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

Clustering Evaluation

• Test sets– GOS-ORF (30%),BIOME (95%),BIOME-ORF (60%)

Page 21: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

BIOME Microbiomes & Viromes

• Microbial sequences are more conserved than viral sequences.

Page 22: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

Clustering Quality

• Need conservative threshold• Use only >30 AA Pfam sequence• Discard short sequence in overlapping Pfam

sequence• Place into different cluster– Sequence in the same Pfam, place into different

cluster.

Page 23: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

Clustering Validation

• Generate a clusters whose sequences from the same Pfam• Minimize the number of clusters• Good clusters : >95% members from the same Pfam

– >97% sequences are in good clusters– ~30 times more than bad clusters

Number of sequences Number of clusters

Clus

ter S

ize

Page 24: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

RAMMCAP

Page 25: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

Protein Family Annotation

• Pfam (24.0, Oct. 2009, 11912 families)– textual descriptions, other resources and literature

references• TIGRFAMs (9.0, Nov. 2009, 3808 models)– GO, Pfam and InterPro models

• COG(2003, 4873 clusters of orthologous groups)– 3 lineages and ancient conserved domain– RPS BLAST(Reverse psi-blast)‐

• E values ≤ 0.001

Page 26: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

Novel Protein Families Discovery

• Spurious ORFs in a large size of cluster without homology match may contain novel protein families.

• In GOS only 1.3% of clusters with cluster size 10 map to 93% of true ORFs≧

• In BIOME only 1.0% of clusters with cluster size 5 map to 28% of true ORFs≧

Page 27: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

METAGENOME COMPARISON

Page 28: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

Statistical Comparison of Metagenomics

• Occurrence profile coefficient

• z score, why? (not Rodriguez-Brito's require 105 simulated samples)

• Low occurrence cut off

HA=4 (0.95) z=1.96HA=7 (0.99) z=2.58

1.z> cut off2.PA f x P≧ B

Page 29: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

Comparison between Rodriguez-Brito's method and z test method.

Page 30: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

Clustering-based Comparison

GOS ORF clusters

rAB

No. of cluster

Page 31: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

Clustering-based Comparison

• BIOME samples are more diverse than GOS

BIOME clusters

Page 32: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

Protein Family-based Comparison

• Merge Pfam, Tigrfam and COG into super families– Pfam- clans, Tigrfam- role categories, and COG-

functional classes• Compare with a specific super family

Page 33: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

Protein Family-based Comparison

(a) GOS on COG Class F, (b) GOS on COG Class T, (c) BIOME on COG Class F, (d) BIOME on COG Class T

Page 34: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

Conclusion

• RAMMCAP improve performance – CD-HIT– z test

• Novel protein families discovery– ORFs clustering

• Metagenome comparison– Cluster-based– Protein family-based

Page 35: Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih

Discussion

• How much improvement when apply RNA prediction before raw reads?

• How to determine significant factor?– PA f * P≧ B (f>1)