Upload
joshdneufeld
View
1.951
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Short introductory talk on multivariate statistics for 16S rRNA gene analysis given at the 2nd Soil Metagenomics conference in Braunschweig Germany, December 2013. A previous talk had discussed quality filtering, chimera detection, and clustering algorithms.
Citation preview
Multivariate exploration of microbial communities Josh D. Neufeld Braunschweig, Germany December, 2013
Michael Lynch (PhD): Taxonomy, phylogenetics, ecology
Michael Hall (co-op): mathematics, programming, user friendly!
Andre Masella (MSc): Computer science
Posted on Slideshare without images and unpublished data
Alpha and Beta diversity
Pipelines
Quick history
Future prospects and problems
Species that matter
Who lives with whom, and why, and where?
Data reduction is essential for: a) summarizing large numbers of observations into manageable numbers b) visualizing many interconnected variables in a compact manner
Alpha diversity: species richness (and evenness) within a single sample Beta diversity: change in species composition across a collection of samples Gamma diversity: total species richness across an environmental gradient
An (abbreviated) history
Numerical ecology phenetics and statistical analysis of organismal counts
macroecology
16S rRNA gene era sequence analysis as a surrogate for counting
mapping of marker to taxonomy
NGS enabled synthesis of phenetics, phylogenetics, and numerical ecology
Now generate V3-V4 bacterial amplicons (~450 bases) Usually PE 300
Assembling paired-end reads dramatically reduces error Corrects mismatches in region of overlap (quality threshold >0.9), set a minimum overlap. Can compare to perfect overlap assembly: “completelymissesthepoint” (name changing soon)
PANDAseq >30x faster than next fastest alternative assembler
1. p-value threshold 2. parallelizes correctly
(both are now added or fixed in PANDAseq)
Biological Observation Matrix
BIOM file format (MacDonald et al. 2012) Standard recognized by EMP, MG-RAST, VAMPS Based on JSON data interchange format
Computational structure in multiple languages
“facilitates the efficient handling and storage of large, sparse biological contingency tables” Encapsulates metadata and contingency table (e.g., OTU table) in one file
Alpha and Beta diversity
Pipelines
Quick history
Future prospects and problems
Species that matter
Who lives with whom, and why, and where?
Data reduction is essential for: a) summarizing large numbers of observations into manageable numbers b) visualizing many interconnected variables in a compact manner
Alpha diversity: species richness (and evenness) within a single sample Beta diversity: change in species composition across a collection of samples Gamma diversity: total species richness across an environmental gradient
Diversity (richness and evenness)
α-diversity: Richness and Evenness
Shannon index (H’), Estimators (Chao1, ACE), Phylogenetic Diversity
Stearns et al., 2011 Hughes et al., 2001
Shannon index (H’): richness and evenness Estimators: richness Faith’s PD: phylogenetic richness
“All biologists who sample natural communities are plagued with the
problem of how well a sample reflects a community’s ‘true’ diversity.”
“Nonparametric estimators show particular promise for microbial data and in some habitats may require sample sizes of only 200 to 1,000 clones to detect richness differences of only tens of species.”
Hug
hes
et a
l. 20
01
0
1
2000 2004 2008 20122002 2004 20100
100
200
300
400
500
Illumina
454
Sanger
Goo
gle
Sch
olar
pro
por
tion
[Seq
eunc
ing
tec
h] A
ND
16S
“Rare b
iosphere” citations
Rare b
iospher
e
Time (year)Lynch and Neufeld. 2013. Nat. Rev. Microbiol. In preparation.
GOALS Understanding of community structure
Better alpha-diversity measures Robust beta-diversity measures
Lynch and Neufeld. 2013. Nat. Rev. Microbiol. In preparation.
Stearns et al. 2011
Bar
tram
et
al. 2
011
Clustering algorithms (influence alpha diversity primarily)
CD-HIT (Li and Godzik, Sanford-Burnham Medical Research Institute)
‘longest-sequence-first’ removal algorithm Fast, many implementations (nucleotide, protein, OTU-specific) Tends to be more stringent than UCLUST
UCLUST (R. Edgar, drive5.com) Faster than CD-HIT Tends to generate larger number of low-abundance OTUs Broader range of clustering thresholds "I do not recommend using the UCLUST algorithm or
CD-HIT for generating OTUs” – Robert Edgar
CROP: Clustering 16S rRNA for OTU Prediction (CROP) “CROP can find clusters based on the natural organization of data without setting a hard cut-off threshold (3%/5%) as required by hierarchical clustering methods.”
Chimeras DNA from two or more parent molecules
PCR artifact Can easily be classified as a “novel” sequence
Increases α-diversity
Software ChimeraSlayer, Bellerophon, UCHIME, Pintail
Reference database or de novo
Classification and taxonomy
Ribosomal Database Project (RDP) classifier Naïve Bayesian classifier (James Cole and Tiedje) http://rdp.cme.msu.edu/
pplacer Phylogenetic placement and visualization
BLAST The tool we know and love
RTAX (UC Berkely, Rob Knight involved) http://dev.davidsoergel.com/trac/rtax/
mothur (Patrick Schloss) http://www.mothur.org/
SINA (SILVA)
RDP classifier Large training sets require active memory management
Can be easily run in parallel by breaking up very large data sets
Can classify Bacteria/Archaea SSU and fungal LSU (can be re-trained)
Algorithm:
determine the probability that an unknown query sequence is a member of a known genus (training set), based on the profile of word subsets of known genera.
Confidence estimation:
the number of times in 100 trials that a genus was selected based on a random subset of words in the query
Take home:
The higher the diversity (bigger sequence space) of the training set, the better the assignment
Longer query = better and more reliable assignment
Short reads (i.e., <250 base) will have lower confidence estimates (cutoff of 0.5 suggested)
Database sources GreenGenes
Latest May 2013
SILVA Latest 115 (August 2013) Includes 18S, 23S, 28S, LSU
RDP Database Latest 11 (October 2013)
GenBank Research-specific
e.g., CORE Oral
Multivariate data reduction
β-diversity
Visualization (ordination) versus hypothesis testing (MRPP, indicator species analysis) Many more algorithms out there for exploration and statistical testing
mostly through widely used R packages vegan (Community Ecology Package) labdsv (Ordination and Multivariate Analysis for Ecology) ape (Analyses of Phylogenetics and Evolution) picante (community analyses etc.)
Visualization (ordination) Complementary to data clustering
looks for discontinuities Ordination extracts main trends as continuous axes
analysis of the square matrix derived from the OTU table
Non-parametric, unconstrained ordination methods most widely used (and best suited)
methods that can work directly on a square matrix An appropriate metric is required to derive this square matrix
many options...
Metrics
Ordination is essentially reducing dimensionality first requirement: accurately model differences among samples
Models are *really* important. Examples include: OTU presence/absence
Dice, Jaccard OTU abundance
Bray-Curtis Phylogenetic
UniFrac
“You can't publish anything without a PCoA plot anymore, but METRICS
used to draw plot important.” - Susan Huse
“all models are wrong, some are useful”
- G.E. Box
Metrics: UniFrac A distance measure comparing multiple communities using phylogenetic information Requires sequence alignment and tree-building
PyNAST, MUSCLE, Infernal Time-consuming and susceptible to poor phylogenetic inference (does it matter?)
Weighted (abundance) ecological features related to abundance
Unweighted ecological features related to taxonomic presence/absence
Ordination example 1 (of many): Principal Coordinates Analysis
Classical Multidimensional Scaling (MDS; Gower 1966) Procedure:
based on eigenvectors position objects in low-dimensional space while preserving distance relationships as well as possible
highly flexible can choose among many association measures
In microbial ecology, used for visualizing phylogenetic or count-based distances Consistent visual output for given distance matrix
Include variance explained (%) on Axis 1 and 2
Ordination example 2 (of many): Non-metric Multidimensional Scaling Ordination not based on eigenvectors
Does not preserve exact distances among objects
attempts to preserve ordering of samples (“ranks”)
Procedure:
iterative, tries to position the objects in a few (2-3) dimensions in such a way that minimizes the “stress”
how well does the new ranked distribution of points represent the original distances in the association matrix? Can express as R2 on axes 1 and 2.
the adjustment goes on until the stress value reaches a local minimum (heuristic solution)
NMDS often represents distance relationships better than PCoA in the same number of dimensions
Susceptible to the “local minimum issue”, and therefore should have strong starting point (e.g., PCoA) or many permutations
You won't get the same result each time you run the analysis. Try several runs until you are comfortable with the result.
Do my treatments separate?
Beta-diversity: Hypothesis testing
Multiple methods, implemented in QIIME, mothur, AXIOME
e.g., MRPP, adonis, NP-MANOVA (perMANOVA), ANOSIM Are treatment effects significant?
Because these are predominantly nonparametric methods, tests for significance rely on testing by permutation Let's focus on MRPP
Multiresponse Permutation Procedures
Compare intragroup average distances with the average distances that would have resulted from all the other possible combinations
T statistic: more negative with increasing group separation (T>-10 common for ecology) A statistic: Degree of scatter within groups (A=1 when all points fall on top of one another) p value: likelihood of similar separation with randomized data.
Alpha and Beta diversity
Pipelines
Quick history
Future prospects and problems
Species that matter
“PCoA plots are the first step of a community analysis, not the last.”
Josh Neufeld
Searching for species that matter
High dimensional data often have too many features to investigate
solution: identify and study species significantly associated with categorical metadata
Indicator species (Dufrene-Legendre) calculates indicator value (fidelity and relative abundance) of species Permutation test for significance Need solution for sparse data - be wary
of groups with small numbers of sites (influence on permutation tests) low abundance can artificially inflate indicator values
Specificity
Fidelity
IndVal (Dufrene & Legendre, 1997)
Specificity Large mean abundance within group relative to summed mean abundances of other groups
Fidelity Presence in most or all sites of that group
Groups defined by a priori by metadata or statistical clustering
Metadata Taxon R^2 value mbc
k__Bacteria;p__Planctomycetes;c__Planctomycetia;o__Gemmatales;f__Isosphaeraceae;g__ 0.611368489781491 mbc
k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Methylocystaceae;g__ 0.677209935419981 mbn
k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Methylocystaceae;g__ 0.64092523702996 soil_depth k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Intrasporangiaceae;g__ 0.669761188668774
Simple linear correlations
mothur: cooccurrence function, measuring whether populations are co-occurring more frequently than you would expect by chance.
Non-negative Matrix Factorization
NMF as a representation method for portraying high-dimensional data as a small number of taxonomic components. Patterns of co-occurring OTUs can be described by a smaller number of taxonomic components. Each sample represented by the collection of component taxa, helping identify relationships between taxa and the environment.
Jonathan Dushoff, McMaster University, Ontario, Canada
SSUnique
SILV
A
SILVA
SILVA
SILV
A
���
;������(KUOLFKLD�UXPLQDQWLXP
8/������������XU�
8/������������XU�
8/������������XU�
8/������������
8/������������
8/������������
8/������������
0������5LFNHWWVLD�SURZD]HNLL
���
��
��
��
��
���
8/������
8/�����
$-�������)LEUREDFWHU�LQWHVWLQDOLV
8/������
8/�����
8/��������
8/�����
8/�����
8/������
8/�����
8/�����
8/�����
8/��������
8/�����
8/��������
$%�������*HPPDWLPRQDV�DXUDQWLDFD
8/�����
*8�������)LEUREDFWHU�VXFFLQRJHQHV
���
���
��
���
���
���
%5&�
*HPPDWLPRQDGHWHV
$OSKDSURWHREDFWHULD
�H�
�I�
8/�����
$-�������,VRVSKDHUD�SDOOLGD
;������*HPPDWD�REVFXULJOREXV
$-�������3ODQFWRP\FHV�EUDVLOLHQVLV
$0�������6FKOHVQHULD�SDOXGLFROD
$0�������6LQJXOLVSKDHUD�DFLGLSKLOD
8/������
$0�������=DYDU]LQHOOD�IRUPRVD
;������3ODQFWRP\FHV�OLPQRSKLOXV
;������%ODVWRSLUHOOXOD�PDULQD
8/�����
$%�������3K\FLVSKDHUD�PLNXUHQVLV
%;�������5KRGRSLUHOOXOD�EDOWLFD
8/�����
$-�������3ODQFWRP\FHV�PDULV
$-�������3LUHOOXOD�VWDOH\L8/�����
8/�����
��
��
����
���
��
��
���
��
���
���
���
�����
���
3ODQFWRP\FHWHV
�F�
8/������
8/�����
8/������
8/�����
*8�������2KWDHNZDQJLD�NRUHHQVLV
$%�������)OH[LEDFWHU�UXEHU$%�������)OH[LEDFWHU�HOHJDQV
$%�������0LFURVFLOOD�PDULQD
$-�������6SRURF\WRSKDJD�P\[RFRFFRLGHV
8/�����
0������)OH[LEDFWHU�IOH[LOLV
&3�������&\WRSKDJD�KXWFKLQVRQLL
(8�������5KRGRF\WRSKDJD�DHURODWD
���
��
��
�����
%DFWHURLGHWHV
�E�
$%�������3ODQNWRWKULFRLGHV�UDFLERUVNLL
8/������
&3�������&KORURIO��DJJUHJDQV
<������+DORVSLUXOLQD�WDSHWLFROD
8/��������
$%�������&DOGLOLQHD�DHURSKLOD
&3�������+HUSHWRVLSKRQ�DXUDQWLDFXV
8/�����
$%�������7KHUPRJHPPDWLVSRUD�IROLRUXP
8/�������
8/������
$%�������$QDHUROLQHD�WKHUPRSKLOD
$-�������6SKDHUREDFWHU�WKHUPRSKLOXV
8/������
8/��������
8/������
(8�������8QFXOWXUHG�EDFWHULXP
8/������
$%�������7KHUPRVSRURWKUL[�KD]DNHQVLV
$%�������&ULQDOLXP�HSLSVDPPXP
0������7KHUPRPLFURELXP�URVHXP
8/�����
8/�����
8/������
$(�������3URFKORURFRFFXV�PDULQXV�VXEVS��PDULQXV
+0�������9DPSLURYLEULR�FKORUHOODYRUXV
$0�������.WHGRQREDFWHU�UDFHPLIHU
$0�������3URFKORURWKUL[�KROODQGLFD
8/������
8/�����
(8�������'HKDORJHQLPRQDV�O\NDQWKURSRUHSHOOHQV
8/�������
8/������
8/�����
8/�����
()�������5XELGLEDFWHU�ODFXQDH
(8�������8QFXOWXUHG�EDFWHULXP
&3�������5RVHLIOH[XV�FDVWHQKRO]LL
8/�������
8/������
8/��������
8/�����
8/������
$%�������/HSWROLQHD�WDUGLYLWDOLV
8/��������
8/������
8/�����
8/������
(8�������8QFXOWXUHG�EDFWHULXP
8/�����
8/�����
��
���
��
��
��
���
��
���
������
��
��
��
��
���
���
���
��
���
���
���
��
���
��
��
��
���
���
���
��
��
&KORURIOH[L
&\DQREDFWHULD
�G�
8/�����
)-�������6SKLQJREDFWHULXP�VKD\HQVH
8/�����
(8�������6SKLQJREDFWHULXP�VL\DQJHQVH
8/�����
$%�������1XEVHOOD�]HD[DQWKLQLIDFLHQV
8/�����
$0�������3VHXGRVSKLQJREDFWHULXP�GRPHVWLFXP
'4�������3DUDSHGREDFWHU�NRUHHQVLV
8/�����
'4�������2OLYLEDFWHU�VLWLHQVLV
*4�������3HGREDFWHU�EDX]DQHQVLV
8/������
$%�������6ROLWDOHD�FDQDGHQVLV
��
���
��
��
��
��
��
��
���
����
���
%DFWHURLGHWHV
�D�
0������1HRULFNHWWVLD�ULVWLFLL
���
;������(KUOLFKLD�UXPLQDQWLXP
8/������������XU�
8/������������XU�
8/������������XU�
8/������������
8/������������
8/������������
8/������������
0������5LFNHWWVLD�SURZD]HNLL
���
��
��
��
��
���
8/������
8/�����
$-�������)LEUREDFWHU�LQWHVWLQDOLV
8/������
8/�����
8/��������
8/�����
8/�����
8/������
8/�����
8/�����
8/�����
8/��������
8/�����
8/��������
$%�������*HPPDWLPRQDV�DXUDQWLDFD
8/�����
*8�������)LEUREDFWHU�VXFFLQRJHQHV
���
���
��
���
���
���
%5&�
*HPPDWLPRQDGHWHV
$OSKDSURWHREDFWHULD
�H�
�I�
8/�����
$-�������,VRVSKDHUD�SDOOLGD
;������*HPPDWD�REVFXULJOREXV
$-�������3ODQFWRP\FHV�EUDVLOLHQVLV
$0�������6FKOHVQHULD�SDOXGLFROD
$0�������6LQJXOLVSKDHUD�DFLGLSKLOD
8/������
$0�������=DYDU]LQHOOD�IRUPRVD
;������3ODQFWRP\FHV�OLPQRSKLOXV
;������%ODVWRSLUHOOXOD�PDULQD
8/�����
$%�������3K\FLVSKDHUD�PLNXUHQVLV
%;�������5KRGRSLUHOOXOD�EDOWLFD
8/�����
$-�������3ODQFWRP\FHV�PDULV
$-�������3LUHOOXOD�VWDOH\L8/�����
8/�����
��
��
����
���
��
��
���
��
���
���
���
�����
���
3ODQFWRP\FHWHV
�F�
8/������
8/�����
8/������
8/�����
*8�������2KWDHNZDQJLD�NRUHHQVLV
$%�������)OH[LEDFWHU�UXEHU$%�������)OH[LEDFWHU�HOHJDQV
$%�������0LFURVFLOOD�PDULQD
$-�������6SRURF\WRSKDJD�P\[RFRFFRLGHV
8/�����
0������)OH[LEDFWHU�IOH[LOLV
&3�������&\WRSKDJD�KXWFKLQVRQLL
(8�������5KRGRF\WRSKDJD�DHURODWD
���
��
��
�����
%DFWHURLGHWHV
�E�
$%�������3ODQNWRWKULFRLGHV�UDFLERUVNLL
8/������
&3�������&KORURIO��DJJUHJDQV
<������+DORVSLUXOLQD�WDSHWLFROD
8/��������
$%�������&DOGLOLQHD�DHURSKLOD
&3�������+HUSHWRVLSKRQ�DXUDQWLDFXV
8/�����
$%�������7KHUPRJHPPDWLVSRUD�IROLRUXP
8/�������
8/������
$%�������$QDHUROLQHD�WKHUPRSKLOD
$-�������6SKDHUREDFWHU�WKHUPRSKLOXV
8/������
8/��������
8/������
(8�������8QFXOWXUHG�EDFWHULXP
8/������
$%�������7KHUPRVSRURWKUL[�KD]DNHQVLV
$%�������&ULQDOLXP�HSLSVDPPXP
0������7KHUPRPLFURELXP�URVHXP
8/�����
8/�����
8/������
$(�������3URFKORURFRFFXV�PDULQXV�VXEVS��PDULQXV
+0�������9DPSLURYLEULR�FKORUHOODYRUXV
$0�������.WHGRQREDFWHU�UDFHPLIHU
$0�������3URFKORURWKUL[�KROODQGLFD
8/������
8/�����
(8�������'HKDORJHQLPRQDV�O\NDQWKURSRUHSHOOHQV
8/�������
8/������
8/�����
8/�����
()�������5XELGLEDFWHU�ODFXQDH
(8�������8QFXOWXUHG�EDFWHULXP
&3�������5RVHLIOH[XV�FDVWHQKRO]LL
8/�������
8/������
8/��������
8/�����
8/������
$%�������/HSWROLQHD�WDUGLYLWDOLV
8/��������
8/������
8/�����
8/������
(8�������8QFXOWXUHG�EDFWHULXP
8/�����
8/�����
��
���
��
��
��
���
��
���
������
��
��
��
��
���
���
���
��
���
���
���
��
���
��
��
��
���
���
���
��
��
&KORURIOH[L
&\DQREDFWHULD
�G�
8/�����
)-�������6SKLQJREDFWHULXP�VKD\HQVH
8/�����
(8�������6SKLQJREDFWHULXP�VL\DQJHQVH
8/�����
$%�������1XEVHOOD�]HD[DQWKLQLIDFLHQV
8/�����
$0�������3VHXGRVSKLQJREDFWHULXP�GRPHVWLFXP
'4�������3DUDSHGREDFWHU�NRUHHQVLV
8/�����
'4�������2OLYLEDFWHU�VLWLHQVLV
*4�������3HGREDFWHU�EDX]DQHQVLV
8/������
$%�������6ROLWDOHD�FDQDGHQVLV
��
���
��
��
��
��
��
��
���
����
���
%DFWHURLGHWHV
�D�
0������1HRULFNHWWVLD�ULVWLFLL
���
;������(KUOLFKLD�UXPLQDQWLXP
8/������������XU�
8/������������XU�
8/������������XU�
8/������������
8/������������
8/������������
8/������������
0������5LFNHWWVLD�SURZD]HNLL
���
��
��
��
��
���
8/������
8/�����
$-�������)LEUREDFWHU�LQWHVWLQDOLV
8/������
8/�����
8/��������
8/�����
8/�����
8/������
8/�����
8/�����
8/�����
8/��������
8/�����
8/��������
$%�������*HPPDWLPRQDV�DXUDQWLDFD
8/�����
*8�������)LEUREDFWHU�VXFFLQRJHQHV
���
���
��
���
���
���
%5&�
*HPPDWLPRQDGHWHV
$OSKDSURWHREDFWHULD
�H�
�I�
8/�����
$-�������,VRVSKDHUD�SDOOLGD
;������*HPPDWD�REVFXULJOREXV
$-�������3ODQFWRP\FHV�EUDVLOLHQVLV
$0�������6FKOHVQHULD�SDOXGLFROD
$0�������6LQJXOLVSKDHUD�DFLGLSKLOD
8/������
$0�������=DYDU]LQHOOD�IRUPRVD
;������3ODQFWRP\FHV�OLPQRSKLOXV
;������%ODVWRSLUHOOXOD�PDULQD
8/�����
$%�������3K\FLVSKDHUD�PLNXUHQVLV
%;�������5KRGRSLUHOOXOD�EDOWLFD
8/�����
$-�������3ODQFWRP\FHV�PDULV
$-�������3LUHOOXOD�VWDOH\L8/�����
8/�����
��
��
����
���
��
��
���
��
���
���
���
�����
���
3ODQFWRP\FHWHV
�F�
8/������
8/�����
8/������
8/�����
*8�������2KWDHNZDQJLD�NRUHHQVLV
$%�������)OH[LEDFWHU�UXEHU$%�������)OH[LEDFWHU�HOHJDQV
$%�������0LFURVFLOOD�PDULQD
$-�������6SRURF\WRSKDJD�P\[RFRFFRLGHV
8/�����
0������)OH[LEDFWHU�IOH[LOLV
&3�������&\WRSKDJD�KXWFKLQVRQLL
(8�������5KRGRF\WRSKDJD�DHURODWD
���
��
��
�����
%DFWHURLGHWHV
�E�
$%�������3ODQNWRWKULFRLGHV�UDFLERUVNLL
8/������
&3�������&KORURIO��DJJUHJDQV
<������+DORVSLUXOLQD�WDSHWLFROD
8/��������
$%�������&DOGLOLQHD�DHURSKLOD
&3�������+HUSHWRVLSKRQ�DXUDQWLDFXV
8/�����
$%�������7KHUPRJHPPDWLVSRUD�IROLRUXP
8/�������
8/������
$%�������$QDHUROLQHD�WKHUPRSKLOD
$-�������6SKDHUREDFWHU�WKHUPRSKLOXV
8/������
8/��������
8/������
(8�������8QFXOWXUHG�EDFWHULXP
8/������
$%�������7KHUPRVSRURWKUL[�KD]DNHQVLV
$%�������&ULQDOLXP�HSLSVDPPXP
0������7KHUPRPLFURELXP�URVHXP
8/�����
8/�����
8/������
$(�������3URFKORURFRFFXV�PDULQXV�VXEVS��PDULQXV
+0�������9DPSLURYLEULR�FKORUHOODYRUXV
$0�������.WHGRQREDFWHU�UDFHPLIHU
$0�������3URFKORURWKUL[�KROODQGLFD
8/������
8/�����
(8�������'HKDORJHQLPRQDV�O\NDQWKURSRUHSHOOHQV
8/�������
8/������
8/�����
8/�����
()�������5XELGLEDFWHU�ODFXQDH
(8�������8QFXOWXUHG�EDFWHULXP
&3�������5RVHLIOH[XV�FDVWHQKRO]LL
8/�������
8/������
8/��������
8/�����
8/������
$%�������/HSWROLQHD�WDUGLYLWDOLV
8/��������
8/������
8/�����
8/������
(8�������8QFXOWXUHG�EDFWHULXP
8/�����
8/�����
��
���
��
��
��
���
��
���
������
��
��
��
��
���
���
���
��
���
���
���
��
���
��
��
��
���
���
���
��
��
&KORURIOH[L
&\DQREDFWHULD
�G�
8/�����
)-�������6SKLQJREDFWHULXP�VKD\HQVH
8/�����
(8�������6SKLQJREDFWHULXP�VL\DQJHQVH
8/�����
$%�������1XEVHOOD�]HD[DQWKLQLIDFLHQV
8/�����
$0�������3VHXGRVSKLQJREDFWHULXP�GRPHVWLFXP
'4�������3DUDSHGREDFWHU�NRUHHQVLV
8/�����
'4�������2OLYLEDFWHU�VLWLHQVLV
*4�������3HGREDFWHU�EDX]DQHQVLV
8/������
$%�������6ROLWDOHD�FDQDGHQVLV
��
���
��
��
��
��
��
��
���
����
���
%DFWHURLGHWHV
�D�
0������1HRULFNHWWVLD�ULVWLFLL
SILVA
���
8/������
8/�����
8/�����8/�����
8/�����
8QFXOWXUHG��03%����>$%������@
8/�����
8/�����
8/������
8/�����
8/������
8/�����
*ORHREDFWHU�YLRODFHXV�3&&������>%$������@
�����������
������������
����������� 5HPDLQLQJ�&\DQREDFWHULD�&KORURSODVWV
�D�
�E�
���
>+4������@
8/������8/������
3\WKLXP�XOWLPXP�>$'26��������@
8/������
>-)������@
8QFXOWXUHG�>$$&<���������@
8/������
8QFXOWXUHG�>+4������@
>*8������@
8QFXOWXUHG�>'4������@
>-)������@
(VFKHULFKLD�FROL�>$%������@
8QFXOWXUHG�>)-������@
8/������
8QFXOWXUHG�>)-������@
8/������
>)-������@
8QFXOWXUHG�>*8������@
'LFW\RVWHOLXP�>'4������@
8QFXOWXUHG�>*8������@
8/������
5LFNHWWVLD�SURZD]HNLL�>$-������@
���������
�����������
�����������
������
����������
������
����������
����������������������
����XU
&KORURSK\WD�(PEU\RSK\WD
���
8QFXOWXUHG
8QFXOWXUHG
)XQJL
5KRGRSK\WD
&KURPDOYHRODWD
$FDQWKDPRHED�SRO\SKDJD�>$)������@
7KUDXVWRFK\WULXP�>$)������@
8QFXOWXUHG
��������
8QFXOWXUHG�>$%������@
������
������
������
������
������
������������
������
������
��������
9LEULR�YXOQLILFXV�>%$������@�2XWJURXSV
Lynch et al. 2012
Nakai et al. 2012
Alpha and Beta diversity
Pipelines
Quick history
Future prospects and problems
Species that matter
Why pipelines? Merge and manage (many) disparate techniques Democratize analysis
improve accessibility
Accelerate pace of innovation, collaboration, and research
Early synthesis
Early synthesis for numerical microbial ecology Synthesis of 16S phylogenetics (Woese et al.) and Hughes (Counting the uncountable)
Numerical ecology for microorganisms
Algorithm development libshuff, dotur (mothur)
Analysis pipelines QIIME, mothur
Knight Lab, U. Colorado at Boulder Predominantly a collection of integrated Python/R scripts Many dependencies
easy managed installation: qiime-deploy
MacQIIME virtual box and Ubuntu fork
avoid for anything but small runs
Becoming the standard for marker gene studies integrated analysis and visualization easy access to broad computational biology toolbox (Python/R)
Automation and extension
AXIOME and phyloseq Extend existing technologies (QIIME, mothur, R, custom)
Layers of abstraction Automation and rapid re-analysis Promote reproducible research (iPython, XML, make)
Implement existing techniques (e.g., MRPP, Dufrene-Legendre IndVal)
numerical microbial ecology needs to better incorporate modern statistical theory
Develop and test new techniques
Axiometic GUI companion for AXIOME
Cross-platform New implementation in development
Generates AXIOME file (XML)
xls template coming soon for
all commands, sample metadata,
and extra info… much easier for
everyone.
“QIIME wraps many other software packages, and these should be cited if they are used. Any time you're using tools that QIIME wraps, it is essential to cite those tools.” http://qiime.org/index.html
Alpha and Beta diversity
Pipelines
Quick history
Future prospects and problems
Species that matter
The future
As data get bigger, interpretation should be “hands off”
Move towards hypothesis testing of high-dimension taxonomic data
Convergence on Galaxy e.g., QIIME in Galaxy is developing
Further extension to cloud services e.g., Amazon EC2
Machine learning and data mining applications
Open-source, web-based platform Deployed locally or in the cloud Ongoing development of 16S rRNA gene analysis
Galaxy Workshed (available tools)
“The advantages of having large numbers of samples at shallow coverage (~1,000 sequences per sample) clearly outweigh having a small number of samples at greater coverage for many datasets, suggesting that the focus for future studies should be on broader sampling that can reveal association with key biological parameters rather than on deeper sequencing.”
“….even [phylogenetic beta-diversity] measures suited to the underlying mechanism of differentiation may require deep sequencing to reveal subtle patterns”
Dr. Donovan Parks
Method standardization Impossible.
Data storage
Sequence reads outpacing data storage costs Federated data?
File formats
e.g., FASTA (difficult to search, difficult to retrieve sequences, not space efficient, do not ensure data is in correct format, no space for metadata, no absolute
standard)… relational databases?
Software Free and Open Source enables an experiment to be faithfully replicated
Algorithms
Memory! Many clustering and phylogenetic inference algorithms vary n2
Distributed, parallel, or cloud computing may not be helpful
Metadata What to do with it? How to marry sequence and metadata sets?
We need better metadata integration, not necessarily more/better metadata
What should we be doing? (take-home messages)
*Surveys are really important for spatial and temporal mapping
*Hypothesis testing follows (or implicit) *What species account for treatment effects?
*Who tracks with who? (why=function) *Who avoids who?
*Are all microorganisms accounted for? (no) *How can we use this information to
manipulate, manage and predict ecosystems?
What should we be doing? (take-home messages)
There is no “one way” to analyze 16S rRNA
You need to build a pipeline for you.
If this seems daunting, it is.
If this is not daunting, your hands are dirty.
It’s getting better all the tii-ime.
Helpful resources
Thank you [email protected]