Introduction to 16S rRNA gene multivariate analysis

Multivariate exploration of microbial communities Josh D. Neufeld Braunschweig, Germany December, 2013

Michael Lynch (PhD): Taxonomy, phylogenetics, ecology

Michael Hall (co-op): mathematics, programming, user friendly!

Andre Masella (MSc): Computer science

Posted on Slideshare without images and unpublished data

Alpha and Beta diversity

Pipelines

Quick history

Future prospects and problems

Species that matter

Who lives with whom, and why, and where?

Data reduction is essential for: a) summarizing large numbers of observations into manageable numbers b) visualizing many interconnected variables in a compact manner

Alpha diversity: species richness (and evenness) within a single sample Beta diversity: change in species composition across a collection of samples Gamma diversity: total species richness across an environmental gradient

An (abbreviated) history

Numerical ecology phenetics and statistical analysis of organismal counts

macroecology

16S rRNA gene era sequence analysis as a surrogate for counting

mapping of marker to taxonomy

NGS enabled synthesis of phenetics, phylogenetics, and numerical ecology

Now generate V3-V4 bacterial amplicons (~450 bases) Usually PE 300

Assembling paired-end reads dramatically reduces error Corrects mismatches in region of overlap (quality threshold >0.9), set a minimum overlap. Can compare to perfect overlap assembly: “completelymissesthepoint” (name changing soon)

PANDAseq >30x faster than next fastest alternative assembler

1. p-value threshold 2. parallelizes correctly

(both are now added or fixed in PANDAseq)

Biological Observation Matrix

BIOM file format (MacDonald et al. 2012) Standard recognized by EMP, MG-RAST, VAMPS Based on JSON data interchange format

Computational structure in multiple languages

“facilitates the efficient handling and storage of large, sparse biological contingency tables” Encapsulates metadata and contingency table (e.g., OTU table) in one file


Pipelines

Quick history


Species that matter

Who lives with whom, and why, and where?

Data reduction is essential for: a) summarizing large numbers of observations into manageable numbers b) visualizing many interconnected variables in a compact manner

Alpha diversity: species richness (and evenness) within a single sample Beta diversity: change in species composition across a collection of samples Gamma diversity: total species richness across an environmental gradient

Diversity (richness and evenness)

α-diversity: Richness and Evenness

Shannon index (H’), Estimators (Chao1, ACE), Phylogenetic Diversity

Stearns et al., 2011 Hughes et al., 2001

Shannon index (H’): richness and evenness Estimators: richness Faith’s PD: phylogenetic richness

“All biologists who sample natural communities are plagued with the

problem of how well a sample reflects a community’s ‘true’ diversity.”

“Nonparametric estimators show particular promise for microbial data and in some habitats may require sample sizes of only 200 to 1,000 clones to detect richness differences of only tens of species.”

Hug

hes

et a

l. 20

01

0

1

2000 2004 2008 20122002 2004 20100

100

200

300

400

500

Illumina

454

Sanger

Goo

gle

Sch

olar

pro

por

tion

[Seq

eunc

ing

tec

h] A

ND

16S

“Rare b

iosphere” citations

Rare b

iospher

e

Time (year)Lynch and Neufeld. 2013. Nat. Rev. Microbiol. In preparation.

GOALS Understanding of community structure

Better alpha-diversity measures Robust beta-diversity measures

Lynch and Neufeld. 2013. Nat. Rev. Microbiol. In preparation.

Stearns et al. 2011

Bar

tram

et

al. 2

011

Clustering algorithms (influence alpha diversity primarily)

CD-HIT (Li and Godzik, Sanford-Burnham Medical Research Institute)

‘longest-sequence-first’ removal algorithm Fast, many implementations (nucleotide, protein, OTU-specific) Tends to be more stringent than UCLUST

UCLUST (R. Edgar, drive5.com) Faster than CD-HIT Tends to generate larger number of low-abundance OTUs Broader range of clustering thresholds "I do not recommend using the UCLUST algorithm or

CD-HIT for generating OTUs” – Robert Edgar

CROP: Clustering 16S rRNA for OTU Prediction (CROP) “CROP can find clusters based on the natural organization of data without setting a hard cut-off threshold (3%/5%) as required by hierarchical clustering methods.”

Chimeras DNA from two or more parent molecules

PCR artifact Can easily be classified as a “novel” sequence

Increases α-diversity

Software ChimeraSlayer, Bellerophon, UCHIME, Pintail

Reference database or de novo

Classification and taxonomy

Ribosomal Database Project (RDP) classifier Naïve Bayesian classifier (James Cole and Tiedje) http://rdp.cme.msu.edu/

pplacer Phylogenetic placement and visualization

BLAST The tool we know and love

RTAX (UC Berkely, Rob Knight involved) http://dev.davidsoergel.com/trac/rtax/

mothur (Patrick Schloss) http://www.mothur.org/

SINA (SILVA)

RDP classifier Large training sets require active memory management

Can be easily run in parallel by breaking up very large data sets

Can classify Bacteria/Archaea SSU and fungal LSU (can be re-trained)

Algorithm:

determine the probability that an unknown query sequence is a member of a known genus (training set), based on the profile of word subsets of known genera.

Confidence estimation:

the number of times in 100 trials that a genus was selected based on a random subset of words in the query

Take home:

The higher the diversity (bigger sequence space) of the training set, the better the assignment

Longer query = better and more reliable assignment

Short reads (i.e., <250 base) will have lower confidence estimates (cutoff of 0.5 suggested)

Database sources GreenGenes

Latest May 2013

SILVA Latest 115 (August 2013) Includes 18S, 23S, 28S, LSU

RDP Database Latest 11 (October 2013)

GenBank Research-specific

e.g., CORE Oral

Multivariate data reduction

β-diversity

Visualization (ordination) versus hypothesis testing (MRPP, indicator species analysis) Many more algorithms out there for exploration and statistical testing

mostly through widely used R packages vegan (Community Ecology Package) labdsv (Ordination and Multivariate Analysis for Ecology) ape (Analyses of Phylogenetics and Evolution) picante (community analyses etc.)

Visualization (ordination) Complementary to data clustering

looks for discontinuities Ordination extracts main trends as continuous axes

analysis of the square matrix derived from the OTU table

Non-parametric, unconstrained ordination methods most widely used (and best suited)

methods that can work directly on a square matrix An appropriate metric is required to derive this square matrix

many options...

Metrics

Ordination is essentially reducing dimensionality first requirement: accurately model differences among samples

Models are *really* important. Examples include: OTU presence/absence

Dice, Jaccard OTU abundance

Bray-Curtis Phylogenetic

UniFrac

“You can't publish anything without a PCoA plot anymore, but METRICS

used to draw plot important.” - Susan Huse

“all models are wrong, some are useful”

- G.E. Box

Metrics: UniFrac A distance measure comparing multiple communities using phylogenetic information Requires sequence alignment and tree-building

PyNAST, MUSCLE, Infernal Time-consuming and susceptible to poor phylogenetic inference (does it matter?)

Weighted (abundance) ecological features related to abundance

Unweighted ecological features related to taxonomic presence/absence

Ordination example 1 (of many): Principal Coordinates Analysis

Classical Multidimensional Scaling (MDS; Gower 1966) Procedure:

based on eigenvectors position objects in low-dimensional space while preserving distance relationships as well as possible

highly flexible can choose among many association measures

In microbial ecology, used for visualizing phylogenetic or count-based distances Consistent visual output for given distance matrix

Include variance explained (%) on Axis 1 and 2

Ordination example 2 (of many): Non-metric Multidimensional Scaling Ordination not based on eigenvectors

Does not preserve exact distances among objects

attempts to preserve ordering of samples (“ranks”)

Procedure:

iterative, tries to position the objects in a few (2-3) dimensions in such a way that minimizes the “stress”

how well does the new ranked distribution of points represent the original distances in the association matrix? Can express as R2 on axes 1 and 2.

the adjustment goes on until the stress value reaches a local minimum (heuristic solution)

NMDS often represents distance relationships better than PCoA in the same number of dimensions

Susceptible to the “local minimum issue”, and therefore should have strong starting point (e.g., PCoA) or many permutations

You won't get the same result each time you run the analysis. Try several runs until you are comfortable with the result.

Do my treatments separate?

Beta-diversity: Hypothesis testing

Multiple methods, implemented in QIIME, mothur, AXIOME

e.g., MRPP, adonis, NP-MANOVA (perMANOVA), ANOSIM Are treatment effects significant?

Because these are predominantly nonparametric methods, tests for significance rely on testing by permutation Let's focus on MRPP

Multiresponse Permutation Procedures

Compare intragroup average distances with the average distances that would have resulted from all the other possible combinations

T statistic: more negative with increasing group separation (T>-10 common for ecology) A statistic: Degree of scatter within groups (A=1 when all points fall on top of one another) p value: likelihood of similar separation with randomized data.


Pipelines

Quick history


Species that matter

“PCoA plots are the first step of a community analysis, not the last.”

Josh Neufeld

Searching for species that matter

High dimensional data often have too many features to investigate

solution: identify and study species significantly associated with categorical metadata

Indicator species (Dufrene-Legendre) calculates indicator value (fidelity and relative abundance) of species Permutation test for significance Need solution for sparse data - be wary

of groups with small numbers of sites (influence on permutation tests) low abundance can artificially inflate indicator values

Specificity

Fidelity

IndVal (Dufrene & Legendre, 1997)

Specificity Large mean abundance within group relative to summed mean abundances of other groups

Fidelity Presence in most or all sites of that group

Groups defined by a priori by metadata or statistical clustering

Metadata Taxon R^2 value mbc

k__Bacteria;p__Planctomycetes;c__Planctomycetia;o__Gemmatales;f__Isosphaeraceae;g__ 0.611368489781491 mbc

k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Methylocystaceae;g__ 0.677209935419981 mbn

k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Methylocystaceae;g__ 0.64092523702996 soil_depth k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Intrasporangiaceae;g__ 0.669761188668774

Simple linear correlations

mothur: cooccurrence function, measuring whether populations are co-occurring more frequently than you would expect by chance.

Non-negative Matrix Factorization

NMF as a representation method for portraying high-dimensional data as a small number of taxonomic components. Patterns of co-occurring OTUs can be described by a smaller number of taxonomic components. Each sample represented by the collection of component taxa, helping identify relationships between taxa and the environment.

Jonathan Dushoff, McMaster University, Ontario, Canada

SSUnique

SILV

A

SILVA

SILVA

SILV

A

��

;��(KUOLFKLD�UXPLQDQWLXP

8/��XU�

8/��XU�

8/��XU�

8/��

8/��

8/��

8/��

0��5LFNHWWVLD�SURZD]HNLL

��

��

��

��

��

��

8/��

8/��

$-��)LEUREDFWHU�LQWHVWLQDOLV

8/��

8/��

8/��

8/��

8/��

8/��

8/��

8/��

8/��

8/��

8/��

8/��

$%��*HPPDWLPRQDV�DXUDQWLDFD

8/��

*8��)LEUREDFWHU�VXFFLQRJHQHV

��

��

��

��

��

��

%5&�

*HPPDWLPRQDGHWHV

$OSKDSURWHREDFWHULD

�H�

�I�

8/��

$-��,VRVSKDHUD�SDOOLGD

;��*HPPDWD�REVFXULJOREXV

$-��3ODQFWRP\FHV�EUDVLOLHQVLV

$0��6FKOHVQHULD�SDOXGLFROD

$0��6LQJXOLVSKDHUD�DFLGLSKLOD

8/��

$0��=DYDU]LQHOOD�IRUPRVD

;��3ODQFWRP\FHV�OLPQRSKLOXV

;��%ODVWRSLUHOOXOD�PDULQD

8/��

$%��3K\FLVSKDHUD�PLNXUHQVLV

%;��5KRGRSLUHOOXOD�EDOWLFD

8/��

$-��3ODQFWRP\FHV�PDULV

$-��3LUHOOXOD�VWDOH\L8/��

8/��

��

��

��

��

��

��

��

��

��

��

��

��

��

3ODQFWRP\FHWHV

�F�

8/��

8/��

8/��

8/��

*8��2KWDHNZDQJLD�NRUHHQVLV

$%��)OH[LEDFWHU�UXEHU$%��)OH[LEDFWHU�HOHJDQV

$%��0LFURVFLOOD�PDULQD

$-��6SRURF\WRSKDJD�P\[RFRFFRLGHV

8/��

0��)OH[LEDFWHU�IOH[LOLV

&3��&\WRSKDJD�KXWFKLQVRQLL

(8��5KRGRF\WRSKDJD�DHURODWD

��

��

��

��

%DFWHURLGHWHV

�E�

$%��3ODQNWRWKULFRLGHV�UDFLERUVNLL

8/��

&3��&KORURIO��DJJUHJDQV

<��+DORVSLUXOLQD�WDSHWLFROD

8/��

$%��&DOGLOLQHD�DHURSKLOD

&3��+HUSHWRVLSKRQ�DXUDQWLDFXV

8/��

$%��7KHUPRJHPPDWLVSRUD�IROLRUXP

8/��

8/��

$%��$QDHUROLQHD�WKHUPRSKLOD

$-��6SKDHUREDFWHU�WKHUPRSKLOXV

8/��

8/��

8/��

(8��8QFXOWXUHG�EDFWHULXP

8/��

$%��7KHUPRVSRURWKUL[�KD]DNHQVLV

$%��&ULQDOLXP�HSLSVDPPXP

0��7KHUPRPLFURELXP�URVHXP

8/��

8/��

8/��

$(��3URFKORURFRFFXV�PDULQXV�VXEVS��PDULQXV

+0��9DPSLURYLEULR�FKORUHOODYRUXV

$0��.WHGRQREDFWHU�UDFHPLIHU

$0��3URFKORURWKUL[�KROODQGLFD

8/��

8/��

(8��'HKDORJHQLPRQDV�O\NDQWKURSRUHSHOOHQV

8/��

8/��

8/��

8/��

()��5XELGLEDFWHU�ODFXQDH


&3��5RVHLIOH[XV�FDVWHQKRO]LL

8/��

8/��

8/��

8/��

8/��

$%��/HSWROLQHD�WDUGLYLWDOLV

8/��

8/��

8/��

8/��


8/��

8/��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

&KORURIOH[L

&\DQREDFWHULD

�G�

8/��

)-��6SKLQJREDFWHULXP�VKD\HQVH

8/��

(8��6SKLQJREDFWHULXP�VL\DQJHQVH

8/��

$%��1XEVHOOD�]HD[DQWKLQLIDFLHQV

8/��

$0��3VHXGRVSKLQJREDFWHULXP�GRPHVWLFXP

'4��3DUDSHGREDFWHU�NRUHHQVLV

8/��

'4��2OLYLEDFWHU�VLWLHQVLV

*4��3HGREDFWHU�EDX]DQHQVLV

8/��

$%��6ROLWDOHD�FDQDGHQVLV

��

��

��

��

��

��

��

��

��

��

��

%DFWHURLGHWHV

�D�

0��1HRULFNHWWVLD�ULVWLFLL

��


8/��XU�

8/��XU�

8/��XU�

8/��

8/��

8/��

8/��


��

��

��

��

��

��

8/��

8/��


8/��

8/��

8/��

8/��

8/��

8/��

8/��

8/��

8/��

8/��

8/��

8/��


8/��


��

��

��

��

��

��

%5&�

*HPPDWLPRQDGHWHV

$OSKDSURWHREDFWHULD

�H�

�I�

8/��






8/��




8/��



8/��



8/��

��

��

��

��

��

��

��

��

��

��

��

��

��

3ODQFWRP\FHWHV

�F�

8/��

8/��

8/��

8/��





8/��




��

��

��

��

%DFWHURLGHWHV

�E�


8/��



8/��



8/��


8/��

8/��



8/��

8/��

8/��


8/��




8/��

8/��

8/��





8/��

8/��


8/��

8/��

8/��

8/��




8/��

8/��

8/��

8/��

8/��


8/��

8/��

8/��

8/��


8/��

8/��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

&KORURIOH[L

&\DQREDFWHULD

�G�

8/��


8/��


8/��


8/��



8/��



8/��


��

��

��

��

��

��

��

��

��

��

��

%DFWHURLGHWHV

�D�


��


8/��XU�

8/��XU�

8/��XU�

8/��

8/��

8/��

8/��


��

��

��

��

��

��

8/��

8/��


8/��

8/��

8/��

8/��

8/��

8/��

8/��

8/��

8/��

8/��

8/��

8/��


8/��


��

��

��

��

��

��

%5&�

*HPPDWLPRQDGHWHV

$OSKDSURWHREDFWHULD

�H�

�I�

8/��






8/��




8/��



8/��



8/��

��

��

��

��

��

��

��

��

��

��

��

��

��

3ODQFWRP\FHWHV

�F�

8/��

8/��

8/��

8/��





8/��




��

��

��

��

%DFWHURLGHWHV

�E�


8/��



8/��



8/��


8/��

8/��



8/��

8/��

8/��


8/��




8/��

8/��

8/��





8/��

8/��


8/��

8/��

8/��

8/��




8/��

8/��

8/��

8/��

8/��


8/��

8/��

8/��

8/��


8/��

8/��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

&KORURIOH[L

&\DQREDFWHULD

�G�

8/��


8/��


8/��


8/��



8/��



8/��


��

��

��

��

��

��

��

��

��

��

��

%DFWHURLGHWHV

�D�


SILVA

��

8/��

8/��

8/��8/��

8/��

8QFXOWXUHG��03%��>$%��@

8/��

8/��

8/��

8/��

8/��

8/��

*ORHREDFWHU�YLRODFHXV�3&&��>%$��@

��

��

�� 5HPDLQLQJ�&\DQREDFWHULD�&KORURSODVWV

�D�

�E�

��

>+4��@

8/��8/��

3\WKLXP�XOWLPXP�>$'26��@

8/��

>-)��@

8QFXOWXUHG�>$$&<��@

8/��

8QFXOWXUHG�>+4��@

>*8��@

8QFXOWXUHG�>'4��@

>-)��@

(VFKHULFKLD�FROL�>$%��@

8QFXOWXUHG�>)-��@

8/��

8QFXOWXUHG�>)-��@

8/��

>)-��@

8QFXOWXUHG�>*8��@

'LFW\RVWHOLXP�>'4��@

8QFXOWXUHG�>*8��@

8/��

5LFNHWWVLD�SURZD]HNLL�>$-��@

��

��

��

��

��

��

��

��

��XU

&KORURSK\WD�(PEU\RSK\WD

��

8QFXOWXUHG

8QFXOWXUHG

)XQJL

5KRGRSK\WD

&KURPDOYHRODWD

$FDQWKDPRHED�SRO\SKDJD�>$)��@

7KUDXVWRFK\WULXP�>$)��@

8QFXOWXUHG

��

8QFXOWXUHG�>$%��@

��

��

��

��

��

��

��

��

��

9LEULR�YXOQLILFXV�>%$��@�2XWJURXSV

Lynch et al. 2012

Nakai et al. 2012


Pipelines

Quick history


Species that matter

Why pipelines? Merge and manage (many) disparate techniques Democratize analysis

improve accessibility

Accelerate pace of innovation, collaboration, and research

Early synthesis

Early synthesis for numerical microbial ecology Synthesis of 16S phylogenetics (Woese et al.) and Hughes (Counting the uncountable)

Numerical ecology for microorganisms

Algorithm development libshuff, dotur (mothur)

Analysis pipelines QIIME, mothur

Knight Lab, U. Colorado at Boulder Predominantly a collection of integrated Python/R scripts Many dependencies

easy managed installation: qiime-deploy

MacQIIME virtual box and Ubuntu fork

avoid for anything but small runs

Becoming the standard for marker gene studies integrated analysis and visualization easy access to broad computational biology toolbox (Python/R)

Automation and extension

AXIOME and phyloseq Extend existing technologies (QIIME, mothur, R, custom)

Layers of abstraction Automation and rapid re-analysis Promote reproducible research (iPython, XML, make)

Implement existing techniques (e.g., MRPP, Dufrene-Legendre IndVal)

numerical microbial ecology needs to better incorporate modern statistical theory

Develop and test new techniques

Axiometic GUI companion for AXIOME

Cross-platform New implementation in development

Generates AXIOME file (XML)

xls template coming soon for

all commands, sample metadata,

and extra info… much easier for

everyone.

“QIIME wraps many other software packages, and these should be cited if they are used. Any time you're using tools that QIIME wraps, it is essential to cite those tools.” http://qiime.org/index.html


Pipelines

Quick history


Species that matter

The future

As data get bigger, interpretation should be “hands off”

Move towards hypothesis testing of high-dimension taxonomic data

Convergence on Galaxy e.g., QIIME in Galaxy is developing

Further extension to cloud services e.g., Amazon EC2

Machine learning and data mining applications

Open-source, web-based platform Deployed locally or in the cloud Ongoing development of 16S rRNA gene analysis

Galaxy Workshed (available tools)

“The advantages of having large numbers of samples at shallow coverage (~1,000 sequences per sample) clearly outweigh having a small number of samples at greater coverage for many datasets, suggesting that the focus for future studies should be on broader sampling that can reveal association with key biological parameters rather than on deeper sequencing.”

“….even [phylogenetic beta-diversity] measures suited to the underlying mechanism of differentiation may require deep sequencing to reveal subtle patterns”

Dr. Donovan Parks

Method standardization Impossible.

Data storage

Sequence reads outpacing data storage costs Federated data?

File formats

e.g., FASTA (difficult to search, difficult to retrieve sequences, not space efficient, do not ensure data is in correct format, no space for metadata, no absolute

standard)… relational databases?

Software Free and Open Source enables an experiment to be faithfully replicated

Algorithms

Memory! Many clustering and phylogenetic inference algorithms vary n2

Distributed, parallel, or cloud computing may not be helpful

Metadata What to do with it? How to marry sequence and metadata sets?

We need better metadata integration, not necessarily more/better metadata

What should we be doing? (take-home messages)

*Surveys are really important for spatial and temporal mapping

*Hypothesis testing follows (or implicit) *What species account for treatment effects?

*Who tracks with who? (why=function) *Who avoids who?

*Are all microorganisms accounted for? (no) *How can we use this information to

manipulate, manage and predict ecosystems?

What should we be doing? (take-home messages)

There is no “one way” to analyze 16S rRNA

You need to build a pipeline for you.

If this seems daunting, it is.

If this is not daunting, your hands are dirty.

It’s getting better all the tii-ime.

Helpful resources

Thank you [email protected]

Technology

Introduction to 16S rRNA gene multivariate analysis