View
305
Download
0
Category
Preview:
Citation preview
QIIME: Quantitative Insights Into Microbial Ecology (part
1) Thomas JeffriesFederico M. Lauro
Grazia Marina Quero Tiziano Minuzzo
The Omics Analysis Sydney Tutorial
Australian Museum 23rd-24th February 2015
QIIME
• Open source software package for taxonomic analysis of 16S rRNA sequences
• UC Colorado & Northern Arizona
• www.qiime.org (great resource…..)
• Good community support
• Can google most problems
• Multi-platform
• Widely used
Caporaso
Knight
Getting QIIME
Linux: https://github.com/qiime/qiime-deploy
Mac: http://www.wernerlab.org/software/macqiime
Ubuntu virtualbox: http://qiime.org/install/virtual_box.html
Linux remote machine e.g UTS FEIT cluster, NECTAR: http://nectar.org.au/research-cloud
http://qiime.org/install/install.html
Data formats
• 454:
DNA sequences (FASTA, .fna)
Quality (.qual)
Mapping file (.txt)
• Illumina
Sequences and quality in same file (.fastq)
Also supports paired end
Getting into QIIME
• Command line interface
• Some very basic commands needed for QIIME:
example:
/folder$ programme.py -i file_in -o file_out
ls :list files in working directorycd : changes directorycd .. : goes back to parent directory‘tab’ key: magically fills out file namesmkdir : makes a directorypwd : tells you where you are
QIIME tutorial and example data
• Many tutorials @ http://qiime.org/tutorials/index.html
• Good place to start: http://qiime.org/tutorials/tutorial.html
• Great Microbial Ecology course (includes QIIME): http://edamame-course.org/
• A few of the commands have changed in the new version – the current commands are in this talk - and I have renamed the files to make it easier to follow
Some useful terminology
αDiversityAlpha diversity is the diversity within ONE
sample
αDiversity
αDiversity: Richness
αDiversity: Evenness
αDiversity: Evenness
Common metric: Pielou’s evenness
Tutorial dataset
Tutorial dataset
1. Check mapping file format
• Checks that format of mapping file is ok
validate_mapping_file.py -m my_mapping_file.txt -o validate_mapping_file_output
“No errors or warnings were found in mapping file”
1. Check mapping file
Name (ID) of sample
Primer
Sequencing barcode
Sample categories (treatments)
Tab separated !!!
Hands on – validate your mapping file
validate_mapping_file.py -o moving_pictures_tutorial-1.8.0/illumina/ci
d_l1/ -m moving_pictures_tutorial-1.8.0/illumina/ra
w/filtered_mapping_l1.txt
2. De-multiplex - 454
• Using sample specific barcodes, identify each sequence with a sample (renames sequences)
• Performs some QC:
Removes sequences < 200bp
Removes sequences with a quality score <25
Removes sequences with >6 ambiguous bases or >6 homopolymer runs
split_libraries.py -m my_mapping_file.txt -f my_sequence_file.fna -q my_quality_file.qual -o split_library_output
• Produces seqs.fna
2. De-multiplex - Illumina (Step 1)
• If the samples contain paired-end reads, you first need to join them and update the barcodes using:
join_paired_ends.py -f my_forw_reads.fastq -r my_rev_reads.fastq -b my_barcodes.fastq -o my_joined.fastq
2. De-multiplex - Illumina (Step 2)
Then you can proceed to the split libraries step. If the sequences are NOT paired-ends go directly to split_libraries_fastq.py. This step also performs the Illumina reads QC:
split_libraries_fastq.py -m my_mapping_file.txt -i my_sequence_file.fastq -b my_barcodes.fastq -o split_library_output
• Data from multiple lanes can be processed together by separating inputs with a comma (,)
• Produces seqs.fna
1.8.0/illumina/raw/subsampled_s_1_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_2_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_3_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_4_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_5_sequence.fastq,moving_pictures_tutorial-
1.8.0/illumina/raw/subsampled_s_6_sequence.fastq -b moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_1_sequence_barcodes.fastq,moving_pictures_t
utorial-1.8.0/illumina/raw/subsampled_s_2_sequence_barcodes.fastq,moving_pictures_tutorial-1.8.0/
illumina/raw/subsampled_s_3_sequence_barcodes.fastq,moving_pictures_tutorial-1.8.0/
illumina/raw/subsampled_s_4_sequence_barcodes.fastq,moving_pictures_tutorial-1.8.0/
illumina/raw/subsampled_s_5_sequence_barcodes.fastq,moving_pictures_tutorial-1.8.0/
illumina/raw/subsampled_s_6_sequence_barcodes.fastq -m moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l1.txt,moving_pict
ures_tutorial-1.8.0/illumina/raw/filtered_mapping_l2.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l3.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l4.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l5.txt,moving_pictures_tutorial-1.8.0/illumina/raw/
filtered_mapping_l6.txt
count_seqs.py -i moving_pictures_tutorial-1.8.0/illumina/slout/seqs.fna
Hands on: split your libraries
3. OTU picking strategies
• De Novo OTU picking: clustering of sequences at 97%
Overlapping sequences
No reference database necessary
computationally expensive
• Closed-Reference
non overlapping reads
needs reference database
discards sequences with no match - e.g. no erroneous reads
• Open-reference
Overlapping reads
reads clustered against reference and non matching reads are clustered de-novo
Hands on – picking O.T.U.s
pick_open_reference_otus.py -o moving_pictures_tutorial-1.8.0/illumina/otus/ -i moving_pictures_tutorial-1.8.0/illumina/slout/seqs.fna -r gg_13_8_otus/rep_set/97_otus.fasta -p moving_pictures_tutorial-1.8.0/uc_fast_params.txt
pick_de_novo_otus.py -o moving_pictures_tutorial-1.8.0/illumina/otus_denovo/ -i moving_pictures_tutorial-1.8.0/illumina/slout/seqs.fna
3. Pick OTUs
Note: following steps can be automated by (what we are doing):
pick_de_novo_otus.py –i seqs.fna -o otus
pick_otus.py -i seqs.fna -o picked_otus_default
•Will cluster your sequences at 97% similarity (can change this if you wish) and produce ‘seqs_otus.txt’ which maps each sequence to a cluster
•Uses UCLUST algorithm (Edgar, 2010, Bioinformatics)
3. Pick OTUs
Generate OTUs by clustering reads based on similarity (default is 97%)
Sort reads according to size (long -> short)
Cluster
OTU1
OTU2
OTU3
OTU4
OTU5
4. Pick representative sequences
• We want a representative sequence for each OTU – time consuming to annotate each sequence and they are already clustered……
• This will take the most abundant sequence in each OTU and make a file that has 1 sequence for each OTU (rep_set1.fna)
pick_rep_set.py -i seqs_otus.txt -f seqs.fna -o rep_set1.fna
5. Annotate (assign taxonomy to each OTU)
• Compare each representative sequence to a database using one of several algorithms:
• UCLUST, BLAST, RDP Classifier, et al…..….
• New Defaults: UCLUST against the Greengenes database
assign_taxonomy.py -i rep_set1.fna
(output in directory: uclust_assigned_taxonomy)
• BLAST example (reference sequences and taxonomy downloaded from database):
assign_taxonomy.py -i rep_set1.fna -r ref_seq_set.fna -t id_to_taxonomy.txt -m blast
5. Annotate
• Some useful databases that are compatible with QIIME:
http://greengenes.secondgenome.com
Good for everything and default in QIIME
http://unite.ut.eeFungal Internal Transcribed Spacer (ITS)
Good for soil fungi
http://www.arb-silva.deContains both 16S and 18S rRNA (Eukaryotes…)
Good representation of marine taxa
Recap
Species A
Species B
Species C
mixed amplicons
Sample 1
Sample 2
Sample 3
OTU 1
OTU 2
OTU 3
Split library into samples
using barcodes
Used clustering to choose OTUs
Picked a representative
sequences and assigned
taxonomy
Referencedatabase
6. Putting it all together: making an OTU table
• Need to combine the OTU identity with the abundance information in the clusters and link back to each sample so we can do ECOLOGY
• The table is in .biom format:
• http://biom-format.org/documentation/biom_format.html
• Convert to text file:
• biom convert -i otu_table.biom -o otu_table.txt --table-type "otu table" --header-key taxonomy –b
make_otu_table.py -i seqs_otus.txt -t rep_set1_tax_assignments.txt -o otu_table.biom
Closed reference O.T.U. picking pick_closed_reference_otus.py -i seqs.fna -r reference.fna -o otus_w_tax/ -t taxa_map.txt
•Reference is database i.e. greengenes unaligned 97% otus and matching taxa map (same files as for BLAST)
•Output has all of your sequences aligned to greengenes and an OTU table
•So this picks OTUs and Assign taxonomy in 1 step (but loose non-matching sequences….do we care? – taxa summaries no, beta-diversity maybe….)
•Quick – good for illumina
7. Aligning sequences
• Back to our representative sequences….
• How closely related are the organisms present in the samples i.e. what is the phylogeny of our community and how does this shift between samples
• Default: PYNAST to align samples to a reference set of pre-aligned sequences (e.g. greengenes ALIGNED) – more computationally efficient than de novo alignment
• Can also select other methods e.g. MUSCLE,
align_seqs.py -i rep_set1.fna –o pynast_aligned/
7. Aligning sequences• Not all regions of the rRNA gene are informative or useful for
phylogenetic inference
• Gaps – short length sequence vs full length rRNA gene
• filter_alignment.py -i rep_set1_aligned.fasta -o filtered_alignment/
• Optional lanemask template that defines informative regions for some databases
• filter_alignment.py -i seqs_rep_set_aligned.fasta -m lanemask_in_1s_and_0s -o filtered_alignment/
• If you are going to use this alignment for making a phylogenetic tree this step is essential…..
A note on chimera removal
•Chimeras sequences formed from DNA of 2 or more organisms (artifact of PCR amplification)
•QIIME uses ChimeraSlayer to detect chimeric sequences using your alignment and a reference database
•You should then remove these OTU’s from your OTU table and alignment before proceeding with tree building and visualization of results :
•-e chimeric_seqs.txt when making OTU table, filter_fasta.py for alignment
identify_chimeric_seqs.py -m ChimeraSlayer -i rep_set_aligned.fasta -a reference_set1_aligned.fasta -o chimeric_seqs.txt
8. Make a phylogenetic tree
make_phylogeny.py -i rep_set1_aligned_pfiltered.fasta -o rep_phylo.tre
• Builds a tree from the alignment using FastTree
• Outputs a tree in newick format (.tre) which can be opened with software such as FigTree or can be used to calculate phylogenetic metrics
• Also filter Chimeras from tree
We now have 2 final outputs:
• OTU Table
1.Taxonomic composition
2.α-diversity (e.g. ‘species’ richness)
3.β-diversity (e.g. abundance similarity between samples)
• Phylogenetic tree
1.Phylogenetic β-diversity
QIIME has powerful visualization and statistical tools
Hands on – reformatting outputs
biom convert -i "otu table" --header-key taxonomy -bmoving_pictures_tutorial-1.8.0/illumina/otus_denovo/ot
u_table.biom -o moving_pictures_tutorial-1.8.0/illumina/otus_denovo/ot
u_table.txt --table-type
filter_alignment.py -i moving_pictures_tutorial-1.8.0/illumina/otus_denovo/p
ynast_aligned_seqs/seqs_rep_set_aligned.fasta -o moving_pictures_tutorial-1.8.0/illumina/otus_denovo/p
ynast_aligned_seqs/filtered_alignment
We have automated (piped) most of the steps I have talked aboutWe need to convert the OTU table to a text file and filter the alignment
9. Merging the mapping files
• We started with 6 lanes of Illumina but now we have a single OTU table. The merged mapping file will have duplicated barcodes but these are not used anymore (already demultiplexed):
• merge_mapping_files.py -o combined_mapping_file.txt -m mapfile1.txt,mapfile2.txt…,mapfilexxx.txt
Hands on – merge your mapping files
merge_mapping_files.py -o moving_pictures_tutorial-1.8.0/illumina/combined_mapping_file.txt -m moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l1.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l2.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l3.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l4.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l5.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l6.txtbiom summarize-table -i moving_pictures_tutorial-1.8.0/illumina/otus_denovo/otu_table.biom -o moving_pictures_tutorial-1.8.0/illumina/otus_denovo/otu_table.summary
Visualizing diversity 1 – community composition
biom summarize-table –i otu_table.biom –o otu_table_summary.txt
Counts/Sample detail:
L3S237: 138.0L3S235: 187.0L3S372: 205.0L3S373: 228.0L3S367: 259.0L3S370: 273.0L3S368: 274.0L3S369: 284.0
• Summary of OTU table: we want to standardize the number of sequences (sampling depth) to allow accurate comparison Ie. 146 sequences
single_rarefaction.py -i otu_table.biom -o otu_table_even146.biom -d 138
alpha_rarefaction.py -i otu_table.biom -m combined_mapping_file.txt -o rarefaction/ -t rep_set.tre
• How ‘deep’ do we need to go to adequately sample community? = Rarefaction analysis
• number of species increase until a point where producing more sequence does not significantly increase the number of observed species
• repeated subsampling of your data at different intervals. Plots subsamples against the number of observed species. If curves flatten, then you have sequenced at sufficient depth.
• Rarefaction trade off between ‘keeping’ samples below a given sequence cut-off and loosing diversity
Visualizing diversity 1 – community composition
Hands on - Rarefaction
single_rarefaction.py -i moving_pictures_tutorial-1.8.0/illumina/otus_denovo/otu_table.biom -o moving_pictures_tutorial-1.8.0/illumina/otus_denovo/otu_table_even138.biom -d 138
alpha_rarefaction.py -i moving_pictures_tutorial-1.8.0/illumina/otus_denovo/otu_table.biom -o moving_pictures_tutorial-1.8.0/illumina/otus_denovo/rarefaction/ -m moving_pictures_tutorial-1.8.0/illumina/combined_mapping_file.txt -t moving_pictures_tutorial-1.8.0/illumina/otus_denovo/rep_set.tre
Tomorrow……
Visualizing and comparing diversity
Software references: QIIME Caporaso et al 2010. QIIME allows analysis of high-throughput community sequencing data. Nature Methods 7(5): 335-336.
UCLUST Edgar RC. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19):2460-2461.
BLAST Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol 215(3):403-410.
GRENGENES McDonald et al 2012. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J 6(3): 610–618.
RDP Classifier Wang Q, Garrity GM, Tiedje JM, Cole JR. 2007. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microb 73(16): 5261-5267.
PyNAST Caporaso JG et al 2010. PyNAST: a flexible tool for aligning sequences to a template alignment. Bioinformatics 26:266-267.
ChimeraSlayer Haas BJ, Gevers D, Earl AM, Feldgarden M, Ward DV, Giannoukos G, et al. 2011. Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Research 21:494-504.
MUSCLE Edgar, R.C. 2004 MUSCLE: multiple sequence alignment with high accuracy and high throughput Nucleic Acids Res:1792-1797
FasttTree Price MN, Dehal PS, Arkin AP. 2010. FastTree 2-Approximately Maximum-Likelihood Trees for Large Alignments. Plos One 5(3)
UNIFRAC Lozupone C, Knight R. 2005. UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol 71(12): 8228-8235.
Emperor Vazquez-Baeza Y, Pirrung M, Gonzalez A, Knight R. 2013. Emperor: A tool for visualizing high-throughput microbial community data. Gigascience 2(1):16.
Recommended