NGS data processing Bioinformatics tips, tools of the trade and pipeline writing Na Cai 4 th year...
Preview:
Citation preview
- Slide 1
- NGS data processing Bioinformatics tips, tools of the trade and
pipeline writing Na Cai 4 th year DPhil in Clinical Medicine
Supervisor: Jonathan Flint
- Slide 2
- Example projects CONVERGE -1.7x whole genome sequencing in
12,000 Han Chinese Women -6000 Cases of MD, 6000 controls -Detailed
questionnaire -45T of sequencing data Commercial Outbred Mice -0.1x
whole genome sequencing in 2,000 mice -Known breeding history
-Extensive phenotyping -2T of sequencing data
- Slide 3
- NGS data processing Taken from:
http://www.broadinstitute.org/gatk/guide/best-practices
- Slide 4
- Large-scale sequencing projects Lots of data Terabytes! Storage
problems, I/O problems, RAM problems Time consuming to process
Errors! Lots of them! Contamination Duplication Missing data
Difficult regions/features of the genome
- Slide 5
- Approach to NGS data Explore the data before processing
large-scale Pilot your experiments with small subsets Try default
parameters of softwares before altering Check output Right number
of lines? Did anything fail silently? Different handling of
different classes of input? How are missing values coded? %
failure?
- Slide 6
- Exploratory work in R read.table(, as.is=T, na.strings=c(NA,
nan)) dim(), str(), mode(), complete.cases() head(), tail()
table(), summary() order(), rank() plot(), library(ggplot2)
library(plyr)
- Slide 7
- Pipeline writing Arguments/options for different input
Arguments/options for parameters/auxillary files Reusable functions
Reasonably flexible input format recognition Set up for
parallelizing stderr for debugging, checking progress, but beware
of its size and I/O! Create new directories as you go along Create
flag files to indicate successful completion of each step
- Slide 8
- Make Specify input file and output file Specify command for
input output Make checks presence of output file before running
command Make deletes output of commands that did not finish
running
- Slide 9
- Ruffus http://www.ruffus.org.uk Flexible: one many and many one
processes Fully integrated with Python programming Need specify
only the max number of cores allowed for parallelisation Useful
printout options to check pipeline
- Slide 10
- Setting up Ruffus
- Slide 11
- Once Ruffus is set up - Help
- Slide 12
- Once Ruffus is set up just print
- Slide 13
- NGS data processing Taken from:
http://www.broadinstitute.org/gatk/guide/best-practices
- Slide 14
- Processing a raw BAM file Practical concerns Number of samples
Size of files Run time Server/cluster usage: How the jobs can be
parallelized Scientific concerns Ploidy of genome Source of DNA
Features of genome Variation between samples Genome coverage Error
rates
- Slide 15
- Manipulating a BAM file Converting between bams and fastqs
Indexing Coordinate sorting Splitting or merging Filter out reads
using bitwise flags/other criteria Mask entire regions
- Slide 16
- Example: Contaminants
- Slide 17
- Slide 18
- Useful Resource: Harvard Sysbio Remove duplicate sequences in
FASTA Remove short sequences in FASTA Format FASTA
http://archive.sysbio.harvard.edu/csb/resources/computati
onal/scriptome/UNIX/Protocols/Sequences.html
http://archive.sysbio.harvard.edu/csb/resources/computati
onal/scriptome/UNIX/Protocols/Sequences.html
- Slide 19
- Useful Resource: NGSUtils Tools (in Python) for FASTA, BAM,
BED, GTF file processing Eg. bamutils filter can filter out reads
with more than x mismatches http://ngsutils.org
- Slide 20
- Useful Resource: PicardTools Tools (in java) for BAM and FASTA
processing Cool tools: SamToFastq, MergeSamFiles, ValidateSamFile,
ReplaceSamHeader, MarkDuplicates Cool options: SORT_ORDER,
CREATE_INDEX, CREATE_MD5_FILE, VALIDATION_STRINGENCY
http://broadinstitute.github.io/picard
- Slide 21
- Useful Resource: GATK Tools (in java) for NGS processing and
analysis Cools things about it: Best Practices page, Forum,
Tutorials, Presentations https://www.broadinstitute.org/gatk/
- Slide 22
- Useful Resource: GATK
http://www.broadinstitute.org/gatk/guide/best-practices
- Slide 23
- Indel Realignment
http://www.broadinstitute.org/gatk/guide/best-practices
- Slide 24
- Why Realign Around Indels?
http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf
- Slide 25
- Why Realign Around Indels?
http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf
- Slide 26
- How does it work? Identified intervals: Known Indels Indels
discovered in original alignments (in CIGAR strings of reads in BAM
files) Reads where there is evidence of possible misalignment
Identified intervals: Known Indels Indels discovered in original
alignments (in CIGAR strings of reads in BAM files) Reads where
there is evidence of possible misalignment
http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf
- Slide 27
- The Indel Realigner Workflow
http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf
- Slide 28
- Implementing RealignerTargetCreator
Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads
sample3reads sample4 sample5 sample6 sample7 The
RealignerTargetCreater needs as many reads from all the samples at
a particular site to determine if reads tend to get misaligned
there need to parse in data for all samples at the same time
- Slide 29
- Slide 30
- Implementing IndelRealigner
Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads
sample3reads sample4 sample5 sample6 sample7 Once the Intervals are
identified, reads from any single sample can be realigned
individually based on the samples own insertion/deletion lengths
only need to parse in one samples data at a time
- Slide 31
- Slide 32
- Base Quality Score Recalibration (BQSR)
http://www.broadinstitute.org/gatk/guide/best-practices
- Slide 33
- Why BQSR?
http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3-Base_recalibration.pdf
- Slide 34
- The BQSR workflow
http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3-Base_recalibration.pdf
- Slide 35
- Implementing BaseRecalibrator
Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads
sample3reads sample4 sample5 sample6 sample7 The BaseRecalibrator
needs all reads from each samples at all unmasked sites to come up
with the recalibration table for the dataset need to parse in all
of the data of each sample
- Slide 36
- Slide 37
- Variant Calling
http://www.broadinstitute.org/gatk/guide/best-practices
- Slide 38
- Variant Calling
http://www.broadinstitute.org/gatk//events/2038/GATKwh0-BP-5-Variant_calling.pdf
- Slide 39
- Implementing Variant Calling
Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads
sample3reads sample4 sample5 sample6 sample7 The UnifiedGenotyper
(and many other callers) needs as many reads from all the samples
at a particular site to determine if there is a variant at the site
tend need to parse in data for all samples at a particular site at
the same time
- Slide 40
- Slide 41
- Useful Resource: Variant Callers
- Slide 42
- Acknowledgements Jonathan Flint, Richard Mott Robbie Davies,
Winni Kretzschmar Kiran Garimella (GATK) Leo Goodstadt (Ruffus)
Gerton Lunter (Stampy) Andy Rimmer (Platypus) Zam Iqbal (Cortex)
John Broxholme (all software help and maintenance) Jon Diprose,
Robert Esnouf (Clusters) Tim Bardsley, Mark Gibbons, Ruth Porter
(IT support)