Read Processing and Mapping: From Raw to Analysis-ready Reads BEN PASSARELLI STEM CELL INSTITUTE GENOME CENTER NGS WORKSHOP 31 MAY 2013

Read Processing and Mapping:From Raw to Analysis-ready Reads

B E N P A S S A R E L L I S T E M C E L L I N S T I T U T E G E N O M E C E N T E R

N G S W O R K S H O P3 1 M A Y 2 0 1 3

2

From Raw to Analysis-ready Reads

Session Topics• Overview of high-throughput sequencing platforms• Understand read data formats and quality scores• Identify and fix some common read data problems• Find a genomic reference for mapping• Mapping reads to a reference genome• Understand alignment output• Sort, merge, index alignment for further analysis• Mark/eliminate duplicate reads• Locally realign at indels• Recalibrate base quality scores• How to get started

Raw reads

Read assessment and prep

Mapping

DuplicateMarking

Local realignment

Base quality recalibration

Analysis-readyreads

3

Sample to Raw Reads

Library Construction

QC and Quantification

SamplePreparation Sequencing Raw Reads

4

Instrument OutputIllumina

MiSeqIllumina

HiSeqIon PGM Ion Proton Pacific Biosciences

RS

Images (.tiff)Cluster intensity file (.cif)

Base call file (.bcl)

Standard flowgram file (.sff) MovieTrace (.trc.h5)Pulse (.pls.h5)Base (.bas.h5)

Sequence Data(FASTQ Format)

Sequencing Platforms at a Glance

6

Solid Phase Amplification

V3 HiSeq

Library DNA binds to Oligos Immobilized on Glass Flowcell Surface

Sequencing Steps•Clusters are linearized•Sequencing primer annealed•All four dNTPs added at each cycle Each with different **Fluorescent Tag**•Intensity of different tags base call•Error Profile: substitutions

FASTQ Format (Illumina Example)

@DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAACAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT+BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ@DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AGAAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG+@@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2@DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AGCCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC+CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ@DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AGGAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG+CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ

Read RecordHeader

Read BasesSeparator

(with optional repeated header)

Read Quality Scores

Flow Cell IDLane Tile

Tile CoordinatesBarcode

NOTE: for paired-end runs, there is a second file with one-to-one corresponding headers and reads

Phred* quality score Q with base-calling error probability P

Q = -10 log10P* Name of first program to assign accurate base quality scores. From the Human Genome Project.

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.................................................... !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126

S - Sanger Phred+33 range: 0 to 40 I - Illumina 1.3+ Phred+64 range: 0 to 40 L - Illumina 1.8+ Phred+33 range: 0 to 41

Q scoreProbability of base error

Base confidence

Sanger-encoded(Q Score + 33) ASCII character

10 0.1 90% “+”

20 0.01 99% “5”

30 0.001 99.9% “?”

40 0.0001 99.99% “I”

Base Call Quality: Phred Quality Scores

Initial Read Assessment and Processing

COMMON PROBLEMS THAT CAN AFFECT ANALYSIS

LOW CONFIDENCE BASE CALLS typically toward ends of reads criteria vary by application

PRESENCE OF ADAPTER SEQUENCE IN READS poor fragment size selection protocol execution or artifacts

OVER-ABUNDANT SEQUENCE DUPLICATES

LIBRARY CONTAMINATION

Raw reads


Mapping

DuplicateMarking

Local realignment


Analysis-readyreads

Quick Read Assessment: FastQC

FREE DOWNLOAD Download: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Tutorial : http://www.youtube.com/watch?v=bz93ReOv87Y

SAMPLES READS (200K DEFAULT): FAST, LOW RESOURCE USE

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

http://www.youtube.com/watch?v=bz93ReOv87Y

Read Assessment Example (Cont’d)

Trim for base quality or adapters(run or library issue)

Trim leading bases(library artifact)

Read Assessment Example (Cont’d)TruSeq Adapter, Index 9 5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG

13

Comprehensive Read Assessment: Prinseq

http://prinseq.sourceforge.net/



Fastx toolkit* http://hannonlab.cshl.edu/fastx_toolkit/(partial list)FASTQ Information: Chart Quality Statistics and Nucleotide DistributionFASTQ Trimmer: Shortening FASTQ/FASTA reads (removing barcodes or noise).FASTQ Clipper: Removing sequencing adaptersFASTQ Quality Filter: Filters sequences based on qualityFASTQ Quality Trimmer: Trims (cuts) sequences based on qualityFASTQ Masker: Masks nucleotides with 'N' (or other character) based on quality*defaults to old Illumina fastq (ASCII offset 64). Use –Q33 option.

SepPrep https://github.com/jstjohn/SeqPrepAdapter trimmingMerge overlapping paired-end read

Biopython http://biopython.org, http://biopython.org/DIST/docs/tutorial/Tutorial.html(for python programmers)Especially useful for implementing custom/complex sequence analysis/manipulation

Galaxy http://galaxy.psu.eduGreat for beginners: upload data, point and clickJust about everything you’ll see in today’s presentations

SolexaQA2 http://solexaqa.sourceforge.netDynamic trimmingLength sorting (resembles read grouping of Prinseq)

Selected Tools to Process Reads

http://hannonlab.cshl.edu/fastx_toolkit/



https://github.com/jstjohn/SeqPrep

https://github.com/jstjohn/SeqPrep

http://biopython.org/

http://biopython.org/

http://biopython.org/DIST/docs/tutorial/Tutorial.html



http://galaxy.psu.edu/



http://solexaqa.sourceforge.net/

http://solexaqa.sourceforge.net/

15

Many Analysis Pipelines Start with Read Mapping

http://www.broadinstitute.org/gatk/guide/topic?name=best-practices http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

http://www.broadinstitute.org/gatk/guide/topic?name=best-practices

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

Read Mapping

http://www.broadinstitute.org/igv/

Raw reads


Mapping

DuplicateMarking

Local realignment


Analysis-readyreads



Sequence References and Annotations

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/data.shtml http://www.ncbi.nlm.nih.gov/guide/howto/dwn-genome Comprehensive reference information

http://hgdownload.cse.ucsc.edu/downloads.html Comprehensive reference, annotation, and translation information

ftp://[email protected]/bundle References and SNP information data by GATK Human only

http://cufflinks.cbcb.umd.edu/igenomes.html Pre-indexed references and gene annotations for Tuxedo suite Human, Mouse, Rat , Cow, Dog, Chicken, Drosophila, C. elegans, Yeast

http://www.repeatmasker.org

http://www.ncbi.nlm.nih.gov/guide/howto/dwn-genome

http://hgdownload.cse.ucsc.edu/downloads.html

ftp://[email protected]/bundle

http://cufflinks.cbcb.umd.edu/igenomes.html

http://www.repeatmasker.org/

Fasta Sequence Format

>chr1…TGGACTTGTGGCAGGAATgaaatccttagacctgtgctgtccaatatggtagccaccaggcacatgcagccactgagcacttgaaatgtggatagtctgaattgagatgtgccataagtgtaaaatatgcaccaaatttcaaaggctagaaaaaaagaatgtaaaatatcttattattttatattgattacgtgctaaaataaccatatttgggatatactggattttaaaaatatatcactaatttcat…>chr2…>chr3…

• One or more sequences per file• “>” denotes beginning of sequence or contig• Subsequent lines up to the next “>” define sequence• Lowercase base denotes repeat masked base• Contig ID may have comments delimited by “|”

Read Mapping Novoalign

(3.0)SOAP3

(version 91)BWA

(0.7.4)Bowtie2(2.1.0)

Tophat2(2.0.8b)

STAR(2.3.0e)

License Commercial GPL v3 GPL v3 Artistic Artistic GPL v3

Mismatch allowed

up to 8 up to 3 user specified. max is function of read length and error rate

user specified uses Bowtie2 user specified

Alignments reported per read

random/all/none random/all/none user selected user selected uses Bowtie2 user selected

Gapped alignment

up to 7bp 1-3bp gap yes yes yessplice junctionsintrons

yessplice junctionsintrons

Pair-end reads yes yes yes yes yes yes

Best alignment highest alignment score

minimal number of mismatches

minimal number of mismatches

highest alignment score

uses Bowtie2 highest alignment score

Trim bases 3’ end 3’ end 3’ and 5’ end 3’ and 5’ end uses Bowtie2 3’ and 5’ end

Comments At one time, best performance and alignment quality

Element of Broad’s “best practices” genotyping workflow

Smith-Waterman quality alignments, currently fastest

Currently most popular RNA-seq aligner

Very fast; uses memory to achieve performance

BWA Features• Uses Burrows Wheeler Transform

— fast— modest memory footprint (<4GB)

• Accurate• Tolerates base mismatches

— increased sensitivity — reduces allele bias

• Gapped alignment for both single- and paired-ended reads• Automatically adjusts parameters based on read lengths and error

rates• Native BAM/SAM output (the de facto standard)• Large installed base, well-supported• Open-source (no charge)

Read Mapping: BWA

21

Read Mapping: Bowtie 2BOWTIE2 Uses dynamic programming (edit distance scoring)

o Eliminates need for realignment around indelso Can be tuned for different sequencing technologies

Multi-seed search - adjustable sensitivity Input read length limited only by available memory Fasta or Fastq input Caveats

o Longer input reads require much more memoryo Trade-off parallelism with memory requirement

http://bowtie-bio.sourceforge.net/bowtie2

Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2, Nature Methods. 2012, 9:357-359

Dynamic Programming Illustration

SAM (BAM) FormatSEQUENCE ALIGNMENT/MAP FORMAT Universal standard Human-readable (SAM) and compact (BAM) forms

STRUCTURE Header

› version, sort order, reference sequences, read groups, program/processing history

Alignment records

[benpass align_genotype]$ samtools view -H allY.recalibrated.merge.bam@HD VN:1.0 GO:none SO:coordinate@SQ SN:chrM LN:16571@SQ SN:chr1 LN:249250621@SQ SN:chr2 LN:243199373@SQ SN:chr3 LN:198022430…@SQ SN:chr19 LN:59128983@SQ SN:chr20 LN:63025520@SQ SN:chr21 LN:48129895@SQ SN:chr22 LN:51304566@SQ SN:chrX LN:155270560@SQ SN:chrY LN:59373566…@RG ID:86-191 PL:ILLUMINA LB:IL500 SM:86-191-1@RG ID:BsK010 PL:ILLUMINA LB:IL501 SM:BsK010-1@RG ID:Bsk136 PL:ILLUMINA LB:IL502 SM:Bsk136-1@RG ID:MAK001 PL:ILLUMINA LB:IL503 SM:MAK001-1@RG ID:NG87 PL:ILLUMINA LB:IL504 SM:NG87-1…@RG ID:SDH023 PL:ILLUMINA LB:IL508 SM:SDH023@PG ID:GATK IndelRealigner VN:2.0-39-gd091f72 CL:knownAlleles=[] targetIntervals=tmp.intervals.list LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null@PG ID:bwa PN:bwa VN:0.6.2-r126

SAM/BAM Format: Header

samtools to view bamheadersort order

reference sequence names with lengths

read groups with platform, library and sample information

program (analysis) history

[benpass align_genotype]$ samtools view allY.recalibrated.merge.bam

HW-ST605:127:B0568ABXX:2:1201:10933:3739 147 chr1 27675 60 101M = 27588 -188TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC=7;:;<=??<=BCCEFFEJFCEGGEFFDF?BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB8@ABCCCDCBDA@>:/RG:Z:86-191

HW-ST605:127:B0568ABXX:3:1104:21059:173553 83 chr1 27682 60 101M = 27664 -119ATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGCTACAGTA8;8.7::<?=BDHFHGFFDCGDAACCABHCCBDFBE</BA4//BB@BCAA@CBA@CB@ABA>A??@B@BBACA>?;A@8??CABBBA@AAAA?AA??@BB0RG:Z:SDH023* Many fields after column 12 deleted (e.g., recalibrated base scores) have been deleted for improved readability

SAM/BAM Format: Alignment Records

http://samtools.sourceforge.net/SAM1.pdf

13 4 5 6 8 9

10

11

2


25

Preparing for Next Steps

SUBSEQUENT STEPS REQUIRE SORTED AND INDEXED BAMS Sort orders: karyotypic, lexicographical Indexing improves analysis performance

PICARD TOOLS: FAST, PORTABLE, FREE http://picard.sourceforge.net/command-line-overview.shtml Sort: SortSam.jar Merge: MergeSamFiles.jar Index: BuildBamIndex.jar

ORDER: SORT, MERGE (OPTIONAL), INDEX

Raw reads


Mapping

DuplicateMarking

Local realignment


Analysis-readyreads

http://picard.sourceforge.net/command-line-overview.shtml

Duplicate Marking

Raw reads


Mapping

DuplicateMarking

Local realignment


Analysis-readyreads

$java -Xmx4g -jar <path to picard>/MarkDuplicates.jar \INPUT=aligned.sorted.bam \OUTPUT=aligned.sorted.dedup.bam \VALIDATION_STRINGENCY=LENIENT \METRICS_FILE=aligned.dedup.metrics.txt \REMOVE_DUPLICATES=false \ASSUME_SORTED=true

http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates

http://picard.sourceforge.net/command-line-overview.shtml%23MarkDuplicates

[benpass align_genotype]$ samtools view allY.recalibrated.merge.bam

HW-ST605:127:B0568ABXX:2:1201:10933:3739 147 chr1 27675 60 101M = 27588 -188TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC=7;:;<=??<=BCCEFFEJFCEGGEFFDF?BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB8@ABCCCDCBDA@>:/RG:Z:86-191

SAM/BAM Format: Alignment Records


http://picard.sourceforge.net/explain-flags.html


http://picard.sourceforge.net/explain-flags.html

Local RealignmentBWT-BASED ALIGNMENT IS FAST FOR MATCHING READS TO REFERENCE

INDIVIDUAL BASE ALIGNMENTS OFTEN SUB-OPTIMAL AT INDELS

APPROACH Fast read mapping with BWT-based aligner Realign reads at indel sites using gold standard (but much slower)

Smith-Waterman algorithm

BENEFITS Refines location of indels Reduces erroneous SNP calls Very high alignment accuracy in significantly less time, with fewer

resources1Smith, Temple F.; and Waterman, Michael S. (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195–197. doi:10.1016/0022-2836(81)90087-5. PMID 7265238

Raw reads


Mapping

DuplicateMarking

Local realignment


Analysis-readyreads

Local Realignment

DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889

Post re-alignment at indelsRaw BWA alignment

STEP 1: Find covariates at non-dbSNP sites using:Reported quality scoreThe position within the readThe preceding and current nucleotide (sequencer properties)

java -Xmx4g -jar GenomeAnalysisTK.jar \-T BaseRecalibrator \-I alignment.bam \-R hg19/ucsc.hg19.fasta \-knownSites hg19/dbsnp_135.hg19.vcf \-o alignment.recal_data.grp

STEP 2: Generate BAM with recalibrated base scores:

java -Xmx4g -jar GenomeAnalysisTK.jar \-T PrintReads \-R hg19/ucsc.hg19.fasta \-I alignment.bam \-BQSR alignment.recal_data.grp \-o alignment.recalibrated.bam

Base Quality Recalibration

Raw reads


Mapping

DuplicateMarking

Local realignment


Analysis-readyreads

Base Quality Recalibration (Cont’d)

32

Raw reads


Mapping

DuplicateMarking

Local realignment


Analysis-readyreads

Is there an easier way to get started?!

http://galaxyproject.org/ Click on “Use Galaxy”

http://galaxyproject.org/

http://galaxyproject.org/

Getting Started

35

Raw reads


Mapping

DuplicateMarking

Local realignment


Analysis-readyreads

Documents

Read Processing and Mapping: From Raw to Analysis-ready Reads BEN PASSARELLI STEM CELL INSTITUTE GENOME CENTER NGS WORKSHOP 31 MAY 2013