Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
36626 - Next Generation Sequencing Analysis
Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation Group
Data PreprocessingNext Generation Sequencing analysis
DTU Bioinformatics
36626 - Next Generation Sequencing Analysis
Generalized NGS analysis
Raw reads
Pre-processing
Assembly:Alignment /
de novo
Application specific:
Variant calling,count matrix, ...
Comparesamples / methods
Answer?Question
Dat
a si
ze
36626 - Next Generation Sequencing Analysis
Generalized NGS analysis
Raw reads
Pre-processing
Assembly:Alignment /
de novo
Application specific:
Variant calling,count matrix, ...
Comparesamples / methods
Answer?Question
Dat
a si
ze
36626 - Next Generation Sequencing Analysis
Assembly: Two basic approaches
• Alignment: Use a reference genome and align your reads to the genome
• de novo assembly: Try to assemble the reads into a genome without any prior knowledge
36626 - Next Generation Sequencing Analysis
Assembly: Two basic approaches
• Alignment: Use a reference genome and align your reads to the genome
• de novo assembly: Try to assemble the reads into a genome without any prior knowledge
Monday
Wednesday
36626 - Next Generation Sequencing Analysis
Assembly: Two basic approaches
• Alignment: Use a reference genome and align your reads to the genome
• de novo assembly: Try to assemble the reads into a genome without any prior knowledge
Monday
Wednesday
But first a look at data preprocessing
36626 - Next Generation Sequencing Analysis
Preprocessing• Reads have qualities - bases are not always correct!
• Different error profiles pr. technology
• What can we do?
• Quality trimming
• Adaptor clipping
• 5’ clipping
• k-mer correction
• ...
36626 - Next Generation Sequencing Analysis
Analyze data using FastQC• Report basic statistics on your
data
• Identify issues with your data
36626 - Next Generation Sequencing Analysis
Per base sequence quality
Trim from 3’ to qual 20
Illumina
36626 - Next Generation Sequencing Analysis
Average quality
Remove reads with avg. qual < 20
Illumina
Remove reads with “N” basecalls
36626 - Next Generation Sequencing Analysis
Trim from 5’• Sometimes something is fishy in the beginning of the
read
Clip a certain number of bases from 5’
36626 - Next Generation Sequencing Analysis
Adapters• Sometimes adapters/primers are also part of the read
• Adapter/primers are non-biological sequences
• Short read alignment is global - adapters are no-go
• de novo assembly will be confused ~ artificial repeats
• If you dont know which were used: FastQC will (may) find them for you!
36626 - Next Generation Sequencing Analysis
Adapters - example
We will use “Cutadapt” and “AdapterRemoval” to cut adapters, many other options exist
Very important if your DNA fragment is shorter than read length
36626 - Next Generation Sequencing Analysis
454 / ion torrent data• Main problem is indels at
homopolymer runs
• (Trim homopolymers), trim trailing poor quality bases
• Remove very short reads
• For de novo adapters should be removed (prinseq)
• For alignment we use Smith-Waterman (local) so less important
Prinseq output
36626 - Next Generation Sequencing Analysis
k-mer correction• What is a k-mer?
• Create a sliding window of size k, move it over all your reads and count occurrence of k-mers
• We can use this to correct sequencing errors!
ACGTGTAACGTGACGTTGGADNA:
Eg. k=5ACGTGCGTGTGTGTA
36626 - Next Generation Sequencing Analysis
k-mer correctionsuch that the probability that a randomly selected k-mer
from the space of 42
k(for odd k considering reverse
complements as equivalent) possible k-mers occurs in arandom sequence of nucleotides the size of thesequenced genome G is ~0.01. That, is we want k suchthat
2
40 01
Gk� . (2)
which simplifies to
k G� log4 200 (3)
For an approximately 5 Mbp such as E. coli, we set kto 15, and for the approximately 3 Gbp human genome,we set k to 19 (rounding down for computational rea-sons). For the human genome, counting all 19-mers inthe reads is not a trivial task, requiring >100 GB ofRAM to store the k-mers and counts, many of whichare artifacts of sequencing errors. Instead of executingthis computation on a single large memory machine, weharnessed the power of many small memory machinesworking in parallel on different batches of reads. Weexecute the analysis using Hadoop [43] to monitor theworkflow, and also to sum together the partial countscomputed on individual machines using an extension ofthe MapReduce word counting algorithm [45]. TheHadoop cluster used in these experiments contains 10nodes, each with a dual core 3.2 gigahertz Intel Xeonprocessors, 4 GB of RAM, and 367 GB local disk (20cores, 40 GB RAM, 3.6 TB local disk total).In order to better differentiate true k-mers and error
k-mers, we incorporate the quality values into k-mercounting. The number of appearances of low coveragetrue k-mers and high copy error k-mers may be similar,but we expect the error k-mers to have lower qualitybase calls. Rather than increment a k-mer’s coverage byone for every occurrence, we increment it by the pro-duct of the probabilities that the base calls in the k-merare correct as defined by the quality values. We refer tothis process as q-mer counting. q-mer counts approxi-mate the expected coverage of a k-mer over the errordistribution specified by the read’s quality values. Bycounting q-mers, we are able to better differentiatebetween true k-mers that were sequenced to low cover-age and error k-mers that occurred multiple times dueto bias or repetitive sequence.
Coverage cutoffA histogram of q-mer counts shows a mixture of twodistributions - the coverage of true k-mers, and the cov-erage of error k-mers (see Figure 3). Inevitably, these
distributions will mix and the cutoff at which true anderror k-mers are differentiated must be chosen carefully[46]. By defining these two distributions, we can calcu-late the ratio of likelihoods that a k-mer at a given cov-erage came from one distribution or the other. Then thecutoff can be set to correspond to a likelihood ratio thatsuits the application of the sequencing. For instance,mistaking low coverage k-mers for errors will removetrue sequence, fragmenting a de novo genome assemblyand potentially creating mis-assemblies at repeats. Toavoid this, we can set the cutoff to a point where theratio of error k-mers to true k-mers is high, for example1,000:1.In theory, the true k-mer coverage distribution should
be Poisson, but Illumina sequencing has biases that addvariance [26]. Instead, we model true k-mer coverage asGaussian to allow a free parameter for the variance.k-mers that occur multiple times in the genome due torepetitive sequence and duplications also complicate thedistribution. We found that k-mer copy number in var-ious genomes has a ‘heavy tail’ (meaning the tail of thedistribution is not exponentially bounded) that is
Coverage
Den
sity
0 20 40 60 80 100
0.00
00.
005
0.01
00.
015
True k-mers
Error k-mers
Figure 3 k-mer coverage. 15-mer coverage model fit to 76×coverage of 36 bp reads from E. coli. Note that the expectedcoverage of a k-mer in the genome using reads of length L will beL kL
− + 1 times the expected coverage of a single nucleotidebecause the full k-mer must be covered by the read. Above, q -mercounts are binned at integers in the histogram. The error k-merdistribution rises outside the displayed region to 0.032 at coveragetwo and 0.691 at coverage one. The mixture parameter for the priorprobability that a k-mer’s coverage is from the error distribution is0.73. The mean and variance for true k-mers are 41 and 77suggesting that a coverage bias exists as the variance is almosttwice the theoretical 41 suggested by the Poisson distribution. Thelikelihood ratio of error to true k-mer is one at a coverage of seven,but we may choose a smaller cutoff for some applications.
Kelley et al. Genome Biology 2010, 11:R116http://genomebiology.com/2010/11/11/R116
Page 9 of 13
ACGTGGTTGCCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAA
Kelley et al., 2010
Concept: Rare k-mers are seq. errorsNeed >15X coverage
36626 - Next Generation Sequencing Analysis
Merge paired ends
Insert size: 500ntReads: 100ntMiddle: 300nt
Insert size: 180ntReads: 100ntMiddle: -20nt
• Merge overlapping pairs: single longer read
• Smart because Illumina reads have bad 3’ quals
• Very useful for de novo assembly
Magocˇ and Salzberg, 2011
36626 - Next Generation Sequencing Analysis
Merge paired ends
Insert size: 500ntReads: 100ntMiddle: 300nt
Insert size: 180ntReads: 100ntMiddle: -20nt
Overlap
• Merge overlapping pairs: single longer read
• Smart because Illumina reads have bad 3’ quals
• Very useful for de novo assembly
Magocˇ and Salzberg, 2011
36626 - Next Generation Sequencing Analysis
Coverage• Coverage/depth is how many times that your data covers the genome
(on average)
• Example:
• N: Number of reads: 5 mill
• L: Read length: 100
• G: Genome size: 5 Mbases
• C = 5*100/5 = 100X
• On average there are 100 reads covering each position in the genome
BackgroundTechnologies
DataPrimary analysis
Secondary analysis and beyond
File formatsMapping readsDe novo assemblySNP callingQuantificationSoftware
Coverage
C = N ⇥L
G
G : genome size
N : number of reads
L : average read length
Example: 1,500,000,000 of 100nt reads corresponds to a humangenome at 50x
34 / 86
36626 - Next Generation Sequencing Analysis
Last, but important!
• Lots of data - storage is expensive!
• Keep data compressed whenever possible (gzip, bzip, bam)
• Remove intermediate files and files that can easily be re-created