Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation

36626 - Next Generation Sequencing Analysis

Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation Group

Data PreprocessingNext Generation Sequencing analysis

DTU Bioinformatics


Generalized NGS analysis

Raw reads

Pre-processing

Assembly:Alignment /

de novo

Application specific:

Variant calling,count matrix, ...

Comparesamples / methods

Answer?Question

Dat

a si

ze


Generalized NGS analysis

Raw reads

Pre-processing

Assembly:Alignment /

de novo

Application specific:

Variant calling,count matrix, ...

Comparesamples / methods

Answer?Question

Dat

a si

ze


Assembly: Two basic approaches

• Alignment: Use a reference genome and align your reads to the genome

• de novo assembly: Try to assemble the reads into a genome without any prior knowledge





Monday

Wednesday





Monday

Wednesday

But first a look at data preprocessing


Preprocessing• Reads have qualities - bases are not always correct!

• Different error profiles pr. technology

• What can we do?

• Quality trimming

• Adaptor clipping

• 5’ clipping

• k-mer correction

• ...


Analyze data using FastQC• Report basic statistics on your

data

• Identify issues with your data


Per base sequence quality

Trim from 3’ to qual 20

Illumina


Average quality

Remove reads with avg. qual < 20

Illumina

Remove reads with “N” basecalls


Trim from 5’• Sometimes something is fishy in the beginning of the

read

Clip a certain number of bases from 5’


Adapters• Sometimes adapters/primers are also part of the read

• Adapter/primers are non-biological sequences

• Short read alignment is global - adapters are no-go

• de novo assembly will be confused ~ artificial repeats

• If you dont know which were used: FastQC will (may) find them for you!


Adapters - example

We will use “Cutadapt” and “AdapterRemoval” to cut adapters, many other options exist

Very important if your DNA fragment is shorter than read length


454 / ion torrent data• Main problem is indels at

homopolymer runs

• (Trim homopolymers), trim trailing poor quality bases

• Remove very short reads

• For de novo adapters should be removed (prinseq)

• For alignment we use Smith-Waterman (local) so less important

Prinseq output


k-mer correction• What is a k-mer?

• Create a sliding window of size k, move it over all your reads and count occurrence of k-mers

• We can use this to correct sequencing errors!

ACGTGTAACGTGACGTTGGADNA:

Eg. k=5ACGTGCGTGTGTGTA


k-mer correctionsuch that the probability that a randomly selected k-mer

from the space of 42

k(for odd k considering reverse

complements as equivalent) possible k-mers occurs in arandom sequence of nucleotides the size of thesequenced genome G is ~0.01. That, is we want k suchthat

2

40 01

Gk� . (2)

which simplifies to

k G� log4 200 (3)

For an approximately 5 Mbp such as E. coli, we set kto 15, and for the approximately 3 Gbp human genome,we set k to 19 (rounding down for computational rea-sons). For the human genome, counting all 19-mers inthe reads is not a trivial task, requiring >100 GB ofRAM to store the k-mers and counts, many of whichare artifacts of sequencing errors. Instead of executingthis computation on a single large memory machine, weharnessed the power of many small memory machinesworking in parallel on different batches of reads. Weexecute the analysis using Hadoop [43] to monitor theworkflow, and also to sum together the partial countscomputed on individual machines using an extension ofthe MapReduce word counting algorithm [45]. TheHadoop cluster used in these experiments contains 10nodes, each with a dual core 3.2 gigahertz Intel Xeonprocessors, 4 GB of RAM, and 367 GB local disk (20cores, 40 GB RAM, 3.6 TB local disk total).In order to better differentiate true k-mers and error

k-mers, we incorporate the quality values into k-mercounting. The number of appearances of low coveragetrue k-mers and high copy error k-mers may be similar,but we expect the error k-mers to have lower qualitybase calls. Rather than increment a k-mer’s coverage byone for every occurrence, we increment it by the pro-duct of the probabilities that the base calls in the k-merare correct as defined by the quality values. We refer tothis process as q-mer counting. q-mer counts approxi-mate the expected coverage of a k-mer over the errordistribution specified by the read’s quality values. Bycounting q-mers, we are able to better differentiatebetween true k-mers that were sequenced to low cover-age and error k-mers that occurred multiple times dueto bias or repetitive sequence.

Coverage cutoffA histogram of q-mer counts shows a mixture of twodistributions - the coverage of true k-mers, and the cov-erage of error k-mers (see Figure 3). Inevitably, these

distributions will mix and the cutoff at which true anderror k-mers are differentiated must be chosen carefully[46]. By defining these two distributions, we can calcu-late the ratio of likelihoods that a k-mer at a given cov-erage came from one distribution or the other. Then thecutoff can be set to correspond to a likelihood ratio thatsuits the application of the sequencing. For instance,mistaking low coverage k-mers for errors will removetrue sequence, fragmenting a de novo genome assemblyand potentially creating mis-assemblies at repeats. Toavoid this, we can set the cutoff to a point where theratio of error k-mers to true k-mers is high, for example1,000:1.In theory, the true k-mer coverage distribution should

be Poisson, but Illumina sequencing has biases that addvariance [26]. Instead, we model true k-mer coverage asGaussian to allow a free parameter for the variance.k-mers that occur multiple times in the genome due torepetitive sequence and duplications also complicate thedistribution. We found that k-mer copy number in var-ious genomes has a ‘heavy tail’ (meaning the tail of thedistribution is not exponentially bounded) that is

Coverage

Den

sity

0 20 40 60 80 100

0.00

00.

005

0.01

00.

015

True k-mers

Error k-mers

Figure 3 k-mer coverage. 15-mer coverage model fit to 76×coverage of 36 bp reads from E. coli. Note that the expectedcoverage of a k-mer in the genome using reads of length L will beL kL

− + 1 times the expected coverage of a single nucleotidebecause the full k-mer must be covered by the read. Above, q -mercounts are binned at integers in the histogram. The error k-merdistribution rises outside the displayed region to 0.032 at coveragetwo and 0.691 at coverage one. The mixture parameter for the priorprobability that a k-mer’s coverage is from the error distribution is0.73. The mean and variance for true k-mers are 41 and 77suggesting that a coverage bias exists as the variance is almosttwice the theoretical 41 suggested by the Poisson distribution. Thelikelihood ratio of error to true k-mer is one at a coverage of seven,but we may choose a smaller cutoff for some applications.

Kelley et al. Genome Biology 2010, 11:R116http://genomebiology.com/2010/11/11/R116

Page 9 of 13

ACGTGGTTGCCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAA

Kelley et al., 2010

Concept: Rare k-mers are seq. errorsNeed >15X coverage


Merge paired ends

Insert size: 500ntReads: 100ntMiddle: 300nt

Insert size: 180ntReads: 100ntMiddle: -20nt

• Merge overlapping pairs: single longer read

• Smart because Illumina reads have bad 3’ quals

• Very useful for de novo assembly

Magocˇ and Salzberg, 2011


Merge paired ends

Insert size: 500ntReads: 100ntMiddle: 300nt

Insert size: 180ntReads: 100ntMiddle: -20nt

Overlap

• Merge overlapping pairs: single longer read

• Smart because Illumina reads have bad 3’ quals

• Very useful for de novo assembly

Magocˇ and Salzberg, 2011


Coverage• Coverage/depth is how many times that your data covers the genome

(on average)

• Example:

• N: Number of reads: 5 mill

• L: Read length: 100

• G: Genome size: 5 Mbases

• C = 5*100/5 = 100X

• On average there are 100 reads covering each position in the genome

BackgroundTechnologies

DataPrimary analysis

Secondary analysis and beyond

File formatsMapping readsDe novo assemblySNP callingQuantificationSoftware

Coverage

C = N ⇥L

G

G : genome size

N : number of reads

L : average read length

Example: 1,500,000,000 of 100nt reads corresponds to a humangenome at 50x

34 / 86


Last, but important!

• Lots of data - storage is expensive!

• Keep data compressed whenever possible (gzip, bzip, bam)

• Remove intermediate files and files that can easily be re-created

Documents

Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation