Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
ESSAY
Improving somatic mutation detection in cancer
Wouter Huiting, MSc
1769863
13-8-2016
Supervisor: Dr. C. van Diemen
2
Contents Preface ............................................................................................................................................................ 3
Abstract .......................................................................................................................................................... 4
1. Introduction .......................................................................................................................................... 5
1.1 The challenges of somatic variant calling in cancer ................................................................ 6
1.2 Goal of this essay ......................................................................................................................... 7
......................................................................................................................................................................... 9
2. Optimizing the workflow of tumour variant calling ..................................................................... 10
2.1 DNA extraction and shearing .................................................................................................. 10
2.2 Library preparation .................................................................................................................... 11
2.3 High-throughput sequencing: base-calling, depth and coverage ........................................ 13
2.4 Read quality assessment and pre-processing ......................................................................... 15
2.5 Read mapping ............................................................................................................................. 16
2.6 Variant calling, annotation and prioritization ........................................................................ 19
3. Discussion ........................................................................................................................................... 21
4. References ........................................................................................................................................... 24
3
Preface
This essay was written in the spring of 2016 in the context of my Master education Behavioral
and Cognitive Neuroscience at the University of Groningen. The execution of this essay was
supervised by Dr. C. van Diemen, whom I would like to thank for the opportunity.
4
Abstract
Somatic variation analysis is important to help us understand the onset and progression of
cancer. Unfortunately, although next-generation sequencing technology has advanced rapidly
over the last decade, most NGS strategies still prove inadequate to accurately grasp the molecular
complexity of cancer, a fact that is largely the result of intratumour heterogeneity. In addition, the
high-throughput of modern NGS platforms has also left us with the difficult task of managing
and analysing vast amounts of data, forcing researchers to rely heavily on the use of
bioinformatics. Advances in experimental and computational techniques designed to cope with
these challenges occur quickly. The result is a rapidly evolving workflow of somatic tumour
variant analysis. It is important that the entire cancer research community is informed regularly
on these advances. Here I try to provide an overview of the complete workflow of tumour
variant analysis in a way that is relevant to all those involved in cancer research. In addition, I
highlight several weak links in this workflow and provide recommendations on how to cope with
them.
5
1. Introduction
Since the dawn of next-generation sequencing (NGS) more than a decade ago, the sequencing
technology has evolved immensely, driving speed and throughput to an unprecedented level
(Goodwin et al. 2016). The financial costs of sequencing an entire human genome have been
brought down from roughly US$10 million to little over a thousand dollars in just a few years’
time (https://www.genome.gov/27541954/dna-sequencing-costs-data). NGS technologies are
now widely used in laboratories around the world, and many acclaim their potential as a
diagnostic tool (Katsanis et al. 2013; Vrijenhoek et al. 2015). The development and widespread
adoption of NGS technology would not have been possible without the efforts of Frederic
Sanger and his colleagues almost forty years ago. In 1977, it was Sanger who introduced the first
automated sequencing method based on his ground breaking ‘chain-termination’ technique. Due
to its high accuracy (e.g. through long read lengths) and ease of use, Sanger sequencing would
become the dominant DNA sequencing technology for the next decades. Around the turn of the
millennium fundamentally different sequencing methods started to spring up, collectively referred
to as ‘next-generation sequencing’ (NGS) techniques. Aided by the development of new
technologies such as high-resolution imaging, these exciting methods offered several advantages
over Sanger sequencing. Most importantly, they all allowed mass parallelisation of sequencing,
greatly improving throughput (Shendure, Ji. 2008). See Box 1 for an overview of the principles
behind Sanger and modern NGS platforms; an in depth overview lies beyond the scope of this
essay (for an excellent recent review see for example Goodwin et al. 2016).
One field that has particularly benefited from this development is cancer research. Cancer
comprises a group of diseases that arise from ‘a clone that has accumulated the requisite
somatically-acquired genetic aberrations, leading to malignant transformation’ (Stratton. 2011;
Watson et al. 2013). The development of effective therapies for cancer patients requires a
comprehensive assessment of the role of these somatic variants in tumour formation (Wang et al.
2013). Before the advent of NGS, studies relied on a costly and low-throughput workflow of
PCR amplification followed by Sanger sequencing to identify candidate cancer drivers. When
NGS technologies became widely used, this meant that the cancer research field could initiate
systematic sequencing ‘screens’ to identify such somatic variants much more effectively. NGS
studies have since then contributed greatly to our understanding of the mutations that drive
tumourigenesis in different tumour types (Watson et al. 2013). However, accurate somatic
mutation ‘calling’ remains highly challenging still.
6
1.1 The challenges of somatic variant calling in cancer
True somatic variant are very difficult to distinguish. Besides the errors in the sequencing process
itself, e.g. contaminations and sequencing errors, there are issues inherent to the short-read
nature of NGS: amplification bias and ambiguities in short read mapping, to name a few (Wang et
al. 2013). However, most of these problems arise with every NGS-project. What makes somatic
cancer variants in particular difficult to identify is the intratumour heterogeneity (Fidler. 1978;
Marusyk et al. 2012). As tumours evolve from single cells, clonal lineages begin to diverse, giving
rise to distinct subpopulations within the tumour (Gerlinger et al. 2012). It is this genomic
diversity that drives tumour proliferation, enabling cancer cells to survive a range of selective
pressures from the tumour’s microenvironment: pH, hypoxia, therapy and others. (Davis, Navin.
2016). The result is a highly resistant tumour in which variants are non-uniformly present
(Landau et al. 2013), with variant allele frequencies (VAFs) as low as 5% having been reported
(Carter et al. 2012). Several sequencing strategies have been employed to disentangle this
heterogeneity, and they can be divided into two different types of strategies. Methods belonging
to the first type aim to sample from different places in the tumour and sequence them
simultaneously. This can be executed with different macroscopic regions of the tumour mass, a
method referred to as multiregion sequencing (Gerlinger et al. 2012; Yates et al. 2015), but also
on a microscopic scale using a range of single cells (Navin et al. 2011; Xu et al. 2012). A related
technique is to fluorescently label and sort cells from the tumour to achieve homogenous
subpopulations of cancer cells (Bolognesi et al. 2016). The latter method has the added benefit of
removing healthy cells from further analysis. The second strategy is to sequence ultra-deep (Nik-
Zainal et al. 2012). As I will discuss later, this strategy has shown great promise of substantially
improving the identification of tumour variants, even with very low VAFs (Griffith et al. 2015).
Although choosing the appropriate sequencing method is key, other aspects have to be
taken into account as well. Among them are practical considerations, for example the choice of a
valid control input to be able to accurately distinguish germline variation from true somatic
variants. Others are more strategic, like the choice to include or omit a PCR amplification step.
As we will see further down, all of these can influence the outcome of the tumour variant
analysis. One element in particular in which there is still significant room for improvement is the
use of bioinformatics tools in the various data analysis steps of somatic variant calling. Modern
NGS platforms generate large amounts of heterogeneous data, with higher error rates and
(generally) shorter read lengths compared to Sanger sequencing platforms (see Box 1c). Because
of this, NGS poses considerable challenges for data management and computational analysis
(Schadt et al. 2010). Consequently, somatic variant analysis is forced to rely heavily on
7
bioinformatics (Pabinger et al. 2013). A wide range of algorithms has been written to aid in
specific parts of the data analysis (Li, Homer. 2010; Bao et al. 2014), but the fact that there are so
many makes choosing the right toolset a strenuous task, especially for the less experienced user
(Pabinger et al. 2013).
1.2 Goal of this essay
Further advances in somatic variation analysis are important to help us understand the onset and
progression of cancer. As existing sequencing technologies are constantly improved, and new
technologies emerge at a fast pace, it is key that the cancer research field is informed regularly on
new methods, tools etc. Indeed, excellent reviews are periodically written on how to optimize
variant calling, but these often highlight only one aspect of the entire workflow. Moreover, the
majority of these reviews is directed at geneticists and bioinformaticians (see for instance Alioto et
al. 2015 or Griffith et al. 2015). This makes it challenging for clinical researchers and cell
biologists to keep track of the advances in somatic variant calling methods in cancer. As NGS
techniques and the associated data analysis are becoming more and more complex, the danger of
a loss of crosstalk between bioinformatics and (clinically driven) biological research arises. This
can result in a knowledge gap that will directly affect the clinical impact of NGS. Such a gap
could prevent clinicians and researchers from smaller labs to exploit the power of NGS
technology in the future. To prevent such a schism from happening it is crucial that we bring
these fields closer to each other. For these reasons, I here present in interdisciplinary review of
the entire workflow of tumour variant calling, starting from the bench-work. For clarity, I focus
on several elements and principles that have arguably the largest effect on the accuracy of somatic
variant calls, including library preparation, sequencing depth and coverage, as well as the choice
of mappers and callers; other factors will be covered more briefly.
8
9
Box 1 (continued)
d. Sanger and Illumina sequencing
(I) Sanger sequencing workflow. (II-IV) General workflow of Illumina sequencing, the current
marketleader in NGS platforms. See also Box 1a. Images taken from Mardis, 2013. For additional
information, the reader is referred to the excellent recent reviews by Goodwin et al. (2016)
Template-strand
5’ 3’ 3’ 5’
5’
5’
5’
5’
5’
5’
5’
Direction of
electrophoresis
3’
T
C
G
A
A
T
C
5’
Direction of
sequence read
-
-
- - - -
5’ 3’
5’
10
2. Optimizing the workflow of tumour variant calling
2.1 DNA extraction and shearing
Tumour DNA is typically obtained from formaldehyde fixed-paraffin embedded (FFPE) tumour
samples, the standard preservation format for diagnostic surgical pathology (Kokkat et al. 2013).
In parallel to sampling the tumour, DNA is extracted from peripheral blood to serve as a
‘normal’ genome. This allows true somatic variants in the tumour to be discriminated from the
host’s germline variants later in the workflow (Shah et al. 2009; see also Figure 2). If no peripheral
blood is available, DNA from normal tissue surrounding the tumour in the FFPE sample is
sometimes taken as a control (Pleasance et al. 2010). Importantly, choosing the proper source for
a healthy control is no trivial task. For instance, peripheral blood often contains circulating
tumour cells (Pantel, Speicher. 2015), and tissue surrounding the tumour may appear to be
healthy but can in fact harbour cells with significant genomic defects (Sadanandam et al. 2012;
Troester et al. 2016). Both situations are unwanted, as they prevent the accurate identification of
tumour-related somatic alterations. Another potential problem stems from the notion that the
fixation and embedding needed to make FFPE tumour samples can damage the DNA (Ben-Ezra
et al. 1991; Williams et al. 1999). This not only reduces the amount of DNA available for
sequencing, it can also lead to inaccurate variant calling (Akbari et al. 2005). Pre-analytical
molecular sample characterization has been proposed to correct for these problems (Sah et al.
2013).
Next, the obtained DNA has to be sheared into the short oligonucleotides - typically tens
or hundreds of nucleotides long - required for high-throughput NGS (Poptsova et al. 2014).
Although several DNA shearing methods exist, all of them rely on one of two principles:
shearing by mechanical force or shearing through enzymatic fragmentation (Knierim et al.
2011).Few studies have investigated the effects of the various DNA fragmentation protocols on
variant calling accuracy, as it has long been assumed that shearing occurs in a random (i.e.
unbiased) manner. Indeed, there is data supporting the idea that DNA fragmentation is random,
regardless of the method employed, leading some to suggest that ‘a fragmentation method can be
chosen solely according to lab facilities, feasibility and experimental design’ (Knierim et al. 2011).
However, several studies suggest that DNA shearing is in fact subject to a ‘fragmentation bias’, as
both enzymatic- and mechanical fragmentation were shown to be sequence specific (Hansen et al.
2010; Grokhovsky et al. 2011). In an interesting study from 2014, Poptsova and coworkers
showed that ultrasound shearing of genomic DNA can cause an amplified (i.e. higher than
chance) cleavage of GC-rich areas, likely as a result of local variations in DNA structural
dynamics. Regardless of the exact sequence that is preferred, a fragmentation bias will result in a
11
non-random distribution of fragment lengths and sequence ends, which in turn leads to a non-
uniform read coverage after the alignment step (see section 3.3; Finotello et al. 2014). It is
important that sequencing studies are aware of this problem, and try to correct for it. The
solution could be a two-step fragmentation protocol, although that has only been validated for
ChIP-seq (Mokry et al. 2010). Alternatively, it might be possible to negate the fragmentation bias
by adding specific chemical agents in the shearing step (Grokhovsky et al. 2011).
2.2 Library preparation
An important principle of library preparation is ‘library complexity’, referring to the number of
unique fragments present in the library. The main goal when preparing any sequencing library is
to make it as complex (i.e. diverse) as possible, so that it accurately reflects the complexity of the
original genetic sequence (Head et al. 2014), and at the same time avoid bias (Van Dijk et al. 2014).
Once the dsDNA is properly fragmented, the sequencing library can be made. In this
process oligonucleotide adapters (specific to the NGS platform used) are attached to the ends of
the fragments, preparing them for sequencing. Fragment ends are processed before the adapters
can be ligated to generate the blunt-ended fragments that are required for adapter ligation (Head
et al. 2014). End-processing is a two-step process that starts with the enzymatic blunting and 5’
phosphorylation of both sides of the fragment, followed by the addition of an adenine nucleotide
to the 3’ ends. This A-tail not only reduces the risk of fragment-chimeras, but also facilitates the
ligation of the T-tailed adapter oligonucleotides (Quail et al. 2008). This is a crucial step in the
variant calling workflow, as the use of different protocols can result in marked variation in variant
calling effectiveness (see Rhodes et al. 2014; Alioto et al. 2015).
After end-repair and A-tailing, adapters are ligated to the fragments. These adapters are a
crucial part of library preparation as they hybridize to sequencing primers during the sequencing
reaction. Importantly, the choice of adapters not only depends on the NGS platform used, but
also on whether single-end or paired-end reads are pursued (see further down). As a result of
imperfect end-repair or A-tailing, artefacts can arise in the library, including adapter dimers. As
these dimers form clusters very efficiently during sequencing, they take up valuable space and
thus waste the capacity of the sequencing platform (Head et al. 2014). A size-selection step right
after shearing (but before adapter ligation), has been proposed as a means of reducing the effect
of adapter-chimeras (Quail et al. 2009). This also yields a tighter distribution of fragment sizes
resulting in a more homogenous PCR amplification, which in turn will provide a more uniform
read coverage after the alignment step (see section 3.3; Quail et al. 2009).
Standard NGS library preparation protocols then rely on a PCR step to amplify their
library (i.e. enriching for properly ligated fragments) before sequencing (Kebschul, Zador. 2015).
12
The need for this stems largely from the notion that libraries should be carefully quantified
before sequencing commences (see further down), and most quantification protocols require
large amounts of DNA to ensure accurate titration (Meyer et al. 2008; Parkinson et al. 2012).
Unfortunately, PCR is an inherently sensitive process: it not only skews data (and thus reduce
library complexity) but it can also introduce hybrid or erroneous sequences into the library (Aird
et al. 2011). Different factors are thought to underlie these PCR-induced imperfections, among
which fragment length, content dependent amplification of sequences, template switching and
even the intrinsic stochasticity of PCR (Kebschul, Zador. 2015). Variation in the PCR parameters
(temperature, polymerase, buffer) plays a substantial role in this, which can be exacerbated with
every PCR cycle (Dabney, Meyer. 2012; Meyer, Liu. 2014). Because of these issues with PCR
amplification, some studies opt to exclude this step from library preparation (Kozarewa et al.
2009; Quail et al. 2009). The idea behind this is that eliminating the PCR amplification leads to
improved coverage of regions with a high GC content and reduces the amount of duplicate reads
after sequencing (Kozarewa et al. 2009). Omitting PCR indeed results in a more homogenous
distribution of reads and can thus improve variant detection in cancer (Alioto et al. 2015).
However, a PCR-free approach does require a larger amount of starting material, and as FFPE
samples generally yield small quantities of DNA, a PCR-free approach to identify somatic tumour
variants might not always be possible (Luthra et al. 2015).
The last step of library preparation is quantification, during which the precise amount of
adapter ligated fragments present in the library is evaluated. This allows the researcher to load the
correct amount of sample onto the sequencing station (Liu et al. 2012; Loman et al. 2012). This is
important, as sequencing experiments performed with too many or too few correctly ligated
library fragments can yield poor data quality (Laurie et al. 2013). Several methods for library
quantification exist, but the most widespread method is real-time quantitative PCR (qPCR). This
is mainly due to its ability to assess only the amount of adapter ligated fragments (Buehler et al.
2010). Unfortunately, qPCR has also considerable disadvantages that can compromise variant
calling. Template size and sequence content can result in an amplification bias just like with a
regular PCR (Valasek, Repa. 2005). In addition, qPCR demands that a standard curve is created
for each sample, a laborious process that is prone to inaccuracies (Yun et al. 2006). To overcome
these problems White and coworkers developed a digital emulsion PCR method to quantify
libraries (White et al. 2009). This method is based on the (massively parallel) fluorescent detection
of a probe oligonucleotide (e.g. TaqMan) added to the adapter ligated fragments. A droplet digital
PCR (ddPCR) device (e.g. QX100®, Bio-Rad) first generates thousands of droplets; most are
empty, but some contain a single amplified fragment of DNA. Next, droplets are counted and
13
assessed for fluorescence (all or nothing, so binary read-out), after which a the total number of
input molecules can be calculated (White et al. 2009). DdPCR offers superior sensitivity and
stability over convential qPCR quantification (Robin et al. 2016).
2.3 High-throughput sequencing: base-calling, depth and coverage
While new NGS instruments are being developed at an astonishing pace (Goodwin et al. 2016),
the accuracy and speed of the main NGS platforms currently in use is also constantly improved.
Several elements are of particular importance in this respect.
Base-calling algorithms turn the sequencer’s output (fluorescence in the case of Illumina,
current changes in the case of Thermo Fisher) into a base-call. However, due to the inevitable
imperfections in sequencing chemistry and signal detection, errors in base-calling can arise. For
instance, Illumina technology suffers from a number of biases owing to its technology, including
phasing (and prephasing), signal decay and cross-talk (see Figure 1; Cacho et al. 2015). The
standard Illumina base-calling algorithm, ‘Bustard’, is able to reduce the effects of these
uncertainties on base-call accuracy by explicitly modelling these biases. However, the error rates
in Bustard’s calls can still be significantly improved (up to 30%; Nielsen et al. 2011) by more
sophisticated algorithms like BlindCall and freelbis (Das, Vikalo. 2013; Renaud et al. 2013; Ye et
al. 2014). An in depth description of the mathematical principles used by these base-callers lies far
beyond the scope of this report, but an excellent review of recently developed base-callers was
recently written by Cacho and coworkers (2015). Importantly, the use of these advanced
statistical tools instead of Bustard was shown to significantly reduce false positive SNP calls in
tumour variant analysis (Nielsen et al. 2011).
Two other important concepts that are crucial for accurate somatic mutation detection in
cancer are sequencing depth and coverage. Depth and coverage are two highly related terms that
are frequently used interchangeably. Some use coverage to describe the breadth of sequence
coverage, i.e. the percentage of the target genome that is sequenced a given number of times.
Depth can then be thought of as the ‘redundancy of coverage’ (Sims et. 2014), often denoted as n
x (e.g. 30x depth). The importance of sequencing depth becomes instantly clear when one
considers the internal error rate of high-throughput, short read sequencing - without sufficient
depth, it is impossible to distinguish sequencing mistakes from real sequence variants. In
addition, a uniform coverage is needed to eliminate the underrepresentation of SNPs in specific
regions (for instance regions with high GC content) of the genome. Accordingly, an increased
depth and uniformity of coverage can rescue sequencing errors (Sims et. 2014).
14
Figure 1. Commonly modelled base-calling errors for the Illumina platform
(a) Scaled cytosine (C) intensity versus cycle of a single read. A spike indicates a potential C nucleotide at that
position. Phasing is observed as an anticipation signal in the cycle before a C (left arrow) and after (right arrows). It
occurs during the sequencing process when one or more strands within a cluster fail to incorporate the next base in
the read. The reads start lagging behind, distorting the fluorescence emissions. Prephasing occurs when two bases are
incorporated in a single cycle. (b) Maximum intensity (signal) and median intensity (noise) plotted against cycle.
During the sequencing of the complementary strand, some material may be lost , causing a decreased signal to noise
ratio known as signal decay. (c) Intensity versus fluorophore emission spectrum. The spectrum of the guanine (G)
fluorophore bleeds into the optimal spectrum of the thymine (T) filter. Thus, when a G fluorophore is excited, a T
signal will also be detected. This causes a positive correlation between the intensities of these two channels, a
phenomenon known as cross-talk. (d) Two-dimensional histogram of intensity data of the T channel versus G
channel. The G fluorophores (right arrow) transmit to the T channel, hence the positive linearity. However, the T
fluorophores do not transmit to the G channel. A similar situation occurs with A and C channels (not shown). Figure
and text adapted from Cacho et al. 2015.
For the accurate detection of germline single nucleotide variants (SNPs) a 30x average depth in
95% of the genome was shown to be sufficient (Ajay et al. 2011). Most cancer genomes (and
‘normal’ control genomes) are sequenced to comparable depths (Mardis et al. 2012; Borad et al.
2014), as previous studies indicated that depths of 15x-50x are sufficient to detect all SNPs and
15
small indels (Bentley et al. 2008; Ajay et al. 2011). However, these estimates are largely based on
high-purity tumours, while in fact most tumours exhibit severe heterogeneity (as discussed
earlier). As a result, a somatic tumour variant can have a VAF of 5% (Biankin et al. 2012) or even
lower. Such rare variant are unlikely to be picked up with a sequencing depth of 30x. A recent
study by Griffith and coworkers (2015) showed that a 30x-50x depth for whole genome
sequencing is indeed insufficient for adequate variant identification in the face of sample
contamination, aneuploidy or even moderate intratumour heterogeneity. Instead, they
recommend a depth of 500x-1000x for the discovery of novel variants, especially for those with
VAFs < 10%. Performing such ultra-deep sequencing (Wagle et al. 2012) of both the tumour and
normal genomes on current NGS platforms is extremely costly, and so a 100x-300x depth has
been proposed as a compromise (Alioto et al. 2015). In addition, it is also important that the ratio
of tumour : normal depth is kept as close to one as possible, and at least within a 10% range, as
this appears to reduce the amount of false positives (Alioto et al. 2015).
2.4 Read quality assessment and pre-processing
Upon completion of the desired amount of sequencing runs, the output first needs to be
evaluated for its quality (Pabinger et al. 2013). Modern high-throughput sequencing platforms spit
out several millions of short DNA reads with every run (Goodwin et al. 2016). Importantly, not
all of these reads meet the predefined standards as they generally contain several sequencing
artefacts (Dai et al. 2010). These errors have to be removed by trimming and filtering the reads. A
range of tools has been developed to execute the different steps of quality assessment and
subsequent pre-processing (Pabinger et al. 2013). As this is a complex process to grasp fully
without prior bioinformatics training, only the basic principles will be discussed here.
The first step entails the visualization of base quality scores included in the output of the
NGS platform. The output (that is in the text-based FASTQ format) contains not only the
predicted sequence, but also contains for every base the estimated probability of an erroneous
call (as discussed earlier). These error-probability values (P), ranging from 10% to 0.0001%, are
converted into standard quality scores or ‘Phred scores’ by calculating the -10 log (P). As a result,
the Phred scores form an array that is equal in length to the array of base calls, with values
typically ranging from 10 to 40 (in ASCII format, see http://blog.nextgenetics.net/?e=33). Note
that a 0.1% error rate in base calling translates into a Phred score of 10; higher Phred scores
mean higher estimated accuracy (Nielsen et al. 2011). Programs like FastQC (Andrews. 2010)
process these output files and produce graphical summary reports allowing one to quickly assess
the quality of the data. Next, the reads are trimmed and filtered based on both the quality scores
and sequence properties (Pabinger et al. 2013). This is important as read quality is often not
16
consistent over the entire length of a read (Huse et al. 2007; Dohm et al. 2008), and ignoring low
quality base calls hampers downstream variant calling accuracy (Olson et al. 2015). Trimmomatic,
PRINSEQ and other tools can perform various trimming tasks, among which adapter and primer
trimming and 3’ and 5’ low quality stretch trimming. It is important to also remove reads that do
not meet a minimum average base quality, as well as reads that exceed the minimum or maximum
read length threshold. ClinQC integrates multiple quality control tools, making it a highly useful
tool for clinical research (Pandey et al. 2016). Importantly, performing these various trimming and
filtering steps was shown to result in significant improvements in variant calling (Del Fabbro et al.
2013).
2.5 Read mapping
After preprocessing and quality assessment, the reads are ready for further downstream analysis.
The classic approach used by many variant analysis projects is to align (‘map’) the reads of both
the tumour and the normal sample to a validated human reference genome like GRCh37 or the
newer GRCh38 (http://www.ncbi.nlm.nih.gov/project/genome/assembly/grc/human; see also
Figure 2). This process, commonly referred to as ‘resequencing’, entails multiple complex
bioinformatical steps (an alternative is to computationally ‘stitch’ the reads together, a process
referred to as de novo assembly; see Box 2). These steps have to overcome technical hurdles that
are collectively dubbed the ‘read-mapping problem’ (Trapnell, Salzberg. 2009). The core of this
read-mapping problem is two-fold. In a practical sense, aligning billions of short sequences to a
large genome requires highly efficient algorithms in the absence of extreme computational power
(a PC only offers so much bits of memory). A more strategic problem stems from the complexity
and heterogeneity of the human genome. Chromosomes are not just simple arrays of nucleotides
in which variation consists of the occasional single SNP. On the contrary, the genomic sequence
carries insertions and deletions (indels), translocations, inversions, duplications and copy number
variants (CNVs) (Feuk et al. 2006), and these structural variants likely account for ten times more
variation among human genomes than SNPs do (Pang et al. 2010). As a consequence,
chromosomes differ from person to person (Baker. 2012), and in somatic tissue even from cell to
cell (Astolfi et al. 2010; O’Huallachain et al. 2012). This heterogeneity is exacerbated greatly in
cancer genomes due to their so-called ‘mutator phenotype’ (Albertson et al. 2009; Loeb. 2016).
While the short reads generated by modern high-throughput sequencers are great for picking up
point mutations, they are difficult to work with in the face of (large) structural rearrangements
(Trapnell, Salzberg. 2009). Indeed, NGS technologies have long been ‘biased towards typing
unique tags in the genome’ (Baker. 2012).
17
A range of alignment algorithms (‘mappers’) have been developed over the last years to
tackle the read-mapping problem, including Bowtie (Langmead et al. 2009), BWA (Li, Durbin.
2009) and SOAP/SOAP2 (Li, Yu, Li et al. 2009), to name just a few. Instead of describing for
each of these how they work and how they can be wielded to optimize tumour variant calling, I
try here to provide some general focus points. The first thing to realise when executing a tumour
resequencing project is that it is very important to use an appropriate reference sequence to align
the tumour and normal reads to (Alioto et al. 2015). At this moment the main source of human
reference genomes is the Genome Reference Consortium (GRC). The GRC keeps updating the
human reference assembly, as even the most recent human genome build (GRCh38) contains
gaps, particularly around the centromeres (Chaisson et al. 2015). As a result, rare reads that belong
somewhere else are mapped to the wrong place in the genome (the true location is missing),
leading to local read pile-ups i.e. false positives. This phenomenon can be mitigated by providing
Figure 2. Analysis of tumour and matched normal DNA
Distinguishing somatic tumour variants from germline variants requires the parallel sequencing of DNA from
tumour tissue and DNA from normal tissue. Here, peripheral blood is used as a ‘normal’ (see also section 2.1). After
sequencing, the reads are mapped to the human reference genome (in green). Discrepancies observed in both
samples are germline variants (in this example heterozygous), whereas those observed only in the tumour sample are
inferred to be somatic variants.
a ‘decoy’ sequence additional to the reference genome. A decoy is made up out of sequences
known to be absent from the reference. By including it, rare reads are ‘scavenged’ from the
reference (Li. 2014). An extra control step is to filter the alignment for so-called ‘blacklisted sites’
in the genome. These are sites known to suffer from extensive read pile-up, and they should be
18
excluded from further downstream analysis (Miga et al. 2015). Indeed, the use of decoy sequences
and blacklists can reduce false positives in somatic mutation detection (Miga et al. 2015; Alioto et
al. 2015). To reduce false positives it is also key that reads mapped with many errors are filtered
out (Pabinger et al. 2013).
It is also recommended to use paired-end or mate-paired reads (Pabinger et al. 2013).
While in single-end sequencing short fragments are read only from one end, in paired-end
sequencing both ends of longer fragments are read (Volik et al. 2003). The result is a collection of
paired reads separated by a known distance, so sequence as well as relative positional data can be
inferred. This technique has proven to be very useful in detecting so-called ‘copy-neutral
rearrangements’ in cancer genomes (Bashir et al. 2008; Oesper et al. 2012). Paired-end sequencing
also helps to map reads over repetitive regions more precisely* (Treangen et al. 2011).
Finally, it is important to choose the right alignment software. For this it is wise to
consider the NGS platform that was used, as their output sometimes requires the use of specific
algorithms (Luthra et al. 2015). However, it is far more important to choose an alignment tool
based on the application at hand. Not only do the various algorithms often differ in their ability
to pick up specific genetic alterations, but in addition, they tend to suffer from a significant trade-
off between speed and accuracy (Ruffalo et al. 2011). Indeed, the use of different alignment tools
can clearly impact variant calling (Griffith et al. 2015). For alignment in the context of a tumour
variant workflow, the tools Bowtie2, Novoalign and GMAP appear to be valid choices, as they
show a high accuracy in picking up a range of genomic variations. However, Novoalign is much
slower than the two others, most likely because it is based on a different alignment algorithm
(Bao et al. 2014). Indeed, this speed-accuracy trade-off is something that has to be considered,
especially in a clinical setting. As choosing an appropriate alignment tool is clearly a strenuous
task, it is recommended to use multiple alignment strategies in parallel (Griffith et al. 2015). By
assuming that a consensus in alignment has a higher likelihood of being correct, one can increase
the accuracy of the variant calling workflow (Goode et al. 2013).
One post-alignment processing procedure that should also be mentioned here is the
removal of PCR duplicates. PCR duplicates are reads of the exact same length and sequence
identity that arise during library amplification. These duplicates consequently align with the exact
same mapping coordinates. As a result of PCR bias, some reads are amplified much more than
others, resulting in heterogeneous coverage (Aird et al. 2011), as described earlier. To correct for
this bias it is common practice to remove excess duplicate reads after the alignment step
*A thorough discussion on paired-end sequencing is not provided here; for an excellent review see for example Risca, Greenleaf (2015)
19
(DePristo et al. 2011). Importantly, duplicate removal should not be performed carelessly, as
overcorrecting read counts can produce flawed variant calls (Zhou et al. 2014).
2.6 Variant calling, annotation and prioritization
The last step of the tumour variant analysis workflow is variant calling, followed by variant
annotation and prioritization (Pabinger et al. 2013). By comparing the aligned reads of the tumour
and normal samples with each other, and with a reference genome, a range of somatic tumour
variants can be detected (see Figure 2). It is important that a distinction between germline and
somatic variant is made as these variants frequently play different roles in tumour development
and progression (Pujana. 2014). Like read alignment, variant calling and annotation relies heavily
on the use of bioinformatics - GATK and SAMtools are well known ‘callers’ (Li, Handsaker et al.
2009; McKenna et al. 2010), and some tools like Strelka (Saunders et al. 2012) are specifically
designed to pick up somatic variants. These programs employ different algorithms to identify
candidate variants (Altmann et al. 2012). Basic tools identify variants when the number of high
confidence base-calls that disagree with the reference base exceeds a certain threshold; more
refined tools also take into account strand bias and the quality of neighbouring base-calls (Olson
et al. 2015). Importantly, most commonly used variant callers appear ill-suited to handle ultra-
deep sequencing data. This is likely the result of an ‘over-training’ of parameters and filtering
procedures towards a 30x-40x tumour-normal pair (Griffith et al. 2015). As the performance of
different variant callers can change in different settings (as a result of different algorithm
parameters), ‘the selection of an appropriate algorithm should be driven by each experiment’s
design’ (Griffith et al. 2015). Recently a new variant caller, VarDict was developed specifically for
20
ultra-deep sequencing (Lai et al. 2016). Choosing the right caller is a crucial element in any variant
analysis workflow: callers have to be stringent enough to control false positive calls, but not too
stringent as this will result in false negatives (Olson et al. 2015).
Another factor that has to be taken into account when choosing a caller is the mapper
used in the alignment step. Recently, Alioto and co-workers showed that certain mapper-caller
combinations show a much higher compatibility than others (Alioto et al. 2015). If possible, it is
recommended to use multiple variant callers to correct for this cross-talk, as this was shown to
improve performance of somatic variant calling (for the same reason as using multiple mappers;
Bao et al. 2014; Alioto et al. 2015).
After variants have been called, the data enters another bioinformatics pipeline in which
variants are annotated for clinical relevance. The sheer amount of data generated during the
tumour variant analysis means that manually performing this step would be a long and difficult
process (Dienstmann et al. 2014). For this reason, again a range of software tools has been
developed to streamline this process. Together, these tools help to stepwise filter out calls that
are known to be irrelevant while prioritizing those with the largest clinical significance. First the
less reliable and common variant calls are removed, including those with low coverage, low
quality and those supported by a low-confidence read alignment (Patel et al. 2014). The remaining
variants can then be prioritized relative to the disease and the genomic context (Bao et al. 2014).
In this step, the aim is to identify those somatic tumour variants that can ‘confer diagnostic,
prognostic, or treatment-related information’ (Sukhai et al. 2016). For an overview of the tools
developed to perform these steps see Pabinger et al. 2013 or Bao et al. 2014. A particularly
powerful toolset in this respect is ANNOVAR. ANNOVAR integrates public databases that
store detailed information on possible variants, including experimental and/or clinical evidence,
as well as detailed genomics data from for instance the ENCODE project (Wang et al. 2010). By
doing so, it offers a complete annotation and prioritization of tumour variants, helping clinical
researchers to interpret the data and make an informed decision regarding treatment (Yang,
Wang. 2015). In addition, it is important that findings are communicated with the clinical
oncology practice in a clear and timely fashion. A recent review by Dienstmann and coworkers
describes not only a systematic approach to variant annotation and prioritization, but also
proposes to use a structured format of cancer pathology reports (Dienstmann et al. 2014). It is
initiatives like this that will help us to exploit the enormous potential of big genomics data
(Eisenstein. 2015) in the fight against cancer.
21
3. Discussion
In order to better understand cancer and tumourigenesis it is key that the somatic variants
underlying the disease are characterized well. Tumour variant analysis is however no trivial task,
in particular because of intratumour heterogeneity. At the same time, deciphering this
heterogeneity is one of the primary goals of cancer research, as it is thought to be a driving force
of tumour proliferation and resistance to therapy. Designing an efficient (and cost-effective)
tumour variant analysis workflow is therefore challenging. Researchers do not only have to
decide how to perform the bench-work, but they are also asked to find the appropriate tools to
support their specific NGS data analysis. It is important that all those involved understand how
the various components of the somatic variant analysis workflow are executed, and where there is
room for improvement. Indeed, errors that are picked up only in the final stages of variant calling
might arise at the bench.
The overview provided here points out several key points for improvement of the
somatic tumour variant workflow (see also Table 1). Some of these recommendations are
beginning to be followed up in cancer studies, for instance the use of combined analysis tools:
although researchers have long relied on the use of a single caller, in recent years more studies
have instead combined the output of multiple tools (Field et al. 2015). As pipelines employing
multiple callers significantly outperform those that don’t in terms of accuracy (Alioto et al. 2015),
this trend will likely benefit cancer research greatly. Importantly, the same holds true for the use
of multiple alignment tools (Griffith et al. 2015). Another positive development is the increased
number of studies that use ultra-deep sequencing for tumour variant analysis. However, as the
costs of whole-genome sequencing are still substantial, many researchers opt to get higher depths
by sequencing only parts of the genome. One technique is whole exome sequencing (WES).
Although WES is now a routinely used instrument to detect genetic variation in humans
(Koboldt et al. 2013), several recent studies show that sequencing the entire genome (WGS) is
more accurate than WES when it comes to detecting somatic variants (Fang et al. 2014; Meynert
et al. 2014). The reason for this is that WES produces a more heterogeneous read coverage than
WGS, likely because it is subject to higher levels of sequencing bias (Veal et al. 2012; Belkadi et al.
2015). Nonetheless, WES is still a powerful, and not to mention cheaper method. A technique
that is perhaps clinically more relevant than WES is targeted resequencing of mutational
‘hotspots’ of tumours (Mamanova et al. 2010; Gerstung et al. 2012). With this technique a
restricted gene panel can be sequenced ultra-deep, allowing one to pick up low VAF somatic
mutations in known causative genes (Agrawal et al. 2011). Importantly, both WES and targeted
sequencing do not allow for a complete characterization of a tumour genome (Griffith et al.
22
2015): the first does not capture variation in non-coding DNA*, and the latter sequences only
user-specified genomic stretches. Even with very high depth sequencing it can be extremely
difficult to identify low VAF SNPs (Griffith et al. 2015). The required depth depends heavily on
the complexity of the sequencing library. The question of how much additional information can
be gained when a specific library is sequenced deeper is therefore highly relevant. Recently, Daley
and Smith presented a computational method that can quantify library complexity. Such a
method is likely to prove very useful in controlling the costs of routine tumour variant analysis in
the clinic (Daley, Smith. 2013).
Table 1: the main recommendation to optimize tumour variant calling
Workflow component Recommendations
DNA extraction & shearing - Be aware of poor DNA yield and/or quality when extracting from FFPE
samples. Employ pre-analytical sample characterization
Library preparation - Size-selection after shearing
- PCR free preparation if possible
- Use Kapa HiFI polymerase
- Optimize temperatures and/or duration of PCR steps
- Size selection after PCR amplification
- Use ddPCR to quantify library
- Use paired-end reads
Depth and coverage - Sequence deep (100-300x) , especially with variants of low expected (< 10%)
VAF
- Keep tumour : normal depth ratio close to one
Quality assessment - Use multiple software tools, or a complete toolset like ClinQC
Read mapping - Pick most complete validated reference genome (GRCh37 or GRCh38)
- Use decoy sequences and blacklisted sites
- Use multiple mapping tools
- Use PCR duplicate removal with caution
Variant calling
- Optimize mapper-caller combination / use multiple callers
23
* Non-coding DNA is rapidly shedding its ‘junk-DNA’ nickname as a significant amount of it appears to have some
functional role (Flintoft, 2005; but see also Palazzo, Gregory. 2014); Importantly, somatic variants in non-coding
DNA have also been directly linked to cancer (Khurana et al. 2016).
Other recommendations made here are still far from becoming standard practice. For
example, the majority of cancer studies still relies on a PCR step to amplify their library ahead of
quantification, while this amplification has long been recognized as a source of artificial
mutations as well as substantial bias (Aird et al. 2011). Recently, alternative approaches have been
introduced, for instance the use of ddPCR for library quantification when input material is
limiting (Robin et al. 2016). The arrival of Illumina’s TruSeq® PCR-free technology is another
important step towards reducing PCR artefacts (Alioto et al. 2015). However, as TruSeq requires
very high amounts of starting material compared to other NGS technologies, it is currently not
always considered a viable option in tumour variant analysis (Huptas et al. 2016). Future studies
should evaluate whether ddPCR quantification could enable a PCR-free approach in the face of
FFPE tumour samples.
The ‘age of clinical sequencing’ (Daley, Smith. 2013) is clearly approaching fast. Before
genetic screening of tumours, and with it personalized cancer medicine, can become a routine
application however, several important issues will have to be dealt with. One aspect that is
sometimes overlooked is the duration of the complete variant calling workflow, which means that
in the case of some aggressive cancers NGS analysis might be simply too slow (Goodwin et al.
2016). Indeed, faster systems will need to be developed to allow a ubiquitous deployment of
NGS in the cancer clinic. In addition, the accuracy and sensitivity of somatic variant workflows
will need to be further optimized and standardized, especially in the face of somatic variants with
a low VAF. It is key that new variant calling workflows, as well as their individual pipeline
components, are thoroughly evaluated to identify and isolate potential sources of error (Davies et
al. 2016). This will greatly facilitate the analysis of tumour NGS data, and, ultimately, clinical
interpretation.
24
4. References
Agrawal, N., Frederick, M. J., Pickering, C. R., Bettegowda, C., Chang, K., Li, R. J., . . . Myers, J. N. (2011). Exome
sequencing of head and neck squamous cell carcinoma reveals inactivating mutations in NOTCH1. Science (New
York, N.Y.), 333(6046), 1154-1157.
Aird, D., Ross, M. G., Chen, W. S., Danielsson, M., Fennell, T., Russ, C., . . . Gnirke, A. (2011). Analyzing and
minimizing PCR amplification bias in illumina sequencing libraries. Genome Biology, 12(2), R18-2011-12-2-r18.
Epub 2011 Feb 21.
Ajay, S. S., Parker, S. C., Abaan, H. O., Fajardo, K. V., & Margulies, E. H. (2011). Accurate and comprehensive
sequencing of personal genomes. Genome Research, 21(9), 1498-1505.
Akbari, M., Hansen, M. D., Halgunset, J., Skorpen, F., & Krokan, H. E. (2005). Low copy number DNA template
can render polymerase chain reaction error prone in a sequence-dependent manner. The Journal of Molecular
Diagnostics : JMD, 7(1), 36-39.
Albertson, T. M., Ogawa, M., Bugni, J. M., Hays, L. E., Chen, Y., Wang, Y., . . . Preston, B. D. (2009). DNA
polymerase epsilon and delta proofreading suppress discrete mutator and cancer phenotypes in mice. Proceedings
of the National Academy of Sciences of the United States of America, 106(40), 17101-17104.
Alioto, T. S., Buchhalter, I., Derdak, S., Hutter, B., Eldridge, M. D., Hovig, E., . . . Gut, I. G. (2015). A
comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing. Nature
Communications, 6, 10001.
Altmann, A., Weber, P., Bader, D., Preuss, M., Binder, E. B., & Muller-Myhsok, B. (2012). A beginners guide to SNP
calling from high-throughput DNA-sequencing data. Human Genetics, 131(10), 1541-1554.
Andrews, S.FastQC: A quality control tool for high throughput sequence data. Available online
at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc.
25
Baker, M. (2012). Structural variation: The genome's hidden architecture. Nature Methods, 9(2), 133-137.
Bao, R., Huang, L., Andrade, J., Tan, W., Kibbe, W. A., Jiang, H., & Feng, G. (2014). Review of current methods,
applications, and data management for the bioinformatics analysis of whole exome sequencing. Cancer
Informatics, 13(Suppl 2), 67-82.
Bashir, A., Volik, S., Collins, C., Bafna, V., & Raphael, B. J. (2008). Evaluation of paired-end sequencing strategies
for detection of genome rearrangements in cancer. PLoS Computational Biology, 4(4), e1000051.
Belkadi, A., Bolze, A., Itan, Y., Cobat, A., Vincent, Q. B., Antipenko, A., . . . Abel, L. (2015). Whole-genome
sequencing is more powerful than whole-exome sequencing for detecting exome variants. Proceedings of the
National Academy of Sciences of the United States of America, 112(17), 5473-5478.
Ben-Ezra, J., Johnson, D. A., Rossi, J., Cook, N., & Wu, A. (1991). Effect of fixation on the amplification of nucleic
acids from paraffin-embedded material by the polymerase chain reaction. The Journal of Histochemistry and
Cytochemistry : Official Journal of the Histochemistry Society, 39(3), 351-354.
Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., Milton, J., Brown, C. G., . . . Smith, A. J. (2008).
Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456(7218), 53-59.
Berlin, K., Koren, S., Chin, C. S., Drake, J. P., Landolin, J. M., & Phillippy, A. M. (2015). Assembling large genomes
with single-molecule sequencing and locality-sensitive hashing. Nature Biotechnology, 33(6), 623-630.
Biankin, A. V., Waddell, N., Kassahn, K. S., Gingras, M. C., Muthuswamy, L. B., Johns, A. L., . . . Grimmond, S. M.
(2012). Pancreatic cancer genomes reveal aberrations in axon guidance pathway genes. Nature, 491(7424), 399-
405.
Bolognesi, C., Forcato, C., Buson, G., Fontana, F., Mangano, C., Doffini, A., . . . Manaresi, N. (2016). Digital sorting
of pure cell populations enables unambiguous genetic analysis of heterogeneous formalin-fixed paraffin-
embedded tumors by next generation sequencing. Scientific Reports, 6, 20944.
Borad, M. J., Champion, M. D., Egan, J. B., Liang, W. S., Fonseca, R., Bryce, A. H., . . . Carpten, J. D. (2014).
Integrated genomic characterization reveals novel, therapeutically relevant drug targets in FGFR and EGFR
pathways in sporadic intrahepatic cholangiocarcinoma. PLoS Genetics, 10(2), e1004135.
26
Buehler, B., Hogrefe, H. H., Scott, G., Ravi, H., Pabon-Pena, C., O'Brien, S., . . . Happe, S. (2010). Rapid
quantification of DNA libraries for next-generation sequencing. Methods (San Diego, Calif.), 50(4), S15-8.
Cacho, A., Smirnova, E., Huzurbazar, S., & Cui, X. (2015). A comparison of base-calling algorithms for illumina
sequencing technology. Briefings in Bioinformatics.
Carter, S. L., Cibulskis, K., Helman, E., McKenna, A., Shen, H., Zack, T., . . . Getz, G. (2012). Absolute
quantification of somatic DNA alterations in human cancer. Nature Biotechnology, 30(5), 413-421.
Chaisson, M. J., Huddleston, J., Dennis, M. Y., Sudmant, P. H., Malig, M., Hormozdiari, F., . . . Eichler, E. E. (2015).
Resolving the complexity of the human genome using single-molecule sequencing. Nature, 517(7536), 608-611.
Dabney, J., & Meyer, M. (2012). Length and GC-biases during sequencing library amplification: A comparison of
various polymerase-buffer systems with ancient and modern DNA sequencing libraries. Biotechniques, 52(2), 87-
94.
Dai, M., Thompson, R. C., Maher, C., Contreras-Galindo, R., Kaplan, M. H., Markovitz, D. M., . . . Meng, F. (2010).
NGSQC: Cross-platform quality analysis pipeline for deep sequencing data. BMC Genomics, 11 Suppl 4, S7-
2164-11-S4-S7.
Daley, T., & Smith, A. D. (2013). Predicting the molecular complexity of sequencing libraries. Nature Methods, 10(4),
325-327.
Das, S., & Vikalo, H. (2013). Base calling for high-throughput short-read sequencing: Dynamic programming
solutions. BMC Bioinformatics, 14, 129-2105-14-129.
Davies, K. D., Farooqi, M. S., Gruidl, M., Hill, C. E., Woolworth-Hirschhorn, J., Jones, H., . . . Aisner, D. L. (2016).
Multi-institutional FASTQ file exchange as a means of proficiency testing for next-generation sequencing
bioinformatics and variant interpretation. The Journal of Molecular Diagnostics : JMD, 18(4), 572-579.
Davis, A., & Navin, N. E. (2016). Computing tumor trees from single cells. Genome Biology, 17(1), 113-016-0987-z.
Del Fabbro, C., Scalabrin, S., Morgante, M., & Giorgi, F. M. (2013). An extensive evaluation of read trimming effects
on illumina NGS data analysis. PloS One, 8(12), e85024.
27
DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C., . . . Daly, M. J. (2011). A
framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics,
43(5), 491-498.
Dienstmann, R., Dong, F., Borger, D., Dias-Santagata, D., Ellisen, L. W., Le, L. P., & Iafrate, A. J. (2014).
Standardized decision support in next generation sequencing reports of somatic cancer variants. Molecular
Oncology, 8(5), 859-873.
Eisenstein, M. (2015). Big data: The power of petabytes. Nature, 527(7576), S2-4.
Fang, H., Wu, Y., Narzisi, G., O'Rawe, J. A., Barron, L. T., Rosenbaum, J., . . . Lyon, G. J. (2014). Reducing INDEL
calling errors in whole genome and exome sequencing data. Genome Medicine, 6(10), 89-014-0089-z. eCollection
2014.
Ferrarini, M., Moretto, M., Ward, J. A., Surbanovski, N., Stevanovic, V., Giongo, L., . . . Sargent, D. J. (2013). An
evaluation of the PacBio RS platform for sequencing and de novo assembly of a chloroplast genome. BMC
Genomics, 14, 670-2164-14-670.
Feuk, L., Marshall, C. R., Wintle, R. F., & Scherer, S. W. (2006). Structural variants: Changing the landscape of
chromosomes and design of disease studies. Human Molecular Genetics, 15 Spec No 1, R57-66.
Fidler, I. J. (1978). Tumor heterogeneity and the biology of cancer invasion and metastasis. Cancer Research, 38(9),
2651-2660.
Field, M. A., Cho, V., Andrews, T. D., & Goodnow, C. C. (2015). Reliably detecting clinically important variants
requires both combined variant calls and optimized filtering strategies. PloS One, 10(11), e0143199.
Finotello, F., Lavezzo, E., Bianco, L., Barzon, L., Mazzon, P., Fontana, P., . . . Di Camillo, B. (2014). Reducing bias
in RNA sequencing data: A novel approach to compute counts. BMC Bioinformatics, 15 Suppl 1, S7-2105-15-S1-
S7. Epub 2014 Jan 10.
Flintoft, L. (2005). Genome evolution: an adaptive view of non-coding DNA. Nat.Rev.Genet.
28
Gerlinger, M., Santos, C. R., Spencer-Dene, B., Martinez, P., Endesfelder, D., Burrell, R. A., . . . Swanton, C. (2012).
Genome-wide RNA interference analysis of renal carcinoma survival regulators identifies MCT4 as a warburg
effect metabolic target. The Journal of Pathology, 227(2), 146-156.
Gerstung, M., Beisel, C., Rechsteiner, M., Wild, P., Schraml, P., Moch, H., & Beerenwinkel, N. (2012). Reliable
detection of subclonal single-nucleotide variants in tumour cell populations. Nature Communications, 3, 811.
Goode, D. L., Hunter, S. M., Doyle, M. A., Ma, T., Rowley, S. M., Choong, D., . . . Campbell, I. G. (2013). A simple
consensus approach improves somatic mutation prediction accuracy. Genome Medicine, 5(9), 90.
Goodwin, S., Gurtowski, J., Ethe-Sayers, S., Deshpande, P., Schatz, M. C., & McCombie, W. R. (2015). Oxford
nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Research,
25(11), 1750-1756.
Goodwin, S., McPherson, J. D., & McCombie, W. R. (2016). Coming of age: Ten years of next-generation
sequencing technologies. Nature Reviews.Genetics, 17(6), 333-351.
Griffith, M., Miller, C. A., Griffith, O. L., Krysiak, K., Skidmore, Z. L., Ramu, A., . . . Wilson, R. K. (2015).
Optimizing cancer genome sequencing and analysis. Cell Systems, 1(3), 210-223.
Grokhovsky, S. L., Il'icheva, I. A., Nechipurenko, D. Y., Golovkin, M. V., Panchenko, L. A., Polozov, R. V., &
Nechipurenko, Y. D. (2011). Sequence-specific ultrasonic cleavage of DNA. Biophysical Journal, 100(1), 117-125.
Hansen, K. D., Brenner, S. E., & Dudoit, S. (2010). Biases in illumina transcriptome sequencing caused by random
hexamer priming. Nucleic Acids Research, 38(12), e131.
Head, S. R., Komori, H. K., LaMere, S. A., Whisenant, T., Van Nieuwerburgh, F., Salomon, D. R., & Ordoukhanian,
P. (2014). Library construction for next-generation sequencing: Overviews and challenges. Biotechniques, 56(2),
61-4, 66, 68, passim.
Huptas, C., Scherer, S., & Wenning, M. (2016). Optimized illumina PCR-free library preparation for bacterial whole
genome sequencing and analysis of factors influencing de novo assembly. BMC Research Notes, 9, 269-016-2072-
9.
29
Huse, S. M., Huber, J. A., Morrison, H. G., Sogin, M. L., & Welch, D. M. (2007). Accuracy and quality of massively
parallel DNA pyrosequencing. Genome Biology, 8(7), R143.
Katsanis, S. H., & Katsanis, N. (2013). Molecular genetic testing and the future of clinical genomics. Nature
Reviews.Genetics, 14(6), 415-426.
Kebschull, J. M., & Zador, A. M. (2015). Sources of PCR-induced distortions in high-throughput sequencing data
sets. Nucleic Acids Research, 43(21), e143.
Khurana, E., Fu, Y., Chakravarty, D., Demichelis, F., Rubin, M. A., & Gerstein, M. (2016). Role of non-coding
sequence variants in cancer. Nature Reviews.Genetics, 17(2), 93-108.
Knierim, E., Lucke, B., Schwarz, J. M., Schuelke, M., & Seelow, D. (2011). Systematic comparison of three methods
for fragmentation of long-range PCR products for next generation sequencing. PloS One, 6(11), e28240.
Koboldt, D. C., Steinberg, K. M., Larson, D. E., Wilson, R. K., & Mardis, E. R. (2013). The next-generation
sequencing revolution and its impact on genomics. Cell, 155(1), 27-38.
Kokkat, T. J., Patel, M. S., McGarvey, D., LiVolsi, V. A., & Baloch, Z. W. (2013). Archived formalin-fixed paraffin-
embedded (FFPE) blocks: A valuable underexploited resource for extraction of DNA, RNA, and protein.
Biopreservation and Biobanking, 11(2), 101-106.
Kozarewa, I., Ning, Z., Quail, M. A., Sanders, M. J., Berriman, M., & Turner, D. J. (2009). Amplification-free
illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes.
Nature Methods, 6(4), 291-295.
Lai, Z., Markovets, A., Ahdesmaki, M., Chapman, B., Hofmann, O., McEwen, R., . . . Dry, J. R. (2016). VarDict: A
novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Research,
44(11), e108.
Landau, D. A., Carter, S. L., Stojanov, P., McKenna, A., Stevenson, K., Lawrence, M. S., . . . Wu, C. J. (2013).
Evolution and impact of subclonal mutations in chronic lymphocytic leukemia. Cell, 152(4), 714-726.
Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short
DNA sequences to the human genome. Genome Biology, 10(3), R25-2009-10-3-r25. Epub 2009 Mar 4.
30
Laurie, M. T., Bertout, J. A., Taylor, S. D., Burton, J. N., Shendure, J. A., & Bielas, J. H. (2013). Simultaneous digital
quantification and fluorescence-based size characterization of massively parallel sequencing libraries.
Biotechniques, 55(2), 61-67.
Li, H. (2014). Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics
(Oxford, England), 30(20), 2843-2851.
Li, H. (2016). Minimap and miniasm: Fast mapping and de novo assembly for noisy long sequences. Bioinformatics
(Oxford, England), 32(14), 2103-2110.
Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics
(Oxford, England), 25(14), 1754-1760.
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., . . . 1000 Genome Project Data Processing
Subgroup. (2009). The sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England), 25(16),
2078-2079.
Li, H., & Homer, N. (2010). A survey of sequence alignment algorithms for next-generation sequencing. Briefings in
Bioinformatics, 11(5), 473-483.
Li, R., Yu, C., Li, Y., Lam, T. W., Yiu, S. M., Kristiansen, K., & Wang, J. (2009). SOAP2: An improved ultrafast tool
for short read alignment. Bioinformatics (Oxford, England), 25(15), 1966-1967.
Liu, L., Li, Y., Li, S., Hu, N., He, Y., Pong, R., . . . Law, M. (2012). Comparison of next-generation sequencing
systems. Journal of Biomedicine & Biotechnology, 2012, 251364.
Loeb, L. A. (2016). Human cancers express a mutator phenotype: Hypothesis, origin, and consequences. Cancer
Research, 76(8), 2057-2059.
Loman, N. J., Misra, R. V., Dallman, T. J., Constantinidou, C., Gharbia, S. E., Wain, J., & Pallen, M. J. (2012).
Performance comparison of benchtop high-throughput sequencing platforms. Nature Biotechnology, 30(5), 434-
439.
Luthra, R., Chen, H., Roy-Chowdhuri, S., & Singh, R. R. (2015). Next-generation sequencing in clinical molecular
diagnostics of cancer: Advantages and challenges. Cancers, 7(4), 2023-2036.
31
Mamanova, L., Coffey, A. J., Scott, C. E., Kozarewa, I., Turner, E. H., Kumar, A., . . . Turner, D. J. (2010). Target-
enrichment strategies for next-generation sequencing. Nature Methods, 7(2), 111-118.
Mardis, E. R. (2013). Next-generation sequencing platforms. Annual Review of Analytical Chemistry (Palo Alto, Calif.), 6,
287-303.
Marusyk, A., Almendro, V., & Polyak, K. (2012). Intra-tumour heterogeneity: A looking glass for cancer? Nature
Reviews.Cancer, 12(5), 323-334.
McCoy, R. C., Taylor, R. W., Blauwkamp, T. A., Kelley, J. L., Kertesz, M., Pushkarev, D., . . . Fiston-Lavier, A. S.
(2014). Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-
repetitive transposable elements. PloS One, 9(9), e106689.
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., . . . DePristo, M. A. (2010). The
genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome
Research, 20(9), 1297-1303.
Meyer, C. A., & Liu, X. S. (2014). Identifying and mitigating bias in next-generation sequencing methods for
chromatin biology. Nature Reviews.Genetics, 15(11), 709-721.
Meyer, M., Briggs, A. W., Maricic, T., Hober, B., Hoffner, B., Krause, J., . . . Hofreiter, M. (2008). From micrograms
to picograms: Quantitative PCR reduces the material demands of high-throughput sequencing. Nucleic Acids
Research, 36(1), e5.
Meynert, A. M., Ansari, M., FitzPatrick, D. R., & Taylor, M. S. (2014). Variant detection sensitivity and biases in
whole genome and exome sequencing. BMC Bioinformatics, 15, 247-2105-15-247.
Miga, K. H., Eisenhart, C., & Kent, W. J. (2015). Utilizing mapping targets of sequences underrepresented in the
reference assembly to reduce false positive alignments. Nucleic Acids Research, 43(20), e133.
Mokry, M., Hatzis, P., de Bruijn, E., Koster, J., Versteeg, R., Schuijers, J., . . . Cuppen, E. (2010). Efficient double
fragmentation ChIP-seq provides nucleotide resolution protein-DNA binding profiles. PloS One, 5(11), e15092.
Mostovoy, Y., Levy-Sakin, M., Lam, J., Lam, E. T., Hastie, A. R., Marks, P., . . . Kwok, P. Y. (2016). A hybrid
approach for de novo human genome sequence assembly and phasing. Nature Methods, 13(7), 587-590.
32
Myers, E. W., Sutton, G. G., Delcher, A. L., Dew, I. M., Fasulo, D. P., Flanigan, M. J., . . . Venter, J. C. (2000). A
whole-genome assembly of drosophila. Science (New York, N.Y.), 287(5461), 2196-2204.
Navin, N., Kendall, J., Troge, J., Andrews, P., Rodgers, L., McIndoo, J., . . . Wigler, M. (2011). Tumour evolution
inferred by single-cell sequencing. Nature, 472(7341), 90-94.
Nielsen, R., Paul, J. S., Albrechtsen, A., & Song, Y. S. (2011). Genotype and SNP calling from next-generation
sequencing data. Nature Reviews.Genetics, 12(6), 443-451.
Nik-Zainal, S., Alexandrov, L. B., Wedge, D. C., Van Loo, P., Greenman, C. D., Raine, K., . . . Breast Cancer
Working Group of the International Cancer Genome Consortium. (2012). Mutational processes molding the
genomes of 21 breast cancers. Cell, 149(5), 979-993.
Oesper, L., Ritz, A., Aerni, S. J., Drebin, R., & Raphael, B. J. (2012). Reconstructing cancer genomes from paired-end
sequencing data. BMC Bioinformatics, 13 Suppl 6, S10-2105-13-S6-S10.
O'Huallachain, M., Karczewski, K. J., Weissman, S. M., Urban, A. E., & Snyder, M. P. (2012). Extensive genetic
variation in somatic human tissues. Proceedings of the National Academy of Sciences of the United States of America,
109(44), 18018-18023.
Olson, N. D., Lund, S. P., Colman, R. E., Foster, J. T., Sahl, J. W., Schupp, J. M., . . . Zook, J. M. (2015). Best
practices for evaluating single nucleotide variant calling methods for microbial genomics. Frontiers in Genetics, 6,
235.
Pabinger, S., Dander, A., Fischer, M., Snajder, R., Sperk, M., Efremova, M., . . . Trajanoski, Z. (2014). A survey of
tools for variant analysis of next-generation genome sequencing data. Briefings in Bioinformatics, 15(2), 256-278.
Palazzo, A. F., & Gregory, T. R. (2014). The case for junk DNA. PLoS Genetics, 10(5), e1004351.
Pandey, R. V., Pabinger, S., Kriegner, A., & Weinhausel, A. (2016). ClinQC: A tool for quality control and cleaning
of sanger and NGS data in clinical research. BMC Bioinformatics, 17, 56-016-0915-y.
Pang, A. W., MacDonald, J. R., Pinto, D., Wei, J., Rafiq, M. A., Conrad, D. F., . . . Scherer, S. W. (2010). Towards a
comprehensive structural variation map of an individual human genome. Genome Biology, 11(5), R52-2010-11-5-
r52. Epub 2010 May 19.
33
Pantel, K., & Speicher, M. R. (2016). The biology of circulating tumor cells. Oncogene, 35(10), 1216-1224.
Parkinson, N. J., Maslau, S., Ferneyhough, B., Zhang, G., Gregory, L., Buck, D., . . . Fischer, M. D. (2012).
Preparation of high-quality next-generation sequencing libraries from picogram quantities of target DNA.
Genome Research, 22(1), 125-133.
Patel, Z. H., Kottyan, L. C., Lazaro, S., Williams, M. S., Ledbetter, D. H., Tromp, H., . . . Kaufman, K. M. (2014).
The struggle to find reliable results in exome sequencing data: Filtering out mendelian errors. Frontiers in
Genetics, 5, 16.
Pleasance, E. D., Cheetham, R. K., Stephens, P. J., McBride, D. J., Humphray, S. J., Greenman, C. D., . . . Stratton,
M. R. (2010). A comprehensive catalogue of somatic mutations from a human cancer genome. Nature,
463(7278), 191-196.
Poptsova, M. S., Il'icheva, I. A., Nechipurenko, D. Y., Panchenko, L. A., Khodikov, M. V., Oparina, N. Y., . . .
Grokhovsky, S. L. (2014). Non-random DNA fragmentation in next-generation sequencing. Scientific Reports, 4,
4532.
Pujana, M. A. (2014). Integrating germline and somatic data towards a personalized cancer medicine. Trends in
Molecular Medicine, 20(8), 413-415.
Quail, M. A., Kozarewa, I., Smith, F., Scally, A., Stephens, P. J., Durbin, R., . . . Turner, D. J. (2008). A large genome
center's improvements to the illumina sequencing system. Nature Methods, 5(12), 1005-1010.
Quail, M. A., Swerdlow, H., & Turner, D. J. (2009). Improved protocols for the illumina genome analyzer
sequencing system. Current Protocols in Human Genetics / Editorial Board, Jonathan L.Haines ...[Et Al.], Chapter 18,
Unit 18.2.
Renaud, G., Kircher, M., Stenzel, U., & Kelso, J. (2013). freeIbis: An efficient basecaller with calibrated quality scores
for illumina sequencers. Bioinformatics (Oxford, England), 29(9), 1208-1209.
Rhodes, J., Beale, M. A., & Fisher, M. C. (2014). Illuminating choices for library prep: A comparison of library
preparation methods for whole genome sequencing of cryptococcus neoformans using illumina HiSeq. PloS
One, 9(11), e113501.
34
Risca, V. I., & Greenleaf, W. J. (2015). Beyond the linear genome: Paired-end sequencing as a biophysical tool. Trends
in Cell Biology, 25(12), 716-719.
Robin, J. D., Ludlow, A. T., LaRanger, R., Wright, W. E., & Shay, J. W. (2016). Comparison of DNA quantification
methods for next generation sequencing. Scientific Reports, 6, 24067.
Ruffalo, M., LaFramboise, T., & Koyuturk, M. (2011). Comparative analysis of algorithms for next-generation
sequencing read alignment. Bioinformatics (Oxford, England), 27(20), 2790-2796.
Sadanandam, A., Lal, A., Benz, S. C., Eppenberger-Castori, S., Scott, G., Gray, J. W., . . . Benz, C. C. (2012).
Genomic aberrations in normal tissue adjacent to HER2-amplified breast cancers: Field cancerization or
contaminating tumor cells? Breast Cancer Research and Treatment, 136(3), 693-703.
Sah, S., Chen, L., Houghton, J., Kemppainen, J., Marko, A. C., Zeigler, R., & Latham, G. J. (2013). Functional DNA
quantification guides accurate next-generation sequencing mutation detection in formalin-fixed, paraffin-
embedded tumor biopsies. Genome Medicine, 5(8), 77.
Saunders, C. T., Wong, W. S., Swamy, S., Becq, J., Murray, L. J., & Cheetham, R. K. (2012). Strelka: Accurate somatic
small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics (Oxford, England), 28(14), 1811-
1817.
Schadt, E. E., Linderman, M. D., Sorenson, J., Lee, L., & Nolan, G. P. (2010). Computational solutions to large-scale
data management and analysis. Nature Reviews.Genetics, 11(9), 647-657.
Shah, S. P., Morin, R. D., Khattra, J., Prentice, L., Pugh, T., Burleigh, A., . . . Aparicio, S. (2009). Mutational
evolution in a lobular breast tumour profiled at single nucleotide resolution. Nature, 461(7265), 809-813.
Shendure, J., & Ji, H. (2008). Next-generation DNA sequencing. Nature Biotechnology, 26(10), 1135-1145.
Sims, D., Sudbery, I., Ilott, N. E., Heger, A., & Ponting, C. P. (2014). Sequencing depth and coverage: Key
considerations in genomic analyses. Nature Reviews.Genetics, 15(2), 121-132.
Sukhai, M. A., Craddock, K. J., Thomas, M., Hansen, A. R., Zhang, T., Siu, L., . . . Kamel-Reid, S. (2016). A
classification system for clinical relevance of somatic variants identified in molecular profiling of cancer.
Genetics in Medicine : Official Journal of the American College of Medical Genetics, 18(2), 128-136.
35
Trapnell, C., & Salzberg, S. L. (2009). How to map billions of short reads onto genomes. Nature Biotechnology, 27(5),
455-457.
Treangen, T. J., & Salzberg, S. L. (2011). Repetitive DNA and next-generation sequencing: Computational challenges
and solutions. Nature Reviews.Genetics, 13(1), 36-46.
Troester, M., Hoadley, K., & D'Arcy, M.,.. (2016). DNA defects, epigenetics, and gene expression in cancer-adjacent
breast: A study from the cancer genome atlas. Npj Breast Cancer, 2
Valasek, M. A., & Repa, J. J. (2005). The power of real-time PCR. Advances in Physiology Education, 29(3), 151-159.
van Dijk, E. L., Jaszczyszyn, Y., & Thermes, C. (2014). Library preparation methods for next-generation sequencing:
Tone down the bias. Experimental Cell Research, 322(1), 12-20.
Veal, C. D., Freeman, P. J., Jacobs, K., Lancaster, O., Jamain, S., Leboyer, M., . . . Brookes, A. J. (2012). A
mechanistic basis for amplification differences between samples and between genome regions. BMC Genomics,
13, 455-2164-13-455.
Volik, S., Zhao, S., Chin, K., Brebner, J. H., Herndon, D. R., Tao, Q., . . . Collins, C. (2003). End-sequence profiling:
Sequence-based analysis of aberrant genomes. Proceedings of the National Academy of Sciences of the United States of
America, 100(13), 7696-7701.
Vrijenhoek, T., Kraaijeveld, K., Elferink, M., de Ligt, J., Kranendonk, E., Santen, G., . . . Cuppen, E. (2015). Next-
generation sequencing-based genome diagnostics across clinical genetics centers: Implementation choices and
their effects. European Journal of Human Genetics : EJHG, 23(9), 1142-1150.
Wagle, N., Berger, M. F., Davis, M. J., Blumenstiel, B., Defelice, M., Pochanard, P., . . . Garraway, L. A. (2012). High-
throughput detection of actionable genomic alterations in clinical tumor samples by targeted, massively parallel
sequencing. Cancer Discovery, 2(1), 82-93.
Wang, K., Li, M., & Hakonarson, H. (2010). ANNOVAR: Functional annotation of genetic variants from high-
throughput sequencing data. Nucleic Acids Research, 38(16), e164.
Wang, Q., Jia, P., Li, F., Chen, H., Ji, H., Hucks, D., . . . Zhao, Z. (2013). Detecting somatic point mutations in
cancer genome sequencing data: A comparison of mutation callers. Genome Medicine, 5(10), 91.
36
Watson, I. R., Takahashi, K., Futreal, P. A., & Chin, L. (2013). Emerging patterns of somatic mutations in cancer.
Nature Reviews.Genetics, 14(10), 703-718.
White, R. A.,3rd, Blainey, P. C., Fan, H. C., & Quake, S. R. (2009). Digital PCR provides sensitive and absolute
calibration for high throughput sequencing. BMC Genomics, 10, 116-2164-10-116.
Williams, C., Ponten, F., Moberg, C., Soderkvist, P., Uhlen, M., Ponten, J., . . . Lundeberg, J. (1999). A high
frequency of sequence alterations is due to formalin fixation of archival specimens. The American Journal of
Pathology, 155(5), 1467-1471.
Xu, X., Hou, Y., Yin, X., Bao, L., Tang, A., Song, L., . . . Wang, J. (2012). Single-cell exome sequencing reveals
single-nucleotide mutation characteristics of a kidney tumor. Cell, 148(5), 886-895.
Yang, H., & Wang, K. (2015). Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR.
Nature Protocols, 10(10), 1556-1566.
Yates, L. R., Gerstung, M., Knappskog, S., Desmedt, C., Gundem, G., Van Loo, P., . . . Campbell, P. J. (2015).
Subclonal diversification of primary breast cancer revealed by multiregion sequencing. Nature Medicine, 21(7),
751-759.
Ye, C., Hsiao, C., & Corrada Bravo, H. (2014). BlindCall: Ultra-fast base-calling of high-throughput sequencing data
by blind deconvolution. Bioinformatics (Oxford, England), 30(9), 1214-1219.
Yun, J. J., Heisler, L. E., Hwang, I. I., Wilkins, O., Lau, S. K., Hyrcza, M., . . . Der, S. D. (2006). Genomic DNA
functions as a universal external standard in quantitative real-time PCR. Nucleic Acids Research, 34(12), e85.
Zerbino, D. R., & Birney, E. (2008). Velvet: Algorithms for de novo short read assembly using de bruijn graphs.
Genome Research, 18(5), 821-829.
Zhou, W., Chen, T., Zhao, H., Eterovic, A. K., Meric-Bernstam, F., Mills, G. B., & Chen, K. (2014). Bias from
removing read duplication in ultra-deep sequencing experiments. Bioinformatics (Oxford, England)