Upload
ngobao
View
212
Download
0
Embed Size (px)
Citation preview
Researching Cancer with the minION:
Methylation and Structural Variation
Winston Timp
Department of Biomedical Engineering
Johns Hopkins University
Agilent Webinar
November 2017
For Research Use Only. Not for use in diagnostic procedures.
2
Nanopore: Single Molecule Sequencing
• Oxford Nanopore Technologies, CsgG
biological pore
• No theoretical upper limit to sequencing read
length, practical limit only in delivering DNA
to the pore intact
• Palm sized sequencer
• Predicted sequencing output 3-6Gb
Deamer et al 2016, Nature Biotech
Oxford Nanopore Google Hangout March 2016
ATCGATCGATAGTAT
TAGATACGACTAGC
GATCAG
Disclosure: Timp has two patents (US 2011/0226623 A1; US2012/0040343 A1) licensed to ONT
For Research Use Only. Not for use in diagnostic procedures.
3
• Four steps to generating usable data with nanopore sequencing
• Base-calling : the process of converting raw signal into nucleotide sequences
• Nanopolish : uses alignment and current signal to improve base-calls
Nanopore Sequencing Workflow
For Research Use Only. Not for use in diagnostic procedures.
4
• Protein nanopores on a synthetic polymer
• Multiple base-pairs at a time (“k-mers”)
• Characteristic current signature is converted to nucleotide sequences
Sequencing Operation
Oxford Nanopore Technologies
For Research Use Only. Not for use in diagnostic procedures.
5
Nanopore Library Prep
• Library prep is very similar to methods for short-read sequencing
• For DNA shearing we used Covaris gTubes or Diagenode Megaruptor
• After end-repair and A-tailing, leader adapter with motor protein is ligated
• MinION arrays 512 channels (with 4 pores possible per channel) (shown bottom left from running
software); dark green pores are sequencing, light green available, other colors inactive.For Research Use Only. Not for use in diagnostic procedures.
6
Problems with Nanopore basecalling
• Multiple bases influence the current
passing through the pore.
• Through simulation with Brownian
Dynamics, we calculated the
contribution from triplets of DNA in a
solid-state nanopore - 64 current
levels.
• Not all of these different currents are
distinguishable
Comer and Aksimentiev J. of Phys Chem C 116(5) 3376-3393 (2012)For Research Use Only. Not for use in diagnostic procedures.
7
Hidden Markov Model
• Markov chains represent a series of states occurring in sequence, with defined transition
probabilities. In a hidden Markov model the state is only observed indirectly.
• As an example, consider inferring the weather (without going outside) by observing
whether people coming in from outside are wearing gloves or notFor Research Use Only. Not for use in diagnostic procedures.
8
Prior Information for Decoding
• With no prior information, a given current value may not be called correctly (333pA would be called as GGG)
• If we know the previous triplet, the next triplet is well defined, leaving only four possibilities, resulting in the
correct call of TCG
Timp, et al. Biophys J. 30, 349-353 (2012)For Research Use Only. Not for use in diagnostic procedures.
9
• By using a sequence of observables and
maximizing the total joint probability given
below, we find the sequence of states.
• This is done using the Viterbi algorithm –
which grows, finding the most likely path
for each step, saving the probabilities, to
avoid recalculation.
• 1st generation basecallers from Oxford
used a HMM for basecalling similar to the
one detailed in our Biophysical paper
• Transistion probablility matrix for oxford
seems to allow for a 0, 1 (most common),
2, or 5 (reset) move.
• We think that Oxford trained its
basecalling model on unmethylated
lambda
dk P(I(t) | ktt
Õ )´T(t-1)(t )
Nanopore HMM basecalling
For Research Use Only. Not for use in diagnostic procedures.
10
Basecalling shifting to RNN
Oxford Nanopore
• Recently (~6 months) there has been a shift
to neural network based basecalling
• A recurrent neural network is still one with
memory, that has a dependence on past
computations
• Specifically two layers of Bidirectional Long
Short Term Memory (BLSTM)
• These still require the same “training” data
to learn what current distributions
correspond to which k-mers – and the
results are still k-mer based, as multiple
bases still influence the current.
For Research Use Only. Not for use in diagnostic procedures.
11
Assemblies
SPAdes (Illumina only) Canu (Nanopore only)
Long reads really help in getting complete assemblies – using twice as
many Illumina reads, the N50 is still ~0.3 Mb for Illumina only, while it’s
full length chromosome from nanopore onlyFor Research Use Only. Not for
use in diagnostic procedures.
12
Pilon Correction
• Though ONT error rate is
improving, still not
sufficient from raw
basecalled for SNPs
• Application of Illumina
data via pilon allows for
correction
• Multiple rounds of
correction needed –
initially Illumina reads
won’t align correctly to
error prone scaffold
• Next we are trying to
apply nanopolish which
returns to the electrical
signal to correct nanopore
only assemblies
For Research Use Only. Not for use in diagnostic procedures.
13
Nanopolish tools
Consensus Calling Reference-based SNP Calling
Methylation Detection Read Phasing
…
TAGAAGATATCATGTATAGTACGATTAGAAGATATCATGTAGCAGATATCATGTATATTACGAT
CATGTATATTACGAT
github.com/jts/nanopolish
For Research Use Only. Not for use in diagnostic procedures.
14
Nanopolish SNP Calling and Genotyping
-332-300-350
…
hidden Markov
model
…ACTACGATCGAC……ACTACGATCGAC…
…ACTACCATCGAC……ACTACCATCGAC…
…ACTACGATCGAC……ACTACCATCGAC…
input: pairs of
haplotypes
0/1
output:
genotype call
For Research Use Only. Not for use in diagnostic procedures.
15
Human Genotyping Results
Genotype accuracy at all sites: 99.2%
Genotype accuracy at variable sites: 94.8%
Jain et al bioRxiv 128835For Research Use Only. Not for use in diagnostic procedures.
16
Structural Variation
• Defined as an abnormality in large region
(50b-3mb) of a chromosome
• Pervasive in cancer – 50% of pancreatic
ductal adenocarcinoma (PDAC) have SVs
• Common in tumor suppressor genes such
as CDKN2A and SMAD4
• Short-read sequencing has difficulty
resolving SVs, but nanopore sequencing
long reads can stretch across them.
• High coverage needed to detect, but yield
from nanopore sequencing still relatively
low per flowcell (~3-5Gb)
Baker, Monya. 2012. Nature MethodsFor Research Use Only. Not for use in diagnostic procedures.
17
Solution-phase Hybridization Capture
Use of Agilent SureSelectXT Targeted Sequencing System in cancer research
• ~90 bps biotinylated RNA probes complementary to target sequence
• Biotin-streptavidin interaction to enrich for the targeted region
• Optimization for long-reads : > 2 kb
For Research Use Only. Not for use in diagnostic procedures.
18
Targeted Capture Optimization
Collaboration with Josh Wang from Agilent
• Trial 1
• Probe tiling, No empty spaces between probes
• Target region
• CDKN2A : 1.5 Mbp
• Low stringency to allow mismatches
• Result: 2.28 % on-target
• Trial 2
• No tiling, average 400 bp space between probes
• Target regions
• CDKN2A : 1.5 Mbps
• SMAD4 : 850 Kbps
• High stringency to limit off-target capture
• Consideration of known SV breakpoints
• PDAC SVs from James Eshleman lab
• Result: 30 % on-target
For Research Use Only. Not for use in diagnostic procedures.
19
Targeted Sequencing Performance
• Control : NA12878 lymphoblast
• Sample : PDAC from Eshleman
lab
• Illumina short-read targeted
sequencing for comparison
• > 300-fold enrichment
• > 20X average coverage
• Agilent App Note:
https://goo.gl/8V2Fei
Total yield
(reads)On-target
On-target
percentageFold enrichment Coverage
Illumina NA12878 4.4m 3.7m 85% 641X 113X
Nanopore
NA12878107k 32k 30% 353X 27X
Nanopore PDAC 56k 20k 26% 332X 20X
For Research Use Only. Not for use in diagnostic procedures.
20
Nanopore Structural Variation Detection in Cancer
Research
• NA12878 (ENCODE Human lymphoblast cell line)
• SVs detected with Sniffles (Schatz lab)
• chr9:21,038,354 - 21,038,506; 152 bps duplication
• Validated with PacBio data from Genome in a Bottle (Mt. Sinai School of Medicine)
w/Josh Wang (Agilent) & Jim Eshleman (JHMI)For Research Use Only. Not for use in diagnostic procedures.
21
• PDAC cell line (Eshleman): Novel, putative SVs detected from PDAC
• chr18: 51,198,535 – 51,199,143; 600 bps deletion
• Possibly allele-specific SV
w/Josh Wang (Agilent) & Jim Eshleman (JHMI)For Research Use Only. Not for use in diagnostic procedures.
Nanopore Structural Variation Detection in Cancer
Research
22
• Large window of coverage; Absence in CDKN2A region
• chr9:21,950,000 -22,436,000; 486 kbps SV
• Homozygous Deletion previously identified with Illumina data
• Sniffles did not detect this SV; No reads covering either breakpoint, unlucky
coincidence of probes w/Josh Wang (Agilent) & Jim Eshleman (JHMI)For Research Use Only. Not for
use in diagnostic procedures.
Nanopore Structural Variation Detection in Cancer
Research
23
Single Nucleotide Variation Detection in Cancer Research
Number of True SNVs: 3587(Eberle,et al. bioRxiv, 2016)
Illumina Pre-polish Post-polish
Avg.
Coverage 113 27 27
Correct 1133 2485 947
Total 1211 4138 1017
Precision 94% 60% 93%
Sensitivity 32% 69% 26%
Phased SNV analysis is possible with
coverage from targeted sequencing
For Research Use Only. Not for use in diagnostic procedures.
24
Epigenetics: Historical
• The historical definition of epigenetics,
by Waddington, is how genotype
interacts with the environment to create
phenotype. The analogy is the rolling
ball in the image to the right – the
genotype has set up the hills,
environment nudges the ball into one
valley or another, then the ball is, to mix
metaphor, lineage committed.
• Which valley the ball rolls into is not
predetermined, but a stochastic
behavior.Waddington, 1940; Raser et al. Science (2005)
For Research Use Only. Not for use in diagnostic procedures.
25
Epigenetics: Modern
• Modern Definition of epigenetics involves heritable changes other than genetic sequence,
e.g., positive feedback, high order structure, chromatin organization, histone modifications,
DNA methylation.
• An analogy to a computer system:
• DNA Sequence = Hardware
• User input = Environment
• Systems Biology = Running programs
• Epigenetics = RAM
For Research Use Only. Not for
use in diagnostic procedures.
26
Single Read Methylation: Distribution
Number of reads
per pattern
Number of reads
per pattern
• We performed hybridization
capture, then Illumina bisulfite
sequencing on 6 paired colon
cancer and normal samples.
• We then examined methylation
patterns *within reads* and
looked at the distribution in
normal vs. cancer samples.
• Colors in the stacked bar graph
represent different sequenced
samples.
• Areas are clusters in regions
which show significantly
different methylation levels (t-
test).
For Research Use Only. Not for use in diagnostic procedures.
27
Single Read Methylation: Distribution
Number of reads
per pattern
Number of reads
per pattern
• We performed hybridization
capture, then Illumina bisulfite
sequencing on 6 paired colon
cancer and normal samples.
• We then examined methylation
patterns *within reads* and
looked at the distribution in
normal vs. cancer samples.
• Colors in the stacked bar graph
represent different sequenced
samples.
• Areas are clusters in regions
which show significantly
different methylation levels (t-
test).
For Research Use Only. Not for use in diagnostic procedures.
28
Single Read Methylation: Distribution
Number of reads
per pattern
Number of reads
per pattern
• We performed hybridization
capture, then Illumina bisulfite
sequencing on 6 paired colon
cancer and normal samples.
• We then examined methylation
patterns *within reads* and
looked at the distribution in
normal vs. cancer samples.
• Colors in the stacked bar graph
represent different sequenced
samples.
• Areas are clusters in regions
which show significantly
different methylation levels (t-
test).
For Research Use Only. Not for use in diagnostic procedures.
29
Nanopore: Methylation
• Cytosine methylation increases
DNA stiffness, decreasing
stretching force
• This may allow for filtering of
methylated versus
unmethylated DNA using a
pore
Mirsaidov, et al. Biophys J. (2009)For Research Use Only. Not for use in diagnostic procedures.
30
Nanopore: Methylation
Schreiber, et al. PNAS. (2013)Laszlo, et al. PNAS (2013)
• Differences between methylated and
unmethylated cytosine have been detected
using nanopores.
• Methylation state can be called with 90%
accuracy.
• We have implemented a classifier for mC
on using Oxford Nanopore signals.
For Research Use Only. Not for use in diagnostic procedures.
31
Generation of methylated Samples
• To generated methylated samples, we treat unmethylated
DNA (lambda, dam-/dcm- E. Coli, PCR product) with M. SssI
methyltransferase
• We confirm the CG specific methylation using Illumina bisulfite
sequencing of the sample – pictured right is methylation in
different contexts E. Coli dataset treated with M. SssI (red)
versus untreated (green)
For Research Use Only. Not for use in diagnostic procedures.
32
Nanopore: Methylated Error•We sequenced
samples with either
SssI or no treatment.
•Notably, mismatch
error rate is higher on
methylated samples
than unmethylated,
though indel rate
seems mostly
unchanged
•At CG locations,
there is not a clear
alternative
basecalling, though
error is higher; Ts are
not substituted for
methylated Cs
Simpson, Workman, et al. Nature Methods (2017)
For Research Use Only. Not for use in diagnostic procedures.
33
Emission ProbabilitiesS
impson
, Work
man, e
t al. N
atu
re M
eth
od
s (2
01
7)
• We measured distributions of current
for k-mers from E. Coli M.SssI treated
(methylated; green) and untreated
(unmethylated; red) samples on two
different sets of pores - R7.3 and R9
flowcells.
• Boxplots of AGGTCG and TCGAGT k-
mers which both contain CGs show
significant differences in current in
some cases (AGGTCG R7.3) and little
to none in others (TCGAGT R7.3)
• R9 current distribution seem wider in
both cases, but gives better
discrimination in TCGAGT.
For Research Use Only. Not for use in diagnostic procedures.
34
Distance of methylation effect
• We looked at the difference in current
levels dependent on the position of the
methylated base – plotted are the current
differences for R7.3(blue) and R9
pores(orange).
• Signal seems again stronger but more
variable for R9 pores than R7.3
• Methylation can either reduce current or
increase it.
• Some positions are more sensitive to
methylation than others.
Simpson, Workman, Nature Methods (2017)
For Research Use Only. Not for use in diagnostic procedures.
35
Nanopore: nanopolish methyltrain
• Multiple bases influence the current
passing through the pore.
• Current basecallers use a neural-
network based methodology to call
bases.
• We currently use a HMM based
classifier to call methylation
• With nanopolish we can call the
probability:
• Where Sm is the probability
methylated for a given observable D
and Sr the probability unmethylated
• We then take the log of this likelihood
ratio.dk P(I(t) | kt
tÕ )´T(t-1)(t )
For Research Use Only. Not for use in diagnostic procedures.
36
NA12878 Methylation
• NA12878 (lymphoblast) gDNA: Illumina WGBS on X-axis (24X coverage) (SRA:
GSM1002650) vs. R7.3 (0.02X) or R9 (0.13X) nanopore sequencing.
• Correlation of 0.83 (R7.3) and 0.84 (R9) – most gene promoters unmethylated
Sim
pson
, W
ork
man, N
atu
re M
eth
od
s (
201
7)
R7.3 R9
For Research Use
Only. Not for use in
diagnostic procedures.
37
Nanopolish Methylation
Jain et al bioRxiv 128835
Methylation comparison between bisulfite and nanopore calling
In whole genome
For Research Use Only. Not for use in diagnostic procedures.
38
Binned Methylation vs. Transcription Start Sites
• On human genomic samples:
• Binning methylation levels vs. distance to TSS sites,
compared to bisulfite data (NA12878).
• We also generated completely methylated (M.SssI
treated; ~95% meth) and unmethylated – used to
generate the ROC curve (right)
• R7.3 91% accurate at 68% of sites
• R9 94% accurate at 77% of sites
Sim
pson
, W
ork
man, N
atu
re M
eth
od
s (
201
7)
For Research Use Only. Not for use in diagnostic procedures.
39
Cancer-Normal Comparison
Simpson, Workman, Nature Methods (2017)For Research Use Only. Not for
use in diagnostic procedures.
• Reduced representation method:12.5Mb of
the genome (3.5-6kb size selection)
• We sequenced this fraction on nanopore and
bisulfite Illumina seq
• Long reads measure phased methylation
40
Haplotype-Phased Methylation
nanopolish has experimental support for phasing methylation patterns
this haplotype is highly methylated
For Research Use Only. Not for use in diagnostic procedures.
41
Haplotype-Phased Methylation
nanopolish has experimental support for phasing methylation patterns
this haplotype isn’t
For Research Use Only. Not for use in diagnostic procedures.
42
Summary• Nanopore technology is full of potential for sequencing, but always choose the right tool for the right job.
Often multiple approaches with complementary data yield the best results.
• Multiple bases affect the electrical signal from nanopores; rather than a problem, this can be an
advantage, as each base is interrogated multiple times. Using a hidden Markov model and Viterbi
algorithm, we can decode the electrical signal to a DNA base sequence.
• Long-read sequencing is great at detection of structural variants in cancer samples used in clinical
research – when coupled with hybridization capture this can be a powerful approach.
• Modifications to the primary DNA sequence (e.g. cytosine methylation) can be detected directly using
nanopores, as compared to the gold standard of chemical modification (bisulfite treatment).
For Research Use Only. Not for use in diagnostic procedures.
43
Acknowledgments
• Timp Lab – Johns Hopkins
University
• Winston Timp, PhD
• Rachael Workman, MS
• Stephanie Hao, MS
• Yunfan Fan
• Isac Lee
• Amy Vandiver, MD, PhD
• Ontario Institute for Cancer
Research
• Jared Simpson, PhD
• P.C. Zuzarte, PhD
• Matei David, PhD
• L. J. Dursi, PhD
• Eshleman Lab – Johns
Hopkins School of Medicine
• James Eshleman, MD, PhD
• Alexis Norris, PhD
Agilent Technologies
Josh Wang, PhD
Jonathan Levine, PhD
1R01HG009190-01A1
• Simner Lab – Johns Hopkins
School of Medicine
• Patricia (Trish) Simner, PhD
• Belita Opene
• Annie Antar
• Yehudit Bergman
For Research Use Only. Not for use in diagnostic procedures.
44
Other mod detection
For Research Use Only. Not for use in diagnostic procedures.