Upload
moris-lucas
View
219
Download
4
Tags:
Embed Size (px)
Citation preview
Computational Identification of Tumor heterogeneity
2015-03-25Sangwoo Kim
Tumor heterogeneity
• Inter-tumor heterogeneity: genetic and phenotypic variation be-tween individuals with the same tumor type
• Intra-tumor heterogeneity: subclonal diversity within a tumor
Tumor heterogeneity in AML
Tumor progression and response
Heterogeneity and resistance
Inferring tumor heterogeneity
1. single cell sequenc-ing
2. bulk sequencing and recon-struction
COMPUTATIONAL IDENTIFICA-TION OF TUMOR SUBCLONES
Today’s paper 1 (PyClone)
• Shorab Shah, Ph.D.– Associate Professor in the Departments of Pathology
and Computer Science, University of British Colum-bia
– Dr. Shah’s work focuses on characterization of can-cer genomes for determination of pathogenic driver mutations in cancer subtypes and measuring and quantifying tumour evolution
Conceptual overview
• Sequencing – pool sequencing – unclassified tools
Allele frequency and Cellular prevalence
• Allele frequency (af): – ratio of alternative allele to total haploid
• cellular prevalence (cp): – proportion of tumor cells harboring a mutation
70%30%
subclone 1 (AA)
subclone 2 (AB)
• allele frequency = 15%• cellular prevalence = 30%
70%30%
subclone 1 (AA)
subclone 2 (AAB)
• allele frequency = 10%• cellular prevalence = 30%
Allele frequency to cellular preva-lence
Example AF Genotype CP
mutation 1 10% AB 20%
mutation 2 10% AAB 30%
mutation 3 10% ABB 15%
mutation 4 20% AB 40%
mutation 5 20% AABB 40%
mutation 6 50% AB 100%
mutation 7 50% ABB 75%
Genotype (copy number) is essential for heterogene-ity estimation
A toy example
Cellular prevalence and evolution model
Assumption:1) clonal population follows a perfect phylogeny:
no site mutates more than once in its evolutionary history and each harbors at most one somatic mutant genotype
2) clonal population follows a persistent phylogeny:mutations do not disappear or revert
Cellular prevalence and evolution model
10%
10% 20%
30%30%
What to infer:1) number and composition of subclones2) cellular prevalence (cp):
proportion of tumor cells harboring a mutation
Input and Output
• Input (observation):– a set of deeply sequenced mutations (AF)
• from one or multiple locus in each sample
– a measure of allele specific copy number at each muta-tion locus (genotype)
• Output:– CP of each mutation– Clustering among mutations– overall CP and cluster
Clusters and CP
CNV
muta-tion (AF)
Pyclone population structure
Allele frequency of this mutation: 6*4*(2/4) / {2*2 + 4*3 + 6*4}Cellular prevalence of this mutation: 6 / (4 + 6)
Things to consider
• fraction of cancer cell: t– fraction of normal cell = 1-t
• genotype of normal, reference, variant population of nth mutation– gN, gR, gV ∈ {-, A, B, AA, AB, BB, AAA, AAB...}
– ψn = (gnN , gn
R , gnV ) ∈ G3
• read depth at the locus of nth mutation: dn
• number of reads harboring nth mutation: bn
Cellular prevalence of nth mutation
The generative model
prior parameter
posterior parameter
ψn = (gnN , gn
R , gnV )
φn = fraction of cancer cells from the variant populations
The probability
the probability of sampling a read containing the variant allele covering a mutation with state ψ = (gN, gR, gV) and cellular preva-lence φ
c(g) : copy number of the genotype (e.g. g=AAB, c(g)=3)b(g) : number of variant allele of the genotype (e.g. g=AAB, b(g)=1)µ(g) : probability of sampling a variant allele from a cell = b(g)/c(g)
The probability of bn
)
when cp is given we can calculate the probability of observing bn
inferring cp from bn
1. mutations with same cellular prevalence are clustered to a same clone
2. We want to infer the most likely cellular prevalence of mutations from observation; and find clusters for subclonee,g, if the best is [0.7, 0.5, 0.5, 0.4, 0.2, 0.5, 1.0, 0.9, 0.1, 0.4]
always problematic!!
Getting cp by sampling
• Cp prior ~ Dirichlet process– to have discrete cp values
• Sampling:– Metropolis-Hastings algorithm
Let f(x) be a function that is proportional to the desired probability distribution P(x).1.Initialization:
• Choose an arbitrary point x0 to be the first sample, and choose an arbitrary probability density which suggests a candidate for the next sample value x, given the previous sample value y. For the Metropolis algorithm, Q must be symmetric; in other words, it must
satisfy . A usual choice is to let be a Gaussian distribution centered at y, so that points closer to y are more likely to be visited next—making the sequence of samples into a random walk. The function Q is referred to as the proposal density or jumping distribution.
2.For each iteration t:• Generate a candidate x' for the next sample by picking from the distribution .• Calculate the acceptance ratio α = f(x')/f(xt), which will be used to decide whether to accept or reject the candidate. Because f is
proportional to the density of P, we have that α = f(x')/f(xt) = P(x')/P(xt).• If α ≥ 1, then the candidate is more likely than xt; automatically accept the candidate by setting xt+1 = x'. Otherwise, accept the candidate
with probability α; if the candidate is rejected, set xt+1 = xt, instead.
example of cluster
results (synthetic data)
• accuracy with synthetic data– di ~ Poisson(10,000), t=0.75, 8 clusters with CP~Uniform(0,1), genotype -> total copy number
(1~5),
AB, BB, NZ, TCN, PCN -> genotype prior (goto 17p)
results (synthetic data)
prior for mutational genotype
• copy number must be measured– for each mutation site:
• =total copy number• =copy number of each homologous chromosome
• 5 different strategies for assigning genotype– AB prior: gR=AA, gV=AB
– BB prior: gR=AA, gV=BB
– No Zygosity (NZ) prior: gR=AA, c(gV)=, b(gV)=1
– Total Copy Number (TCN) prior: c(gV)=, b(gV) ∈{1... }, • gR=AA or c(gR)=, b(gR)=0
– Parental Copy Number (PCN) prior: c(gV)=, b(gV) ∈{1,}• if b(gV) ∈{}, gR=gN (AA) => mutation occurred before copy number in-
crease• if b(gV)=1, or c(gR)=, b(gR)=0 => mutation occurred after copy number in-
crease
c=4, c1=c2=2
c=3, c1=1, c2=2
results (real data)
Data = physical mixture of 4 individuals (from 1000 Genomes) {0.01,0.05,0.20,0.74)
- NA12156, NA12878, NA18507, NA19240- generated 7 clusters (unique 4, NA18507+NA19240,
NA12878+NA18507+NA19240, All four shared)
BeBin = Beta Binomial (instead of binomial) to emulate over-dis-persion
results (real data)
True answer
Pyclone (7 clusters)
naïve (12 clusters)false separation of clusters with homo and hetero
cluster1
result (ovarian cancer)
Four spatially sampled high-grade serous ovarian cancer -> 49 deeply sequenced validated mutations
LOH
hetero
CNV1~3
IBBMM cluster 1,2,6 should be collapsed to PyClone cluster 1 => single cell sequencing of 25
result (ovarian cancer)
IBBMM cluster 1, 2 is one cluster (as Pyclone ex-pected)
pyclone clus-ter(yellow box = cluster 1)
IBBMM
non-so-matic
Conclusions• PyClone can infer clonal population structures in cancer
1. Using beta-binomial emission densities, which models data sets with more variance in allelic prevalence measurements more effectively than a binomial model.
2. Flexible prior probability estimates ('priors') of possible muta-tional genotypes are used, reflecting how allelic prevalence measurements are deterministically linked to zygosity and co-incident copy-number variation events.
3. Bayesian nonparametric clustering is used to discover group-ings of mutations and the number of groups simultaneously. This obviates fixing the number of groups a priori and allows for cellular prevalence estimates to reflect uncertainty in this parameter.
4. Multiple samples from the same cancer may be analyzed jointly to leverage the scenario in which clonal populations are shared across samples.
Software
• Implemented in Python• Freely available in
– http://compbio.bccrc.ca/software/request-to-download/?sw=pyClone
• License: GPL3 (free for academic use)
V-measure