14
Supplementary Figure 1 Motif size distribution. The number of MS loci per motif size across the whole genome (red), exome (green), and in an annotated set of cancer genes from Lawrence et at 1 (blue). Mono- and di-repeats represent ~99% of all MS loci. 1. Lawrence, M. S. et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495501 (2014). Nature Biotechnology: doi:10.1038/nbt.3966

Nature Biotechnology: doi:10.1038/nbt...sensitivity to detect MS indels decreases markedly at low allele fractions. Nature Biotechnology: doi:10.1038/nbt.3966 log 10 (KS t est ) log

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Nature Biotechnology: doi:10.1038/nbt...sensitivity to detect MS indels decreases markedly at low allele fractions. Nature Biotechnology: doi:10.1038/nbt.3966 log 10 (KS t est ) log

Supplementary Figure 1

Motif size distribution.

The number of MS loci per motif size across the whole genome (red), exome (green), and in an annotated set of cancer genes from Lawrence et at

1 (blue). Mono- and di-repeats represent ~99% of all MS loci.

1. Lawrence, M. S. et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495–501 (2014).

Nature Biotechnology: doi:10.1038/nbt.3966

Page 2: Nature Biotechnology: doi:10.1038/nbt...sensitivity to detect MS indels decreases markedly at low allele fractions. Nature Biotechnology: doi:10.1038/nbt.3966 log 10 (KS t est ) log

No.ofrepeatsNo.ofrepeats

No.oflociNo.ofloci

Supplementary Figure 2

Sequencing coverage across motifs.

The number of MS loci per length for different motifs (A, C, AC, and AG) across the exome is shown in red while the average number of MS loci covered by at least 10 reads is shown in blue. The number of MS loci covered at 10x depth decreases more rapidly than the number of MS loci, demonstrating the difficulty in achieving sufficient coverage for longer repeat lengths. Together, the motifs A, C, AC, and AG represent 98% of MS loci in the exome.

Nature Biotechnology: doi:10.1038/nbt.3966

Page 3: Nature Biotechnology: doi:10.1038/nbt...sensitivity to detect MS indels decreases markedly at low allele fractions. Nature Biotechnology: doi:10.1038/nbt.3966 log 10 (KS t est ) log

MSrepeatlength

Frac

onsofreadsnotinthem

ode

Supplementary Figure 3

Comparison of accuracy of sequence-alignment tools at MS loci.

Noise is plotted as a function of the MS repeat length for the standard alignment (using Burrows-Wheeler Aligner, BWA2) versus

the MS-specific alignment (adapted from lobSTR3). Data is shown for the AG motif. Noise was defined as the fraction of reads that

differ from the modal number of repeats, aggregated over all the MS loci in the X-chromosome from normal male samples (which are assumed to be homozygous at each MS locus). On average, noise is reduced by approximately a factor of 5 using the MS-specific alignment method.

2. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinforma. Oxf. Engl. 25, 1754–1760 (2009).

3. Gymrek, M., Golan, D., Rosset, S. & Erlich, Y. lobSTR: A short tandem repeat profiler for personal genomes. Genome Res. 22, 1154–1162 (2012).

Nature Biotechnology: doi:10.1038/nbt.3966

Page 4: Nature Biotechnology: doi:10.1038/nbt...sensitivity to detect MS indels decreases markedly at low allele fractions. Nature Biotechnology: doi:10.1038/nbt.3966 log 10 (KS t est ) log

No.ofMSindels

Allelefrac on

Supplementary Figure 4

Analysis of true-positive rates.

The number of detected simulated MS indels (out of 200) across repeat lengths (shown in different colors) and allele fractions. The sensitivity to detect MS indels decreases markedly at low allele fractions.

Nature Biotechnology: doi:10.1038/nbt.3966

Page 5: Nature Biotechnology: doi:10.1038/nbt...sensitivity to detect MS indels decreases markedly at low allele fractions. Nature Biotechnology: doi:10.1038/nbt.3966 log 10 (KS t est ) log

log10(KStest)log10(KStest)

No.oflociNo.ofloci

Supplementary Figure 5

False-positive rates.

False positive rates for the A and C motifs as a function MSMuTect parameters. Heat maps show the log10 false positive rate per MS locus (i.e. the fraction of false-called MS indels among all MS loci) for the A and C motifs. The y-axis is the threshold for the different AIC scores (Tr) and the x-axis is the threshold for the Kolmogorov-Smirnov (KS) filtering step.

Nature Biotechnology: doi:10.1038/nbt.3966

Page 6: Nature Biotechnology: doi:10.1038/nbt...sensitivity to detect MS indels decreases markedly at low allele fractions. Nature Biotechnology: doi:10.1038/nbt.3966 log 10 (KS t est ) log

Supplementary Figure 6

Distribution of MS indels and SNVs across cancer.

Comparison of the fraction of MS indels (upper panel) and number of SNVs (lower panel) across 4,041 tumors from 20 tumor types. Only samples with annotated MS indels and SNVs are shown. Red horizontal lines represent the mean number of MS indels in each tumor type.

Nature Biotechnology: doi:10.1038/nbt.3966

Page 7: Nature Biotechnology: doi:10.1038/nbt...sensitivity to detect MS indels decreases markedly at low allele fractions. Nature Biotechnology: doi:10.1038/nbt.3966 log 10 (KS t est ) log

******

*

0

50

100

150

MSI−

H

MSI−

LM

SS

One a

llele

in the

norm

al

Num

ber

of M

S in

dels

********

ns

0

100

200

300

400

MSI−

H

MSI−

LM

SS

********

ns

0

10

20

MSI−

H

MSI−

LM

SS

***

ns

0

1

2

MSI−

H

MSI−

LM

SS

nsns

ns

0

10

20

30

40

MSI−

H

MSI−

LM

SS

One allele in the tumor

Tw

o a

llele

s in t

he n

orm

al

Num

ber

of M

S indels

******

ns

0.0

2.5

5.0

7.5

10.0

MSI−

H

MSI−

LM

SS

Two alleles in the tumor

******

ns

0

1

2

3

MSI−

H

MSI−

LM

SS

Three alleles in the tumor

**

ns

0

1

2

MSI−

H

MSI−

LM

SS

Four alleles in the tumor

Supplementary Figure 7

The number of MS indels for different changes in the number of alleles.

The number of MS indels for STAD samples (broken to MSI-H, MSI-L and MSS) plotted for different numbers of germline and tumor alleles. MSMuTect not only detects the presence of a somatic MS indel, but also infers the actual alleles in both the germline and tumor samples. The upper row shows the number of MS indels for loci that had one allele in the germline and the lower row for two alleles in the germline. The columns represent the number of somatic MS indels alleles in the tumor (range from one to four). For example, the plot in the third column of the second row shows cases in which the germline has two alleles (ie. heterozygous sites) but the tumor sample has 3 alleles. MS indels are more common in MSI-H tumors in all settings except when the germline has two alleles but the tumor has only a single allele (bottom left corner), which reflects loss-of-heterozygosity (LOH). MSI designations (MSI-H, MSI-L, or MSS) are based on Bethesda gel classification (taken from TCGA). The y-axis scale varies across panels. The significance of the difference was calculated using one tailed t-test (ns- p>0.05, * p<0.05, ** p<10

-3, *** p<10

-8, ****

p<10-16

)

Nature Biotechnology: doi:10.1038/nbt.3966

Page 8: Nature Biotechnology: doi:10.1038/nbt...sensitivity to detect MS indels decreases markedly at low allele fractions. Nature Biotechnology: doi:10.1038/nbt.3966 log 10 (KS t est ) log

Supplementary Figure 8

Correlation between germline variability and somatic MS indel frequency.

The x-axis represents the binned fraction of non-reference alleles at each MS locus (out of the 2*N alleles in our cohort, where N is the number of covered normal samples). The somatic MS indel frequency for each MS locus is plotted as blue dots. Black dots represent the mean of each bin. The upper panel shows germline variability of A8 in the range of germline variability between 0 to 0.1 and the lower panel in the range of 0 to 1. The effect of germline variability on the somatic rate is minor for germline variability <0.1.

Nature Biotechnology: doi:10.1038/nbt.3966

Page 9: Nature Biotechnology: doi:10.1038/nbt...sensitivity to detect MS indels decreases markedly at low allele fractions. Nature Biotechnology: doi:10.1038/nbt.3966 log 10 (KS t est ) log

●●

●●

●●

● ●

● ● ● ● ● ●

−5

−4

−3

−2

−1

0

0 10 20 30 40

Number of MS indels

log

10

(Fre

qu

en

cy o

f lo

ci)

Supplementary Figure 9

Distribution of MS indels in A8 in noncoding regions.

The observed frequency of mutated A8 loci per given number of indels are shown as black dots whereas the expected frequency using a fit based on a Binomial distribution is represented by the red line. The x-axis represents the number of MS indels and the y-axis represents the fraction of loci that have a particular number of MS indels.

Nature Biotechnology: doi:10.1038/nbt.3966

Page 10: Nature Biotechnology: doi:10.1038/nbt...sensitivity to detect MS indels decreases markedly at low allele fractions. Nature Biotechnology: doi:10.1038/nbt.3966 log 10 (KS t est ) log

Supplementary Figure 10

STAD quantile–quantile plot.

MSMutSig QQ plot for stomach adenocarcinoma (STAD). Quantile-quantile plot of observed vs. expected P-values under the negative binomial (also called gamma-Poisson) model. Significant MS loci (q<0.1) are shown in red.

Nature Biotechnology: doi:10.1038/nbt.3966

Page 11: Nature Biotechnology: doi:10.1038/nbt...sensitivity to detect MS indels decreases markedly at low allele fractions. Nature Biotechnology: doi:10.1038/nbt.3966 log 10 (KS t est ) log

Supplementary Figure 11

COAD quantile–quantile plot.

MSMutSig QQ plot for colon adenocarcinoma (COAD). Quantile-quantile plot of observed vs. expected P-values under the negative binomial (also called gamma-Poisson) model. Significant MS loci (q<0.1) are shown in red.

Nature Biotechnology: doi:10.1038/nbt.3966

Page 12: Nature Biotechnology: doi:10.1038/nbt...sensitivity to detect MS indels decreases markedly at low allele fractions. Nature Biotechnology: doi:10.1038/nbt.3966 log 10 (KS t est ) log

Supplementary Figure 12

UCEC quantile–quantile plot.

MSMutSig QQ plot for endometrial cancer (UCEC). Quantile-quantile plot of observed vs. expected P-values under the negative binomial (also called gamma-Poisson) model. Significant MS loci (q<0.1) are shown in red.

Nature Biotechnology: doi:10.1038/nbt.3966

Page 13: Nature Biotechnology: doi:10.1038/nbt...sensitivity to detect MS indels decreases markedly at low allele fractions. Nature Biotechnology: doi:10.1038/nbt.3966 log 10 (KS t est ) log

*

0

500

1000

1500

WT (n

= 3

6)

p.K14

89fs

(n =

31)

Exp

ressio

n (

RS

EM

)

Supplementary Figure 13

PRDM2 transcript levels in WT versus mutant PRDM2 cases.

PRDM2 transcript levels (by RNAseq) was lower in cases with a PRDM2 p.K1489fs frameshift mutation than in PRDM2 WT cases (P=0.016, two tailed Mann-Whitney test).

Nature Biotechnology: doi:10.1038/nbt.3966

Page 14: Nature Biotechnology: doi:10.1038/nbt...sensitivity to detect MS indels decreases markedly at low allele fractions. Nature Biotechnology: doi:10.1038/nbt.3966 log 10 (KS t est ) log

Supplementary Figure 14

MutSig quantile–quantile plot for endometrial cancer (UCEC).

Quantile-quantile plot of observed vs. expected P-values for MSI-H cases using only previously identified mutations (red) and using previously identified mutations and MS indels (green). Using MutSig for datasets with large numbers of MS indels leads to an inflation in the number of significantly mutated genes.

Nature Biotechnology: doi:10.1038/nbt.3966