Detection of Rare-Alleles and Their Carriers Using Compressed Se( que ) nsing

Detection of Rare-Alleles and Their Carriers Using Compressed

Se(que)nsingOr Zuk

Broad Institute of MIT and [email protected]

In collaboration with:

Amnon AmirDept. of Physics of Complex Systems, Weizmann Inst. of Science

Noam ShentalDept. of Computer Science, The Open University of Israel

The Problem

Identify genotypes (disease) in a large population

AB ABAA AA AA AA AAAA AA genotypes

Specifics: Large populations (hundreds to tens of thousands)Rare allelesPre-defined genomic regions

Naïve Approach – Targeted selection + Next Gen Seq.: One Test per Individual

collect DNA samples

Apply 9 independent tests

AB ABAA AA AA AA AAAA AA

fraction of B’s out of tested alleles01/2 0 0 0 1/2 0 0 0

Problem: Rare alleles require profiling a high number of individuals. Still very costly. Multiplexing/barcoding provides partial solution (laborious, expensive, often not enough different barcodes)

Targetedselection

Our approach - Targeted Selection + Smart pooling + Next Gen seq.

collect DNA samples. Prepare Pools

Advantages: Fewer pools Reduced sample preparation and sequencing costs Can still achieve accurate genotypes

Apply 3 pooled tests

AB ABAA AA AA AA AAAA AA

fraction of B’s out of tested alleles01/2 0 0 0 1/2 0 0 0

Targetedselection

Reconstruct genotypes

Application 1: Rare recessive genetic diseases

Carrier Healthy!

Normal Healthy

Genotype Phenotype

Affected Sick

Identify carriers of known deleterious mutations

Nationwide carrier screen

http://upload.wikimedia.org/wikipedia/commons/3/3e/Autorecessive.svg













Genetic Disorder Carrier rate

Tay-Sachs 1:25

Cystic Fibrosis 1:30

Familial Dysautonomia 1:30

Usher Syndrome 1:40

Canavan 1:40

Glycogen Storage 1:71

Fanconi Anemia C 1:80

Niemann-Pick 1:80

Mucolipidosis type 4 1:100

Bloom 1:102

Nemaline Myopathay 1:108

Large scale carrier screen

(rates vary across ethnic groups)

Specific mutations - notation

“A”

“B”

“B”

Reference genome …AGCGTTCT…

…AGTGTTCT…Single-nucleotide polymorphism (SNPs)

…AGGTTCTInsertions/Deletions (InDels)

Carrier test screen: Amplify a sample of DNA and then test

“AA” “AB”

fraction of B’s out of tested alleles1/20

Application 2: Genome Wide Association Studies

collect DNA samples

AB ABBB AB BB AA AAAB AB

Cases Controls

AA ABAA AA AA AA ABAA AA

Count: Cases ControlsAA XAA YAAAB XAB YABBB XBB YBB

Try ~105 – 106 different SNPs. Significant ones called

‘discoveries’/’associations’

Statistical test ,p-value

What Associations are Detected ?

[T.A. Manolio et al. Nature 2009]

Goal: push further

Find Novel mutations associated with common disease and their carriers

What Associations are Detected ?

Find Novel mutations associated with common disease and their

carriersProposed approaches:

Profile larger populations.

Look at SNPs with lower Minor Allele Frequency

Re-sequencing in regions with common SNPs found, and other regions of interest

infer/reconstruct

5211

420

521

Compressed Sensing Based Group Testing

Next Generation Sequencing Technology

compressed sensing (CS)a few tests instead of 9

fraction of B’s

Rare Allele Identification in a CS Framework

5211

21

xmy ii

individuals in the pool

511,0,1,1,1,0,0,0,1im

x

# rare alleles

000100001

AAAAAAABAAAAAAAAAB

5211

• The standard CS problem:

n variablesk << n equations

• But: x is sparse:

Matrix should obey certain properties (Robust Isometry Property)Example: random Gaussian or Bernoulli matrix

• Then: Can reconstruct x uniquely with k = O(s log(n/s)) equations (a.k.a. ‘measurements’)

Can do so efficiently, even for large matrices (L1 minimization)

Compressed Sensing (CS)

1, 1, 1, 1,1,1,1, 1,1im

x

1

2

.

.

.

n

xx

x

y Mx

0|| ||x s n

y 1

1

..

k

yy

y

NextGenSeq Output

output: “reads”Example: Illumina, A few millions reads per laneRead length – a few dozens to a few hundreds

line = “read”

NextGenSeq – Targeted Sequencing

Measure the number of reads containing B out of total number of reads. Here: 1/16

Parts of this modeling appeared in [P. Prabhu & I. Pe’er, Genome Research July 09[

Ideal measurement - the fraction of “B” reads:

Model Formulationxmy ii

21

r is itself a random variable )1,loci#

reads # total(~ r

1. sampling noise: finite number of reads from each site - r

NGST measurement:

2. Technical errors:

reread errors: 0.5-1%

DNA preparation errors

21

2,1,0)21/()1(

21..minarg* rr

xeez

rxMtsxx

N

),(~ ii yrBinomialz , Estimated frequency: ii yrz /

sparsity-promoting term

error term

Results (simulations)

arxiv 0909.0400v1

[f = freq. of rare allele[

Can reconstruct over 10,000 people with no errors, using only 200 lanes

Software Package: Comseq [unique solver for this application noise model, translating to CS, reconstruction ..[

Results (real data)

1. Pooled-sequencing experimental dataValidate the Pooling part (variation in amount of DNA)

2. 1000 genomes data Validate all other technical errors (e.g. read error, sampling error ) in a large-scale experiment

Results (dataset 1)

Pooling dataset from: [Out et al., Human Mutation 2009[88 People in one pool – region length (hyb-selection)

sequenced by5 SNPs identified, of which 9 are ‘rare’ (carrier freq. < 4%): 5 with one carrier, 3 with two carriers, 1 with one carrier.

Create ‘in-silico’ pools:• Randomize individuals’ identity in each pool• Determine number of carriers • Sample frequencies based on observed frequencies

in the single pool for the same number of carriers

Results (dataset 1)

Pooling dataset from: [Out et al., Human Mutation 2009[Cartoon:

Results (dataset 1)

One and two carriers: real pooling results match theoretical model Three carriers: real pooling are worse due to one problematic SNP

When constructing pools of at most 2 people, results match theoretical model

#tests

%w

ith p

erfe

ct re

cons

truct

ion

Results (dataset 2) 1000 Genomes Data: http://www.1000genomes.org/

Pilot 3 data: Exome Sequencing, ~1000 genes, ~700 people

Filtered: 633 rare SNP (MAF < 2%), of which 20 contained rar heterozygous364 individuals sequenced by Illumina

Create ‘in-silico’ pools:• Randomize individuals’ identity in each pool• Determine number of carriers • Sample and individual from the pool at random. Then sample a read

from the set of reads for this individual.

http://www.1000genomes.org/

Results (dataset 2)

Results from derived from actual 1000 genomes read match Simulations from our statistical model

• Generic approach: puts together sequencing and CS to identify rare allele carriers.

• Naturally deals with all possible scenarios of multiple carriers and heterozygous or homozygous rare alleles.

• Much higher efficiency over the naive approach. Can be combined with barcoding

• Manuscript available on arxiv: arxiv 0909.0400v1 [N. Shental, A. Amir and O. Zuk, in revision[

• Comseq Package: Code Available at: http://www.broadinstitute.org/mpg/comseq

[simulating, designing experiments, reconstructing genotypes ..[

Conclusions

Thank You

Noam Shental Amnon Amir

Documents

Detection of Rare-Alleles and Their Carriers Using Compressed Se( que ) nsing