View
50
Download
0
Category
Tags:
Preview:
DESCRIPTION
GNUMAP-SNP. Parallel Pair-HMM SNP Detection. Nathan Clement The University of Texas Austin, TX, USA. Outline. Motivation NGS Issues and Requirements Pair- HMM Memory Optimizations Results Conclusion. Motivation. Mutation Detection: SNP discovery HapMap and resequencing - PowerPoint PPT Presentation
Citation preview
PARALLEL PAIR-HMM SNP
DETECTION
GNUMAP-SNP
Nathan ClementThe University of TexasAustin, TX, USA
Outline Motivation
NGS Issues and RequirementsPair-HMM
Memory Optimizations Results Conclusion
MotivationMutation Detection: SNP discovery
HapMap and resequencing Species Identification Bisulfite Sequencing
Epigenetic influencesRNA editing
Error Rates*Instrument Run Time Mb/run Bases/
readPrimary Error Type
Error Rate (%)
3730xl (Capillary)
2 h 0.06 650 Substitution 0.1-1
454 FLX+ 18-20 h 900 700 Indel 1
Illumina HiSeq2000
10 days ≤ 600,000 100+100 Substitution ≥0.1
Ion Torrent – 318 chip
2 h >1000 >100 Indel ~1
PacBio RS 0.5-2h 5-10 860-1100 CG Deletions
16
* Data current as of May 2011: Glenn, Travis C, “Field guide to next-generation DNA sequencers,” Molecular Ecology Resources, vol 11, pp 759-769, 2011
Pair-HMM
Pair-HMM (Mathematics) Match
Gap (in both directions)
Pair-HMM (M)
a t a c g a c ta 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
g 0.00 0.68 0.00 0.00 0.00 0.00 0.00 0.00
t 0.00 0.32 0.68 0.00 0.00 0.00 0.00 0.00
a 0.00 0.00 0.32 0.68 0.00 0.00 0.00 0.00
g 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00
a 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
c 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00
c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
Pair-HMM (X)a t a c g a c t
a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
g 0.31 0.00 0.00 0.00 0.00 0.00 0.00 0.00
t 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
g 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Pair-HMM (Y)
a t a c g a c ta 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
g 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
t 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
a 0.00 0.00 0.00 0.31 0.00 0.00 0.00 0.00
g 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Pair-HMMA C G T
a 1.00 0.00 0.00 0.00
g 0.00 0.00 0.68 0.31
t 0.32 0.00 0.00 0.68
a 0.99 0.00 0.00 0.00
g 0.00 0.00 1.00 0.00
a 1.00 0.00 0.00 0.00
c 0.00 1.00 0.00 0.00
c 0.00 1.00 0.00 0.00
Expected ResultsCHR POS TOT A C G T SNP? PVALchrX 1755234 17.00 0.00 0.00 17 0.00 N
chrX 1755235 18.00 0.00 18.00 0.00 0.00 N
chrX 1755236 19.00 9.99 0.00 9.00 0.01 Y:g->a/g 2.54e-08
chrX 1755237 19.50 0.00 0.00 0.00 19.50 N
chrX 1755238 19.50 0.00 0.00 19.50 0.00 N
chrX 1755239 46.00 0.01 19.49 0.00 0.00 N
Why Inline SNP Calling? Post-Processing
Disk space, less memory Inline
Requires more memoryLess disk spaceCan include specifics probabilities for each
read
Previous Optimizations Two methods for speeding up mapping:
1. Entire genome on one machine2. Split memory among different machines
○ Must normalize across all genome portions○ MPI reduction
Previous Optimizations
Memory Requirements Human Genome (3gb)
HashMap ≈ 12GB4 bits/character = 1.5GB5 floating point values per base (plus N) =
sizeof(float)*5 * 3GB=60GBAlso stores total for easy computation =
sizeof(float) * 3GB = 12GB Total of ≈ 90GB per run
Three Memory Optimizations Normal (no optimization) Integer discretization Centroid discretization
Integer Discretization Only need one floating point value (for
total) and 1 byte/nucleotide. “Parts per 255” Biggest hit: Going into and out of
“integer space”
Integer DiscretizationAdded from ri:1.0 0.00 0.68 0.31 0.01 0.00
Step 1: Convert from Integer Space
Step 2: Add from ri to Genome
Step 3: Convert back to Integer Space
Genome
Total A C G T N12.0 3 231 7 12 3
Total A C G T N12.0 0.15 10.9 0.33 0.56 0.15
Total A C G T N13.0 0.15 11.6 0.64 0.57 0.15
Total A C G T N13.0 2 228 13 11 2
Centroid Discretization Many states not used:
[255, 255, 255, 255, 255][0, 0, 0, 0, 0]
Many states not biologically relevantSNP transition (common) vs transversion
(not likely) MSA uses this compression to perform
fast alignment of one-to-many alignment
Centroid Discretization (cont)
Centroid Discretization (cont) Benefits
Doesn’t waste impossible or infrequently used space
Much smaller memory footprint Drawbacks:
Slight overhead in converting from centroid to floating point spaces
Rounding error (how significant?)
Speed Comparison
Optimization Stats (chrX)
Optimization Memory Mem % Wallclock TP FPNormal 4.76GB 100% 04:25:55 1309 127CharDisc 2.58GB 54.2% 04:36:58 677 0CentDisc 2.01GB 42.2% 04:27:29 166 9058
Conclusion For high error rates, HMM approach is
ideal, but requires more memoryDistributing the genome across processors
doesn’t scale linearly Discretization methods provide good
memory reductions (up to 42%)Centroid discretization performs poorlyInteger discretization can be used when
available memory is low
Questions
Recommended