1
BitMAC: An In-Memory Accelerator for Bitvector-Based Sequence Alignment of Both Short and Long Genomic Reads Damla Senol Cali 1 , Gurpreet S. Kalsi 2 . Lavanya Subramanian 3 , Can Firtina 4 , Anant V. Nori 2 , Jeremie S. Kim 4,1 , Zulal Bingöl 5 , Rachata Ausavarungnirun 6,1 , Mohammed Alser 4 , Juan Gomez-Luna 4 , Amirali Boroumand 1 , Allison Scibisz 1 , Sreenivas Subramoney 2 , Can Alkan 5 , Saugata Ghose 1 , and Onur Mutlu 4,1 1 Carnegie Mellon University, USA 2 Intel, USA 3 Facebook, USA 4 ETH Zürich, Switzerland 5 Bilkent University, Turkey 6 King Mongkut’s University of Technology North Bangkok, Thailand Bitap algorithm (i.e., Shift-Or algorithm, or Baeza-Yates-Gonnet algorithm) [1] can perform exact string matching with fast and simple bitwise operations. Wu and Manber extended the algorithm [2] in order to perform approximate string matching. § Step 1 – Preprocessing: For each character in the alphabet (i.e., A,C,G,T), generate a pattern bitmask that stores information about the presence of the corresponding character in the pattern. § Step 2 – Searching: Compare all characters of the text with the pattern by using the preprocessed bitmasks, a set of bitvectors that hold the status of the partial matches and the bitwise operations. [1] Baeza-Yates, Ricardo, and Gaston H. Gonnet. "A new approach to text searching." Communications of the ACM 35.10 (1992): 74-82. [2] Wu, Sun, and Udi Manber. "Fast text search allowing errors." Communications of the ACM 35.10 (1992): 83-91. String Matching with Bitap Algorithm Problem o Read mapping is the critical first step of the genome sequence analysis pipeline. o In read mapping, each read is aligned against a reference genome to verify whether the potential location results in an alignment for the read (i.e., read alignment). o Read alignment can be viewed as an approximate (i.e., fuzzy) string matching algorithm. o Approximate string matching is typically performed with an expensive quadratic- time dynamic programming algorithm, which consumes over 70% of the execution time of read alignment. 3D-Stacked DRAM and Processing-in-Memory (PIM) Package Substrate Interposer PHY PHY TSV Microbump HBM DRAM Die Logic Die . . . Processor (GPU/CPU/SoC) Die . . . 3D-Stacked DRAM Our Goal and Contributions Limitations of Bitap BitMAC: Efficient In-Memory Bitap Results (1) The algorithm itself cannot be parallelized due to data dependencies across loop iterations, (2) Multiple searches can be done in parallel, but are limited by the number of compute units available in a CPU, (3) Even if many compute units are made available (e.g., a GPU), Bitap is highly constrained by the amount of available memory bandwidth. Also, standard Bitap algorithm cannot (1) perform the traceback step of the alignment. (2) work effectively for both short and long reads. Goal: Designing a fast and efficient customized accelerator for approximate string matching to enable faster read alignment, and therefore faster genome sequence analysis for both short and long reads. Contributions: o Modified Bitap algorithm to perform efficient genome read alignment in memory, by ü parallelizing the Bitap algorithm by removing data dependencies across loop iterations, ü adding efficient support for long reads, and ü developing the first efficient Bitap-based algorithm for traceback. o BitMAC, the first in-memory read alignment accelerator for both short accurate and long noisy reads. ü Hardware implementation of our modified Bitap algorithm, and ü Designed specifically to take advantage of the high internal bandwidth available in the logic layer of 3D-stacked DRAM chips. o Recent technological advances in memory design allow architects to tightly couple memory and logic within the same chip with very high bandwidth, low latency and energy-efficient vertical connectors. à 3D-stacked memories (Hybrid Memory Cube (HMC), High-Bandwidth Memory (HBM)) o A customizable logic layer enables fast, massively parallel operations on large sets of data, and provides the ability to run these operations near memory at high bandwidth and low latency. à Processing-in-memory (PIM) o BitMAC: implementation of our modified PIM-friendly Bitap algorithm using a dedicated hardware accelerator that we add to the logic layer of a 3D-stacked DRAM chip. (1) achieves high performance and energy efficiency with specialized compute units and data locality, (2) balances the compute resources and available memory bandwidth per compute unit, (3) scales linearly with the number of parallel compute units, (4) provides generic applicability by performing both edit distance calculation and traceback for short and long reads. o BitMAC consists of two components that work together to perform read alignment: (1) BitMAC-DC: calculates the edit distance between the reference genome and the query read, and (2) BitMAC-TB: performs traceback using the list of edit locations recorded by BitMAC-DC. HMC Vault Memory (512MB) (Text) Read Data Read/Write Req Pattern Mask BitMAC -DC Configuration and start pulse Processing Block (PB) Pattern Mask gen. Walk Cntrl. Mem. Cntrl. Ref Text (T n -T n+i ) OldR and R BitMAC-DC BitMAC-TB Hybrid Memory Cube (HMC) 16GB 32 Vaults TSVs Logic Layer Vault Host CPU Hardware Accelerator (HWA) Local SRAM Scratchpad (1.2MB) (Pattern, OldR and R) (b) Processing Core (PC) (a) Processing Block (PB) BitMAC-DC: o Systolic array based configuration for an efficient implementation of edit distance calculation. This allows us to provide parallelism across multiple iterations of read alignment. (1) computes the edit distance between the whole reference genome and the input reads, (2) computes the edit distance between the candidate regions reported by an initial filtering step and the input reads, and also (3) finds the candidate regions of the reference genome by running the accelerator with the exact matching mode. BitMAC-TB: o Makes use of a low-power general-purpose PIM core to perform the traceback step of read alignment. o New, efficient algorithm for traceback, which exploits the bitvectors generated by BitMAC-DC. o Divides the matching region of the text (as identified by BitMAC-DC) into multiple windows, and then performs traceback in parallel on each window. This allows us to utilize the full memory bandwidth for the traceback step. 1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 PacBio - 10% PacBio - 15% ONT - 10% ONT - 15% Average Throughput (Reads/sec) BWA-MEM (t=1) BWA-MEM (t=12) Minimap2 (t=1) Minimap2 (t=12) BitMAC 1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09 Illumina-100bp Illumina-150bp Illumina-250bp Average Throughput (KReads/sec) BWA-MEM (t=1) BWA-MEM (t=12) Minimap2 (t=1) Minimap2 (t=12) BitMAC When compared with the alignment steps of BWA-MEM and Minimap2, BitMAC achieves: For simulated PacBio and ONT datasets: ü 4997× and 79× throughput improvement à over the single-threaded baseline ü 455× and 7× throughput improvement à over the 12-threaded baseline ü 15.9× and 13.8× less power consumption For simulated llumina datasets: ü 102,558× and 16,867× throughput improvement à over the single-threaded baseline ü 8162× and 1445× throughput improvement à over the 12-threaded baseline Future Work We believe it is promising to explore: o coupling PIM-based filtering methods with BitMAC to reduce the amount of required read alignments, o enhancing the design of BitMAC to support different scoring schemas and affine gap penalties, o analyzing the effects of a larger alphabet on BitMAC, and o examining the benefit of using AVX-512 support for multiple pattern searches in parallel.

BitMAC: An In-Memory Accelerator for Bitvector-Based ......time dynamic programming algorithm, whichconsumesover70%oftheexecution timeofreadalignment. 3D-Stacked DRAM and Processing-in-Memory

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BitMAC: An In-Memory Accelerator for Bitvector-Based ......time dynamic programming algorithm, whichconsumesover70%oftheexecution timeofreadalignment. 3D-Stacked DRAM and Processing-in-Memory

BitMAC: An In-Memory Accelerator for Bitvector-Based Sequence Alignment of Both Short and Long Genomic Reads

Damla Senol Cali1, Gurpreet S. Kalsi2. Lavanya Subramanian3, Can Firtina4, Anant V. Nori2, Jeremie S. Kim4,1,Zulal Bingöl5, Rachata Ausavarungnirun6,1, Mohammed Alser4, Juan Gomez-Luna4, Amirali Boroumand1,

Allison Scibisz1, Sreenivas Subramoney2, Can Alkan5, Saugata Ghose1, and Onur Mutlu4,1

1 Carnegie Mellon University, USA 2 Intel, USA 3 Facebook, USA 4 ETH Zürich, Switzerland5 Bilkent University, Turkey 6 King Mongkut’s University of Technology North Bangkok, Thailand

Bitap algorithm (i.e., Shift-Or algorithm, or Baeza-Yates-Gonnetalgorithm) [1] can perform exact string matching with fast andsimple bitwise operations. Wu and Manber extended the algorithm[2] in order to perform approximate string matching.§ Step 1 – Preprocessing: For each character in the alphabet (i.e.,

A,C,G,T), generate a pattern bitmask that stores informationabout the presence of the corresponding character in thepattern.

§ Step 2 – Searching: Compare all characters of the text with thepattern by using the preprocessed bitmasks, a set of bitvectorsthat hold the status of the partial matches and the bitwiseoperations.

[1] Baeza-Yates, Ricardo, and Gaston H. Gonnet. "A new approach to text searching."Communications of the ACM 35.10 (1992): 74-82.[2] Wu, Sun, and Udi Manber. "Fast text search allowing errors." Communications of the ACM 35.10(1992): 83-91.

String Matching with Bitap AlgorithmProblemo Read mapping is the critical first step of

the genome sequence analysis pipeline.o In read mapping, each read is aligned

against a reference genome to verifywhether the potential location results inan alignment for the read (i.e., readalignment).

o Read alignment can be viewed as anapproximate (i.e., fuzzy) string matchingalgorithm.

o Approximate string matching is typicallyperformed with an expensive quadratic-time dynamic programming algorithm,which consumes over 70% of the executiontime of read alignment.

3D-Stacked DRAM and Processing-in-Memory (PIM)

Package Substrate

Interposer

PHY PHY

TSVMicrobumpHBM DRAM Die

Logic Die. . .

Processor (GPU/CPU/SoC) Die

. . .

3D-Stacked DRAM

Our Goal and Contributions

Limitations of Bitap

BitMAC: Efficient In-Memory Bitap

Results

(1) The algorithm itself cannot be parallelizeddue to data dependencies across loopiterations,

(2) Multiple searches can be done in parallel, butare limited by the number of compute unitsavailable in a CPU,

(3) Even if many compute units are madeavailable (e.g., a GPU), Bitap is highlyconstrained by the amount of availablememory bandwidth.

Also, standard Bitap algorithm cannot(1) perform the traceback step of the alignment.(2) work effectively for both short and long reads.

Goal: Designing a fast and efficient customized accelerator forapproximate string matching to enable faster read alignment, andtherefore faster genome sequence analysis for both short and long reads.Contributions:o Modified Bitap algorithm to perform efficient genome read alignment

in memory, byü parallelizing the Bitap algorithm by removing data dependencies

across loop iterations,ü adding efficient support for long reads, andü developing the first efficient Bitap-based algorithm for traceback.

o BitMAC, the first in-memory read alignment accelerator for both shortaccurate and long noisy reads.ü Hardware implementation of our modified Bitap algorithm, andü Designed specifically to take advantage of the high internal

bandwidth available in the logic layer of 3D-stacked DRAM chips.

o Recent technological advances in memorydesign allow architects totightly couple memory and logic within thesame chip with very high bandwidth, lowlatency and energy-efficient verticalconnectors.à 3D-stacked memories (Hybrid MemoryCube (HMC), High-Bandwidth Memory(HBM))

o A customizable logic layer enables fast,massively parallel operations on large setsof data, and provides the ability to runthese operations near memory at highbandwidth and low latency.à Processing-in-memory (PIM)

o BitMAC: implementation of our modifiedPIM-friendly Bitap algorithm using adedicated hardware accelerator that weadd to the logic layer of a 3D-stackedDRAM chip.

(1) achieves high performance andenergy efficiency with specializedcompute units and data locality,

(2) balances the compute resources andavailable memory bandwidth percompute unit,

(3) scales linearly with the number ofparallel compute units,

(4) provides generic applicability byperforming both edit distancecalculation and traceback for shortand long reads.

o BitMAC consists of two components thatwork together to perform readalignment:

(1) BitMAC-DC: calculates the editdistance between the referencegenome and the query read, and(2) BitMAC-TB: performs tracebackusing the list of edit locations recordedby BitMAC-DC.

HMC VaultMemory (512MB)

(Text)

Read Data

Read

/Writ

e Re

q

Pattern Mask

BitM

AC-D

C

Configuration and start pulse

Processing Block (PB)

Pattern Mask gen.

Walk Cntrl.

Mem.Cntrl.

Ref Text (Tn-Tn+i)

OldR

and

R

BitMAC-DCBitMAC-TB

Hybrid Memory Cube (HMC)16GB 32 Vaults

TSVs

Logic Layer

Vault

Host CPU

Hardware Accelerator (HWA)

Local SRAM Scratchpad (1.2MB)(Pattern, OldR and R)

(b) Processing Core (PC)(a) Processing Block (PB)

BitMAC-DC:o Systolic array based configuration for an efficient implementation of edit distance calculation. This allows us to

provide parallelism across multiple iterations of read alignment.(1) computes the edit distance between the whole reference genome and the input reads,(2) computes the edit distance between the candidate regions reported by an initial filtering step and the input reads,and also(3) finds the candidate regions of the reference genome by running the accelerator with the exact matching mode.

BitMAC-TB:o Makes use of a low-power general-purpose PIM core to perform the traceback step of read alignment.o New, efficient algorithm for traceback, which exploits the bitvectors generated by BitMAC-DC.o Divides the matching region of the text (as identified by BitMAC-DC) into multiple windows, and then performs

traceback in parallel on each window. This allows us to utilize the full memory bandwidth for the traceback step.

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

PacBio - 10% PacBio - 15% ONT - 10% ONT - 15% Average

Th

rou

ghp

ut

(Rea

ds/

sec)

BWA-MEM (t=1) BWA-MEM (t=12) Minimap2 (t=1) Minimap2 (t=12) BitMAC

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

1.00E+07

1.00E+08

1.00E+09

Illumina-100bp Illumina-150bp Illumina-250bp Average

Th

rou

ghp

ut (

KR

ead

s/se

c)

BWA-MEM (t=1) BWA-MEM (t=12) Minimap2 (t=1) Minimap2 (t=12) BitMAC

When compared with the alignment steps of BWA-MEM and Minimap2, BitMAC achieves:For simulated PacBio and ONT datasets:ü 4997× and 79× throughput improvement à over the single-threaded baselineü 455× and 7× throughput improvement à over the 12-threaded baselineü 15.9× and 13.8× less power consumptionFor simulated llumina datasets:ü 102,558× and 16,867× throughput improvement à over the single-threaded baselineü 8162× and 1445× throughput improvement à over the 12-threaded baseline

Future WorkWe believe it is promising toexplore:o coupling PIM-based filtering

methods with BitMAC toreduce the amount ofrequired read alignments,

o enhancing the design ofBitMAC to support differentscoring schemas and affinegap penalties,

o analyzing the effects of alarger alphabet on BitMAC,and

o examining the benefit ofusing AVX-512 support formultiple pattern searches inparallel.