BitMAC: An In-Memory Accelerator for Bitvector-Based ......time dynamic programming algorithm, whichconsumesover70%oftheexecution timeofreadalignment. 3D-Stacked DRAM and Processing-in-Memory

BitMAC: An In-Memory Accelerator for Bitvector-Based Sequence Alignment of Both Short and Long Genomic Reads

Damla Senol Cali1, Gurpreet S. Kalsi2. Lavanya Subramanian3, Can Firtina4, Anant V. Nori2, Jeremie S. Kim4,1,Zulal Bingöl5, Rachata Ausavarungnirun6,1, Mohammed Alser4, Juan Gomez-Luna4, Amirali Boroumand1,

Allison Scibisz1, Sreenivas Subramoney2, Can Alkan5, Saugata Ghose1, and Onur Mutlu4,1

1 Carnegie Mellon University, USA 2 Intel, USA 3 Facebook, USA 4 ETH Zürich, Switzerland5 Bilkent University, Turkey 6 King Mongkut’s University of Technology North Bangkok, Thailand

Bitap algorithm (i.e., Shift-Or algorithm, or Baeza-Yates-Gonnetalgorithm) [1] can perform exact string matching with fast andsimple bitwise operations. Wu and Manber extended the algorithm[2] in order to perform approximate string matching.§ Step 1 – Preprocessing: For each character in the alphabet (i.e.,

A,C,G,T), generate a pattern bitmask that stores informationabout the presence of the corresponding character in thepattern.

§ Step 2 – Searching: Compare all characters of the text with thepattern by using the preprocessed bitmasks, a set of bitvectorsthat hold the status of the partial matches and the bitwiseoperations.

[1] Baeza-Yates, Ricardo, and Gaston H. Gonnet. "A new approach to text searching."Communications of the ACM 35.10 (1992): 74-82.[2] Wu, Sun, and Udi Manber. "Fast text search allowing errors." Communications of the ACM 35.10(1992): 83-91.

String Matching with Bitap AlgorithmProblemo Read mapping is the critical first step of

the genome sequence analysis pipeline.o In read mapping, each read is aligned

against a reference genome to verifywhether the potential location results inan alignment for the read (i.e., readalignment).

o Read alignment can be viewed as anapproximate (i.e., fuzzy) string matchingalgorithm.

o Approximate string matching is typicallyperformed with an expensive quadratic-time dynamic programming algorithm,which consumes over 70% of the executiontime of read alignment.

3D-Stacked DRAM and Processing-in-Memory (PIM)

Package Substrate

Interposer

PHY PHY

TSVMicrobumpHBM DRAM Die

Logic Die. . .

Processor (GPU/CPU/SoC) Die

. . .

3D-Stacked DRAM

Our Goal and Contributions

Limitations of Bitap

BitMAC: Efficient In-Memory Bitap

Results

(1) The algorithm itself cannot be parallelizeddue to data dependencies across loopiterations,

(2) Multiple searches can be done in parallel, butare limited by the number of compute unitsavailable in a CPU,

(3) Even if many compute units are madeavailable (e.g., a GPU), Bitap is highlyconstrained by the amount of availablememory bandwidth.

Also, standard Bitap algorithm cannot(1) perform the traceback step of the alignment.(2) work effectively for both short and long reads.

Goal: Designing a fast and efficient customized accelerator forapproximate string matching to enable faster read alignment, andtherefore faster genome sequence analysis for both short and long reads.Contributions:o Modified Bitap algorithm to perform efficient genome read alignment

in memory, byü parallelizing the Bitap algorithm by removing data dependencies

across loop iterations,ü adding efficient support for long reads, andü developing the first efficient Bitap-based algorithm for traceback.

o BitMAC, the first in-memory read alignment accelerator for both shortaccurate and long noisy reads.ü Hardware implementation of our modified Bitap algorithm, andü Designed specifically to take advantage of the high internal

bandwidth available in the logic layer of 3D-stacked DRAM chips.

o Recent technological advances in memorydesign allow architects totightly couple memory and logic within thesame chip with very high bandwidth, lowlatency and energy-efficient verticalconnectors.à 3D-stacked memories (Hybrid MemoryCube (HMC), High-Bandwidth Memory(HBM))

o A customizable logic layer enables fast,massively parallel operations on large setsof data, and provides the ability to runthese operations near memory at highbandwidth and low latency.à Processing-in-memory (PIM)

o BitMAC: implementation of our modifiedPIM-friendly Bitap algorithm using adedicated hardware accelerator that weadd to the logic layer of a 3D-stackedDRAM chip.

(1) achieves high performance andenergy efficiency with specializedcompute units and data locality,

(2) balances the compute resources andavailable memory bandwidth percompute unit,

(3) scales linearly with the number ofparallel compute units,

(4) provides generic applicability byperforming both edit distancecalculation and traceback for shortand long reads.

o BitMAC consists of two components thatwork together to perform readalignment:

(1) BitMAC-DC: calculates the editdistance between the referencegenome and the query read, and(2) BitMAC-TB: performs tracebackusing the list of edit locations recordedby BitMAC-DC.

HMC VaultMemory (512MB)

(Text)

Read Data

Read

/Writ

e Re

q

Pattern Mask

BitM

AC-D

C

Configuration and start pulse

Processing Block (PB)

Pattern Mask gen.

Walk Cntrl.

Mem.Cntrl.

Ref Text (Tn-Tn+i)

OldR

and

R

BitMAC-DCBitMAC-TB

Hybrid Memory Cube (HMC)16GB 32 Vaults

TSVs

Logic Layer

Vault

Host CPU

Hardware Accelerator (HWA)

Local SRAM Scratchpad (1.2MB)(Pattern, OldR and R)

(b) Processing Core (PC)(a) Processing Block (PB)

BitMAC-DC:o Systolic array based configuration for an efficient implementation of edit distance calculation. This allows us to

provide parallelism across multiple iterations of read alignment.(1) computes the edit distance between the whole reference genome and the input reads,(2) computes the edit distance between the candidate regions reported by an initial filtering step and the input reads,and also(3) finds the candidate regions of the reference genome by running the accelerator with the exact matching mode.

BitMAC-TB:o Makes use of a low-power general-purpose PIM core to perform the traceback step of read alignment.o New, efficient algorithm for traceback, which exploits the bitvectors generated by BitMAC-DC.o Divides the matching region of the text (as identified by BitMAC-DC) into multiple windows, and then performs

traceback in parallel on each window. This allows us to utilize the full memory bandwidth for the traceback step.

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

PacBio - 10% PacBio - 15% ONT - 10% ONT - 15% Average

Th

rou

ghp

ut

(Rea

ds/

sec)

BWA-MEM (t=1) BWA-MEM (t=12) Minimap2 (t=1) Minimap2 (t=12) BitMAC

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

1.00E+07

1.00E+08

1.00E+09

Illumina-100bp Illumina-150bp Illumina-250bp Average

Th

rou

ghp

ut (

KR

ead

s/se

c)

BWA-MEM (t=1) BWA-MEM (t=12) Minimap2 (t=1) Minimap2 (t=12) BitMAC

When compared with the alignment steps of BWA-MEM and Minimap2, BitMAC achieves:For simulated PacBio and ONT datasets:ü 4997× and 79× throughput improvement à over the single-threaded baselineü 455× and 7× throughput improvement à over the 12-threaded baselineü 15.9× and 13.8× less power consumptionFor simulated llumina datasets:ü 102,558× and 16,867× throughput improvement à over the single-threaded baselineü 8162× and 1445× throughput improvement à over the 12-threaded baseline

Future WorkWe believe it is promising toexplore:o coupling PIM-based filtering

methods with BitMAC toreduce the amount ofrequired read alignments,

o enhancing the design ofBitMAC to support differentscoring schemas and affinegap penalties,

o analyzing the effects of alarger alphabet on BitMAC,and

o examining the benefit ofusing AVX-512 support formultiple pattern searches inparallel.

Documents

BitMAC: An In-Memory Accelerator for Bitvector-Based ......time dynamic programming algorithm, whichconsumesover70%oftheexecution timeofreadalignment. 3D-Stacked DRAM and Processing-in-Memory