A 4-mm 180-nm-CMOS 15-Giga- Cell-Updates-per …strukov/papers/2017/racelogicchip2017.pdfA 4-mm2 180-nm-CMOS 15-Giga- ... fabricated in standard 180-nm CMOS technology, ... Current

A 4-mm2 180-nm-CMOS 15-Giga-Cell-Updates-per-Second DNA Sequence Alignment

Engine Based on Asynchronous Race ConditionsAdvait Madhavan

Electrical andComputer Engineering

UC Banta [email protected]

Timothy SherwoodComputer ScienceUC Banta Barbara

[email protected]

Dmitri StrukovElectrical and

Computer EngineeringUC Banta Barbara

[email protected]

Abstract—We have fabricated and successfully tested, for thefirst time, a prototype chip of a Race Logic computing paradigm,which makes positive use of race conditions for accelerating abroad class of optimization problems, such as ones solved bydynamic programming algorithms. In Race Logic, information isencoded in signal propagation delay, rather than conventionallogic levels, and the result of the computation is observedfrom relative timing differences between injected signals, i.e. theoutcome of races. The 2×2 mm2 chip, fabricated in standard 180-nm CMOS technology, is designed to perform real-world DNAsequence alignment. Measurement results on typical benchmarkdata show 15 GCUPS sustained throughput at 70 mW powerconsumption, with only ∼ 15 mW spent for actual computation.These numbers compare very favorably with the state-of-the-artimplementations.

I. INTRODUCTIONAs we enter into a leakage limited regime, faced with

transistor utilization issues, the proliferation of hardware accel-erators seems inevitable [2], [9]. In order to extract maximumperformance and energy efficiency, researchers have begunquestioning fundamental assumptions of traditional designssuch as precision and binary encoding. Guided by error tolerantapplications and human perceptual imperfections, approximatecircuits are coming to the fore, in which the digital domain isoften abandoned to improve energy efficiency [4], [12], [13].In this paper, we present a prototype of an architecture, thatseems to lie somewhere between traditional digital and analogrealms. Though all wires are driven to a ”1” or ”0” like indigital systems, information is encoded in the delay of thesignal, which allows a continuum of values to be encoded ontoa single wire. Computation is then performed by observingthe relative timing differences between injected signals in areconfigurable circuit. Such a temporal information represen-tation makes some arithmetic operations, such as additionsand comparisons, trivial to implement, allowing for significantperformance and energy gains.

The race formulation is especially well suited to solvedynamic-programming-like problems. Though a whole host ofsuch problems exist [6], we chose the well studied problem ofDNA sequence alignment [1] such that a fair comparison canbe made. With the field of personalized medicine setting thegoal of a complete genome sequencing at a $1000, sequencealignment has recently seen considerable research activity witha number of general purpose [3], GPU [5], as well as ASIC[11] implementations. In contrast to these methods, we presentfor the first time a prototype of a fully custom 4 mm2 DNAsequence alignment engine fabricated in 0.18 µm process, thatmakes positive use of asynchronous race conditions.

II. BACKGROUNDThe core idea behind race logic is to encode information

in a timing delay, which regardless of its implementation,simplifies certain computation primitives [7]. For example,in the case of addition, simply stacking two delay elementsone after another causes the resultant delay to be a sum ofthe two. Similarly, in the case of comparisons, the smallest(or largest) magnitude between multiple data values can beeasily computed by arranging the corresponding delay valuesin parallel and selecting the first arriving (or last arriving)signal. Figure 1a shows the hardware implementations ofsimple addition and comparison operations with simple CMOSprimitives, under the basic assumption that the event that isbeing delayed is a rising edge. It can be seen that selectingthe first (or last) arriving rising edge can be implemented withOR (or AND) gates respectively. The simple ADD, MIN andMAX primitives are not to be underestimated, as when chainedeffectively, the can prove to be very powerful operations.

Fig. 1. (a) Race Logic basics showing unit delays, addition, and comparisontechniques. (b) Edit graphs with two highlighted paths representative of (c)nominal case and (d) worst case alignments. Best case alignment would involveonly the major diagonal and is not representative of such a computation.

To understand how these simple primitives can be chainedeffectively, we look to examples of shortest paths on directedacyclic graphs (DAGs), e.g. as shown in Figure 1b, which areknown to be good representations of dynamic programmingalgorithms. Starting from the root nodes, the shortest pathto the next node is computed by performing addition andcomparison operations at that node. Conventional softwaremethods implement this algorithm by stepping through nodesand following graph dependencies one computation step at atime, while GPU and FPGA based techniques aid in speedupby concurrently compute the scores of independent nodes.

ASIC implementations use either systolic arrays or heuris-tic SRAM-based processing elements with clever encodingschemes for density and performance.

Contrary to such implementations, the delay encoding ex-plicitly constructs the graphical dependency chain in hardwareby replacing edges with delay elements and nodes with ORgates respectively. A counter is used to calculate the total timetaken to traverse the graph which tells us the length of theoptimal path.

To highlight the full potential of Race Logic, we chose thewell studied problem of DNA sequence alignment, in whichthe objective is to measure the similarity between two givenDNA sequences (also called patterns or strings) based on agiven metric, known as the edit distance. String similarity is atypical bottleneck operation, whether it is in reference-assistedor de-novo DNA sequencing and is performed billions of timesin the sequencing of a whole human genome [1]. Not onlydoes his problem map very effectively to a shortest path on agraph, which can be implemented in a reconfigurable way fordifferent sequences, the resultant graph has a regular structurewith the edges weights having a dynamic range of aboutan order of magnitude. This edit graph is a two-dimensionalrepresentation of all the possible alignments between the twoinput sequences as shown in Fig. 1b. Any specific alignmentis just a path in this graph where every edge corresponds to anedit operation (vertical arrows = insertions, horizontal arrows= deletions and diagonal arrows = matches or mismatches). Toselect the alignment with the maximum number of matches,a scoring function (or matrix) is introduced, which penalizesmismatches with a higher score and a match with a lowerone. Determining the “best” alignment is therefore a matter offinding the shortest path in the graph. Typical score matricescan be written as as [1, 2, 1] [1, 4, 3] etc., where the scoresrepresent [match, mismatch, in-dels], respectively.

III. RACE LOGIC IMPLEMENTATIONIn order to be representative of the state of the art, we

chose a problem size of 50 which lies within the nominal rangeof NGS sequence data. Using two 50-symbol long strings, areconfigurable edit graph structure, similar to the one shownin Fig. 1b, is implemented in which weights of the matchand mismatch conditions are based on the real score matricesfrom Ref. [1]. Though the structure of the graph looks thesame for all query DNA sequence input, the magnitudes ofthe particular edge weights (i.e. values of delays) are governedonly the by the specific input pair of sequences. For each newpair of sequences, time spent in navigating the resultant graphis measured using a clock, and is representative of the qualityof alignment/similarity between the respective sequences. Thechoice of delay element is also an important one as previousstudies have shown that synchronous delay elements are veryarea expensive and incur cubic energy scaling with problemsize[7]. Current starved inverter based asynchronous delayelements have been proposed as a compact, fast and energyefficient alternative [8]. The main result of this paper is thatwe show that Race Logic with 10,000 delay elements can bedesigned in a robust and variation-tolerant manner.

A. Race Logic Array and Bias NetworksThe 2,500 cell array consists of repetitions of a funda-

mental unit cell whose tileable structure is shown in Fig. 2b.Each cell implements 4 delay elements, XOR based matching

circuitry and a symmetric resettable OR gate at the input. Thetiming on the array reset circuitry is adjusted externally toensure that all delay elements have been reset. Each delayelement is constructed out of a current starved inverter with itsoutput split by the current control transistor for better matching[10], followed by a regular minimum size inverter to sharpenthe edge (Fig. 2c). Since only rising edges go through thedelay element, the current starved inverter only uses NMOSbias and cascode control nodes. Each cell receives 2-bit symbolinput from the nucleotides being compared, to determine whichdelays to choose based on match and mismatch conditions, andrising edge input from its top, left and diagonal neighbours.The symbol input goes into XOR-based matching circuitrywhich chooses a diagonal delay element through a pass-gate MUX to make sure that both paths have similar delaycharacteristics. Dummy pass-gates were also added in offdiagonal paths to ensure similar delay across all delay paths.

In the design of this first prototype, functional correctnessand accuracy in the face of process variations and switch-ing/coupling noise was a major concern. Intra-die processvariations by virtue of large array size (∼ 1 mm a side) as wellas switching noise from high speed switching activity couldboth negatively affect correctness of the array with injectednoise being data dependant and hence more important to getrid of. To address these issues, we proposed heavy decouplingof bias nodes with MIM-caps to keep silicon area down toa minimum, as well as partitioning of the array into regionsof 10 × 10 with their own local bias networks that allow fordecoupling of switching activity from one part of the arrayto another as shown in Figure 3c. The local biases networks(Fig. 3b) are generated by a set of global bias networks thatdistribute timing information across the entire array. To ensurethat the generated currents are variable, yet tolerant to processand supply variations, the global bias network uses an Op-Ampto pin a fixed voltage across a precision potentiometer.

B. I/O system, Clock and Control LogicThe I/O system consists of a binary coded SPI interface

for sequence inputs with the following encoding for DNAnucleotides: A = “00”, G = “01”, C = “10” and D = “11”.The system can load up to four 50-symbol long sequencesat once and run them in either patterned or burst mode. Thepatterned mode allows for checking functional correctness ofeach pattern against a reference, while the burst mode repeatsone sequence pair alignment at maximum allowable throughputof the system until the user initiates stoppage. The idea hereis to allow enough time for correct measurement of powerthrough the external system.

The function of the clock generator block (Fig. 2a), is togenerate an output clock to count the time period of the criticalpath race in unit delay steps. Since the exact unit delay is notknown post fabrication, we directly vary the supply voltage ofa 11-stage ring oscillator, to generate large dynamic range (700ps to 15 ns), controllable, and calibratable clock. A differentialamplifier based level shifter was used to scale the clock signal,which is then buffered out to the rest of the circuit, as wellas 12 bit pre-scaled counter that divides it down for off-chipmeasurement.

The function of the control logic block is to take allthe elements discussed in this section and create a simplestate machine based interface that can be used externally forperformance and power measurement. The state machine of

Fig. 2. (a) Organization of the implemented chip. CS D1, D2 and OD represent current control for diagonal match, diagonal mismatch, and off diagonal delays,respectively, while LCS represents the local current sources. (b) Unit cell of the Race Logic array and gate level implementation of its (c) symmetric OR gate,and (d) delay element. The implemented delay element also has a cascode control node (not shown on panel d).

Fig. 3. (a) Micrograph of the Race Logic chip with its major functional units highlighted. The Race Logic array is an explicit implementation of the editgraph, and is reconfigured every computation based on the input patterns. (b) and (c) show a 10 × 10 race array with MIM-caps and local bias network. Thelocal bias networks receive their control input from the global bias network as shown in Fig. 2a. Panels (d) and (e) show the circuit and layout of the unit cellwhich is tiled to construct the whole array, while panels (f) and (g) show symmetric OR gate design and delay element.

such a control logic block is responsible for interfacing withthe array, and counter blocks, starting the race computation(array and counter), detecting the end of the race resetting thearray and counter.

IV. RESULTS AND DISCUSSIONCharacterization of the delay elements revealed a minimum

unit delay of approx 2 ns, with about an order of magnitudeof control over the dynamic range through the precisionpotentiometer. By using specific input sequences, diagonal andoff diagonal paths can be specifically isolated allowing reliableconstruction of score matrices from NCBI blast benchmark(such as, e.g. [1, 2, 1] [1, 4, 3] [1, 6, 4] [1, 10, 6]).

To test the functionality of the designed circuit, we ransimilarity measures on real DNA sequence data from human

genome by simulating the process of its shotgun sequencing.In particular, a section from chromosome 1 was partitionedinto 50-symbol long sequences taken at random places, witha coverage of 15. These “shotgunned strands” are comparedagainst each other, both in simulation as well as our imple-mentation. The results for score matrix [1, 4, 3] for a set of100 sequence samples is shown in Figure 4. The measuredscore closely tracks the expected one with about 2.9% errordue to process, mismatch and noise based variations, whilethe average and maximum sequencing throughput are ∼ 1010

/ ∼ 2.5 × 1010 cell-updates-per-second (CUPS), respectively.For the representative threshold value of 90 [8], the throughputis close to ∼ 1.5× 1010 CUPS.

In order to understand the power distribution of the array,

Fig. 4. Expected versus measured alignment scores for the case when multiplesequences are compared against a reference sequence.

we separated the power domains such that the array bias andswitching power could be measured independently of eachother. Using specific input we isolated the best case and worstcase array paths and compared energy numbers for variousscore matrices (Table I). For very large delay values, biascurrents are very small and hence idle power is in the low mWsrange. When the match condition is isolated for maximumspeed, 25 local biases are simultaneously activated causingan increase in bias power. Switching power is, however,minimally affected as only the main diagonal is switching. Forthe case of [1, inf, 1] score matrix, the bias power is almostdoubled as expected, but both best and worst case powerremains the same. Careful analysis reveals that though the bestcase alignment takes half the time as the worst case one, onlythe upper left triangle of the array switches, whereas in theworst case the whole array switches. Hence both cases havesimilar power. Looking at more representative score matrices(bottom two rows of the table) we see that at maximumthroughput the power of the array is about 70 mW.

TABLE I. MEASURED POWER NUMBERS FOR BEST AND WORST CASEALIGNMENTS FOR VARIOUS SCORE MATRICES.

The performance compares very favorably with those ofstate-of-the-art implementations [5], [11], [3] (Table II). (Thetable only shows results for comparable alignment algorithms,and, e.g., does not include recent work with higher reportedthroughput due to much simpler, dynamic-range-of-2 scorematrix.) Clearly, the throughput could be easily increasedfor DNA sequence alignment problem by processing multiplestrings in parallel in different arrays, and will roughly scalelinearly with chip area for Race Logic implementation. A crudeestimates show that the Race Logic implementation based onthe same process and comparable chip area could have atleast two orders of magnitude higher throughput at similarpower consumption as compared to GPU implementations [5].Moreover, threshold-based strategies, discussed in Refs. [8],[1], can be used to further improve Race Logic throughput andpower consumption. It should be noted, that though this work

focus on accelerating a particular task, we expect that RaceLogic would be useful for variety of other applications whichheavily rely on dynamic programming algorithms, such asprotein alignment, image seam carving, dynamic time warping,stereo correspondence [6].

TABLE II. COMPARISON OF THIS WORK WITH RECENT ASIC, GPUAND SIMD IMPLEMENTATIONS.

V. CONCLUSIONThis paper presents a functional prototype of a DNA

sequence alignment engine based on radically different com-puting paradigm - asynchronous Race Logic to perform highthroughout and low energy computation. A 4 mm2 prototypechip, fabricated in 180 nm CMOS process, allows for 15×109

CUPS throughput at 70 mW power when running practicalDNA sequence alignment tasks, which compares very favor-ably with state-of-the-art results.

ACKNOWLEDGMENTThe authors would like to thank Melika Payvand for her

help in chip layout and tapeout.

REFERENCES[1] R. Ekblom and J. B. Wolf. A field guide to whole-genome sequencing,

assembly and annotation. Evolutionary applications, 7(9):1026–1042,2014.

[2] H. Esmaeilzadeh, E. Blem, et al. Dark silicon and the end of multicorescaling. In Proc. ISCA’11, pages 365–376, 2011.

[3] M. Farrar. Striped smith–waterman speeds database searches six timesover other simd implementations. Bioinformatics, 23(2):156–161, 2007.

[4] Y. Huang, N. Guo, M. Seok, Y. Tsividis, and S. Sethumadhavan.Evaluation of an analog accelerator for linear algebra. In Proc. ISCA’16,pages 570–582, 2016.

[5] Y. Liu, A. Wirawan, and B. Schmidt. Cudasw++ 3.0: acceleratingsmith-waterman protein database search by coupling cpu and gpu simdinstructions. BMC bioinformatics, 14(1):1, 2013.

[6] A. Madhavan. Abusing hardware race conditions to perform usefulcomputation. PhD thesis, UC Santa Barbara, 2016.

[7] A. Madhavan, T. Sherwood, and D. Strukov. Race logic: A hardwareacceleration for dynamic programming algorithms. In Proc. ISCA’14,pages 517–528. IEEE, 2014.

[8] A. Madhavan, T. Sherwood, and D. Strukov. Energy efficient compu-tation with asynchronous races. In Proc. DAC’16, page 108, 2016.

[9] I. Magaki, M. Khazraee, L. V. Gutierrez, and M. B. Taylor. Asic clouds:Specializing the datacenter. In Proc. ISCA’16, pages 178–190, 2016.

[10] P. Mroszczyk and P. Dudek. Tunable CMOS delay gate with reducedimpact of fabrication mismatch on timing parameters. In Proc. NEW-CAS’13, pages 1–4, 2013.

[11] N. Neves, N. Sebastiao, D. Matos, P. Tomas, P. Flores, and N. Roma.Multicore simd asip for next-generation sequencing and alignmentbiochip platforms. IEEE Trans. VLSI, 23(7):1287–1300, 2015.

[12] R. St Amant, A. Yazdanbakhsh, J. Park, B. Thwaites, H. Esmaeilzadeh,A. Hassibi, L. Ceze, and D. Burger. General-purpose code accelerationwith limited-precision analog computation. ACM SIGARCH ComputerArchitecture News, 42(3):505–516, 2014.

[13] S. Venkataramani, S. T. Chakradhar, K. Roy, and A. Raghunathan.Approximate computing and the quest for computing efficiency. InProc. DAC’15, page 120, 2015.

Documents

A 4-mm 180-nm-CMOS 15-Giga- Cell-Updates-per …strukov/papers/2017/racelogicchip2017.pdfA 4-mm2 180-nm-CMOS 15-Giga- ... fabricated in standard 180-nm CMOS technology, ... Current