15
Application-Driven Near-Data Processing for Similarity Search Vincent T. Lee, Amrita Mazumdar, Carlo C. del Mundo, Armin Alaghi, Luis Ceze, Mark Oskin University of Washington {vlee2, amrita, cdel, armin, luisceze, oskin}@cs.washington.edu ABSTRACT Similarity search is a key to a variety of applications in- cluding content-based search for images and video, rec- ommendation systems, data deduplication, natural lan- guage processing, computer vision, databases, computa- tional biology, and computer graphics. At its core, sim- ilarity search manifests as k-nearest neighbors (kNN), a computationally simple primitive consisting of highly parallel distance calculations and a global top-k sort. However, kNN is poorly supported by today’s architec- tures because of its high memory bandwidth require- ments. This paper proposes an application-driven near-data processing accelerator for similarity search: the Similar- ity Search Associative Memory (SSAM). By instantiat- ing compute units close to memory, SSAM benefits from the higher memory bandwidth and density exposed by emerging memory technologies. We evaluate the SSAM design down to layout on top of the Micron hybrid mem- ory cube (HMC), and show that SSAM can achieve up to two orders of magnitude area-normalized through- put and energy efficiency improvement over multicore CPUs; we also show SSAM is faster and more energy efficient than competing GPUs and FPGAs. Finally, we show that SSAM is also useful for other data intensive tasks like kNN index construction, and can be general- ized to semantically function as a high capacity content addressable memory. 1. INTRODUCTION Similarity search is a key computational primitive found in a wide range of applications, such as compu- tational biology [1], computer graphics [2], image and video retrieval [3, 4], image classification [5], content deduplication [6, 7], machine learning, databases [8], data mining [9], and computer vision [10]. While much attention has been directed towards accelerating fea- ture extraction techniques like convolutional neural net- works [11], there has been relatively little work focused on accelerating the task that follows: taking the re- sulting feature vectors and searching the vast corpus of data for similar content. In recent years, the impor- tance and ubiquity of similarity search has increased dramatically with the explosive growth of visual con- tent: users shared over 260 billion images on Facebook in 2010 [12], and uploaded over 300 hours of video on YouTube every minute in 2014 [13]. This volume of vi- sual data is only expected to continue growing exponen- tially [14], and has motivated search-based graphics and vision techniques such as visual memex [15], 3D recon- struction [16], and cross-domain image matching [17]. Similarity search manifests as a simple algorithm: k- nearest neighbors (kNN). At a high level, kNN is an approximate associative computation which tries to find the most similar content with respect to the query con- tent. At its core, kNN consists of many parallelizable distance calculations and a single global top-k sort, and is often supplemented with indexing techniques to re- duce the volume of data that must be processed. While computationally very simple, kNN is notoriously mem- ory intensive on modern CPUs and heterogeneous com- puting substrates making it challenging to scale to large datasets. In kNN, distance calculations are cheap and abundantly parallelizable across the dataset, but mov- ing data from memory to the computing device is a significant bottleneck. Moreover, this data is used only once per kNN query and discarded since the result of a kNN query is only a small set of identifiers. Batch- ing requests to amortize this data movement has lim- ited benefits as time-sensitive applications have strin- gent latency budgets. Indexing techniques such as kd- trees [18], hierarchical k-means clustering [19], and lo- cality sensitive hashing [20] are often employed to re- duce the search space but trade reduced search accuracy for enhanced throughput. Indexing techniques also suf- fer from the curse of dimensionality [21]; in the context of kNN, this means indexing structures effectively de- grade to linear search for increasing accuracy targets. Because of its significance, generality, parallelism, un- derlying simplicity, and small result set, kNN is an ideal candidate for near-data processing. The key in- sight is that a small accelerator can reduce the tradi- tional bottlenecks of kNN by applying orders of magni- tude data reduction near memory, substantially reduc- ing the need for data movement. While there have been many attempts at processing-in-memory (PIM) in the past [22–27], much of prior work suffered from DRAM technology limitations. Logic created in DRAM pro- cesses was too slow, while DRAM implemented in logic processes suffered from poor retention and high power 1 arXiv:1606.03742v2 [cs.DC] 10 Jul 2017

Application-Driven Near-Data Processing for Similarity Search

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Application-Driven Near-Data Processing for Similarity Search

Application-Driven Near-Data Processing for SimilaritySearch

Vincent T. Lee, Amrita Mazumdar, Carlo C. del Mundo, Armin Alaghi, Luis Ceze, Mark OskinUniversity of Washington

{vlee2, amrita, cdel, armin, luisceze, oskin}@cs.washington.edu

ABSTRACT

Similarity search is a key to a variety of applications in-cluding content-based search for images and video, rec-ommendation systems, data deduplication, natural lan-guage processing, computer vision, databases, computa-tional biology, and computer graphics. At its core, sim-ilarity search manifests as k-nearest neighbors (kNN),a computationally simple primitive consisting of highlyparallel distance calculations and a global top-k sort.However, kNN is poorly supported by today’s architec-tures because of its high memory bandwidth require-ments.

This paper proposes an application-driven near-dataprocessing accelerator for similarity search: the Similar-ity Search Associative Memory (SSAM). By instantiat-ing compute units close to memory, SSAM benefits fromthe higher memory bandwidth and density exposed byemerging memory technologies. We evaluate the SSAMdesign down to layout on top of the Micron hybrid mem-ory cube (HMC), and show that SSAM can achieve upto two orders of magnitude area-normalized through-put and energy efficiency improvement over multicoreCPUs; we also show SSAM is faster and more energyefficient than competing GPUs and FPGAs. Finally, weshow that SSAM is also useful for other data intensivetasks like kNN index construction, and can be general-ized to semantically function as a high capacity contentaddressable memory.

1. INTRODUCTIONSimilarity search is a key computational primitive

found in a wide range of applications, such as compu-tational biology [1], computer graphics [2], image andvideo retrieval [3, 4], image classification [5], contentdeduplication [6, 7], machine learning, databases [8],data mining [9], and computer vision [10]. While muchattention has been directed towards accelerating fea-ture extraction techniques like convolutional neural net-works [11], there has been relatively little work focusedon accelerating the task that follows: taking the re-sulting feature vectors and searching the vast corpus ofdata for similar content. In recent years, the impor-tance and ubiquity of similarity search has increaseddramatically with the explosive growth of visual con-

tent: users shared over 260 billion images on Facebookin 2010 [12], and uploaded over 300 hours of video onYouTube every minute in 2014 [13]. This volume of vi-sual data is only expected to continue growing exponen-tially [14], and has motivated search-based graphics andvision techniques such as visual memex [15], 3D recon-struction [16], and cross-domain image matching [17].

Similarity search manifests as a simple algorithm: k-nearest neighbors (kNN). At a high level, kNN is anapproximate associative computation which tries to findthe most similar content with respect to the query con-tent. At its core, kNN consists of many parallelizabledistance calculations and a single global top-k sort, andis often supplemented with indexing techniques to re-duce the volume of data that must be processed. Whilecomputationally very simple, kNN is notoriously mem-ory intensive on modern CPUs and heterogeneous com-puting substrates making it challenging to scale to largedatasets. In kNN, distance calculations are cheap andabundantly parallelizable across the dataset, but mov-ing data from memory to the computing device is asignificant bottleneck. Moreover, this data is used onlyonce per kNN query and discarded since the result ofa kNN query is only a small set of identifiers. Batch-ing requests to amortize this data movement has lim-ited benefits as time-sensitive applications have strin-gent latency budgets. Indexing techniques such as kd-trees [18], hierarchical k-means clustering [19], and lo-cality sensitive hashing [20] are often employed to re-duce the search space but trade reduced search accuracyfor enhanced throughput. Indexing techniques also suf-fer from the curse of dimensionality [21]; in the contextof kNN, this means indexing structures effectively de-grade to linear search for increasing accuracy targets.

Because of its significance, generality, parallelism, un-derlying simplicity, and small result set, kNN is anideal candidate for near-data processing. The key in-sight is that a small accelerator can reduce the tradi-tional bottlenecks of kNN by applying orders of magni-tude data reduction near memory, substantially reduc-ing the need for data movement. While there have beenmany attempts at processing-in-memory (PIM) in thepast [22–27], much of prior work suffered from DRAMtechnology limitations. Logic created in DRAM pro-cesses was too slow, while DRAM implemented in logicprocesses suffered from poor retention and high power

1

arX

iv:1

606.

0374

2v2

[cs

.DC

] 1

0 Ju

l 201

7

Page 2: Application-Driven Near-Data Processing for Similarity Search

demands; attempts at hybrid processes [28] result in theworst of both. PIM architectures are more appealing to-day with the advent of die-stacked memory technologywhich enables the co-existence of an efficient DRAMlayer and efficient logic layer [29].

We propose Similarity Search Associative Memory(SSAM) which integrates a programmable acceleratorinto a die-stacked memory module. Semantically, aSSAM takes a query as input and returns the top-k clos-est neighbors stored in memory as output. We evaluatethe performance and energy efficiency gains of SSAMby implementing, synthesizing, and simulating the de-sign down to layout. We then compare SSAM againstcurrent CPUs, GPUs, and FPGAs, and show that it canachieve better area-normalized throughput and energyefficiency.

Our paper makes the following contributions:

• A characterization of state-of-the-art k-nearestneighbors including both application-level and ar-chitectural opportunities that justify acceleration.

• An application-driven codesign of a near mem-ory vector processor-based accelerator architecturewith hardware support for similarity search on topof Hybrid Memory Cube (HMC).

• Instruction extensions to leverage hardware unitsto accelerate similarity search.

The rest of the paper is organized as follows. Sec-tion 2 introduces and characterizes the kNN algorithm.Section 3 describes the SSAM architecture and thehardware/software interface. Section 4 outlines eval-uation methodology, and Section 5 presents evaluationresults. Section 6 discusses the impact of these resultson different application areas and design points. Fi-nally, Section 7 discusses related work.

2. CHARACTERIZATION OF KNNWe now introduce and characterize the kNN algo-

rithm pipeline and indexing techniques, and highlightthe application-level and architectural opportunities foracceleration.

2.1 Case study: content-based searchA typical kNN software application pipeline for

content-based search (Figure 1) has five stages: fea-ture extraction, feature indexing, query generation, k-nearest neighbors search, and reverse lookup. In featureextraction (Figure 1a), the raw multimedia corpus isconverted into an intermediary feature vector represen-tation. Feature vectors may represent pixel trajectoriesin a video, word embeddings of a document, or shapesin an image [6, 30, 31], and are extracted using featuredescriptors or convolutional neural networks [31–36].While feature extraction is an important component ofthis pipeline, it only needs to be performed once for thedataset and can be done offline; a significant portion ofwork has also shown feature extraction can be achievedefficiently [11, 37–40]. In indexing (Figure 1b), featurevectors from feature extraction are organized into data

structures (discussed in Section 2.3). At query time,these data structures are used to quickly prune thesearch space; intuitively, these data structures shouldbe able to reduce the search time from linear to loga-rithmic in the size of the data. Indexing, like featureextraction, can be performed offline and away from thecritical path of the query.

While feature extraction and indexing can be per-formed offline, the query generation stage (Figure 1c) ofthe search pipeline occurs online. In query generation, auser uploads a multimedia file (image, video, etc.) andrequests similar content back. The query runs throughthe same feature extractor used to create the databasebefore being passed to the search phase. Once a queryis generated, the k-nearest neighbors stage (Figure 1d)attempts to search for the most similar content in thedatabase. The kNN algorithm consists of many highlyparallelizable distance calculations and a global top-ksort; indexing structures may also be employed to prunethe search space but trade accuracy for performance.The similarity metric employed by the distance calcu-lation often depends on the application, but commondistance metrics include Euclidean distance, Hammingdistance [41–48], cosine similarity [49], and learned dis-tance metrics [50]. The final step in the pipeline is re-verse lookup where the resulting k nearest neighbors aremapped to their original database content. The result-ing media is then returned to the user.

2.2 Typical workload parametersPrior work shows the feature dimensionality for

descriptors such as Speeded Up Robust Feature(SURF) [32], word embeddings [51], Scale InvariantFeature Transform (SIFT) [31], GIST descriptors [33],AlexNet [52], and ResNet [36] ranges from 64 to 4096dimensions. For higher dimensional feature vectors,it is common to apply techniques such as principalcomponent analysis to reduce feature dimensionality totractable lengths [53]. The number of nearest neighborsk for an array of search applications has been shown torange from 1 (nearest neighbor) up to 20 [6,13,31,54,55].Each kNN algorithm variant also has a number of addi-tional parameters such as indexing technique, distancefunction, bucket size, index-specific hyperparameters,and hardware specific optimizations.

To simplify the characterization, we limit our initialevaluation to Euclidean distance and three real worlddatasets: the Global Vectors for Word Representations(GloVe) dataset [51], the GIST dataset [56], and theAlexNet dataset [57]. The GloVe dataset consists of 1.2million word embeddings extracted from Twitter tweetsand the GIST dataset consists of 1 million GIST featurevectors extracted from images. We also constructed anAlexNet dataset by taking 1 million images from theFlickr dataset [57] and applying AlexNet [34] to extractthe feature vectors. For each dataset, we separate itinto a “training” set used to build the search index, anda “test” set used as the queries when measuring appli-cation accuracy. Exact dataset parameters used for ourcharacterization and evaluation are shown in Table 1.

2

Page 3: Application-Driven Near-Data Processing for Similarity Search

Index ConstructionFeature Extraction

Multimedia DatabaseExtract shape,

appearance, and motion

Feature

Vector

Feature

Vector

Intermediary

Representation

Query GenerationQuery Data

Feature Vector

Extract

Features

Index Traversal K-Nearest Neighbors

Result IDResult ID

Distance

Calculations

?

?

?

??

?

?

?

Top K-Sort

Reverse Lookup

Generate Reponse

(a)

(b) (f)

(c) (d) (e)

Figure 1: Software application pipeline for similarity search. Feature extraction and indexing is done offline. (a)feature extraction, (b) feature indexing, (c), query generation, (d) index traversal (e) k-nearest neighbor search, (f)reverse lookup.

Table 1: Evaluated kNN workload parameters

WorkloadDatasetVectors

QueriesDimen-sions

Neigh-bors

GloVe 1183514 10000 100 6GIST 1000000 1000 960 10

AlexNet 1000000 1000 4096 16

2.3 Approximate kNN algorithms tradeoffsWe now characterize three canonical indexing tech-

niques employed by approximate kNN algorithms: kd-trees, hierarchical k-means, and multi-probe localitysensitive hashing (MPLSH). Indexing techniques em-ploy hierarchical data structures which are traversed atquery time to prune the search space. In kd-trees, theindex is constructed by randomly cutting the dataset bythe top-N vector dimensions with highest variance [58].The resulting index is a tree data structure where eachleaf in the tree contains a bucket of similar vectors; thedepth of the bucket depends on how tall the tree is lim-ited to be. Queries which traverse the index and end upin the same bucket should be similar; multiple paralleltrees are often used in parallel with different cut orders.Multiple leaves in the tree can be visited to improve thequality of the search; to do this, the traversal employsbacktracking to check additional “close by” buckets in adepth first search-like fashion. A user-specified boundtypically limits the number of additional buckets visitedwhen backtracking.

Similarly, in hierarchical k-means the dataset is parti-tioned recursively based on k-means cluster assignmentsto form a tree data structure [19]. Like kd-tree indices,the height of the tree is restricted, and each leaf in thetree holds a bucket of similar vectors which are searchedwhen a query reaches that bucket; backtracking is alsoused to expand the search space and search “close by”buckets.

Finally, MPLSH constructs a set of hash tables whereeach hash location is associated with a bucket of similarvectors [59]. In MPLSH, hash functions are designed tointentionally cause hash collisions to map similar vec-tors to the same bucket. To improve accuracy, MPLSHapplies small perturbations to the hash result to createadditional probes into the same hash table to search“close by” hash partitions. In our evaluation, we usehyperplane MPLSH (HP-MPLSH) which cuts the spaceinto random hyperplanes and set the number of hashbits or hyperplane cuts to 20.

Each of these approximate kNN algorithms trade ac-curacy for enhanced throughput. In kNN, accuracy isdefined as SE ∩ SA/|SE | where SE is the true set ofneighbors returned by exact floating point based lin-ear kNN search, and SA is the set of neighbors re-turned by approximate kNN. In general, searching moreof the dataset improves search accuracy for indexingtechniques. To quantify the accuracy of indexing struc-tures, we benchmark the accuracy and throughput ofindexing techniques for the GloVe, GIST, and AlexNetdatasets. We use the Fast Library for ApproximateNearest Neighbors (FLANN) [19] to benchmark kd-trees and hierarchical k-means, and Fast Lookups forCosine and Other Nearest Neighbors Library (FAL-CONN) [60] to benchmark HP-MPLSH. For kd-treesand hierarchical k-means we vary the number of leafnodes or buckets in the tree that backtracking willcheck, while for HP-MPLSH we increase the numberof probes used per hash table. Each of these modifi-cations effectively increases the fraction of the datasetsearched per query and lowers overall throughput.

The resulting throughput versus accuracy curves areshown in Figure 2 for single threaded implementations.In general, our results show indexing techniques canprovide up to 170× throughput improvement over lin-ear search while still maintaining at least 50% searchaccuracy, but only up to 13× in order to achieve 90%

3

Page 4: Application-Driven Near-Data Processing for Similarity Search

accuracy. Past 95-99% accuracy, we find that index-ing techniques effectively degrade to linear search (bluesolid line). More importantly, our results show there isa significant opportunity for also accelerating approx-imate kNN techniques. Hardware acceleration of ap-proximate kNN search can either increase throughputat iso-accuracy by simply speeding up the computationor increase search accuracy at iso-latency by searchinglarger volumes of data.

2.4 Alternative numerical representationsand distance metrics

We now briefly discuss the space of numerical repre-sentations and distance metrics used in kNN search.Fixed-Point Representations: Fixed-point arith-metic is much cheaper to implement in hardware thanfloating point units. To evaluate whether floating pointis necessary for kNN, we converted each dataset to a 32-bit fixed-point representation and repeated the through-put versus accuracy experiments. Overall, we find thereis negligible accuracy loss between 32-bit floating-pointand 32-bit fixed-point data representations.Hamming Space Representations: A large bodyof recent work has shown that Hamming codes can bean effective alternative for Euclidean space representa-tions [41, 46–48, 61–63]. Binarization techniques tradeaccuracy for higher throughput since precision is lostby binarizing floating point values but throughput in-creases since the dataset size is smaller; binarizationalso enables Hamming distance calculations which arecheaper to implement in hardware. In practice, care-fully constructed Hamming codes have been shown toachieve excellent results [64].Alternative Distance Metrics: While the canonicaldistance metric for kNN is the Euclidean norm, therestill exist a wide variety of alternative distance metrics.Such alternative metrics include Manhattan distance,cosine similarity, Chi squared distance, Jaccard similar-ity, and learned distance metrics [50].

2.5 Architectural CharacterizationTo more concretely quantify the architectural behav-

iors of kNN variants, we instrumented the baselines pre-sented earlier using the Pin [65] instruction mix tool onan Intel i7-4790K CPU. Table 2 shows the instructionprofile for linear, kd-tree, k-means, and MPLSH basedalgorithms respectively. Recall that linear search per-formance is still valuable since higher accuracy targetsreduce to linear search; in addition, approximate algo-rithms still use linear search to scan buckets of vectors atthe end of their traversals. As expected, the instructionprofile shows that vector operations and extensions areimportant for kNN workloads due to the many vector-parallel distance calculations. In addition, the high per-centage of memory reads confirms that the computationhas high data movement demands. Approximate kNNtechniques like KD-trees and MPLSH exhibit less skewtowards vectorized instructions but still exhibit similarmemory intensive behavior and show vectorization isvaluable.

Table 2: Architectural behavior profiles of kNN algo-rithms for GloVe dataset.

AlgorithmAVX/SSEInst. (%)

Mem.Reads (%)

Mem.Writes (%)

Linear 54.75 45.23 0.44KD-Tree 28.75 31.60 10.21K-Means 51.63 44.96 1.12MPLSH 18.69 31.53 14.16

3. SSAM ARCHITECTUREBased on the characterization results in Section 2, it

is clear that similarity search algorithms (1) are an idealmatch for vectorized processing units, and (2) can ben-efit from higher memory bandwidth to better supportits data intensive execution phases. We now present ourapplication-guided SSAM module and accelerator archi-tecture which exploits near-data processing and special-ized vector compute units to address these bottlenecks.

3.1 System integration and software interfaceSSAM is a memory module that integrates into a

typical system as a memory module similar to existingDRAM as shown in Figure 3. A host processor inter-faces with an SSAM module similar to how it interactswith a DRAM memory module. The host processoris connected to each SSAM module over a communi-cation bus; additional communication links are used ifmultiple SSAM-enabled modules are instantiated. SinceHMC modules can be composed together, these addi-tional links and SSAM modules allows us to scale upthe capacity of the system. A typical system may alsohave multiple host processors (not shown) as the num-ber of SSAM modules that the system must maintainincreases.

To abstract the lower level details of SSAMs awayfrom the programmer, we assume a driver stack exposesa minimal memory allocation API which manages userinteraction with SSAM-enabled memory regions. AnSSAM-enabled memory region is defined as a specialpart of the memory space which is physically backedby an SSAM instead of a standard DRAM module. Asample programming interface of how one would useSSAM-enabled memory regions is shown in Figure 4.SSAM-enabled memory regions would be tracked andstored in a free list similar to how standard memory al-location is implemented in modern systems. AllocatedSSAM memory regions come with a set of special oper-ations that allow the user to set the indexing mode, inadditional to handling standard memory manipulationoperations like memcpy. Similar to the CUDA program-ming model, we use analogous memory and executionoperations to operate SSAM-enabled memory. Pageswith data subject to SSAM queries are pinned (not sub-ject to swapping by the OS).

3.2 SSAM architecture and hybrid memorycube

The SSAM architecture is built on top of a HybridMemory Cube 2.0 (HMC) memory substrate [66] to cap-italize on enhanced memory bandwidth. The HMC is a

4

Page 5: Application-Driven Near-Data Processing for Similarity Search

0 10 20 30 40 50 60 70 80 90Accuracy (%)

0100

101

102

103

104

105

Thro

ughp

ut (q

uerie

s / s

)

100% accurate linear search

GloVe

0 10 20 30 40 50 60 70 80 90Accuracy (%)

0100

101

102

103

104

105

100% accurate linear search

GIST

Linear KD-Tree K-Means Hyperplane MPLSH

0 10 20 30 40 50 60 70 80 90Accuracy (%)

0100

101

102

103

104

105

100% accurate linear search

AlexNet

Figure 2: Approximate kNN algorithms tradeoff accuracy for throughput (up and to the right is better).

Application

SSAM Memory API

SSAM Driver

Communication Links

SSAM

Module

Data

Lin

k

Data Link

Data Link

Data

Lin

k

SSAM

Module

Data

Lin

k

Data Link

Data Link

Data

Lin

k

Host Processor Unit

Standard

DRAM Module

SSAM

Module

Data

Lin

k

Data Link

Data Link

Data

Lin

k

SSAM

Module

Data

Lin

k

Data Link

Data Link

Data

Lin

k

Standard

DRAM Module

Standard

DRAM Module

...

...

...

Figure 3: SSAM system integration (modifications ingrey). SSAM modules replace or coexist with standardDRAM modules.

// Example program using SSAMint * knn(int * query , int *dataset ,

size_t length , size_t dims , int k) {// allocate buffer of SSAM memoryint * nbuf = nmalloc(length * dims);nmode(nbuf , LINEAR );nmemcpy(nbuf , dataset , length * dims

* sizeof(int));nbuild_index(nbuf , params = NULL);nwrite_query(nbuf , query);// execute kNN searchnexec(nbuf);int * result = nread_result(nbuf);nfree(nbuf);return result;

}

Figure 4: Example program using SSAM-enabled mem-ory regions. Lower level hardware configuration detailsare abstracted away from the programmer.

die-stacked memory architecture composed of multipleDRAM layers and a compute layer. The DRAM layersare vertically partitioned into a number of vaults (Fig-ure 5a). Vaults are each accessed via a vault controllerwhich reside on a top-level compute layer. In HMC 2.0,the module is partitioned into a maximum of 32 vaults(only 16 are shown), where each vault controller oper-ates at 10 GB/s yielding an aggregate internal memorybandwidth of 320 GB/s. The HMC architecture also iscomposed of four external data links (240 GB/s exter-nal bandwidth) which send and receive information tothe host processor or other HMC modules. These ex-ternal data links allow one or more HMC modules to be

composed to effectively form a larger network of SSAMsif data exceeds the capacity of a single SSAM module.

Our SSAM architecture leverages the existing HMCsubstrate and introduces a number of SSAM accelera-tors to handle the kNN search. These SSAM accelera-tors are instantiated on the compute layer next to exist-ing vault controllers as shown in Figure 5b. SSAM ac-celerators are further decomposed into processing units(Figure 5d). To fully harness the bandwidth available,we replicate processing units to fully use the memorybandwidth by measuring the peak bandwidth needs ofeach processing unit across all indexing techniques. ForkNN, we expect to achieve near optimal memory band-width since almost all data accesses to memory are largecontiguous blocks such as bucket scans and data struc-tures, which are contiguously allocated in memory. Ourmodifications are made orthogonal to the functionalityof the HMC control logic so that the HMC module canstill operate as a standard memory module (i.e. accel-eration logic can be bypassed). Our processing unitsdo not implement a full cache hierarchy since there islittle data reuse outside of the query vector and index-ing data structure per query. Unlike GPUs cores, pro-cessing units are not restricted to operating in lockstepand multiple different indexing kernels can coexist oneach SSAM module. Finally, we do not expect externaldata links to become a bottleneck as a vast majority ofthe data movement occurs within SSAM modules them-selves. As a result, we only expect the communicationnetwork between the host processors and SSAM unitsto consist of kNN results which are a fraction of theoriginal dataset size, and configuration data.

3.3 Processing unit architectureEach processing unit consists of a fully integrated

scalar and vector processing unit similar to [67] but areaugmented with several instructions and hardware unitsto better support kNN. Fully-integrated vector process-ing units are naturally well-suited for accelerating kNNdistance calculations because they are (1) able to exploitthe abundant data parallelism in kNN and (2) well-suited for streaming computations. Using vector pro-cessing units also introduces flexibility in the types ofdistance calculations that can be executed. The scalarunit is better suited for executing index traversals whichare sequential in nature, and provides flexibility in the

5

Page 6: Application-Driven Near-Data Processing for Similarity Search

Memory Vaults and Partitions

Vault 0 Vault 1Vault

31...

Vault

Controller

Vault

Controller

Vault

Controller...

SSAM Module

HMC Switch

SSAM Accelerator

SSAM Accelerator

SSAM Accelerator

...

Processing Unit

Processing Unit

Processing Unit

Processing Unit

Processing Unit

Processing Unit

Processing Unit

Processing Unit

... ...

SSAM Accelerator

Vault Controller

SSAM Processing Unit

Instruction

Memory

Decode

Memory Interface

Program Vector

ALU

Scratchpad

Stack Unit

Vector

Register

s

Scalar

ALUScalar

Registers

Priority Queue

Query + Index

(a) (b) (c)

Data Link Data LinkData LinkData Link

...

Vau

lt

Logic Layer

DRAM Layer

HMC Organization

(d)

Figure 5: SSAM accelerator architecture. (a) HMC die organization (only 16 vaults shown, HMC 2.0 has 32), (b)SSAM logic layer organization, (c) SSAM accelerator organization, (d) processing unit microarchitecture.

types of indexing techniques that can be employed. Vec-tor units on the other hand are better suited for highthroughput data parallel distance calculations in kNN.We use a single instruction stream to drive both thescalar and vector processing units since at any giventime a processing unit will only be performing eitherdistance calculations or index traversals in kNN. Forour evaluation, we perform a design sweep over severaldifferent vector lengths: 2, 4, 8, and 16. We find that 32scalar registers, and 8 vector registers are sufficient tosupport our kNN workloads. Finally, we use forwardingpaths between pipeline stages to implement chaining ofvector operations.

We also integrate several hardware units that are use-ful for accelerating similarity search. First, we introducea priority queue unit implemented using a shift regis-ter architecture proposed in [68], and is used to per-form the sort and global top-k calculations. For ourSSAM design, priority queues are 16 entries deep. Weopt to provide a hardware priority queue instead of asoftware one since the overhead of a priority queue in-sert becomes non-trivial for shorter vectors. Because ofits modular design, the priority queues can be chainedto support higher k values; likewise, priority queues inthe chain can also be disabled if they are not needed.Second, we introduce a small hardware stack unit in-stantiated on the scalar datapath to aid kNN indextraversals. The stack unit is a natural choice to fa-cilitate backtracking when traversing hierarchical indexstructures. Finally, we integrate a 32 KB scratchpadto hold frequently accessed data structures, such as thequery vector and indexing structures. We find that amodestly sized scratchpad memory is sufficient for kNNsince the only heavily reused data are the query vectorsand indices (data vectors are scanned and immediatelydiscarded).

Unlike conventional scalar-vector architectures, weintroduce several new instructions to exercise newhardware units for similarity search. First, we in-troduce priority queue insert (PQUEUE_INSERT), load(PQUEUE_LOAD), and reset (PQUEUE_RESET) instructionswhich are used to manipulate the hardware priorityqueue. The PQUEUE_INSERT instruction takes two regis-ters and inserts them into the hardware priority queue

as an (id, value) tuple. The PQUEUE_LOAD instruc-tion reads either the id or value of a tuple in the pri-ority queue at a designated queue position, while thePQUEUE_RESET clears the priority queue. We also in-troduce a scalar and vector 32-bit fused xor-populationcount instruction (SFXP and VFXP) which is similar toa fused multiply add instruction. The FXP instructionis useful for cheaply implementing Hamming distancecalculations and assumes that each 32-bit word is 32dimensions of a binary vector. The FXP instruction isalso cheap to implement in hardware since the XORonly adds one additional layer of logic to the populationcount hardware. Finally, we introduce a data prefetchinstruction MEM_FETCH since the linear scans throughbuckets of vectors exhibit predictable contiguous mem-ory access patterns.

3.4 SSAM configurationWe assume that the host processor driver stack is

able to communicate with each SSAM to initialize andbring up SSAM devices using a special address regiondedicated to operating SSAMs. Execution binaries arewritten to instruction memories on each processing unitand can be recompiled to support different distancemetrics, indexing techniques, and kNN parameters. Inaddition, any indexing data structures are also writ-ten to the scratchpad memory or larger DRAM priorto executing any queries on SSAMs. Any large datastructures such as hash function weights in MPLSH orcentroids in k-means are stored in SSAM memory sincethey are larger and experience limited reuse. If hierar-chical indexing structures such as kd-trees or hierarchi-cal k-means do not fit in the scratchpad, they are par-titioned such that the top half of the hierarchy residesin scratchpad, and the bottom halves are dynamicallyloaded to the scratchpad from DRAM as needed dur-ing execution. A small portion of the scratchpad is alsoallocated for holding the query vector; this region is con-tinuously rewritten as an SSAM services queries. If ankNN query must touch multiple vaults, the host proces-sor broadcasts the search across SSAM processing unitsand performs the final set of global top-k reductions onthe host processor. Finally, if SSAM capabilities are notneeded, the host processor can disable the SSAM accel-

6

Page 7: Application-Driven Near-Data Processing for Similarity Search

Table 3: Processing unit instruction set. (S/V) are scalar and vector instructions. (S) instructions are scalar only.

Type Instruction

Arithmetic (S/V) ADD, SUB, MULT, POPCOUNT, ADDI, SUBI, MULTIBitwise/Shift (S/V) OR, AND, NOT, XOR, ANDI, ORI, XORI, SR, SL, SRAControl (S) BNE, BGT, BLT, BE, JStack Unit (S) POP, PUSHRegister Move/Memory Instructions (S/V) SVMOVE, VSMOVE, MEM_FETCH, LOAD, LOAD, STORE

New SSAM Instructions (S)PQUEUE_INSERT, (S)PQUEUE_LOAD, (S)PQUEUE_RESET, (S/V)FXP

Table 4: Evaluated platforms

Platform Type CoresTech.(nm)

Freq.(MHz)

Area(mm2)

XeonE5-2620

CPU 6 32 2000 435

Titan X GPU 3072 28 1075 601Kintex-7 FPGA N/A 28 91 164

SSAM ASIC80-320

65 250131-329

erator logic so that it operates simply as a standardmemory.

4. EVALUATION METHODOLOGYWe now outline our evaluation methodology used to

compare and contrast SSAMs with competing CPUs,GPUs, and FPGAs shown in Table 4. To provide fairenergy efficiency and performance measurements, wenormalize each platform to a 28 nm technology process.SSAM ASIC: To evaluate SSAM, we implemented,synthesized, and place-and-routed our design in Ver-ilog with the Synopsys Design Compiler and IC Com-piler using a TSMC 65 nm standard cell library; SRAMmemories were generated using the ARM Memory Com-piler. We also built an assembler and simulator to gen-erate program binaries, benchmark assembly programs,and validate the correctness of our design. To measurethroughput, we use post-placement and route frequencyestimates and simulate the time it takes to process eachof the workloads in Table 1. Each benchmark is hand-written using our instruction set defined in Table 3.For power and energy efficiency estimates, we generatetraces from real datasets to measure realistic activityfactors. We then use the PrimeTime power analyzer toestimate power and multiply by the simulated run timeto obtain energy efficiency estimates. Finally, we reportthe area estimates provided in the post-placement androute reports normalized to a 28 nm technology.Xeon E5-2620 CPU: We evaluate a six core XeonE5-2620 as our CPU baseline. For each platform, webenchmark wall-clock time using the implementationsof kNN provided by the FLANN library [19] for linear,kd-tree, and k-means based search, and the FALCONNlibrary for hyperplane MPLSH [60]. For power and en-ergy efficiency measurements, we use an external powermeter to measure dynamic compute power. Dynamiccompute power is computed by taking the differencebetween the load and idle power when running each

benchmark. Energy efficiency is then calculated as theproduct of the run time and dynamic power. Estimatesof the CPU die size is taken from [69].Titan X GPU: For our GPU comparison, we use anNVIDIA Titan X GPU using a well-optimized, off-the-shelf implementation provided by Garcia et al. [70]. Weagain record wall-clock time, and measure idle and loadpower using a power meter to measure run time andenergy efficiency. We estimate the die size of the TitanX from [71].Kintex-7 FPGA: We measure the performance andenergy efficiency of our implementation on a XilinxKintex-7 FPGA using Vivado 2014.5. We use post-placement and route frequency estimates and simulatedrun times to estimate the throughput of kNN on theFPGA fabric. For power measurements, we use the Vi-vado Power Analyzer tool and refer to [72] for devicearea estimates.

5. EVALUATION RESULTSWe now present throughput, power, and energy effi-

ciency measurements of SSAMs relative to competingheterogeneous computing platforms. For brevity, wefirst evaluate Euclidean distance kNN then separatelyevaluate different indexing techniques and distance met-rics.

5.1 Accelerator power and areaOur post-placement and route power and area results

are shown in Figures 6a and 6b respectively for differentprocessing unit vector lengths and different submodulesin the design. Area and power measurements are nor-malized to 28 nm technology using linear scaling fac-tors. In terms of area, a large portion of the acceler-ator design is devoted to the SRAMs composing thescratchpad memory. However, relative to the CPU orGPU, the SSAM acceleration logic is still significantlysmaller. Compared to the Xeon E5-2620, an SSAM is6.23-15.62× smaller while compared to the Titan X anSSAM is 9.84-24.66× smaller. For comparison, the diesize for HMC 1.0 in [29] in a 90 nm process is 729mm2; normalized to a 28 nm process, the die size wouldbe ≈ 70.6 mm2 which is roughly the same or largerthan our SSAM accelerator design1. In terms of power,a SSAM uses no more than a typical memory modulewhich makes it compatible with the power consumptionof die stacked memories. Prior work by Puttaswamy et

1Die size for HMC 2.1 are not publicly available.

7

Page 8: Application-Driven Near-Data Processing for Similarity Search

al. [73] shows temperature increases from integratinglogic on die-stacked memory are not fatal to the designeven for a general purpose core. Since SSAM consumesless power than general purpose cores, we do not expectthermal issues to be fatal.

5.2 Throughput and energy efficiencyWe now report area-normalized throughput and en-

ergy efficiency gains across each platform for exact lin-ear search which is agnostic to dataset composition andindex traversal overheads. This quantifies the gainsattributed to different heterogeneous computing tech-nologies. Figures 7a and 7b shows the area-normalizedthroughput and energy efficiency of a SSAM againstcompeting heterogeneous solutions. The FPGA andSSAM designs are suffixed by the design vector length;for instance, SSAM-4 refers to a SSAM design withprocessing units that have vector length 4. We ob-serve SSAM achieve area-normalized throughput im-provements of up to 426×, and energy efficiency gains ofup to 934× over multi-threaded Xeon E5-2620 CPU re-sults. We also observe that GPUs and the FPGA imple-mentation of the SSAM acceleration logic exhibit com-parable throughput and energy efficiency. The FPGA insome cases underperforms the GPU since it effectivelyimplements a soft vector core instead of a fixed-functionunit; we expect that a fixed-function FPGA core wouldfare better.

In terms of the enhanced bandwidth, we attributeroughly one order of magnitude run time improvementto the higher internal bandwidth of HMC 2.0. Opti-mistically, standard DRAM modules provide up to 25GB/s of memory bandwidth whereas HMC 2.0 provides320 GB/s. For similarity search, the difference in avail-able bandwidth directly translates to raw performance.The remaining gains in energy efficiency and perfor-mance can be attributed mostly to accelerator special-ization. To quantify the impact of the priority queue, wesimulate the performance of SSAM using a software pri-ority queue instead of leveraging the hardware queue.At a high level, the hardware queue improves perfor-mance by up to 9.2% for wider vector processing units.

5.3 Approximate kNN searchWe now evaluate the impact of approximate indexing

structures and specialization on throughput and energyefficiency. Figure 8 compares the throughput versusaccuracy curves for a SSAM and Xeon E5-2620 CPUfor each dataset. In general, at a 50% accuracy tar-get we observe up to two orders of magnitude through-put improvement for kd-tree, k-means, and HP-MPLSHover CPU baselines. The kd-tree and k-means index-ing structures are still dominated by distance calcula-tions and benefit greatly from augmented bandwidthwhen sequentially scanning through buckets for neigh-bors. HP-MPLSH on the other hand is composed ofa combination of many hash function calculations andbucket traversals; we find that for the parameter setsused in our characterization, the performance of HP-MPLSH is dominated mostly by hashing rate. However,

Table 5: Relative throughput of alternate distance met-rics

Distance Metric GloVe GIST AlexNet

Euclidean 1× 1× 1×Hamming 4.38× 7.98× 9.38×

Cosine similarity 0.46× 0.47× 0.47×Manhattan 0.94× 0.99× 0.99×

the parameters for HP-MPLSH can be adjusted to re-duce the dependence on hash performance by reducingthe number of hash bits; this would increase the num-ber of vectors hashed to the same bucket and shift theperformance bottleneck from hashing performance backto linear bucket scans.

5.4 Alternative distance metricsWe now briefly quantify the performance of alter-

native distance metrics on SSAM for three additionaldistance metrics: Hamming distance, Cosine similarity,and Manhattan distance. Unsurprisingly, the impactof binarizing data vectors and using Hamming distanceprovides good throughput improvement (up to 9.38×)since less data must be loaded to process a vector andHamming distances using the FXP instruction on SSAMsare cheap. Manhattan distance and Euclidean distancesare the same cost since they require roughly the samenumber of operations. Meanwhile, cosine similarity2

is about twice as expensive as Euclidean distance be-cause of the additional numerator and divisor terms.Fixed-point division for cosine similarity is performedin software using shifts and subtracts, however the soft-ware division is still much cheaper than the rest of thedistance calculation.

6. DISCUSSIONWe now briefly evaluate the generality of SSAMs

for other workloads, and with respect to content ad-dressable memories, then compare SSAMs to alterna-tive near-data processing technologies.

6.1 Index construction and other applicationsThe SSAM is not limited to approximate kNN search

and can also perform other data intensive operationssuch as index construction or data intensive applica-tions. In kNN, the overhead of building indexing struc-tures is amortized away by the number of queries ex-ecuted; however, index construction is still three or-ders of magnitude slower than single query execution.SSAMs can be reprogrammed to also perform these dataintensive tasks; index construction also benefits fromnear-data processing since techniques like k-means andkd-tree construction require multiple scans over the en-tire dataset. For instance, to build a hierarchical k-means index we execute k-means by treating clustercentroids as the dataset, and streaming the dataset in askNN queries to determine the closest centroid. Whilea host processor must still handle the short serializedphases of k-means, SSAMs are able to accelerate the

2Cosine similarity is defined as (∑

i aibi)2/(

∑i a

2i

∑i b

2i ).

8

Page 9: Application-Driven Near-Data Processing for Similarity Search

SSAM-2 SSAM-4 SSAM-8 SSAM-16SSAM Design

0

5

10

15

20

25Ac

cele

rato

r Po

wer

(W)

10.211.5

14.7

21.1

Design Power by ModulePriority QueueStack UnitALUsScratchpadReg. FilesIns. MemoryPipeline/Control

(a) SSAM accelerator logic power by submodule (28 nm).

SSAM-2 SSAM-4 SSAM-8 SSAM-16SSAM Design

0

20

40

60

80

100

Acce

lera

tor

Area

(mm

^2)

30.5238.34

58.21

97.48

Design Area by ModulePriority QueueStack UnitALUsScratchpadReg. FilesIns. MemoryPipeline/Control

(b) SSAM accelerator logic area by submodule (28 nm).

Figure 6: SSAM accelerator power and area by submodule (28 nm)

E5-2620Titan X

FPGA-2FPGA-4

FPGA-8FPGA-16

SSAM-2SSAM-4

SSAM-8SSAM-16

Compute Platform

0

100

101

Thro

ughp

ut (Q

ueri

es /

(s *

mm

^2)

)

0.12

2.0

3.6 4.0

2.6 2.8

8.8 12 11

9.1

0.02

0.69

0.47 0.

62

0.53 0.

81 0.94

1.6

2.2 2.5

0.00 0.

20

0.11

0.15

0.13 0.21

0.22 0.

37 0.53 0.

66

Linear Search Area-Normalized ThroughputGloVeGISTAlexNet

(a) Area-normalized Throughput

E5-2620Titan X

FPGA-2FPGA-4

FPGA-8FPGA-16

SSAM-2SSAM-4

SSAM-8SSAM-16

Compute Platform

0100

101

102

103

104

Ene

rgy

Eff

icie

ncy

(Que

ries

/ J)

0.77

13

3.6

4.0

2.6

2.8

82

130

148

131

0.11

3.5

0.48 0.62

0.53 0.

81

8.8 15

27 36

0.01

0.97

0.11

0.15

0.13

0.21

2.1 3.

7 6.5 9.

6

Linear Search Energy EfficiencyGloVeGISTAlexNet

(b) Energy efficiency

Figure 7: Area-normalized throughput and energy efficiency for exact linear search using Euclidean distance.

0 10 20 30 40 50 60 70 80 90Accuracy (%)

10-2

10-1

100

101

102

103

104

105

106

Area

-Nor

mal

ized

Thr

ough

put

(que

ries

/ (s

* m

m^

2))

GloVe

0 10 20 30 40 50 60 70 80 90Accuracy (%)

10-2

10-1

100

101

102

103

104

105

106GIST

CPU KD-Tree CPU K-Means CPU HP-MPLSH SSAM KD-Tree SSAM K-Means SSAM HP-MPLSH

0 10 20 30 40 50 60 70 80 90Accuracy (%)

10-2

10-1

100

101

102

103

104

105

AlexNet

Figure 8: Area-normalized throughput versus accuracy for Euclidean distance kNN using indexing structures (upand to the right is better).

data intensive scans in the k-means kernel by perform-ing the computation near memory. Similarly for kd-tree index construction SSAMs can be used to quicklyscan the dataset and compute the variance across alldimensions; the host processor can then assign bifur-cation points and generate the tree. In both cases, thehost processor must provide some control and high levelorchestration but the bulk of each index constructionkernel can be offloaded to exploit the benefits of highmemory bandwidth.

SSAMs can also be used to accelerate other dataintensive applications that benefit from vectorizationand enhanced memory bandwidth. Applications suchas support vector machines, k-means, neural networks,and frequent itemset mining can all be implemented onSSAM. In particular, the vectorized FXP instruction is

useful for evaluation classes of application which relyon many Hamming distance calculations such as binaryneural networks [74], and binary hash functions.

6.2 SSAM as a high density generalized con-tent addressable memory

Semantically SSAMs are a generalization of a contentaddressable memory (CAM) and ternary content ad-dressable memory (TCAM); more importantly SSAMsare a semantically more powerful associative comput-ing substrate. To use an SSAM as a CAM, we sim-ply fix k = 1 and check if a neighbor has a distanceof zero to the query. To use an SSAM as a TCAM,we use a modified Hamming distance which adjusts forternary matches. To do this, a ternary bit mask T issupplied to an SSAM with the query vector Q. The

9

Page 10: Application-Driven Near-Data Processing for Similarity Search

ternary bit mask holds a 0 at positions in the queryvector where a ternary match should occur and 1 oth-erwise. The query vector and input data vector D areXOR’ed and AND’ed with the ternary bit mask; thiseffectively masks off positions where the bits did notmatch. We then check if the distance between D and Qis zero to determine if there was a ternary match. Theresulting modified Hamming distance can be expressedPOPCOUNT((D⊕Q) & T).

Unlike CAMs and TCAMs, SSAMs more generallysupport approximate and arbitrary width matches. Ap-proximate matches are more powerful since they areable to return similar content as opposed to only ex-act content matches. SSAMs can also return multiplematches while CAMs and TCAMs typically are designto only return one memory location.

Architecturally, SSAMs realize a different designpoint from traditional CAMs or TCAMs since the com-pute units used to perform the matching are separatefrom the underlying memory cell implementation. Bi-nary CAMs and TCAMs are organized to simultane-ously search all the data in parallel while SSAMs cre-ate the illusion of an associative memory but rely onhigh internal bandwidth to quickly scan data and pro-cess a query. SSAMs also benefit from much higherbit density, capacity and reduced operating power ofDRAMs while maintaining flexible associative comput-ing capabilities. Table 6 provides a comparison of ef-fective bit density between TCAM, DRAM, and SSAMfor a 16 GB capacity memory. In terms of area, SSAMseffectively achieve the same bit density as DRAM be-cause the additional acceleration logic is small relativeto DRAM macro sizes. In contrast, TCAMs are morethan 19× less dense than standard DRAMs and theSSAM design presented in this paper. Finally, in termsof power a 20 Mb TCAM macro consumes 10.6 W [75]which is already significantly more than the SSAM ac-celeration logic and most DRAM modules.

Table 6: TCAM, DRAM, SSAM density comparison

TCAM28nm [75]

DRAM65nm [76]

SSAM-4+DRAM

Density (Mb/mm2) 0.61 2.17 N/AArea @ 16GB

(mm2)214872 60402 60609

Area @ 28nm+16GB (mm2)

214872 11208 11247

Density @ 28nm+16GB(Mb/mm2)

0.61 11.69 11.65

6.3 Alternative near-data processing sub-strates

Near-data processing manifests in many differentshapes and forms; in this section, we briefly contrastour approach against alternative near-data processingarchitectures.Micron Automata Processor (AP): The AP is anear-data processing architecture specialized for high

Table 7: SSAM and AP throughput comparison for lin-ear Hamming distance kNN

Dataset GloVe GIST AlexNet

SSAM-4 (queries/s) 2059 481 134AP (queries/s) 288 2.64 0.553

speed non-deterministic finite automata (NFA) evalua-tion [77]. Unlike SSAM, the AP cannot efficiently im-plement arithmetic operations and is limited to distancemetrics like Hamming distance or Jaccard similarity. Ata high level, the automata design is composed of mul-tiple parallel NFAs where each NFA encodes a distancecalculation with a single dataset vector3. A query vec-tor is streamed into the AP and compared against allNFAs in parallel and sorted. To support execution ofdifferent NFAs, the AP can be reconfigured much likereconfiguration on a FPGA. We briefly evaluate the APby designing, and compiling a design for each dataset,and use the results to build an analytical model to esti-mate performance for a current generation AP device.Table 7 shows the AP’s performance and energy effi-ciency compared to SSAM. At a high level, we find thatthe large datasets presented in this paper do not fit onone AP board configuration, and as a result the APis bottlenecked by the high reconfiguration overheadscompared to SSAM.Compute in Resistive RAM (ReRAM): Compu-tation in ReRAM is an emerging near-data processingtechnique that can perform limited compute operationsby directly exploiting the resistive properties of ReRAMmemory cells [78]. This allows augmented computa-tional capabilities beyond what are available to DRAMbased near-data processing techniques such as SSAM.Most notably, Chi et al. [79] has shown how in-situReRAM computation can be used to accelerate convo-lutional neural networks without moving data out of theReRAM cells themselves. As the technology matures, itwould not be unprecendented to replace DRAM in favorof ReRAM and its augmented computing capabilities.In-Storage Processing: There has also been a re-newed interest in instantiation computation near diskor SSD. Recent work such as Intelligent SSD [80, 81]and Biscuit [82] have all proposed adding computationnear mass storage devices and shown promising im-provements for applications like databases. However,compared to SSAM, in-storage processing architecturestarget a different bandwidth to storage capacity designpoint. Unlike SSAM, SSD-based near-data processinghandles terabytes of data at lower bandwidth speedswhich is less ideal for latency critical applications likesimilarity search.Die-Stacked HMC Architectures (This Paper)Instantiating an accelerator adjacent to HMC is not anew proposal [83–85]; prior work has shown that suchan architectural abstraction is useful for acceleratinggraph processing [83] and neural networks [85]. This ar-chitecture has several advantages over in-situ ReRAM

3Details of the automata design are beyond the scope of thispaper

10

Page 11: Application-Driven Near-Data Processing for Similarity Search

computation and the automata processor. First, by ab-stracting the computation away from the memory sub-strate, the types of computation supported is decou-pled from the restrictions of underlying memory im-plementations. Second, by separating the computationfrom actual memory cells, this architectural abstrac-tion achieves much higher compute and memory den-sity; this is unlike substrates like the AP where computeand memory are both instantiated in the same memoryprocess.

7. RELATED WORKThe concept of near-data processing has been stud-

ied in the literature for decades. More interestingly,the concept of integrating similarity search acceleratorswith memory also has an established history, indicatingongoing interest.CAMs: As far back as the 1980s and 1990s, near-memory accelerators were proposed to improve the per-formance of nearest neighbor search using CAMs [86].Kohonen et al. [87] proposed using a combination ofCAMs and hashing techniques to perform nearest neigh-bor search. Around the same time, Kanerva et al. [88]propose sparse distributed memory (SDM) and a “BestMatch Machine” to implement nearest neighbor search.The ideas behind SDM were later employed by Robertsin PCAM [86] which is, to the best of our knowledge,the first fabricated CAM-based accelerator capable ofperforming nearest neighbor search on its own.

Algorithms that exploit TCAMs to perform contentaddressable search such as ternary locality sensitivehashing (TLSH) [89] and binary-reflected Gray code [90]do exist. However, TCAMs suffer from lower memorydensity, higher power consumption, and smaller capac-ity than emerging memory technologies. While priorwork [91] shows promising increases in performance,energy efficiency, and capacity, TCAM cells are lessdense than DRAM cells. For the massive scale datasetsin kNN workloads, the density disparity translates toan order of magnitude in cost. Despite these limita-tions, there is still active work in TCAMs for data-intensive applications to accelerator associative compu-tations [92].Multiprocessor and Vector PIMs: In the late1990s, Patterson et al. [93] proposed IRAM which intro-duced processing units integrated with DRAM. In par-ticular, Gebis et al. [94] and Kozyarakis et al. [95] pro-posed VIRAM which used a vector processor architec-ture with embedded DRAM similar to our work. Sim-ilar to our work, the intention of VIRAM was to capi-talize on the higher bandwidth and reduce energy con-sumption by co-locating general MIPS cores and vec-tor register and compute units near DRAM. Unlike VI-RAM, SSAM does not implement a full cache hierarchy,targets a different class of algorithms, and uses a 3Ddie-stacked solutions.

Kogge et al. [96] propose the EXECUBE architec-ture which integrates general purpose cores with DRAMmacros. Elliott et al. [97] propose C-RAM which addSIMD processing units adjacent to the sense amplifiers

capable of bit serial operations. Active Pages [27] andFlexRAM [24] envisioned a programmable processingelement near each DRAM macro block which could beprogrammed for kNN acceleration. However, none ofthese prior efforts directly addresses the kNN searchproblems we discuss.

More recently, Active Memory Cube (AMC) [84] pro-poses a similar vector processing unit and cache-lesssystem on top of HMC. While both SSAM and AMCarrive at the same architecture conclusion - that vectorPIM on die-stacked DRAM is useful - our work pro-vides a more application-centric approach which allowsus to codesign architectural features such as the priorityqueue.Application-Driven PIM: Application-justified PIMdesign is not a new idea. Deering et al. [98] proposeFBRAM, a “new form” of memory optimized to im-prove random access writes to accelerate z-bufferingfor 3D graphics algorithms. Lipman and Yang [99]propose a DRAM based architecture called smart ac-cess memory (SAM) for nearest-neighbor search target-ing DB applications. Their design tightly integratesa k-nearest neighbor accelerator engine and microar-chitecturally shares common elements with our design.Agrawal et al. [100] exploit accelerators to reduce thetotal cost of ownership of high-dimensional similaritysearch. Yu et al. [101] optimize all-pairs nearest neigh-bors by fusing neighbor selection with distance compu-tations. Finally, Tandon et al [102] propose an all pairssimilarity accelerator for NLP; however, their work inte-grates their accelerator with the last level cache insteadof memory.

The emergence and maturity of die-stacked memoryand alternative memory substrates has also enabled awide variety of PIM accelerator proposals [83–85, 103–108]. Chi et al. [103], Kim et al. [109], and Gao etal. [110] all propose PIM solutions for accelerating neu-ral networks. Ahn et al. [83] propose PIM on topof HMC for graph processing, and Hsieh et al. [104]and Zhang et al. [111] propose PIM-based GPU archi-tectures. Imani et al. [112] propose MPIM for linearkNN search; however, their architecture uses a resistiveRAM-based approach and is limited to bitwise opera-tions. Furthermore, MPIM does not consider modernapproximate kNN indexing algorithms nor does it eval-uate the quality versus accuracy tradeoffs that thesealgorithms make.

8. CONCLUSIONSWe presented SSAM, an application-driven near-data

processing architecture for similarity search. We showedthat by moving computation closer to memory, SSAMis able to address the data movement challenges of sim-ilarity search and exploit application codesign oppor-tunities to accelerate similarity search. While we usedHMC as our memory backend, the high-level acceler-ator design and insights still generalize to alternativememory technology and in-memory processing architec-tures. The PIM proposal presented in this paper arealso relevant to other data intensive workloads where

11

Page 12: Application-Driven Near-Data Processing for Similarity Search

data movement is becoming an increasingly fundamen-tal challenge in improving system efficiency.

9. REFERENCES[1] A. S. Arefin, C. Riveros, R. Berretta, and P. Moscato,

“Gpu-fs-k nn: A software tool for fast and scalable k nncomputation using gpus,” vol. 7, p. e44000, Public Libraryof Science, 2012.

[2] T. J. Purcell, C. Donner, M. Cammarano, H. W. Jensen,and P. Hanrahan, “Photon mapping on programmablegraphics hardware,” in Proceedings of the ACMSIGGRAPH/EUROGRAPHICS Conference on GraphicsHardware, pp. 41–50, Eurographics Association, 2003.

[3] J. Sivic and A. Zisserman, “Video google: A text retrievalapproach to object matching in videos,” in Proceedings ofthe Ninth IEEE International Conference on ComputerVision - Volume 2, ICCV ’03, (Washington, DC, USA),pp. 1470–, IEEE Computer Society, 2003.

[4] H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez,and C. Schmid, “Aggregating local image descriptors intocompact codes,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 34, pp. 1704–1716, Sept. 2012.

[5] O. Boiman, E. Shechtman, and M. Irani, “In defense ofnearest-neighbor based image classification,” in ComputerVision and Pattern Recognition, 2008. CVPR 2008. IEEEConference on, pp. 1–8, IEEE, 2008.

[6] M. Kusner, Y. Sun, N. Kolkin, and K. Q. Weinberger,“”From Word Embeddings To Document Distances”,” inProceedings of the 32nd International Conference onMachine Learning (ICML-15) (D. Blei and F. Bach, eds.),pp. 957–966, JMLR Workshop and ConferenceProceedings, 2015.

[7] L. Aronovich, R. Asher, E. Bachmat, H. Bitner, M. Hirsch,and S. T. Klein, “The design of a similarity baseddeduplication system,” in Proceedings of SYSTOR 2009:The Israeli Experimental Systems Conference, SYSTOR’09, (New York, NY, USA), pp. 6:1–6:14, ACM, 2009.

[8] R. Agrawal, C. Faloutsos, and A. N. Swami, “Efficientsimilarity search in sequence databases,” in Proceedings ofthe 4th International Conference on Foundations of DataOrganization and Algorithms, FODO ’93, (London, UK,UK), pp. 69–84, Springer-Verlag, 1993.

[9] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang,H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu,Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg,“Top 10 algorithms in data mining,” Knowl. Inf. Syst.,vol. 14, pp. 1–37, Dec. 2007.

[10] A. Torralba, R. Fergus, and W. T. Freeman, “80 milliontiny images: A large data set for nonparametric objectand scene recognition,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 30, pp. 1958–1970, Nov. 2008.

[11] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li,T. Chen, Z. Xu, N. Sun, and O. Temam, “Dadiannao: Amachine-learning supercomputer,” in Microarchitecture(MICRO), 2014 47th Annual IEEE/ACM InternationalSymposium on, pp. 609–622, Dec 2014.

[12] D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel,“Finding a needle in haystack: Facebook’s photo storage,”in Proceedings of the 9th USENIX Conference onOperating Systems Design and Implementation, OSDI’10,(Berkeley, CA, USA), pp. 47–60, USENIX Association,2010.

[13] YouTube, “Statistics - YouTube,” 2014.

[14] “Rebooting the it revolution: A call to action,” 2015.

[15] T. Malisiewicz and A. Efros, “Beyond categories: Thevisual memex model for reasoning about objectrelationships,” in Advances in neural informationprocessing systems, pp. 1222–1230, 2009.

[16] S. Agarwal, Y. Furukawa, N. Snavely, I. Simon,B. Curless, S. M. Seitz, and R. Szeliski, “Building rome ina day,” Commun. ACM, vol. 54, pp. 105–112, Oct. 2011.

[17] A. Shrivastava, T. Malisiewicz, A. Gupta, and A. A.Efros, “Data-driven visual similarity for cross-domainimage matching,” ACM Transaction of Graphics (TOG)(Proceedings of ACM SIGGRAPH ASIA), vol. 30, no. 6,2011.

[18] C. Silpa-Anan and R. Hartley, “Optimised kd-trees for fastimage descriptor matching,” in Computer Vision andPattern Recognition, 2008. CVPR 2008. IEEEConference on, pp. 1–8, IEEE, 2008.

[19] M. Muja and D. G. Lowe, “Fast approximate nearestneighbors with automatic algorithm configuration,” inInternational Conference on Computer Vision Theoryand Application VISSAPP’09), pp. 331–340, INSTICCPress, 2009.

[20] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni,“Locality-sensitive hashing scheme based on p-stabledistributions,” in Proceedings of the Twentieth AnnualSymposium on Computational Geometry, SCG ’04, (NewYork, NY, USA), pp. 253–262, ACM, 2004.

[21] P. Indyk and R. Motwani, “Approximate nearestneighbors: Towards removing the curse of dimensionality,”in Proceedings of the Thirtieth Annual ACM Symposiumon Theory of Computing, STOC ’98, (New York, NY,USA), pp. 604–613, ACM, 1998.

[22] W. A. Wulf and S. A. McKee, “Hitting the memory wall:Implications of the obvious,” SIGARCH Comput. Archit.News, vol. 23, pp. 20–24, Mar. 1995.

[23] D. Patterson, T. Anderson, N. Cardwell, R. Fromm,K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, “Acase for intelligent ram,” IEEE Micro, vol. 17, pp. 34–44,Mar. 1997.

[24] Y. Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V. Lam,P. Pattnaik, and J. Torrellas, “Flexram: toward anadvanced intelligent memory system,” in ComputerDesign, 1999. (ICCD ’99) International Conference on,pp. 192–201, 1999.

[25] J. Draper, J. Chame, M. Hall, C. Steele, T. Barrett,J. LaCoss, J. Granacki, J. Shin, C. Chen, C. W. Kang,I. Kim, and G. Daglikoca, “The architecture of the divaprocessing-in-memory chip,” in Proceedings of the 16thInternational Conference on Supercomputing, ICS ’02,(New York, NY, USA), pp. 14–25, ACM, 2002.

[26] J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang,E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote,M. Parker, L. Schaelicke, and T. Tateyama, “Impulse:building a smarter memory controller,” inHigh-Performance Computer Architecture, 1999.Proceedings. Fifth International Symposium On,pp. 70–79, Jan 1999.

[27] M. Oskin, F. T. Chong, and T. Sherwood, “Active pages:A computation model for intelligent memory,” SIGARCHComput. Archit. News, vol. 26, pp. 192–203, Apr. 1998.

[28] IBM, “”Blue Logic SA-27E ASIC”,” 1999.”http://www.ic72.com/pdffile/s/381279.pdf”.

[29] J. T. Pawlowski, “Hybrid memory cube (hmc),” HotChips, vol. 23, 2011.

[30] Heng Wang and Klaser, A. and Schmid, C. and Cheng-LinLiu, “Action recognition by dense trajectories,” inComputer Vision and Pattern Recognition (CVPR), 2011IEEE Conference on, pp. 3169–3176, June 2011.

[31] D. G. Lowe, “Distinctive Image Features fromScale-Invariant Keypoints,” Int. J. Comput. Vision,vol. 60, pp. 91–110, Nov. 2004.

[32] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool,“”speeded-up robust features (surf)”,” Computer Visionand Image Understanding, vol. 110, no. 3, pp. 346 – 359,2008. Similarity Matching in Computer Vision andMultimedia.

[33] M. Douze, H. Jegou, H. Sandhawalia, L. Amsaleg, andC. Schmid, “Evaluation of GIST Descriptors for Web-scaleImage Search,” in Proceedings of the ACM InternationalConference on Image and Video Retrieval, CIVR ’09,(New York, NY, USA), pp. 19:1–19:8, ACM, 2009.

12

Page 13: Application-Driven Near-Data Processing for Similarity Search

[34] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNetClassification with Deep Convolutional Neural Networks,”in Advances in Neural Information Processing Systems 25(F. Pereira, C. Burges, L. Bottou, and K. Weinberger,eds.), pp. 1097–1105, Curran Associates, Inc., 2012.

[35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, andA. Rabinovich, “Going deeper with convolutions,” inComputer Vision and Pattern Recognition (CVPR), 2015.

[36] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residuallearning for image recognition,” CoRR,vol. abs/1512.03385, 2015.

[37] J. Hauswald, M. A. Laurenzano, Y. Zhang, C. Li,A. Rovinski, A. Khurana, R. Dreslinski, T. Mudge,V. Petrucci, L. Tang, and J. Mars, “Sirius: An openend-to-end voice and vision personal assistant and itsimplications for future warehouse scale computers,” inProceedings of the Twentieth International Conference onArchitectural Support for Programming Languages andOperating Systems (ASPLOS), ASPLOS ’15, (New York,NY, USA), ACM, 2015. Acceptance Rate: 17

[38] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo,X. Feng, Y. Chen, and O. Temam, “Shidiannao: Shiftingvision processing closer to the sensor,” in Proceedings ofthe 42Nd Annual International Symposium on ComputerArchitecture, ISCA ’15, (New York, NY, USA),pp. 92–104, ACM, 2015.

[39] J. Hauswald, Y. Kang, M. A. Laurenzano, Q. Chen, C. Li,R. Dreslinski, T. Mudge, J. Mars, and L. Tang, “Djinnand tonic: Dnn as a service and its implications for futurewarehouse scale computers,” in Proceedings of the 42ndAnnual International Symposium on ComputerArchitecture (ISCA), ISCA ’15, (New York, NY, USA),ACM, 2015. Acceptance Rate: 19

[40] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, andO. Temam, “Diannao: A small-footprint high-throughputaccelerator for ubiquitous machine-learning,” SIGARCHComput. Archit. News, vol. 42, pp. 269–284, Feb. 2014.

[41] A. Gionis, P. Indyk, and R. Motwani, “Similarity Searchin High Dimensions via Hashing,” in Proceedings of the25th International Conference on Very Large Data Bases,VLDB ’99, (San Francisco, CA, USA), pp. 518–529,Morgan Kaufmann Publishers Inc., 1999.

[42] M. Raginsky and S. Lazebnik, “Locality-sensitive binarycodes from shift-invariant kernels,” in Advances in NeuralInformation Processing Systems 22 (Y. Bengio,D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta,eds.), pp. 1509–1517, Curran Associates, Inc., 2009.

[43] M. M. B. C. Strecha, A. M. Bronstein and P. Fua,“LDAHash: Improved Matching with SmallerDescriptors,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 34, no. 1, 2012.

[44] A. Torralba, R. Fergus, and Y. Weiss, “Small codes andlarge image databases for recognition,” in In Proceedingsof the IEEE Conf on Computer Vision and PatternRecognition, 2008.

[45] J. Wang, S. Kumar, and S.-F. Chang, “Semi-SupervisedHashing for Large-Scale Search,” IEEE Trans. PatternAnal. Mach. Intell., vol. 34, pp. 2393–2406, Dec. 2012.

[46] Y. Weiss, A. Torralba, and R. Fergus, “Spectral Hashing,”in Advances in Neural Information Processing Systems 21(D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou,eds.), pp. 1753–1760, Curran Associates, Inc., 2009.

[47] K. He, F. Wen, and J. Sun, “K-means hashing: Anaffinity-preserving quantization method for learningbinary compact codes,” in Proceedings of the 2013 IEEEConference on Computer Vision and Pattern Recognition,CVPR ’13, (Washington, DC, USA), pp. 2938–2945, IEEEComputer Society, 2013.

[48] M. Norouzi, D. J. Fleet, and R. Salakhutdinov, “Hammingdistance metric learning,” in Proceedings of the 25thInternational Conference on Neural InformationProcessing Systems, NIPS’12, (USA), pp. 1061–1069,Curran Associates Inc., 2012.

[49] J. Weston, S. Chopra, and K. Adams, “#tagspace:Semantic embeddings from hashtags,” in Proceedings ofthe 2014 Conference on Empirical Methods in NaturalLanguage Processing, EMNLP 2014, October 25-29, 2014,Doha, Qatar, A meeting of SIGDAT, a Special InterestGroup of the ACL, pp. 1822–1827, 2014.

[50] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell,“Distance metric learning, with application to clusteringwith side-information,” in Proceedings of the 15thInternational Conference on Neural InformationProcessing Systems, NIPS’02, (Cambridge, MA, USA),pp. 521–528, MIT Press, 2002.

[51] J. Pennington, R. Socher, and C. D. Manning, “Glove:Global vectors for word representation,” in Proceedings ofthe 2014 Conference on Empirical Methods in NaturalLanguage Processing, EMNLP 2014, October 25-29, 2014,Doha, Qatar, A meeting of SIGDAT, a Special InterestGroup of the ACL, pp. 1532–1543, 2014.

[52] Jia, Yangqing and Shelhamer, Evan and Donahue, Jeffand Karayev, Sergey and Long, Jonathan and Girshick,Ross and Guadarrama, Sergio and Darrell, Trevor, “Caffe:Convolutional Architecture for Fast Feature Embedding,”arXiv preprint arXiv:1408.5093, 2014.

[53] I. Jolliffe, “Principal component analysis,” 2014.

[54] A. Yershova and S. LaValle, “Improving Motion-PlanningAlgorithms by Efficient Nearest-Neighbor Searching,”Robotics, IEEE Transactions on, vol. 23, pp. 151–157, Feb2007.

[55] D. Kelly and L. Azzopardi, “How many results per page?:A study of serp size, search behavior and user experience,”in Proceedings of the 38th International ACM SIGIRConference on Research and Development in InformationRetrieval, SIGIR ’15, (New York, NY, USA), pp. 183–192,ACM, 2015.

[56] H. Jegou, M. Douze, and C. Schmid, “Productquantization for nearest neighbor search,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 33, pp. 117–128, Jan.2011.

[57] Yahoo!, “Webscope | Yahoo Labs,” 2014.

[58] C. Silpa-Anan and R. I. Hartley, “Optimised kd-trees forfast image descriptor matching.,” in CVPR, IEEEComputer Society, 2008.

[59] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li,“Multi-probe lsh: Efficient indexing for high-dimensionalsimilarity search,” in Proceedings of the 33rd InternationalConference on Very Large Data Bases, VLDB ’07,pp. 950–961, VLDB Endowment, 2007.

[60] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, andL. Schmidt, “Practical and optimal lsh for angulardistance,” in Advances in Neural Information ProcessingSystems 28 (C. Cortes, N. D. Lawrence, D. D. Lee,M. Sugiyama, and R. Garnett, eds.), pp. 1225–1233,Curran Associates, Inc., 2015.

[61] Y. Gong and S. Lazebnik, “Iterative Quantization: AProcrustean Approach to Learning Binary Codes,” inCVPR ’11.

[62] J. Wang, S. Kumar, and S.-F. Chang, “Semi-SupervisedHashing for Large-Scale Search,” in IEEE Trans. PatternAnal. Mach. Intell. [45], pp. 2393–2406.

[63] A. Torralba, R. Fergus, and Y. Weiss, “Small codes andlarge image databases for recognition,” in In Proceedingsof the IEEE Conf on Computer Vision and PatternRecognition [44].

[64] K. Lin, J. Lu, C.-S. Chen, and J. Zhou, “Learningcompact binary descriptors with unsupervised deep neuralnetworks,” in CVPR, 2016.

[65] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser,G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood,“Pin: Building customized program analysis tools withdynamic instrumentation,” in Proceedings of the 2005ACM SIGPLAN Conference on Programming LanguageDesign and Implementation, PLDI ’05, (New York, NY,USA), pp. 190–200, ACM, 2005.

13

Page 14: Application-Driven Near-Data Processing for Similarity Search

[66] H. M. C. Consortium, “”Hybrid Memory CubeSpecification 2.1”,” 2014.

[67] R. M. Russell, “The cray-1 computer system,” Commun.ACM, vol. 21, pp. 63–72, Jan. 1978.

[68] S.-W. Moon, K. G. Shin, and J. Rexford, “Scalablehardware priority queue architectures for high-speedpacket switches,” IEEE Trans. Comput., vol. 49,pp. 1215–1227, Nov. 2000.

[69] R. S. Anand Lal Shimpi, “The intel ivy bridge (core i73770k) review.” Accessed: 2016-11-10.

[70] V. Garcia, E. Debreuve, and M. Barlaud, “Fast k nearestneighbor search using GPU,” in Computer Vision andPattern Recognition Workshops, 2008. CVPRW ’08.IEEE Computer Society Conference on, pp. 1–6, June2008.

[71] “Nvidia geforce gtx titan x.” Accessed: 2016-11-10.

[72] U. Technologies, “Logic detailed structural analysis of thexilinx kintex-7 28nm fpga (es),” 2011. Accessed:2016-11-10.

[73] K. Puttaswamy and G. H. Loh, “Thermal analysis of a 3ddie-stacked high-performance microprocessor,” inProceedings of the 16th ACM Great Lakes Symposium onVLSI, GLSVLSI ’06, (New York, NY, USA), pp. 19–24,ACM, 2006.

[74] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi,“Xnor-net: Imagenet classification using binaryconvolutional neural networks,” CoRR,vol. abs/1603.05279, 2016.

[75] K. Nii, N. Watanabe, M. Yamawaki, K. Yoshinaga,M. Wada, and I. Hayashi, “A 28nm 400mhz 4-parallel1.6gsearch/s 80mb ternary cam,” IEEE, 2014.

[76] S. Romanovsky, A. Katoch, A. Achyuthan, C. O’Connell,S. Natarajan, C. Huang, C.-Y. Wu, M.-J. Wang, C. J.Wang, P. Chen, and R. Hsieh, “A 500mhz random-accessembedded 1mb dram macro in bulk cmos.,” in ISSCC,pp. 270–271, IEEE, 2008.

[77] P. Dlugosch, D. Brown, P. Glendenning, M. Leventhal,and H. Noyes, “An efficient and scalable semiconductorarchitecture for parallel automata processing,” Paralleland Distributed Systems, IEEE Transactions on, vol. 25,no. 12, pp. 3088–3098, 2014.

[78] X. Guo, E. Ipek, and T. Soyata, “Resistive computation:Avoiding the power wall with low-leakage, stt-mram basedcomputing,” SIGARCH Comput. Archit. News, vol. 38,pp. 371–382, June 2010.

[79] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang,and Y. Xie, “Prime: A novel processing-in-memoryarchitecture for neural network computation inreram-based main memory,” in Proceedings of the 43rdInternational Symposium on Computer Architecture [103],pp. 27–39.

[80] D.-H. Bae, J.-H. Kim, S.-W. Kim, H. Oh, and C. Park,“Intelligent ssd: a turbo for big data mining,” inProceedings of the 22nd ACM international conference onConference on information & knowledgemanagement, CIKM ’13, (New York, NY, USA),pp. 1573–1576, ACM, 2013.

[81] S. Cho, C. Park, H. Oh, S. Kim, Y. Yi, and G. R. Ganger,“Active disk meets flash: A case for intelligent ssds,” inProceedings of the 27th International ACM Conference onInternational Conference on Supercomputing, ICS ’13,(New York, NY, USA), pp. 91–102, ACM, 2013.

[82] B. Gu, A. S. Yoon, D.-H. Bae, I. Jo, J. Lee, J. Yoon, J.-U.Kang, M. Kwon, C. Yoon, S. Cho, J. Jeong, and D. Chang,“Biscuit: A framework for near-data processing of big dataworkloads,” in Proceedings of the 43rd InternationalSymposium on Computer Architecture, ISCA ’16,(Piscataway, NJ, USA), pp. 153–165, IEEE Press, 2016.

[83] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “Ascalable processing-in-memory accelerator for parallelgraph processing,” in Proceedings of the 42Nd AnnualInternational Symposium on Computer Architecture,

ISCA ’15, (New York, NY, USA), pp. 105–117, ACM,2015.

[84] R. Nair, S. F. Antao, C. Bertolli, P. Bose, J. R.Brunheroto, T. Chen, C. Y. Cher, C. H. A. Costa, J. Doi,C. Evangelinos, B. M. Fleischer, T. W. Fox, D. S. Gallo,L. Grinberg, J. A. Gunnels, A. C. Jacob, P. Jacob, H. M.Jacobson, T. Karkhanis, C. Kim, J. H. Moreno, J. K.O’Brien, M. Ohmacht, Y. Park, D. A. Prener, B. S.Rosenburg, K. D. Ryu, O. Sallenave, M. J. Serrano,P. D. M. Siegl, K. Sugavanam, and Z. Sura, “Activememory cube: A processing-in-memory architecture forexascale systems,” IBM Journal of Research andDevelopment, vol. 59, pp. 17:1–17:14, March 2015.

[85] D. Kim, J. Kung, S. Chai, S. Yalamanchili, andS. Mukhopadhyay, “Neurocube: A programmable digitalneuromorphic architecture with high-density 3d memory,”in Proceedings of the 43rd International Symposium onComputer Architecture [109], pp. 380–392.

[86] J. D. Roberts, “PROXIMITYCONTENT-ADDRESSABLE MEMORY:AN EFFICIENTEXTENSION TO k-NEAREST NEIGHBORS SEARCH(M.S. Thesis),” tech. rep., Santa Cruz, CA, USA, 1990.

[87] T. Kohonen, Self-organization and Associative Memory:3rd Edition. New York, NY, USA: Springer-Verlag NewYork, Inc., 1989.

[88] P. Kanerva, Sparse Distributed Memory. Cambridge, MA,USA: MIT Press, 1988.

[89] R. Shinde, A. Goel, P. Gupta, and D. Dutta, “Similaritysearch and locality sensitive hashing using ternary contentaddressable memories,” in Proceedings of the 2010 ACMSIGMOD International Conference on Management ofData, SIGMOD ’10, (New York, NY, USA), pp. 375–386,ACM, 2010.

[90] A. Bremler-Barr, Y. Harchol, D. Hay, and Y. Hel-Or,“Ultra-fast similarity search using ternary contentaddressable memory,” in Proceedings of the 11thInternational Workshop on Data Management on NewHardware, DaMoN’15, (New York, NY, USA),pp. 12:1–12:10, ACM, 2015.

[91] Q. Guo, X. Guo, Y. Bai, and E. Ipek, “A resistive tcamaccelerator for data-intensive computing,” in Proceedingsof the 44th Annual IEEE/ACM International Symposiumon Microarchitecture, MICRO-44, (New York, NY, USA),pp. 339–350, ACM, 2011.

[92] Q. Guo, X. Guo, R. Patel, E. Ipek, and E. G. Friedman,“Ac-dimm: Associative computing with stt-mram,”SIGARCH Comput. Archit. News, vol. 41, pp. 189–200,June 2013.

[93] D. Patterson, T. Anderson, N. Cardwell, R. Fromm,K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, “Acase for intelligent ram,” IEEE Micro, vol. 17, pp. 34–44,Mar. 1997.

[94] J. Gebis, S. Williams, D. Patterson, and C. Kozyrakis,“Viram1: a media-oriented vector processor withembedded dram,” 2004.

[95] C. Kozyrakis, J. Gebis, D. Martin, S. Williams,I. Mavroidis, S. Pope, D. Jones, and D. Patterson, “Vectoriram: A media-oriented vector processor with embeddeddram,” 2000.

[96] P. M. Kogge, T. Sunaga, H. Miyataka, K. Kitamura, andE. Retter, “Combined DRAM and logic chip for massivelyparallel systems,” in 16th Conference on AdvancedResearch in VLSI (ARVLSI ’95), March 27-29, 1995,Chapel Hill, North Carolina, USA, pp. 4–16, 1995.

[97] D. Elliott, M. Stumm, W. M. Snelgrove, C. Cojocaru, andR. McKenzie, “Computational ram: Implementingprocessors in memory,” IEEE Des. Test, vol. 16,pp. 32–41, Jan. 1999.

[98] M. F. Deering, S. A. Schlapp, and M. G. Lavelle, “Fbram:A new form of memory optimized for 3d graphics,” inProceedings of the 21st Annual Conference on ComputerGraphics and Interactive Techniques, SIGGRAPH ’94,(New York, NY, USA), pp. 167–174, ACM, 1994.

14

Page 15: Application-Driven Near-Data Processing for Similarity Search

[99] A. Lipman and W. Yang, “The Smart Access Memory: AnIntelligent RAM for Nearest Neighbor DatabaseSearching,” in In ISCA Workshop on Mixing Logic andDRAM, 1997.

[100] S. R. Agrawal, C. M. Dee, and A. R. Lebeck, “ExploitingAccelerators for Efficient High Dimensional SimilaritySearch,” in Proceedings of the 21st ACM SIGPLANSymposium on Principles and Practice of ParallelProgramming, PPoPP ’16, (New York, NY, USA),pp. 3:1–3:12, ACM, 2016.

[101] C. D. Yu, J. Huang, W. Austin, B. Xiao, and G. Biros,“Performance Optimization for the K-nearest NeighborsKernel on x86 Architectures,” in Proceedings of theInternational Conference for High PerformanceComputing, Networking, Storage and Analysis, SC ’15,(New York, NY, USA), pp. 7:1–7:12, ACM, 2015.

[102] P. Tandon, J. Chang, R. G. Dreslinski, V. Qazvinian,P. Ranganathan, and T. F. Wenisch, “Hardwareacceleration for similarity measurement in naturallanguage processing,” in Proceedings of the 2013International Symposium on Low Power Electronics andDesign, ISLPED ’13, (Piscataway, NJ, USA), pp. 409–414,IEEE Press, 2013.

[103] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang,and Y. Xie, “Prime: A novel processing-in-memoryarchitecture for neural network computation inreram-based main memory,” in Proceedings of the 43rdInternational Symposium on Computer Architecture,ISCA ’16, (Piscataway, NJ, USA), pp. 27–39, IEEE Press,2016.

[104] K. Hsieh, E. Ebrahim, G. Kim, N. Chatterjee,M. O’Connor, N. Vijaykumar, O. Mutlu, and S. W.Keckler, “Transparent offloading and mapping (tom):Enabling programmer-transparent near-data processing ingpu systems,” in 2016 ACM/IEEE 43rd AnnualInternational Symposium on Computer Architecture(ISCA), pp. 204–216, June 2016.

[105] K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang,A. Boroumand, S. Ghose, and O. Mutlu, “Acceleratingpointer chasing in 3d-stacked memory: Challenges,mechanisms, evaluation,” in 2016 IEEE 34th International

Conference on Computer Design (ICCD), pp. 25–32, Oct2016.

[106] G. H. Loh, N. Jayasena, M. H. Oskin, M. Nutter,D. Roberts, M. M. Dong, P. Zhang, and M. Ignatowski,“A processing-in-memory taxonomy and a case forstudying fixed-function pim,” 2013.

[107] C. Xie, S. L. Song, J. Wang, W. Zhang, and X. Fu,“Processing-in-memory enabled graphics processors for 3drendering,” in 2017 IEEE 23rd International Symposiumon High Performance Computer Architecture (HPCA),Feb 2017.

[108] D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse,L. Xu, and M. Ignatowski, “Top-pim:Throughput-oriented programmable processing inmemory,” in Proceedings of the 23rd InternationalSymposium on High-performance Parallel and DistributedComputing [111], pp. 85–98.

[109] D. Kim, J. Kung, S. Chai, S. Yalamanchili, andS. Mukhopadhyay, “Neurocube: A programmable digitalneuromorphic architecture with high-density 3d memory,”in Proceedings of the 43rd International Symposium onComputer Architecture, ISCA ’16, (Piscataway, NJ, USA),pp. 380–392, IEEE Press, 2016.

[110] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis,“Tetris: Scalable and efficient neural network acceleratinowith 3d memory,” in Proceedings of the Twenty SecondInternational Conference on Architectural Support forProgramming Languages and Operating Systems(ASPLOS), ASPLOS ’17, (New York, NY, USA), ACM,2017.

[111] D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse,L. Xu, and M. Ignatowski, “Top-pim:Throughput-oriented programmable processing inmemory,” in Proceedings of the 23rd InternationalSymposium on High-performance Parallel and DistributedComputing, HPDC ’14, (New York, NY, USA), pp. 85–98,ACM, 2014.

[112] M. Imani, Y. Kim, and T. Rosing, “Mpim: Multi-purposein-memory processing using configurable resistivememory,” in 2017 22nd Asia and South Pacific DesignAutomation Conference (ASP-DAC), pp. 757–763, Jan2017.

15