Download pptx - Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs

Alessandro Margara and Gianpaolo [email protected], [email protected]

Dip. Elettronica e Informazione (DEI)Politecnico di Milano

mailto:[email protected]

mailto:[email protected]

High Performance Content-Based Matching Using GPUs - DEBS 2011 2

The Problem: Content-Based MatchingPublishers SubscribersContent-Based Matching

(Smoke=true and Room = “Kitchen”) or (Light>30 and Room=“Bedroom”)Light=50,

Room=Bedroom, Sender=“Sensor1”

Filter

Constraint

Predicate

Attribute


• Introduced by Nvidia in 2006• General purpose parallel computing architecture

– New instruction set– New programming model– Programmable using high-level languages

• Cuda C (a C dialect)

Programming GPUs: CUDA


Programming Model: Basics– The device (GPU) acts as a coprocessor for the host (CPU)

and has its own separate memory space• It is necessary to copy input data from the main memory to the

GPU memory before starting a computation …• … and to copy results back to the main memory when the

computation finishes– Often the most expensive operations

» Involve sending information through the PCI-Ex bus» Bandwidth but also latency

– Also requires serialization of data structures!» They must be kept simple


Typical Workflow

Allocate memory on device

Serialize and copy data to device

Execute one or more kernels on the device

Wait for the device to finish processing

Copy results back


Programming Model: Fundamentals• Single Program Multiple Threads implementation

strategy– A single kernel (function) is executed by multiple threads

in parallel• Threads are organized in blocks

– Threads within different blocks operate independently– Threads within the same block cooperate to solve a

single sub-problem• The runtime provides a blockId and a threadId variable,

to uniquely identify each running thread– Accessing such variables is the only way to differentiate

the work done by different threads


Programming Model: Memory management

• Hierarchical organization of memory– All threads have access to the same common global memory

• Large (512MB-6GB) but slow (DRAM)• Stores information received from the host• Persistent across different function calls

– Threads within a block coordinate themselves using a shared memory• Implemented on-chip

– Fast but limited (16-48KB)

– Each thread has its own local memory• It’s the only “cache” available

– No hardware/system support– Must be explicitly controlled by the application code


More on Memory Management

• Without hardware managed caches, accesses to global memory can easily become a bottleneck

• Issues to consider when designing algorithms and data structures– Maximize usage of shared (block local) memory

• Without overcoming its size– Threads with contiguous ids should access contiguous

global memory regions• Hardware can combine them into several memory-wide

accesses

High Performance Content-Based Matching Using GPUs - DEBS 2011

Hardware Implementation

• An array of Streaming Multiprocessors (SMs) containing many (extremely simple) processing cores– Each SM executes threads in groups of

32 called warps• Scheduling is performed in hardware with

zero overhead– Optimized for data parallel problems

• Maximum efficiency only if all threads in a warp agree on the execution path

9


Some Numbers

• NVIDIA GTX 460• 1GB RAM (Global Memory)• 7 Streaming Multiprocessors• Each SM contains 48 cores• Each SM manages up to 48 warps (32 threads each)• Up to 10752 threads managed concurrently!!!

– Up to 336 threads running concurrently!!!• Today’s cheap GPU: less than 160$


Existing Algorithms

• Two approaches– Counting algorithms– Tree-based algorithms

• Complex data structures to optimize sequential execution– Trees, Maps, …– Lots of pointers!!!

• Hardly fit the data parallel programming model!


Algorithm DescriptionF1: A>10 and B=20 F2: B>15 and C<30

Constraint Filter

A>10 F1

B=20 F1

B>15 F2

C<30 F2

D=20 F3

S1

S2 F3: D=20

Filter Size Count Interface

F1 2 S1

F2 2 S1

F3 1 S2

0

0

0

1

1

2

A=12B=20A=12B=20


Algorithm Description

• Constraints with the same name are stored in array on the GPU– Contiguous memory

regions• When processing an event

E, the CPU selects all relevant constraint arrays– Based on the name of the

attributes in E


Algorithm Description• Bi-dimensional organization of

threads– One thread for each

attribute/constraint pair• Threads in the same block

evaluate the same attribute– It can be copied in shared memory

• Threads with contiguous ids access contiguous constraints– Accesses combined into several

memory-wide operations• Filters count updated with an

atomic operation

B=32 Event attributesC=21A=7


Improvement• Problem: before processing each

event we need to reset filters count and interfaces selection vector

• Naïve version: use a memset– Communication with the GPU

introduces additional delay• Solution: two copies of filters

count and interfaces vector• While processing an event

– One copy is used– One copy is reset for the next

event– Inside the same kernel

• No communication overhead


Results: Default Scenario• Comparison against state of the

art sequential implementation– SFF (Siena) 1.9.4– AMD CPU @ 2.8GHz

• Default scenario– Relatively “simple”– 10 interfaces, 25k filters, 1M

constraints• Analysis changing various

parameters• We measure latency

– Processing time for a single event

7x


Results: Number of Constraints

10x


Results: Number of Filters

13x


Results• What is the time needed to

install subscriptions?– Need to serialize data

structures– Need to copy from CPU

memory to GPU memory– But data structures are simple!

• Memory requirements?– 35MB in the default scenario– Up to 200MB in all our tests– Not a problem for a modern

GPU


Results

• We measured the latency when processing a single event– 0.14ms processing time 7000 events/s? – What about the maximum throughput?

9400events/s


Conclusions

• Benefits of GPU in a wide range of scenarios– In particular in the most challenging workloads

• Additional advantage– It leaves the CPU free to perform other tasks

• E.g. Communication related tasks• Available for download

– Includes a translator from Siena subscriptions / messages– More info at http://home.dei.polimi.it/margara


Future Work• We are currently working with multi-core CPUs

– Using OpenMP• We are currently testing our algorithm within a real system

– Both GPUs and multi-core CPUs– Take into account communication overhead– Measure of latency and throughput

• We plan to explore the advantages of GPUs with probabilistic (as opposed to exact) matching– Encoded filters (Bloom filters)– Balance between performance and percentage of false positives


Questions?