High Performance Content-Based Matching Using GPUs
Alessandro Margara and Gianpaolo [email protected], [email protected]
Dip. Elettronica e Informazione (DEI)Politecnico di Milano
High Performance Content-Based Matching Using GPUs - DEBS 2011 2
The Problem: Content-Based MatchingPublishers SubscribersContent-Based Matching
(Smoke=true and Room = “Kitchen”) or (Light>30 and Room=“Bedroom”)Light=50,
Room=Bedroom, Sender=“Sensor1”
Filter
Constraint
Predicate
Attribute
High Performance Content-Based Matching Using GPUs - DEBS 2011 3
• Introduced by Nvidia in 2006• General purpose parallel computing architecture
– New instruction set– New programming model– Programmable using high-level languages
• Cuda C (a C dialect)
Programming GPUs: CUDA
High Performance Content-Based Matching Using GPUs - DEBS 2011 4
Programming Model: Basics– The device (GPU) acts as a coprocessor for the host (CPU)
and has its own separate memory space• It is necessary to copy input data from the main memory to the
GPU memory before starting a computation …• … and to copy results back to the main memory when the
computation finishes– Often the most expensive operations
» Involve sending information through the PCI-Ex bus» Bandwidth but also latency
– Also requires serialization of data structures!» They must be kept simple
High Performance Content-Based Matching Using GPUs - DEBS 2011 5
Typical Workflow
Allocate memory on device
Serialize and copy data to device
Execute one or more kernels on the device
Wait for the device to finish processing
Copy results back
High Performance Content-Based Matching Using GPUs - DEBS 2011 6
Programming Model: Fundamentals• Single Program Multiple Threads implementation
strategy– A single kernel (function) is executed by multiple threads
in parallel• Threads are organized in blocks
– Threads within different blocks operate independently– Threads within the same block cooperate to solve a
single sub-problem• The runtime provides a blockId and a threadId variable,
to uniquely identify each running thread– Accessing such variables is the only way to differentiate
the work done by different threads
High Performance Content-Based Matching Using GPUs - DEBS 2011 7
Programming Model: Memory management
• Hierarchical organization of memory– All threads have access to the same common global memory
• Large (512MB-6GB) but slow (DRAM)• Stores information received from the host• Persistent across different function calls
– Threads within a block coordinate themselves using a shared memory• Implemented on-chip
– Fast but limited (16-48KB)
– Each thread has its own local memory• It’s the only “cache” available
– No hardware/system support– Must be explicitly controlled by the application code
High Performance Content-Based Matching Using GPUs - DEBS 2011 8
More on Memory Management
• Without hardware managed caches, accesses to global memory can easily become a bottleneck
• Issues to consider when designing algorithms and data structures– Maximize usage of shared (block local) memory
• Without overcoming its size– Threads with contiguous ids should access contiguous
global memory regions• Hardware can combine them into several memory-wide
accesses
High Performance Content-Based Matching Using GPUs - DEBS 2011
Hardware Implementation
• An array of Streaming Multiprocessors (SMs) containing many (extremely simple) processing cores– Each SM executes threads in groups of
32 called warps• Scheduling is performed in hardware with
zero overhead– Optimized for data parallel problems
• Maximum efficiency only if all threads in a warp agree on the execution path
9
High Performance Content-Based Matching Using GPUs - DEBS 2011 10
Some Numbers
• NVIDIA GTX 460• 1GB RAM (Global Memory)• 7 Streaming Multiprocessors• Each SM contains 48 cores• Each SM manages up to 48 warps (32 threads each)• Up to 10752 threads managed concurrently!!!
– Up to 336 threads running concurrently!!!• Today’s cheap GPU: less than 160$
High Performance Content-Based Matching Using GPUs - DEBS 2011 11
Existing Algorithms
• Two approaches– Counting algorithms– Tree-based algorithms
• Complex data structures to optimize sequential execution– Trees, Maps, …– Lots of pointers!!!
• Hardly fit the data parallel programming model!
High Performance Content-Based Matching Using GPUs - DEBS 2011 12
Algorithm DescriptionF1: A>10 and B=20 F2: B>15 and C<30
Constraint Filter
A>10 F1
B=20 F1
B>15 F2
C<30 F2
D=20 F3
S1
S2 F3: D=20
Filter Size Count Interface
F1 2 S1
F2 2 S1
F3 1 S2
0
0
0
1
1
2
A=12B=20A=12B=20
High Performance Content-Based Matching Using GPUs - DEBS 2011 13
Algorithm Description
• Constraints with the same name are stored in array on the GPU– Contiguous memory
regions• When processing an event
E, the CPU selects all relevant constraint arrays– Based on the name of the
attributes in E
High Performance Content-Based Matching Using GPUs - DEBS 2011 14
Algorithm Description• Bi-dimensional organization of
threads– One thread for each
attribute/constraint pair• Threads in the same block
evaluate the same attribute– It can be copied in shared memory
• Threads with contiguous ids access contiguous constraints– Accesses combined into several
memory-wide operations• Filters count updated with an
atomic operation
B=32 Event attributesC=21A=7
High Performance Content-Based Matching Using GPUs - DEBS 2011 15
Improvement• Problem: before processing each
event we need to reset filters count and interfaces selection vector
• Naïve version: use a memset– Communication with the GPU
introduces additional delay• Solution: two copies of filters
count and interfaces vector• While processing an event
– One copy is used– One copy is reset for the next
event– Inside the same kernel
• No communication overhead
High Performance Content-Based Matching Using GPUs - DEBS 2011 16
Results: Default Scenario• Comparison against state of the
art sequential implementation– SFF (Siena) 1.9.4– AMD CPU @ 2.8GHz
• Default scenario– Relatively “simple”– 10 interfaces, 25k filters, 1M
constraints• Analysis changing various
parameters• We measure latency
– Processing time for a single event
7x
High Performance Content-Based Matching Using GPUs - DEBS 2011 17
Results: Number of Constraints
10x
High Performance Content-Based Matching Using GPUs - DEBS 2011 18
Results: Number of Filters
13x
High Performance Content-Based Matching Using GPUs - DEBS 2011 19
Results• What is the time needed to
install subscriptions?– Need to serialize data
structures– Need to copy from CPU
memory to GPU memory– But data structures are simple!
• Memory requirements?– 35MB in the default scenario– Up to 200MB in all our tests– Not a problem for a modern
GPU
High Performance Content-Based Matching Using GPUs - DEBS 2011 20
Results
• We measured the latency when processing a single event– 0.14ms processing time 7000 events/s? – What about the maximum throughput?
9400events/s
High Performance Content-Based Matching Using GPUs - DEBS 2011 21
Conclusions
• Benefits of GPU in a wide range of scenarios– In particular in the most challenging workloads
• Additional advantage– It leaves the CPU free to perform other tasks
• E.g. Communication related tasks• Available for download
– Includes a translator from Siena subscriptions / messages– More info at http://home.dei.polimi.it/margara
High Performance Content-Based Matching Using GPUs - DEBS 2011 22
Future Work• We are currently working with multi-core CPUs
– Using OpenMP• We are currently testing our algorithm within a real system
– Both GPUs and multi-core CPUs– Take into account communication overhead– Measure of latency and throughput
• We plan to explore the advantages of GPUs with probabilistic (as opposed to exact) matching– Encoded filters (Bloom filters)– Balance between performance and percentage of false positives
High Performance Content-Based Matching Using GPUs - DEBS 2011 23
Questions?