View
218
Download
3
Category
Tags:
Preview:
Citation preview
ShredderGPU-Accelerated Incremental
Storage and Computation
Pramod Bhatotia§, Rodrigo Rodrigues§, Akshat Verma¶
§MPI-SWS, Germany¶IBM Research-India
USENIX FAST 2012
Pramod Bhatotia 2
Handling the data deluge• Data stored in data centers is growing at a fast
pace
• Challenge: How to store and process this data ?• Key technique: Redundancy elimination
• Applications of redundancy elimination• Incremental storage: data de-duplication• Incremental computation: selective re-execution
Pramod Bhatotia 3
Redundancy elimination is expensive
For large-scale data, chunking easily becomes a bottleneck
Content-based chunking [SOSP’01]
ContentMarker
ChunkingFile Chunks Hash Yes
No
Duplicate
Hashing Matching
0 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 0 1 0 1 0
FileFingerprint
Pramod Bhatotia 4
Accelerate chunking using GPUs
GPUs have been successfully applied to compute-intensive tasks
Multicore GPU based design0
0.5
1
1.5
2
2.5
3
GBp
s
2X
20Gbps (2.5GBps)
Using GPUs for data-intensive tasks presents new challenges
?
StorageServers
Pramod Bhatotia 5
Rest of the talk
• Shredder design• Basic design• Background: GPU architecture & programming
model• Challenges and optimizations
• Evaluation• Case studies
• Computation: Incremental MapReduce• Storage: Cloud backup
Pramod Bhatotia 6
Shredder basic design
GPU (Device)CPU (Host)
Reader
Transfer
Store
Chunkingkernel
Data for chunking
Chunkeddata
Pramod Bhatotia 7
GPU architecture
CPU(Host)
GPU (Device)
Multi-processor N
Multi-processor 2Multi-processor 1
SP SP SPSP
PCI
Host memory
Device global
memory
Shared memory
Pramod Bhatotia 8
GPU programming model
CPU(Host)
GPU (Device)
Multi-processor N
Multi-processor 2Multi-processor 1
SP SP SPSP
PCI
Host memory
Device global
memory
Shared memory
Input
Output
Threads
Pramod Bhatotia 9
Scalability challenges
1. Host-device communication bottlenecks
2. Device memory conflicts
3. Host bottlenecks(See paper for details)
Pramod Bhatotia 10
Host-device communication bottleneck
Challenge # 1
GPU (Device)
Device global
memoryPCI
Synchronous data transfer and kernel execution • Cost of data transfer is comparable to kernel
execution• For large-scale data it involves many data transfers
CPU (Host)
Mainmemory Chunking
kernel
I/O
Transfer
Reader
Pramod Bhatotia 11
Asynchronous execution
TimeCopy to Buffer 1
GPU (Device)Asynchronous
copy
Buffer 1
Device global memoryMain
memoryTransfer Buffer 2
CPU (Host)
ComputeBuffer 2
ComputeBuffer 1
Copy to Buffer 2
Pros: + Overlaps communication with computation+ Generalizes to multi-buffering
Cons:- Requires page-pinning of buffers at host side
Pramod Bhatotia 12
Circular ring pinned memory buffers
GPU (Device)
CPU (Host)
Memcpy
Pinned circular Ring buffers
Pageablebuffers
Device global memory
Asynchronouscopy
Pramod Bhatotia 13
Device memory conflictsChallenge # 2
GPU (Device)
Device global
memoryPCI
CPU (Host)
Mainmemory Chunking
kernel
I/O
Transfer
Reader
Pramod Bhatotia 14
Accessing device memory
Thread-1 Thread-2 Thread-4
Device global memory
Multi-processor
SP-1 SP-2 SP-3 SP-4
Thread-3400-600Cycles
Fewcycles
Device shared memory
Pramod Bhatotia 15
Device global memory
Un-coordinated accesses to global memory lead to a large number of memory bank conflicts
Memory bank conflicts
Multi-processor
SP
Device shared memory
SP SP SP
Pramod Bhatotia 16
Accessing memory banks
Bank 0 Bank 3Bank 1 Bank 2
048
15
26
37
MSBs LSBs
Address
Memoryaddress
Chipenable
OR
Data out
Interleaved memory
Pramod Bhatotia 17
Memory coalescingDevice global memory
Device shared memory
Memorycoalescing
Tim
e1 2 4Thread # 3
Pramod Bhatotia 18
Processing the data
Thread-1 Thread-2 Thread-4
Device shared memory
Thread-3
Pramod Bhatotia 19
Outline
• Shredder design• Evaluation• Case-studies
Pramod Bhatotia 20
Evaluating Shredder
• Goal: Determine how Shredder works in practice• How effective are the optimizations?• How does it compare with multicores?
• Implementation• Host driver in C++ and GPU in CUDA• GPU: NVidia Tesla C2050 cards• Host machine: Intel Xeon with12 cores
(See paper for details)
Pramod Bhatotia 21
Shredder vs. Multicores
Multicore GPU Basic GPU Async GPU Async + Coalescing
0
0.5
1
1.5
2
2.5
GBp
s 5X
Pramod Bhatotia 22
Outline
• Shredder design• Evaluation• Case studies
• Computation: Incremental MapReduce• Storage: Cloud backup (See paper for details)
Pramod Bhatotia 23
Read input
Map tasks
Reduce tasks
Write output
Incremental MapReduce
Pramod Bhatotia 24
Unstable input partitions
Read input
Map tasks
Reduce tasks
Write output
Pramod Bhatotia 25
GPU accelerated Inc-HDFS
Input file
ShredderHDFS Client
copyFromLocal
Split-1 Split-2 Split-3
Content-based chunking
Pramod Bhatotia 26
Related work
• GPU-accelerated systems• Storage: Gibraltar [ICPP’10], HashGPU [HPDC’10]• SSLShader[NSDI’11], PacketShader[SIGCOMM’10],
…
• Incremental computations• Incoop[SOCC’11], Nectar[OSDI’10],
Percolator[OSDI’10],…
Pramod Bhatotia 27
Conclusions
• GPU-accelerated framework for redundancy elimination• Exploits massively parallel GPUs in a cost-
effective manner
• Shredder design incorporates novel optimizations• More data-intensive than previous usage of GPUs
• Shredder can be seamlessly integrated with storage systems • To accelerate incremental storage and
computation
Thank You!
Recommended