Transcript
Page 1: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

ShredderGPU-Accelerated Incremental

Storage and Computation

Pramod Bhatotia§, Rodrigo Rodrigues§, Akshat Verma¶

§MPI-SWS, Germany¶IBM Research-India

USENIX FAST 2012

Page 2: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 2

Handling the data deluge• Data stored in data centers is growing at a fast

pace

• Challenge: How to store and process this data ?• Key technique: Redundancy elimination

• Applications of redundancy elimination• Incremental storage: data de-duplication• Incremental computation: selective re-execution

Page 3: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 3

Redundancy elimination is expensive

For large-scale data, chunking easily becomes a bottleneck

Content-based chunking [SOSP’01]

ContentMarker

ChunkingFile Chunks Hash Yes

No

Duplicate

Hashing Matching

0 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 0 1 0 1 0

FileFingerprint

Page 4: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 4

Accelerate chunking using GPUs

GPUs have been successfully applied to compute-intensive tasks

Multicore GPU based design0

0.5

1

1.5

2

2.5

3

GBp

s

2X

20Gbps (2.5GBps)

Using GPUs for data-intensive tasks presents new challenges

?

StorageServers

Page 5: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 5

Rest of the talk

• Shredder design• Basic design• Background: GPU architecture & programming

model• Challenges and optimizations

• Evaluation• Case studies

• Computation: Incremental MapReduce• Storage: Cloud backup

Page 6: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 6

Shredder basic design

GPU (Device)CPU (Host)

Reader

Transfer

Store

Chunkingkernel

Data for chunking

Chunkeddata

Page 7: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 7

GPU architecture

CPU(Host)

GPU (Device)

Multi-processor N

Multi-processor 2Multi-processor 1

SP SP SPSP

PCI

Host memory

Device global

memory

Shared memory

Page 8: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 8

GPU programming model

CPU(Host)

GPU (Device)

Multi-processor N

Multi-processor 2Multi-processor 1

SP SP SPSP

PCI

Host memory

Device global

memory

Shared memory

Input

Output

Threads

Page 9: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 9

Scalability challenges

1. Host-device communication bottlenecks

2. Device memory conflicts

3. Host bottlenecks(See paper for details)

Page 10: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 10

Host-device communication bottleneck

Challenge # 1

GPU (Device)

Device global

memoryPCI

Synchronous data transfer and kernel execution • Cost of data transfer is comparable to kernel

execution• For large-scale data it involves many data transfers

CPU (Host)

Mainmemory Chunking

kernel

I/O

Transfer

Reader

Page 11: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 11

Asynchronous execution

TimeCopy to Buffer 1

GPU (Device)Asynchronous

copy

Buffer 1

Device global memoryMain

memoryTransfer Buffer 2

CPU (Host)

ComputeBuffer 2

ComputeBuffer 1

Copy to Buffer 2

Pros: + Overlaps communication with computation+ Generalizes to multi-buffering

Cons:- Requires page-pinning of buffers at host side

Page 12: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 12

Circular ring pinned memory buffers

GPU (Device)

CPU (Host)

Memcpy

Pinned circular Ring buffers

Pageablebuffers

Device global memory

Asynchronouscopy

Page 13: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 13

Device memory conflictsChallenge # 2

GPU (Device)

Device global

memoryPCI

CPU (Host)

Mainmemory Chunking

kernel

I/O

Transfer

Reader

Page 14: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 14

Accessing device memory

Thread-1 Thread-2 Thread-4

Device global memory

Multi-processor

SP-1 SP-2 SP-3 SP-4

Thread-3400-600Cycles

Fewcycles

Device shared memory

Page 15: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 15

Device global memory

Un-coordinated accesses to global memory lead to a large number of memory bank conflicts

Memory bank conflicts

Multi-processor

SP

Device shared memory

SP SP SP

Page 16: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 16

Accessing memory banks

Bank 0 Bank 3Bank 1 Bank 2

048

15

26

37

MSBs LSBs

Address

Memoryaddress

Chipenable

OR

Data out

Interleaved memory

Page 17: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 17

Memory coalescingDevice global memory

Device shared memory

Memorycoalescing

Tim

e1 2 4Thread # 3

Page 18: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 18

Processing the data

Thread-1 Thread-2 Thread-4

Device shared memory

Thread-3

Page 19: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 19

Outline

• Shredder design• Evaluation• Case-studies

Page 20: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 20

Evaluating Shredder

• Goal: Determine how Shredder works in practice• How effective are the optimizations?• How does it compare with multicores?

• Implementation• Host driver in C++ and GPU in CUDA• GPU: NVidia Tesla C2050 cards• Host machine: Intel Xeon with12 cores

(See paper for details)

Page 21: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 21

Shredder vs. Multicores

Multicore GPU Basic GPU Async GPU Async + Coalescing

0

0.5

1

1.5

2

2.5

GBp

s 5X

Page 22: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 22

Outline

• Shredder design• Evaluation• Case studies

• Computation: Incremental MapReduce• Storage: Cloud backup (See paper for details)

Page 23: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 23

Read input

Map tasks

Reduce tasks

Write output

Incremental MapReduce

Page 24: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 24

Unstable input partitions

Read input

Map tasks

Reduce tasks

Write output

Page 25: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 25

GPU accelerated Inc-HDFS

Input file

ShredderHDFS Client

copyFromLocal

Split-1 Split-2 Split-3

Content-based chunking

Page 26: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 26

Related work

• GPU-accelerated systems• Storage: Gibraltar [ICPP’10], HashGPU [HPDC’10]• SSLShader[NSDI’11], PacketShader[SIGCOMM’10],

• Incremental computations• Incoop[SOCC’11], Nectar[OSDI’10],

Percolator[OSDI’10],…

Page 27: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Pramod Bhatotia 27

Conclusions

• GPU-accelerated framework for redundancy elimination• Exploits massively parallel GPUs in a cost-

effective manner

• Shredder design incorporates novel optimizations• More data-intensive than previous usage of GPUs

• Shredder can be seamlessly integrated with storage systems • To accelerate incremental storage and

computation

Page 28: Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India

Thank You!


Recommended