Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Low-Latency Transaction Execution on Graphics Processors:
Dream or Reality?Iya Arefyeva, Gabriel Campero Durand,
Marcus Pinnecke, David Broneske, Gunter Saake
Workgroup Databases and Software Engineering University of Magdeburg
Motivation: Context
2
GPGPUs are becoming essential for accelerating computation
- 3 out of Top 5 from HPC 500 (June 2018) are powered by GPUs- 56% of the flops on the list come from GPU acceleration [10]
Summit Supercomputer,Oak Ridge
Motivation: Context
3
Online analytical processing (OLAP):
- few long running tasks performed on big chunks of data - easy to exploit data parallelism → good for GPUs
GPU accelerated systems for OLAP: GDB [1], HyPE [2], CoGaDB [3], Ocelot [4], Caldera [5], MapD [8]
Online transaction processing (OLTP):
- thousands of short lived transactions within a short period of time- data should be processed as soon as possible due to user interaction
GPU accelerated systems for OLTP: GPUTx [6]
GPUs are also important for accelerating database workloads:
→ comparably less studied
Motivation: Context
4
Hybrid transactional and analytical processing (HTAP):
- real-time analytics on data that is ingested and modified in the transactional database engine
- challenging due to conflicting requirements in workloads
GPU accelerated systems for HTAP: Caldera [5]*
*However in Caldera, GPUs don’t process OLTP workloads → possible underutilization
Caldera Architecture [5]
Motivation: GPUs for OLTP
5
Intrinsic GPU challenges:
1. SPMD processing2. Coalesced memory access3. Branch divergence overheads4. Communication bottleneck: data needs to be transferred from RAM to GPU and back
over PCIe bus 5. Bandwidth bottleneck: bandwidth of PCIe bus is lower than the bandwidth of a GPU6. Limited memory
SM structure of Nvidia's Pascal GP100 SM [9]
Motivation: GPUs for OLTP
6
OLTP Challenges:
- Managing isolation and consistency with massive parallelism- Previous research (GPUTx [6]) proposed a Bulk Execution Model and a
K-set transaction handling
Experiments with GPUTx [6]
Our contributions
7
In this early work, we
1. Evaluate a simplified version of the K-set execution model from GPUTx, assuming single key operations and massive point queries.
2. Test on a CRUD benchmark, reporting impact of batch sizes and bounded staleness.
3. Suggest 2 possible characteristics that could aid in the adoption of GPUs for OLTP, as we seek to adopt them in the design of an GPU OLTP query processor.
-
8
Prototype Design
batch collection
Implementation
9
- Storage engine is implemented in C++
- OpenCL for GPU programming
- The table is stored on the GPU (in case the GPU is used), only the necessary data is transferred
- Client requests are handled in a single thread
- In order to support operator-based K-sets several cases need to be considered.
- These cases determine our transaction manager
client 1 client 2 client N...
batch processing
replying to clients
requestrequest request
batch processing
batch processing
Implementation
10
write 1 write 2write 3write 4write 5write 6write 7
key 22key 4key 19key 8key 10
key 1key 56
write 8key 5
new request
collected batchfor writes
Case 0: If a batch is not completely filled, the server waits for K seconds after receiving the last request and executes everything(K=0.1 in our experiments)
Case 1: Reads or independent writes:
Implementation
11
write 1 write 2write 3write 4write 5write 6write 7
key 22key 4key 19key 8key 10
key 1key 56
write 8key 5
write 8key 5
new request
collected batchfor writes
Case 0: If a batch is not completely filled, the server waits for K seconds after receiving the last request and executes everything(K=0.1 in our experiments)
Case 1: Reads or independent writes:
Implementation
12
write 1 write 2write 3write 4write 5write 6write 7
key 22key 4key 19key 8key 10
key 1key 56
batch processing
write 8key 5
write 1 write 2write 3write 4write 5write 6write 7
key 22key 4key 19key 8key 10
key 1key 56
write 8key 5
write 8key 5
new request
collected batchfor writes
batch full
Case 0: If a batch is not completely filled, the server waits for K seconds after receiving the last request and executes everything(K=0.1 in our experiments)
Case 1: Reads or independent writes:
Implementation
13
Case 2: Write after Write
write 8key 4
write 1 write 2write 3write 4write 5write 6
key 22key 4key 19key 8key 10
key 1
collected batchfor writes
new request
!
Implementation
14
Case 2: Write after Write
write 8key 4
write 1 write 2write 3write 4write 5write 6
key 22key 4key 19key 8key 10
key 1
collected batchfor writes
new request
batch processing
write 1 write 2write 3write 4write 5write 6
key 22key 4key 19key 8key 10
key 1
!
flush writes
Implementation
15
Case 2: Write after Write
write 8key 4
write 1 write 2write 3write 4write 5write 6
key 22key 4key 19key 8key 10
key 1
collected batchfor writes
new request
batch processing
write 8key 4
write 1 write 2write 3write 4write 5write 6
key 22key 4key 19key 8key 10
key 1
!
flush writes
collected batchfor writes
Implementation
16
Case 3: Read after Write
read 5key 4
write 1 write 2write 3write 4write 5write 6
key 22key 4key 19key 8key 10
key 1
new request
batch processing
read 5key 4
write 1 write 2write 3write 4write 5write 6
key 22key 4key 19key 8key 10
key 1
read 1 read 2read 3read 4key 7
key 13key 32key 25!
flush writes
collected batchfor writes
collected batchfor reads
Implementation
17
Case 4: Write after Read
write 5key 4
read 1 read 2read 3read 4read 5read 6read 7
key 22key 4key 19key 8key 10
key 1key 56
new request
batch processing
write 5key 4
read 1 read 2read 3read 4read 5read 6read 7
key 22key 4key 19key 8key 10
key 1key 56
write 1 write 2write 3write 4key 7
key 13key 32key 25!
flush reads
collected batchfor reads
collected batchfor writes
18
Evaluation
YCSB (Yahoo! Cloud Serving Benchmark)
19
YCSB client architecture [7]
100k read operations
All fields of a tuple are read
Zipfian distribution of requests
1 million update operations
Only one field is updated
Zipfian distribution of requests
100k read/update operations(50% reads and 50% updates)
80% operations access last entries (20% of tuples)
Workloads
20
Read-only Workload R Write-only Workload W Mixed Workload M
10k records in the tableEach tuple consists of 10 fields (100 bytes each), key length is 24 bytes
- CPU: Intel Xeon E5-2630- GPU: Nvidia Tesla K40c
- OpenCL 1.2- CentOS 7.1 (kernel version 3.10.0)
Goal: Evaluating performance on independent reads or write to find the impact of batch size
Goal: What is the impact of concurrency control?Do stale reads improve performance?
Evaluation (workload R - read only)
21
- CPU & row store provides the best performance
- Small batches reduce collection time- Very small batches for GPUs are not
efficient- Execution is faster with bigger
batches- However, it does not
compensate for slow response time
Evaluation (workload W - update only)
22
- CPU & row store provides the best performance
- Small batches reduce collection time- Very small batches for GPUs are not
efficient- Execution is faster with bigger
batches- However, it does not
compensate for slow response time
Evaluation (workload M - read/update, CPU)
23
- Concurrency control is beneficial for the CPUsmaller batches → clients get replies quicker
- Allowing stale reads (0.01 s) improves the performance for the CPU due to shorter waiting time before execution
- Big batches are better because of the reduced waiting time in case of conflicting operations big batches→ more operations are executed & the server waits less often
with CC w/o CC
with CC w/o CC
Evaluation (workload M - read/update, GPU)
24
- Concurrency control is not beneficial for the GPUsmaller batches → the GPU is not utilized efficiently
- Allowing stale reads improves the performance for the GPU & column store due to shorter waiting time before execution
- Big batches are better because of the reduced waiting time in case of conflicting operationsmore operations are executed → the server waits less often
with CC w/o CC
with CC w/o CC
25
Conclusions and Future Work
Discussion & Conclusion
26
The GPU batch size conundrum for OLTP:
Case 1: small batches are processed
- clients get replies quicker- GPUs are not utilized efficiently due to the small number of data elements
(this could be improved by splitting requests into fine-grained operations)
Case 2: big batches are processed
- many data elements are beneficial for GPUs- but it takes long to collect batches and throughput can be decreased
(this gets faster with higher arrival rates)
+ Other considerations: transfer overhead in case the table is not stored on the GPU
Future Work
27
+ More complex transactions and support for rollbacks
+ Concepts for recovery and logging
+ Comparison with state of the art systems
Future Work
28
+ More complex transactions and support for rollbacks
+ Concepts for recovery and logging
+ Comparison with state of the art systems
Thank you!
Questions?
29
References
1. He, B., Lu, M., Yang, K., Fang, R., Govindaraju, N.K., Luo, Q. and Sander, P.V., 2009. Relational query coprocessing on graphics processors. ACM Transactions on Database Systems (TODS), 34(4), p.21.
2. Breß, S. and Saake, G., 2013. Why it is time for a HyPE: A hybrid query processing engine for efficient GPU coprocessing in DBMS. Proceedings of the VLDB Endowment, 6(12), pp.1398-1403.
3. Breß, S., 2014. The design and implementation of CoGaDB: A column-oriented GPU-accelerated DBMS. Datenbank-Spektrum, 14(3), pp.199-209.
4. Heimel, M., Saecker, M., Pirk, H., Manegold, S. and Markl, V., 2013. Hardware-oblivious parallelism for in-memory column-stores. Proceedings of the VLDB Endowment, 6(9), pp.709-720.
5. Appuswamy, R., Karpathiotakis, M., Porobic, D. and Ailamaki, A., 2017. The Case For Heterogeneous HTAP. In 8th Biennial Conference on Innovative Data Systems Research (No. EPFL-CONF-224447).
6. He, B. and Yu, J.X., 2011. High-throughput transaction executions on graphics processors. Proceedings of the VLDB Endowment, 4(5), pp.314-325.
7. Cooper, Brian F., Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. "Benchmarking cloud serving systems with YCSB." In Proceedings of the 1st ACM symposium on Cloud computing, pp. 143-154. ACM, 2010.
8. MapD Product Website: https://www.mapd.com/ 9. Soyata, Tolga. GPU Parallel Program Development Using CUDA. Chapman and Hall/CRC, 2018.
30
References
10. Top 500 news: https://www.top500.org/news/new-gpu-accelerated-supercomputers-change-the-balance-of-power-on-the-top500/
11. Appuswamy, Raja, Angelos C. Anadiotis, Danica Porobic, Mustafa K. Iman, and Anastasia Ailamaki. "Analyzing the impact of system architecture on the scalability of OLTP engines for high-contention workloads." Proceedings of the VLDB Endowment 11, no. 2 (2017): 121-134.
Extra Slides: Isolation level
31
+ By assuming single operation transactions and no consistency checks, our results avoid divergence and report on larger batches.
+ We are on single-key CRUD, not SQL level.
+ All anomalies are
preventable through K-sets and managing the dependency graph, so this more general approach supports complete Serializability
Source: https://blog.acolyer.org/2016/02/24/a-critique-of-ansi-sql-isolation-levels/
Implementation: Introduction
32
- Two core components:- Process Manager(PM): In charge of how processes are executed.- Transactional Storage Manager (TxSM): Includes the concurrency control
functionality.
State-of-the-art PM and TxSM design for GPUs [6]
From the PM approaches we adopt the latter, but while GPUTx is stored procedure-based, our implementation is operator-based.Assumptions:
- Break transactions into smaller record-level operators (basic CRUD)- More complex transactions have to be managed higher in the hierarchy
GPU computing
- For optimal execution behavior, each thread within a work group should access sequential blocks of memory (coalesced memory access)
- Communication bottleneck: data needs to be transferred from RAM to GPU and back over a PCIe bus
- Bandwidth bottleneck: the bandwidth of a PCIe bus is lower than the bandwidth of a GPU
33
Block 1Shared memory
Thread Thread
Constant memory
Global memory
Texture memory
Block 2Shared memory
Thread Thread
CPU
RegistersRegistersRegisters Registers
- composed of many cores → multiple threads at a time
- efficient at data parallelism
- limited memory size
Local memory Local memory Local memory Local memory
Motivation: GPUs for OLTP
34
GPUs for OLTP:
- The goal: to reduce the cost of database ownership by improvements in throughput
- There are challenges both from the device and from the workload
Row vs. column store
35
Row store:
- allows to quickly perform operations that affect all attributes
- one pointer to access a tuple
- fields of a tuple are likely to be pre-fetched in the cache
- good fit for OLTP
- if only a fraction of attributes is needed, unnecessary fields are retrieved together with relevant data
A
a1
B C
b1 c1
a2 b2 c2
a3 b3 c3
A
a1
B C
b1 c1a2 b2 c2a3 b3 c3
Column store:
- allows to read only the necessary data
- good fit for OLAP
- better compression rate
- requires accessing each field separately