Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Experimental Results Evaluating the performance impact of each of the six dimensions Measured runtime with accurate timer, reporting average of five runs Please see paper [3] for additional results and details
Aggregation is widely used to extract useful information from large volumes of data. In-memory databases are rising in popularity due to the demands of big data analytics applications.Many different algorithms and data structures can be used for in-memory aggregation, but their relative performance characteristics are inadequately studied. Prior studies in aggregationprimarily focused on small selections of query workloads and data structures. We undertook a comprehensive analysis of in-memory aggregation that encompasses 20 popular and state-of-the-art algorithms and data structures. Insights gained from theoretical and empirical evaluation are used to identify the trade-offs of each algorithm, with the goal of offering insights topractitioners. Our results allowed us to identify the best approach in different situations, based on specific characteristics of the query workload and dataset.
In-memory Aggregation for Big Data AnalyticsPuya Memarzia, Virendra C. Bhavsar, and Suprio Ray
Faculty of Computer Science, University of New Brunswick, Fredericton, New Brunswick, Canada
ABSTRACT
References: [1] Puya Memarzia, Suprio Ray, and Virendra C. Bhavsar, “On Improving Data Skew Resilience In Main-memory Hash Joins”, 22nd International Database Engineering & Applications Symposium (IDEAS), May 2018 [2] Gray, Jim, Prakash Sundaresan, Susanne Englert, Ken Baclawski, and Peter J. Weinberger. "Quickly generating billion-record synthetic databases." In Acm Sigmod Record, vol. 23, no. 2, pp. 243-252. ACM, 1994. [3] Puya Memarzia, Suprio Ray, and Virendra C. Bhavsar, “A Six-dimensional Analysis of In-memory Aggregation”, International Conference on Extending Database Technology (EDBT-2019), 2018.
(a) Vector COUNT (Q1)
Motivation Aggregation: a ubiquitous and expensive
operation commonly used in data analytics Applications: data warehousing, data
mining, business intelligence tools, … Plethora of algorithms and data structures
could be used to implement aggregation What are the tradeoffs and opportunities to
improve performance? How to select the best tool for the job?
Memory getting faster, denser, and cheaper Significant speedups (orders of magnitude)
over disk-based query processing Data resides in main memory Performance highly influenced by memory
access patterns [1]
Figure 3. Six analysis dimensions (important factors that can be evaluated independantly)
Figure 4. Variable dataset distribution - Vector COUNT (Q1) - 100M records - 1M group-by cardinality
Conclusions Aggregation heavily impacted by data and workload characteristics Sort-based approaches: generally perform best on holistic workloads Hash-based approaches: generally the fastest on distributive workloads Tree-based approaches: too slow for write-once-read-once (WORO)
workloads. Potential for write-once-ready-many (WORM) workloads orsituations requiring dynamic growth
Sponsored by:
1. Algorithm and Data Structure 3. Key Distribution and Skew2. Query and Aggregate Function
4. Group-By Cardinality 5. Dataset Size and Memory Usage 6. Concurrency and Multithreaded Scalingthreads
In-memory Query Processing
Data
App App
Data
(optional disk snapshots)
Data
Data
CPU
RAMRAM
Disk resident data
Experimental Methodology
Hash-based, sort-based, and tree-based approaches
Including popular, state-of-the-art, and custom implementations
Aggregation query categories Scalar or Vector Distributive, Algebraic, or Holistic Range predicate
Data key distribution (Zipfian, heavy hitter, moving cluster, etc.)
Data ordering (randomness/locality)
0
10
20
30
40
50
60
70
80
Rseq Rseq-Shf Hhit Hhit-shf Zipf MovCQue
ry E
xecu
tion
Tim
e (C
PU
Cyc
les)
Bill
ions
Dataset distribution
ART Judy Btree Hash_SC Hash_LP Hash_Sparse Hash_Dense Hash_LC Introsort Spreadsort
Number of unique values in group-by columns
Determines size of output
Support for concurrent (shared memory) access
Ability to scale up with additional threads
Size of input data Algorithm memory efficiency Query memory requirements
Figure 2. Comparison of disk-based and memory-based approaches
Label Type
ART Tree
Judy Tree
Btree Tree
Hash_SC Hash Table
Hash_LP Hash Table
Hash_Sparse Hash Table
Hash_Dense Hash Table
Hash_LC Hash Table
Hash_TBBSC Hash Table
Hash_LCMC Hash Table
Introsort Sort Algorithm
Spreadsort Sort Algorithm
Sort_QSLB Sort Algorithm
Sort_BI Sort Algorithm
ART Judy Btree Hash_SC Hash_LP Hash_Sparse Hash_Dense Hash_LC Introsort Spreadsort
102 103 104 105 106 107
0
5
10
15
20
25
30
35
40
Que
ry E
xecu
tion
Tim
e (C
PU
Cyc
les)
Bill
ions
Group-by Cardinality
102 103 104 105 106 107
0
10
20
30
40
50
60
Que
ry E
xecu
tion
Tim
e (C
PU
Cyc
les)
Bill
ions
Group-by Cardinality
102 103 104 105 106 107
higher cardinality = more unique values => more cache/TLB misses
Hash_LP
Spreadsort
0
2
4
6
8
10
12
14
16
18
1 2 3 4 5 6 7 8
No of threads
Que
ry E
xecu
tion
Tim
e (C
PU
Cyc
les)
Bill
ions
Hash_TBBSC Hash_LCMC Sort_QSLB Sort_BI
0
5
10
15
20
25
1 2 3 4 5 6 7 8
No of threads
Que
ry E
xecu
tion
Tim
e (C
PU
Cyc
les)
Bill
ions
Hash_TBBSC Hash_LCMC Sort_QSLB Sort_BI
(a) Vector COUNT (Q1) – 1K Groups (b) Vector COUNT (Q1) – 1M Groups
Hash_TBBSCHash_LCMC
Limited improvement beyond 4 threads
Reminder: we have four physical cores with hyperthreading
Runtime generally faster and more consistent on Hash_TBBSC
Hash_LCMC more efficient at high cardinality (1M groups)
AlgorithmQueryDataset
Six dataset distributions Extends popular datasets
originally proposed in [2] Up to 100M key-value pairs Variable cardinality/skew
Seven query workloads Covers all fundamental
aggregation categories
Initial pool of 20 algorithmsselected for evaluation
Microbenchmarks used to filterout inefficient implementations
Distributive query example (Q1)
SELECT product_id, MEDIAN(amount)FROM sales GROUP BY product_id
SELECT product_id, COUNT(*)FROM sales GROUP BY product_id
Holistic query example (Q3)
Table 1. Algorithms and Data Structures
(b) Vector MEDIAN (Q3)
Figure 5. Variable cardinality - 100M records
expensive to process: Zipf and Rseq-Shfcheaper to process: Rseq and MovC (locality matters!)
Figure 6. Multithreaded scaling (concurrent algorithms) - 100M records
Figure 1. Aggregation overview and motivation