22
MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan, NEC Labs Hakan Hacıgümüş. NEC Labs Junichi Tatemura, NEC Labs Neoklis Polyzotis, UC Santa Cruz Michael J. Carey, UC Irvine Appeared in SIGMOD 2014 *Now at HP Vertica R&D

MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

MISO: Souping Up Big Data Query Processing with a

Multistore System

Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan, NEC Labs

Hakan Hacıgümüş. NEC Labs Junichi Tatemura, NEC Labs

Neoklis Polyzotis, UC Santa Cruz Michael J. Carey, UC Irvine

Appeared in SIGMOD 2014 *Now at HP Vertica R&D

Page 2: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

Analytical System in today’s Organizations

2

Big Data / Exploratory Queries

HDFS RDBMS

Business Reporting

Queries

Data important to your organization High business value Well formatted (schema-on-write) Well known workload

Data that collect but there may be a use for it Unknown business value Formatted on-the-fly as needed (schema-on-read) Evolutionary workload (i.e., adhoc)

Big Data Business Data

Page 3: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

System Characteristics

§  Big Data Store offers ability to store massive amounts of data and begin querying right away –  Executes SQL + UDFs –  Slow query performance [Pavlo et. al. 2009, Dean et al. 2010]

§  RDBMS offers the ability to achieve high performance for queries –  Executes SQL –  High data loading time [Simitsis et al. 2009]

§  Emerging system architectures combine both –  Create a “Multistore” system for workload co-processing

3

Page 4: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

Multistore Architecture to Speedup Big Data Processing

4

Query Plan

Big Data / Exploratory Queries

HV (HDFS)

DW (RDBMS)

Business Reporting

Queries

DW has limited spare capacity to speedup big data queries

Big Data

Page 5: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

Multistore System Architecture

§  Multistore query plan is “split” across both stores –  Transfers data and computation

during query processing

§  Two related work [Simitsis et al. 2012, DeWitt et al. 2013] focus on optimizing multistore plans

§  Let us look at the performance for

all possible splits for a single plan

5

QUERY OPTIMIZER

EXECUTION LAYER

q1

Multistore Plan

HV DW

Big Data

Page 6: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

Execution Profile for all Possible Splits

0"

2"

4"

6"

8"

10"

12"

Execu&

on)Tim

e)(103)sec))

Mul&store)Plans)

DW*EXE" TRANSFER"DUMP" HV*EXE" ≈""

H"B"

S"28k"

6

Each plan represents a different split of the plan

On-the-fly data loading

Up-front data loading

Page 7: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

Our Approach: Multistore Design

§  Problem is at the core of multistore architectures –  Up-front data loading too time consuming –  On-the-fly data loading results in redundant work –  Need to decide what subset of data will be useful to load into DW

§  Multistore physical design problem is a data placement

problem that determines: –  What data to materialize (which “views”) –  Where to materialize the data (which “store”)

§  We introduce MISO, our MultIStore Online tuner

7

Page 8: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

Multistore System with MISO

HV DW

MISO

LOG DATA

q1

produces opportunistic views

Split across stores

q2

Meta-data Store

Keeps track of opportunistic views

Rewriter

Split across stores

Q2 benefits from the opportunistic views of Q1 [SIGMOD 2014 paper on Wednesday]

Periodically moves the useful views to DW [Today’s talk]

Tuner

Execution Layer

8

MISO is our “secret sauce” to tune the multistore

Page 9: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

Our MultIStore Online Tuner (MISO)

§  Periodically reorganizes the data in each store –  Transfer views between the stores –  Adapt the design as the workload changes dynamically (system

observes a stream of queries)

§  High-level goals: 1.  Facilitate earlier splits for each query 2.  Most part of the queries run on DW, even bypassing HV

sometimes 3.  Limit the impact on Data Warehouse (DW)

§  Challenges –  Have to solve 2 physical design problems –  Problems are tied together in a unique way

9

Page 10: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

Problem Statement

10

Given an •  observed query stream •  current view placement •  budget constraints,    compute new placement of views that minimizes the expected future workload cost.

QUERY OPTIMIZER

EXECUTION LAYER

q1, q2 …

MISO TUNER What-If

Plan new design

HV (HDFS)

DW (RDBMS)

View Transfer Budget

HISTORY

View Storage Budget

View Storage Budget

Big Data

Views Vh Views Vd

Page 11: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

Views Vh

Storage budget Bh

HV View Transfer budget Bt

Views Vd

Storage budget Bd

DW

MISO Tuner Solution

Solve DW design FIRST

Solve HV design SECOND

11

Use as much of the transfer budget

Use leftover transfer budget

Page 12: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

Selecting Views for Each Store

§  MISO solves a multi-dimensional 0-1 knapsack problem for each store –  Inputs are the set of views and budget constraints –  Dimensions are the storage and transfer budgets

§  Knapsack complication: view benefits are not independent –  A pair of views (x,y) can interact in two ways [Schnaitter et al. 2009]

•  Positive interaction •  Negative interaction

[Schnaitter et al. 2012] Packing Heuristics

12

Page 13: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

Experimental Setup

§  2 Independent Clusters connected via 1 GbE –  Hadoop (HV) 15 nodes (1 head + 14 slaves)

•  Latest version –  DW 9 nodes (1 head + 8 slaves)

•  Commercial parallel data warehouse

§  All machines have identical hardware –  2x Xeon CPUs, 16 GB RAM, 2 TB disk

13

Page 14: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

Experimental Setup (2)

§  Workload (See our SIGMOD DANAC 2013 paper) –  Mimic data analyst/scientist exploration to find value –  32 queries: 8 analysts each writing 4 versions of a query –  Complex queries with UDFs

•  Average query runtime on Hadoop cluster is about 10k secs

§  Data sets –  1 TB Twitter “tweet” stream (social data) –  1 TB Foursquare “check-in” stream (social data) –  12 GB Landmarks data (static/historic data)

14

Page 15: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

System Performance Comparison

0  

50  

100  

150  

200  

250  

300  

350  

400  

HV-­‐ONLY   DW-­‐ONLY   MS-­‐BASIC   HV-­‐OP   MS-­‐MISO  

TTI  (10

3  Sec)  

DW-­‐EXE   TRANSFER  HV-­‐EXE   TUNE  ETL  

15

TTI è Time To Insight. Holistic metric that includes everything, ETL, tuning, transfer, execution time.

Page 16: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

0  

50  

100  

150  

200  

250  

300  

350  

400  

HV-­‐ONLY   DW-­‐ONLY   MS-­‐BASIC   HV-­‐OP   MS-­‐MISO  

TTI  (10

3  Sec)  

DW-­‐EXE   TRANSFER   HV-­‐EXE  

TUNE   ETL  

Breakdown of Execution Time

16

(a) MS-BASIC (b) MS-MISO with small storage budget

(c) MS-MISO with large storage budget

0%#10%#20%#30%#40%#50%#60%#70%#80%#90%#100%#

1# 3# 5# 7# 9# 11# 13# 15# 32#

Percen

tage)of)Execu/o

n)Time)

Query)Rank)

%DW/EXE#%#TRANSFER#%HV/EXE#

|| 1" 3" 5" 7" 9" 11" 13" 15" 17" 19" 21" 32"

Query&Rank&

%HV+EXE"%"TRANSFER"%DW+EXE"

|| 1" 3" 5" 7" 9" 11" 13" 15" 17" 19" 21" 32"

Query&Rank&

%HV+EXE"%"TRANSFER"%DW+EXE"

||

Page 17: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

MultiStore Tuning Methods

0  

50  

100  

150  

200  

250  

300  

MS-­‐BASIC   MS-­‐OFF   MS-­‐LRU   MS-­‐MISO   MS-­‐ORA  

TTI  (10

3  Sec)  

DW-­‐EXE  TRANSFER  TUNE  

17

Page 18: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

Varying Storage Budgets (Bt = 10GB)

0  

50  

100  

150  

200  

250  

300  

0.125X   0.5X   1X   2X  

TTI  (x  10

3 secon

ds)  

MS-­‐LRU   MS-­‐OFF   MS-­‐MISO  

18

Page 19: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

Impact on DW

19

§  Execute a background workload on DW –  TPC-DS reporting queries consuming IO and CPU resources –  Consider DW with 20% spare capacity and 40% spare

capacity for both IO and CPU case

§  Avg. impact (slowdown) on DW queries < 2%

Page 20: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

Summary

§  Tuning the multistore important to obtaining good performance

§  Our MISO Tuner algorithm periodically reorganizes views in each store

§  This improves upon both up-front data loading and on-the-fly data loading

§  Outperforms multistore query processing without tuning

(MS-BASIC) and naïve tuning approaches such as LRU

20

Page 21: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

21

Page 22: MISO: Souping Up Big Data Query Processing with a ...€¦ · MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan,

22

www.nec-labs.com