25
Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu Malik, Randal Burns Johns Hopkins University Stratos Papadomanolakis, Anastassia Ailamaki Carnegie Mellon University

A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

  • Upload
    erwin

  • View
    36

  • Download
    0

Embed Size (px)

DESCRIPTION

A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching. Xiaodan Wang, Tanu Malik, Randal Burns Johns Hopkins University Stratos Papadomanolakis, Anastassia Ailamaki Carnegie Mellon University. Overview. Motivation Data intensive scientific database federations - PowerPoint PPT Presentation

Citation preview

Page 1: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Xiaodan Wang, Tanu Malik, Randal Burns

Johns Hopkins UniversityStratos Papadomanolakis,

Anastassia AilamakiCarnegie Mellon University

Page 2: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

Overview

Motivation– Data intensive scientific database federations– Mid-tier caching improves scalability

Choosing the unit of cache replacement– Minimize aggregate network traffic– Improve query execution performance

Query prototypes– Cache groups of columns– Adapts to changes in the workload

Page 3: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

OpenSkyQuery

Federation of sky surveys (a virtual telescope)– Expected to grow from 30 sites to over 100

Available over the Internet (community of astronomers, educational users)

Sites are autonomous, heterogeneous, and geographically distributed

Data intensive workload (large data sets, network-bound)

Scaling through mid-tier caching– Minimize network traffic – Offload query processing

Page 4: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

Caching Schema

Difficult to achieve good query performance– Caches employ commodity hardware– An index-free environment

Both network and query performance are sensitive to granularity of cache replacement

Fine granularity (column)– Poor network performance at small cache sizes– High I/O overhead

Coarse granularity (table)– Groups unrelated columns– Inefficient query and network performance

Page 5: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

Contributions

Cache workload-defined groups of columns (query prototypes)

Adaptive – candidate query prototypes are discovered incrementally from the request stream

Self-organizing – each prototype describes a physical schema optimized for a specific class of queries

Improve in-cache query execution performance without sacrificing network savings

Page 6: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

Caching for Network Savings

Identify and cache database objects that provide network savings

– Requests that access these objects are serviced from the cache

– Reduces contention for network bandwidth

Bypass Yield Caching (Malik et al., ICDE’05) – Caching framework that uses economic principles to

maximize network savings– Database objects are ranked by yield (expected network

savings per unit of cache space utilized)

Page 7: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

Choosing the Unit of Cache Replacement

Semantic caching is unsuitable for Astronomy– Lack locality (objects are rarely reused)– Evaluating query containment is difficult (nested

queries, complex joins, and user-defined functions are common)

Employ schema-based caching– Queries reuse the same set of columns– Derive popular columns from the workload– Analogous materialized views

Page 8: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

File-Bundling (Otoo et al., SC’04)

Loading only columns with high yield at small cache sizes

A B C D E F G H I J

Q1 Q2 Q3 Q4

BC

Cache

HI

Caching columns B, C, H, and I results in no cache hits Solution: cache groups of columns

Page 9: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

Caching Groups of Columns

Existing schema-based caching models are static (e.g. CacheTables, MTCache, TimesTen)

– Do not account for dynamic workload access patterns– Physical schema of backend database or defined a priori– May group columns that are rarely used together

Query prototypes caching– Identifies the best groupings from the workload – Minimizes query execution cost against prototypes without

sacrificing network savings

Page 10: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

Query Prototype

Given a query qi, define the Query Access Set, QAS(qi), as the set of attributes accessed by qi

qi and qj share the same query prototype if they access the same attributes (QAS(qi) = QAS(qj))

Example:

SELECT objID

FROM Galaxy, SpecObj

WHERE objID = bestobjID and specclass = 2 and z between 0.121 and 0.127

QAS = {Galaxy:objID, SpecObj:bestobjID,

SpecObj:specclass, SpecObj:z}

Page 11: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

Query Prototype

B1A1 A2 A3

A2 A3B1 B2 B3

R1

Base Tables

R2

Q1Cache

Prototype

QAS(Q1) = {R1:A2, R1:A3, R2:B1}QAS(Q2) = {R2:B1, R2:B2, R2:B3}

Q2

B1 B2 B3

Prototype

B1 is replicated in the cache

Page 12: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

Workload Properties

Read-only queries One-month trace against the Sloan Digital Sky

Survey (SDSS) Data Release 4 – 2TB 1.4 million queries generating 360GB of network

traffic 1176 query prototypes describe the entire workload 11 prototypes capture 91% of the queries 6 prototypes generate 89% of the network traffic

Page 13: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

Experiments

Evaluate caching of tables, columns, vertical partitions, and query prototypes

AutoPart (Papadomanolakis et al., SSDBM’04)– An automated partitioning algorithm for large

scientific databases– Groups columns in order to improve query

execution performance– Produces the best workload-driven, static grouping

Page 14: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

Network Savings

Page 15: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

Cache Pollution

Page 16: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

Query Performance

Page 17: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

Discussion

Improving network and query execution performance are complementary goals

Columns should be grouped together at small cache sizes (cache hits suffer due to file-bundling)

Column groupings should be adaptive because– Workload access pattern is dynamic– Indexes are not available

Page 18: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

Questions

???

Page 19: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

Schema Reuse

• Localized to a small subset of tables

Page 20: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

Schema Reuse

• Similar reuse among columns

Page 21: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

Object Reuse

• Few objects are reused

Page 22: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

SkyQuery

Federation middleware built at Hopkins

Wrapper/Mediator architecture using web services

Page 23: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

Load CostObject Load Cost by Unit of Cache Replacement

Qry PrototypeColumn0

20

40

60

80

100

120

140

160

180

200

Unit of Cache Replacement

# W

rite

s/M

B

Page 24: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

Scan Cost

Scanning large tables, the useful region is a small fraction

Incur IO overhead for accessing data from extraneous columns

Spatial locality among related columns

Q

Page 25: A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

Hopkins Storage Systems Lab, Department of Computer Science

Join Cost

Joining results for queries that access multiple fragments

Access should be localized to few fragments to minimize join overhead

Q