“Good Enough” Database Caching
Hongfei GuoUniversity of Wisconsin-
Madison
2
Motivation — Scaling Google
…
3
Updates
…
Backend DBMS
How to tell whether the cached data is “good enough” for an application?
NO data quality requirements from the applications! NO data quality guarantees from the caching DBMS!
Motivation — Scaling A DBMS By Caching
Application Server
Application Server
App specific code
Caching DBMS
Asynchronous Updates
4
Apps: Specifies data quality requirements in queries
Cache: Enforces data quality constraint[SIGMOD 2004] [SIGMOD 2004 Demo]
Cache admin: Specify local data quality to be maintained by cache(Data quality-centric database caching model)[TR 2005] [submitted for publication]
Data quality-aware adaptive cache management[ongoing work]
Caching DBMS
Backend DBMS
Application ServerApplication Server
The Thesis
5
Data Quality Metrics (informal)
Currency: The elapsed time since this copy becomes stale
Consistency: A query result is (snapshot) consistent iff it is as if evaluated from a snapshot of the master database
C&C: Currency & Consistency
6
Roadmap
Background Specifying data quality constraints in SQL Data quality-centric caching model Enforcing data quality constraints Other research Future directions
7
Specifying Data Quality Constraints in SQL
[Guo, Larson, Ramakrishnan and Goldstein, SIGMOD 2004]
Currency requirements Consistency requirements Extend SQL to specify relaxed
C&C requirements Formal semantics of C&C
constraints
8
Example 1: The caching database keeps BookCopy
Customer A is about to purchase –he wants the data to be exactly current (High data quality is preferred)
Customer B is browsing –it is ok if the data is no more than 3 days out of sync (Quick response time is preferred)
Currency Requirements
9
Example 1: The caching database keeps BookCopy
Customer A is about to purchase –he wants the data to be exactly current (High data quality is preferred)
Customer B is browsing –it is ok if the data is no more than 3 days out of sync (Quick response time is preferred)
Currency Requirements
10
Example 1: The caching database keeps BookCopy
Customer A is about to purchase –he wants the data to be exactly current (High data quality is preferred)
Customer B is browsing –it is ok if the data is no more than 3 days out of sync (Quick response time is preferred)
Currency Requirements
Different apps may have different currency requirements for the same
query
11
bid
title author
bid rid
text
1 databases
Raghu 1 1 …
1 databases
Raghu 1 2 …
2 databases
Ullman 2 3 …
Ullmandatabases2
Raghudatabases1
authortitlebid
BookCopy
…23
…12
…11
textbidrid
ReviewCopy
SELECT *FROM Books B, Reviews R WHERE B.bid = R.bid AND
B.title = “Databases”
Example 2:
Consistency Requirements
The whole query result be consistentBooks be consistent & Reviews be consistentEach book be consistent with its reviews
Different apps may have different consistency requirements for the same
query
12
bid
title author
bid rid
text
1 databases
Raghu 1 1 …
1 databases
Raghu 1 2 …
2 databases
Ullman 2 3 …
CURRENCY BOUND 10 min ON (B, R) BY B.bid
CURRENCY BOUND 10 min ON (B), 30 min ON (R)
CURRENCY BOUND 10 min ON (B, R)
Proposed SQL Syntax
Ullmandatabases2
Raghudatabases1
authortitlebid
BookCopy
…23
…12
…11
textbidrid
ReviewCopy
SELECT *FROM Books B, Reviews R WHERE B.bid = R.bid AND
B.title = “Databases“
Consistency class
Currency bound
Group by
13
Extend SQL to express C&C constraints Single-block queries Multi-block (i.e., nested) queries Timeline constraint
Formal semantics of C&C constraints
Specifying Data quality Constraints in SQL: Contributions
Provides correctness standard for using
replicated or cached data
14
Roadmap
Background Specifying data quality constraints in SQL Data quality-centric caching model Enforcing data quality constraints Other Research Future directions
15
Data Quality-Centric Caching
Model[Guo, Larson and Ramakrishnan, submitted]
Cache data quality properties Cache property specification Maintenance and “safety”
16
Cache Properties (=
contract)
Why Define Cache Properties?
Query processing
Cache maintenance
17
Cache Properties (P+3C)
Presence — per object Consistency — a set of objects Completeness — per predicate Currency — object staleness
View 1
View 2View 3
Basic Concepts
ObjectTables
Cache
H2
H1Master Database
Snapshots
View 1
View 2View 3
Cache Property Examples
Cache
H2
H1Master Database
Present Complete
Currency = now – stale point
Consistent
Stale point
20
Specifying Cache Properties
Specified as integrity constraints Presence constraint Consistency constraint Completeness constraint Presence correlation constraint Consistency correlation constraint
21
AuthorList_PCT:
authorId name city
1 Alice Madison
2 Bob Madison
3 Cedric Seattle
Presence Constraint AuthorCopy:
authorId
1
2
3
Backend DBMS
Caching DBMS
22
control-table
CREATE VIEW AuthorCopy AS SELECT * FROM Authors
CREATE TABLE AuthorList_PCT (authorId int)
ALTER VIEW AuthorCopy ADD
ON authorId IN (SELECTauthorId FROM authorId_PCT
Partially materialize
d view[Zhou et al 2005]
authorId name city
Presence ConstraintAuthorCopy:
authorId
AuthorList_PCT:
1 Alice Madison
2 Bob Madison
3 Cedric Seattle
1
2
3
control-key
PRESENCE
23
CityList_CsCT:
authorId name city
1 Alice Madison
2 Bob Madison
3 Cedric Seattle
Consistency Constraint AuthorCopy:
city
Madison
authorId
AuthorList_PCT:
1
2
3
authorId
AuthorList_PCT:
1
2
3
CREATE TABLE CityList_CsCT (city string)
ALTER VIEW AuthorCopy ADD
ON city IN (SELECT city
FROM cityList_CsCT
Consistency
Backend DBMS
Cache Region
24
authorId
AuthorList_PCT:CityList_CpCT:
authorId name city
1 Alice Madison
2 Bob Madison
3 Cedric Seattle
Completeness Constraint AuthorCopy:
city
Madison
CREATE TABLE CityList_CpCT (city string)
ALTER VIEW AuthorCopy ADD
ON city IN (SELECT city
FROM cityList_CsCT
Completeness
Backend DBMS
authorId
AuthorList_PCT:
1
3
1
3
25
111 1 aaa222 1 bbb333 2 ccc444 3 ddd555 3 eee
isbn authorId title
1 Alice Madison
2 Bob Madison3 Cedric Seattle
authorId name city
Presence Correlation Constraint
AuthorCopy:
BookCopy:
ALTER VIEW BookCopy ADD PRESENCE ON authorId IN (SELECT authorId
FROM AuthorCopy)
authorId
AuthorList_PCT:
1
2
3Backend
DBMS
authorId
authorId
26
111 1 aaa222 1 bbb333 2 ccc444 3 ddd555 3 eee
isbn authorId title
1 Alice Madison
2 Bob Madison3 Cedric Seattle
authorId name city
Presence Correlation Constraint
AuthorCopy:
BookCopy:
authorId
AuthorList_PCT:
1
2
3
authorId
authorId
AuthorList_PCT
AuthorCopy
BookCopy
authorId
authorId
27
111 1 aaa222 1 bbb333 2 ccc444 3 ddd555 3 eee
isbn authorId title
1 Alice Madison
2 Bob Madison3 Cedric Seattle
authorId name city
Consistency Correlation Constraint
AuthorCopy:
BookCopy:
authorId
AuthorList_PCT:
1
2
3
authorId
authorIdBackend
DBMS
ALTER VIEW BookCopy ADD CONSISTENCY ROOT
28
111 1 aaa222 1 bbb333 2 ccc444 3 ddd555 3 eee
isbn authorId title
1 Alice Madison
2 Bob Madison3 Cedric Seattle
authorId name city
Consistency Correlation Constraint
AuthorCopy:
BookCopy:
authorId
AuthorList_PCT:
1
2
3
authorId
authorId
AuthorList_PCT
AuthorCopy
BookCopy
authorId
authorId
29
Cache Schema Example
AuthorList_PCT
AuthorCopy
BookCopy
ReviewerList_PCT
ReviewerCopy
authorId
authorId
isbn
reviewId
reviewerId
ReviewCopy
30
Pull-Maintenance
Refresh a region by pulling query results
When refreshing a region, also refresh the affected closure All overlapping regions All correlated regions
31
111 1 aaa222 1 bbb333 1 ccc444 3 aaa555 4 eee
Pull-Maintenance
isbn authorId title
BookCopy:
title
AuthorList_PCT: authorId
TitleList_CsCT:
134
aaa
authorId
32
111 1 aaa222 1 bbb333 1 ccc444 3 aaa555 3 eee
Pull-MaintenanceAuthorCopy:
isbn authorId title
BookCopy:
1 Alice Madison3 Cedric Seattle
authorId name cityAuthorList_PCT
AuthorCopy
BookCopy
authorId
authorIdauthorId
33
Inefficient PullingAuthorCopy:
isbn price title
BookCopy:
1 Alice Madison3 Cedric Seattle
authorId name city
111 10 aaa222 20 bbb333 30 ccc555 50 eee
AuthorBookCopy:authorId isbn
1 111
1 2221 3333 1113 555
authorId
isbn
Shared-row
problem
34
Issues
Inefficient pulling: Calculation of the affected closure
requires checking the rows
Efficient pulling: The affected closure does NOT
depend on the instance of a view Only requires forward pull among
correlated views
35
Theoretical Results Definition:
(Safe PMV) A partially materialized view V is safe if the following two conditions hold for every instance of the cache that satisfies all integrity constraints:
For any pair of regions in V, either they don’t overlap or one is contained in the other.
If V is gray, let X denote the set of regions in V defined by presence control-key values. X is a partitioning of V and no pair of regions in X is contained in any one region defined on V.
Cache schema design rules:
Rule 1: A cache graph is a DAG.
Rule 2: Only red nodes can have independent completeness or consistency control-tables.
Rule 3: Every PMV with more than one parent must be a red circle.
Rule 4: If a PMV has the shared-row problem according to Lemma 5.2, then it cannot be gray.
Rule 5: A PMV cannot have non-compatible control-tables.
Property for every instance
Syntactically checkable conditions
(polynomial)
Theorem:
Given a cache schema <W, E>, if it satisfies the design rules, then every PMV in W is safe. Conversely, if the schema violates one of these rules, there is an instance of the cache satisfying all specified integrity constraints in which some PMV is unsafe.
36
Data Quality-Centric Caching Model: Contributions
Four cache properties Specifying cache properties
Cache property unit: cache region Safe views and efficient pulling
Provides an abstraction layer (contract) between query
processing and cache maintenance
37
Roadmap
Background Specifying data quality constraints in SQL Data quality-centric caching model Enforcing data quality constraints Other research Future directions
38
Enforcing Data Quality Constraints
Overview Simple case: View-level
consistency [Guo, Larson, Ramakrishnan and Goldstein, SIGMOD 2004] [Guo, Larson, Ramakrishnan and Goldstein, SIGMOD 2004 Demo]
Implemented in MS SQL Server code base
General case: Row-level consistency[Guo, Larson and Ramakrishnan, submitted]
QueriesQueries with Relaxed
C&C Requirements
Results
QueryOptimizer
ExecutionEngine
Results
Cache Region
Metadata
HeartbeatTables
Backend DBMS
Local Materialized
Views
Caching DBMS
Extension to MTCache Framework
Shadow Databases
MTCache Framework [Larson et al. 2004]
40
Simple Case Assumptions
Fully materialized views Each view is consistent Push-based maintenance
E.g., MS replication service
QueryOptimizer
ExecutionEngine
Results
Queries with Relaxed C&C Requirements
Cache Region
Metadata
HeartbeatTables
Backend DBMS
Local Materialized
Views
Results
Extension to MTCache Framework
Shadow Databases
Caching DBMS
42
Consistency tracking cache region (CR) The unit of update propagation Data mutually consistent all the time Properties, e.g., est. delay, est. interval
Currency tracking heartbeat table
12: 2012: 3012: 301 12: 0012: 00
Cid Timestamp
1
2 12: 00
12: 10
V 1
V 3
V 4 V 5
V2
C&C Tracking Mechanism
V 1
V 3
V 4 V 5
V2
Backend Cache
CR1:
2 12: 00 CR2:
QueryOptimizer
ExecutionEngine
Results
Queries with Relaxed C&C Requirements
Currency Region
Metadata
HeartbeatTables
Backend DBMS
Local Materialized
Views
Results
Extension to MTCache Framework
Shadow Databases
Queries with Relaxed C&C Requirements
Caching DBMS
The best plan that: Satisfies consistency requirements Includes run-time currency checking
44
Extension to the Optimizer
Compile-time consistency checking
Run-time currency checking Cost estimation
45
Consistency Checking
Enforced at optimization time Immediately prune a sub-plan if it
violates consistency constraints
Merge join
Local scanReviews
Remote queryon Books
Q1: σ( Books Reviews) CURRENCY 5 ON (Books, Reviews)
46
Run-time Currency Checking
When view V matches expression E
E V
Currency guard:Check if local view V satisfies currency requirement
SwitchUnion
CurrencyGuard
Remote planrequesting E
Local plan using V
47
Cost Estimation
Cost for the SwitchUnion operator:
C = p * Clocal + (1- p) * Cremote + Ccg
p : probability that the local branch will be usedClocal : cost of execution of the local branchCremote : cost of execution of the remote branchCcg : cost of currency checking
48
Estimating p
Compute p from three parameters:f : estimated refresh interval
d : estimated minimal delay B : currency bound
0 if B-d ≤ 0,(B-d)/f if 0 < B-d ≤ f,1 if B-d > f
p =
49
Changing The Assumptions
Fully materialized views
Consistent views
Push-based maintenance
Partially materialized views
Row-level consistency
Pull-based maintenance
More general algorithms Run-time check for consistency
constraints that can not be validated at compile-time
50
Run-time C&C Checking
When view V matches expression E
E SwitchUnion
CurrencyGuard
Remote planrequesting E
Local plan using V
Currency guard:Check if local view V satisfies currency requirement
51
Run-time C&C Checking
When view V matches expression E
E SwitchUnion
CurrencyGuard
Remote planrequesting E
Local plan using V
C&CGuard
Consistency guard:Check if local view V satisfies consistency requirement
Currency guard:Check if local view V satisfies currency requirement
52
Performance Evaluation Goals
Currency guards overhead Consistency guards overhead
Simple checks A spectrum of checks ranging from
simple to complicated
53
Experimental Setting
Back-end hosts a TPCD database tpcd1gh with scale factor 1.0 (~1GB)
Cache server has a shadow of tpcd1gh
Two local views: custCopy, orderCopy LAN connection between cache and
backend server
54
Queries Used
Qa: key select
SELECT * FROM Customers C WHERE c_custkey=1 CURRENCY 10 ON (C)
Qb: join query
SELECT * FROM Customers C, Orders O WHERE c_custkey=o_custkey and c_custkey=1 CURRENCY 10 ON (C), 20 ON (O)
Qc: non-key select
SELECT * FROM Customers C WHERE c_nationkey = 1 CURRENCY 10 on (C)
55
0
50
100
150
200
250
Qa Qb Qc Qa Qb Qc
Currency guard
Query
Currency Guards Overhead
15.26%
21.3%
3.66%
3.59% 4.31%
0.41%
Local
Remote
Execu
tion t
ime (
ms)
56
Simple Consistency Guards Overhead
0
10
20
30
40
50
60
70
80
Qa Qb Qc Qa Qb Qc
Consistency guard
Query
Local
Remote
Execu
tion t
ime (
ms)
16.56%
14.00%
1.72%
1.59%1.66%
1.6%
57
0
1
2
3
4
5
6
7
A11a A11b A12 S11 S12 A11a A11b A12 S11 S12
Consistency guard
Query
Single Table Consistency Guard Overhead
Local
Remote
Execu
tion t
ime (
ms)
62.85%
16.98% 71.41%
6.06% 8.79%7.48%2.33%4.95%
58.32%
23.77%
(Qa is used)
58
Enforcing Data Quality Constraints: contributions
Algorithms for enforcing C&C constraints in query processing
Implemented a prototype in MS SQL Server code base for a restricted case
Provides DBMS guarantees for C&C requirements
59
Related WorkRelaxing data quality Distributed databases
Read-only transactions [Garcia-Moninaet al. 1982]
Demarcation protocol [Barbará et al 1992]
TACC [Yu et al. 2000] Epsilon-serilizability [Pu et al. 1992]
Warehousing and web views WebViews [Labrinidis et al 2003] FAS [Röhm et al. 2002] Obsolescent views [Gal 1999] Distributed views [Segev et al 1990]
Replica management Quasi-copies [Alonso et al. 1998],
[Gallersdörfer et al. 1995] Good-enough views [Seligman et al.
1997] TRAPP [Olson et al. 2000]
Caching Database caching
DBCache [Altinel et al. 2003] Constraint-based database caching
[Härder et al. 2004] Mid-Tier caching [TimesTen 2002] Shared-storage caching [Khalil et al 2002]
Others Semantic caching [Dar et al 1996] Cache in Postgres [Stonebraker et al 1990] Predicate-based caching [Keller et al 1996] WATCHMAN [Scheuermann et al 1996] Cache investment [Kossmann et al 2000] DECAF [Kiernan and Carey 2000] Proxy caching [Luo et al 2001]
Uniqueness of our approach (query-centric): Query: Specifies fine-grained C&C
constraints Admin: Flexible data quality control in
terms of granularity and properties Caching DBMS: Provides C&C
guarantees for individual query
60
Other Research UW: Indexing large-scale, dynamic one-
dimensional intervals [In preparation] A family of data structures Differed index
Evaluating different locking protocols for database caching [ongoing]
Quality of services evaluation of multicast streaming protocols [SIGMETRICS 2002]
MS: SchemaGen project [Software released] Designed and implemented a relational schema
generator for annotated XML schemas MSR-Redmond: RECYCLE project
Added support for update statistics for query result caching in SQL Server
61
Future DirectionsImprove current prototype Read-write
transactions? Time-line
constraints?
Automate cache design/tuning How to get a good
cache schema?
Apply “good enough” to other forms of replications Indexing data?
62
Summary
Problem: Gap between applications and caching DBMS
A comprehensive solution Specifying data quality constraints Data quality-centric cache model Enforcing Data quality constraints Data quality-aware adaptive cache
management
Questions?