Upload
hector-bates
View
221
Download
4
Tags:
Embed Size (px)
Citation preview
1
Querying Big Data: Theory and Practice
Theory– Tractability revisited for querying big data– Parallel scalability– Bounded evaluability
Techniques– Parallel algorithms – Bounded evaluability and access constraints– Query-preserving compression– Query answering using views– Bounded incremental query processing
TDD: Topics in Distributed Databases
2
Tractability revised for querying big data
Parallel scalability
Bounded evaluability
New theory for querying big data 2
To query big data, we have to determine whether it is feasible at all.
For a class Q of queries, can we find an algorithm T such that given any Q in Q and any big dataset D, T efficiently computes the answers Q(D) of Q in D within our available resources?
Fundamental question
Is this feasible or not for Q?
BD-tractability
33
4
The good, the bad and the ugly
Traditional computational complexity theory of almost 50 years:
• The good: polynomial time computable (PTIME)
• The bad: NP-hard (intractable)
• The ugly: PSPACE-hard, EXPTIME-hard, undecidable…
Polynomial time queries become intractable on big data!
What happens when it comes to big data?
Using SSD of 6G/s, a linear scan of a data set DD would take
• 1.9 days when DD is of 1PB (1015B)
• 5.28 years when DD is of 1EB (1018B)
O(n) time is already beyond reach on big data in practice!
4
5
Complexity classes within P
NC (Nick’s class): highly parallel feasible• parallel polylog time • polynomially many processors
Too restrictive to include practical queries feasible on big data
BIG open: P = NC?
L: O(log n) space NL: nondeterministic O(log n) space polylog-space: logk(n) space
5
Polynomial time algorithms are no longer tractable on big data. So we may consider “smaller” complexity classes
parallel logk(n)
as hard as P = NP
L NL polylog-space P, NC P
6
Tractability revisited for queries on big data
A class Q of queries is BD-tractable if there exists a PTIME preprocessing function such that for any database D on which queries of Q are defined,
D’ = (D) for all queries Q in Q defined on D, Q(D) can be computed by evaluating Q on D’ in parallel polylog time (NC)
BD-tractable queries are feasible on big data
Does it work? If a linear scan of D could be done in log(|D|) time: 15 seconds when D is of 1 PB instead of 1.99 days 18 seconds when D is of 1 EB rather than 5.28 years
DD (D)(D)
Q1((D))
Q2((D))。。
parallel logk(|D|, |Q|)
6
7
BD-tractable queries
A class Q of queries is BD-tractable if there exists a PTIME preprocessing function such that for any database D on which queries of Q are defined,
D’ = (D) for all queries Q in Q defined on D, Q(D) can be computed by evaluating Q on D’ in parallel polylog time (NC)
Preprocessing: a common practice of database people
one-time process, offline, once for all queries in Q indices, compression, views, incremental computation, …
not necessarily reduce the size of D
BDTQ0: the set of all BD-tractable query classes
in parallel with more resources
What is the maximum size of D’ ?
7
8
What query classes are BD-tractable?
Boolean selection queries Input: A dataset D Query: Does there exist a tuple t in D such that t[A] = c?
Build a B+-tree on the A-column values in D. Then all such selection queries can be answered in O(log(|D|)) time.
Some natural query classes are BD-tractable
Relational algebra + set recursion on ordered relational databases
D. Suciu and V. Tannen: A query language for NC, PODS 1994
What else?
Graph reachability queries Input: A directed graph G Query: Does there exist a path from node s to t in G?
NL-complete
8
9
Deal with queries that are not BD-tractable
Many query classes are not BD-tractable.
Can we make it BD-tractable?
No. The problem is well known to be P-complete! We need PTIME to process each query (G, (u, v)) ! Preprocessing does not help us answer such queries.
Breadth-Depth Search (BDS)Input: An unordered graph G = (V, E) with a numbering on its nodes, and a pair (u, v) of nodes in V Question: Is u visited before v in the breadth-depth search of G?
Starts at a node s, and visits all its children, pushing them onto a stack in the reverse order induced by the vertex numbering. After all of s’ children are visited, it continues with the node on the top of the stack, which plays the role of s
Is this problem (query class) BD-tractable?
D is empty, Q is (G, (u, v))What is P-complete?
9
10
Make queries BD-tractable
Factorization: partition instances to identify a data part D for preprocessing, and a query part Q for operations
BDTQ: The set of all query classes that can be made BD-tractable
Preprocessing: (G) performs BDS on G, and returns a list M consisting of nodes in V in the same order as they are visited
For all queries (u, v), whether u occurs before v can be decided by a binary search on M, in log(|M|) time
Breadth-Depth Search (BDS) Input: An unordered graph G = (V, E) with a numbering on its nodes, and a pair (u, v) of nodes in V Question: Is u visited before v in the breadth-depth search of G?
Factorization: D is G = (V, E), Q is (u, v)
after proper factorization
10
11
Fundamental problems for BD-tractability
BD-tractable queries help practitioners determine what query classes are tractable on big data.
Are we done yet?
No, a number of questions in connection with a complexity class!
Reductions: how to transform a problem to another in the class that we know how to solve, and hence make it BD-tractable?
Complete problems: Is there a natural problem (a class of queries) that is the hardest one in the complexity class? A problem to which all problems in the complexity class can be reduced
How large is BDTQ? BDTQ0? Compared to P? NC?
Analogous to our familiar NP-complete problems
Why do we care?
Fundamental to any complexity classes: P, NP, …
Name one NP-complete problem that you know
Why do we need reduction?
11
12
Reductions
Departing from our familiar polynomial-time reductions, we need reductions that are in NC, and deal with both data D and query Q!
transform a given problem to one that we know how to solve
NC-factor reductions NC: a pair of NC functions that allow re-factorizations (repartition data and query part), for BDTQ
F-reductions F: a pair of NC functions that do not allow re-factorizations, for BDTQ0
transformations for making queries BD-tractable
Properties: transitivity: if Q1 NC Q2 and Q2 NC Q3, then Q1 NC Q3 (also F)
compatibility: if Q1 NC Q2 and Q2 is in BDTQ, then so is Q1.
if Q1 F Q2 and Q2 is in BDTQ0, then so is Q1.
to determine whether a query class is BD-tractable
12
13
Complete problems for BDTQ
A query class Q is complete for BDTQ if Q is in BDTQ, and moreover, for any query class Q’ in BDTQ, Q’ NC Q
A query class Q is complete for BDTQ0 if Q is in BDTQ0, and for any query class Q’ in BDTQ0, Q’ F Q
There exists a natural query class Q that is complete for BDTQ
BDS is both P-complete and BDTQ-complete!
Is there a complete problems for BDTQ?
Breadth-Depth Search (BDS)
What does this tell us?
13
14
Is there a complete problem for BDTQ0?
A query class Q is complete for BDTQ0 if Q is in BDTQ0, and for any query class Q’ in BDTQ0, Q’ F Q
Unless P = NC, a query class complete for BDTQ0 is a witness for P \ NC
Whether P = NC is as hard as whether P = NP If we can find a complete problem for BDTQ0 and show that it is
not in NC, then we solve the big open whether P = NC
It is hard to find a complete problem for BDTQ0
An open problem
14
Comparing with P and NC
NC BDTQ = P
All PTIME query classes can be made BD-tractable!
Unless P = NC, NC BDTQ0 P
Unless P = NC, not all PTIME query classes are BD-tractable
need proper factorizations to answer PTIME queries on big data
How large is BDTQ? How large is BDTQ0?
separation
PTIMEPTIME
BD-tractablenot
BD-tractable
Properly contained in P
Not all polynomial-time queries are BD-tractable15
16
Polynomial hierarchy revised
Tractability revised for querying big data
NP and beyond
PP
BD-tractablenot
BD-tractable
Parallel polylog time
16
17
What can we get from BD-tractability?
Guidelines for the following.
Why we need to study theory for querying big data
What query classes are feasible on big data?
What query classes can be made feasible to answer on big data?
How to determine whether it is feasible to answer a class Q of queries on big data?
Reduce Q to a complete problem Qc for BDTQ via NC
If so, how to answer queries in Q?
• Identify factorizations (NC reductions) such that Q NC Qc
• Compose the reduction and the algorithm for answering queries of Qc
BDTQ0
BDTQ
17
Parallel scalability
1818
Parallel query answering
BD-tractability is hard to achieve.
Parallel processing is widely used, given more resources
DB
P
M
DB
P
M
DB
P
M
interconnection network
10,000 processors
Parallel scalable: the more processors, the “better”? 19
Using 10000 SSD of 6G/s, a linear scan of DD might take: 1.9 days/10000 = 16 seconds when DD is of 1PB (1015B)5.28 years/10000 = 4.63 days when DD is of 1EB (1018B)
Only ideally!
How to define “better”?
20
Degree of parallelism -- speedup
Speedup: for a given task, TS/TL, TS: time taken by a traditional DBMS TL: time taken by a parallel system with more resources TS/TL: more sources mean proportionally less time for a task
Linear speedup: the speedup is N while the parallel system has N times resources of the traditional system
resources
Speed: throughputresponse time
Linear speedup
Question: can we do better than linear speedup?
21
Degree of parallelism -- scaleup
Scaleup: TS/TL A task Q, and a task QN, N times bigger than Q
A DBMS MS, and a parallel DBMS ML,N times larger
TS: time taken by MS to execute Q
TL: time taken by ML to execute QN
Linear scaleup: if TL = TS, i.e., the time is constant if the resource increases in proportion to increase in problem size
resources and problem size
TS/TL
Question: can we do better than linear scaleup?
22
Better than linear scaleup/speedup?
NO, even hard to achieve linear speedup/scaleup!
Startup costs: initializing each process
Interference: competing for shared resources (network, disk,
memory or even locks)
Skew: it is difficult to divide a task into exactly equal-sized parts;
the response time is determined by the largest part
Linear speedup is the best we can hope for -- optimal!
Give 3 reasons
Think of blocking in MapReduce
Data shipment cost for shard-nothing architecturesIn the real world, linear scaleup is too ideal to get!
A weaker criterion: the more processors are available, the less response time it takes.
23
Parallel query answering
Given a big graph G, and n processors S1, …, SnG is partitioned into fragments (G1, …, Gn) G is distributed to n processors: Gi is stored at Si
Performance guarantees: bounds on response time and data shipment
Performance
Parallel query answeringInput: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G
Response time (aka parallel computation cost): Interval from the time when
Q is submitted to the time when Q(G) is returned Data shipment (aka network traffic): the total amount of data shipped
between different processors, as messages
What is it? Why do we need to worry about it?
24
Parallel scalability
A distributed algorithm is useful if it is parallel scalable
Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G
Complexityt(|G|, |Q|): the time taken by a sequential algorithm with a single
processorT(|G|, |Q|, n): the time taken by a parallel algorithm with n
processorsParallel scalable: if
T(|G|, |Q|, n) = O(t(|G|, |Q|)/n) + O((n + |Q|)k)
Polynomial reduction (including the cost of data shipment, k is a constant)
When G is big, we can still query G by adding more processors if we can afford
them
25
linear scalability
Querying big data by adding more processors
An algorithm T for answering a class Q of queries Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G
Algorithm T is linearly scalable in computation if its parallel complexity is a function of |Q| and |G|/n,
and in data shipment if the total amount of data shipped is a function of |
Q| and n
The more processors, the less response time
Independent of the size |G| of big G
Is it always possible?
26
Graph pattern matching via graph simulation
Input: a graph pattern graph Q and a graph G
Output: Q(G) is a binary relation S on the nodes of Q and G
O((| V | + | VQ |) (| E | + | EQ| )) time
• each node u in Q is mapped to a node v in G, such that (u, v) S∈
• for each (u,v) S, ∈ each edge (u,u’) in Q is mapped to an edge (v, v’ ) in G, such that (u’,v’ ) S∈
26
Parallel scalable?
27
Impossibility
Nontrivial to develop parallel scalable algorithms
There exists NO algorithm for distributed graph simulation that is
parallel scalable in either computation, or data shipment
Why?
Pattern: 2 nodesGraph: 2n nodes, distributed to
n processors
Possibility: when G is a tree, parallel scalable in both response time
and data shipment
28
Weak parallel scalability
Rational: we can partition G as preprocessing, such that|Ef| is minimized (an NP-complete problem, but there are effective
heuristic algorithms), and When G grows, |Ef| does not increase substantially
Algorithm T is weakly parallel scalable in computation if its parallel computation cost is a function of |Q| |G|/n
and |Ef|, and in data shipment if the total amount of data shipped is a function of |Q|
and |Ef|
The cost is not a function of |G| in practice
Doable: graph simulation is weakly parallel scalable
edges across different fragments
29
MRC: Scalability of MapReduce algorithms
Characterize scalable MapReduce algorithms in terms of disk usage,
memory usage, communication cost, CPU cost and rounds.
For a constant > 0 and a data set D, |D|1- machines, a MapReduce
algorithm is in MRC if Disk: each machine uses O(|D|1-) disk, O(|D|2-2) in total. Memory: each machine uses O(|D|1-) memory, O(|D|2-2) in total.Data shipment: in each round, each machine sends or receives O(|
D|1-) amount of data, O(|D|2-2) in total.CPU: in each round, each machine takes polynomial time in |D|.The number of rounds: polylog in |D|, that is, logk(|D|)
the larger D is, the more processors
The response time is still a polynomial in |D|
30
MMC: a revision of MRC
For a constant > 0 and a data set D, n machines, a MapReduce
algorithm is in MMC if Disk: each machine uses O(|D|/n) disk, O(|D|) in total. Memory: each machine uses O(|D|/n) memory, O(|D|) in total.Data shipment: in each round, each machine sends or receives O(|
D|/n) amount of data, O(|D|) in total.CPU: in each round, each machine takes O(Ts/n) time, where Ts is
the time to solve the problem in a single machine.The number of rounds: O(1), a constant number of rounds.
Speedup: O(Ts/n) timethe more machines are used, the less time is taken
Compared with BD-tractable and parallel scalability
What algorithms are in MRC? Recursive computation?Restricted to MapReduce
Bounded evaluability
3131
32
Scale independence
Input: A class Q of queries
Question: Can we find, for any query Q Q and any (possibly big) dataset D, a fraction DQ of D such that
|DQ | M, and
Q(D) = Q(DQ)?
Making the cost of computing Q(D) independent of |D|!
Independent of the size of D
Particularly useful for
A single dataset D, eg, the social graph of Facebook
Minimum DQ – the necessary amount of data for answering Q
Q( )DDQ( ) DQDQDQDQ
32
33
Facebook: Graph Search
Find me restaurants in New York my friends have been to in 2013
select rid
from friend(pid1, pid2), person(pid, name, city),
dine(pid, rid, dd, mm, yy)
where pid1 = p0 and pid2 = person.pid and
pid2 = dine.pid and city = NYC and yy = 2013
How many tuples do we need to access?
Facebook: 5000 friends per person
Each year has at most 366 days
Each person dines at most once per day
pid is a key for relation person
Access constraints (real-life limits)
33
34
Bounded query evaluation
Accessing 5000 + 5000 + 5000 * 366 tuples in total
Fetch 5000 pid’s for friends of p0 -- 5000 friends per person
For each pid, check whether she lives in NYC – 5000 person tuples
For each pid living in NYC, finds restaurants where they dined in
2013 – 5000 * 366 tuples at most
Contrast to Facebook : more than 1.38 billion nodes, and over 140 billion links
A query plan
Find me restaurants in New York my friends have been to in 2013
select ridfrom friend(pid1, pid2), person(pid, name, city), dine(pid, rid, dd, mm, yy)where pid1 = p0 and pid2 = person.pid and pid2 = dine.pid and city = NYC and yy = 2013
34
35
Access constraints
On a relation schema R: X (Y, N) X, Y: sets of attributes of R for any X-value, there exist at most N distinct Y values Index on X for Y: given an X value, find relevant Y values
Access schema: A set of access constraints
Combining cardinality constraints and index
friend(pid1, pid2): pid1 (pid2, 5000) 5000 friends per person
dine(pid, rid, dd, mm, yy): pid, yy (rid, 366) each year has at most
366 days and each person dines at most once per day
person(pid, name, city): pid (city, 1) pid is a key for relation person
Examples
35
36
Finding access schema
On a relation schema R: X (Y, N)
Bounded evaluability: only a small number of access constraints
Functional dependencies X Y: X (Y, 1)
Keys X: X (R, 1)
Domain constraints, e.g., each year has at most 366 days
Real-life bounds: 5000 friends per person (Facebook)
The semantics of real-life data, e.g., accidents in the UK from
1975-2005
• dd, mm, yy (aid, 610) at most 610 accidents in a day
• aid (vid, 192) at most 192 vehicles in an accident
Discovery: extension of function dependency discovery, TANE
How to find these?
36
37
Bounded queries
Input: A class Q of queries, an access schema A Question: Can we find by using A, for any query Q Q and any
(possibly big) dataset D, a fraction DQ of D such that
|DQ | M, and
Q(D) = Q(DQ)?Examples
The graph search query at Facebook
All Boolean conjunctive queries are bounded– Boolean: Q(D) is true or false– Conjunctive: SPC, selection, projection, Cartesian product
Boundedness: to decide whether it is possible to compute Q(D)
by accessing a bounded amount of data at all
What are these?
But how to find DQ?
37
38
Boundedly evaluable queries
Input: A class Q of queries, an access schema A Question: Can we find by using A, for any query Q Q and any
(possibly big) dataset D, a fraction DQ of D such that
|DQ | M,
Q(D) = Q(DQ), and moreover,
DQ can be identified in time determined by Q and A?
Examples
The graph search query at Facebook
All Boolean conjunctive queries are bounded but are not necessarily
effectively bounded!
effectively find
If Q is boundedly evaluable, for any big D, we can efficiently
compute Q(D) by accessing a bounded amount of data!38
39
Deciding bounded evaluability
Input: A query Q, an access schema A Question: Is Q boundedly evaluable under A?
Yes. doable
Conjunctive queries (SPC) with restricted query plans:• Characterization: sound and complete rules• PTIME algorithms for checking effective boundedness and for generating query plans, in |Q| and |A|
Relational algebra (SQL): undecidable
Many practical queries are in fact boundedly evaluable!
What can we do?
• Special cases• Sufficient conditions
Parameterized queries in recommendation systems, even SQL
39
Techniques for querying big data
4040
An approach to querying big data
41
Given a query Q, an access schema A and a big dataset D
1. Decide whether Q is effectively bounded under A
2. If so, generate a bounded query plan for Q
3. Otherwise, do one of the following:
① Extend access schema or instantiate some parameters of Q, to make Q effectively bounded
② Use other tricks to make D small (to be seen shortly)
③ Compute approximate query answers to Q in D
Very effective for conjunctive queries
77% of conjunctive queries are boundedly evaluable
Efficiency: 9 seconds vs. 14 hours of MySQL 60% of graph pattern queries are boundedly
evaluable (via subgraph isomorphism) Improvement: 4 orders of magnitudes
41
42
Bounded evaluability using views
Input: A class Q of queries, a set of views V, an access schema A Question: Can we find by using A, for any query Q Q and any
(possibly big) dataset D, a fraction DQ of D such that
|DQ | M,
a rewriting Q’ of Q using V,
Q(D) = Q’(DQ,V(D)), and
DQ can be identified in time determined by Q, V, and A?
access views, and additionally
a bounded amount of data
Query Q may not be boundedly evaluable, but may be boundedly
evaluable with views!
Q’( , )DDQ( ) DQDQDQDQVV
42
43
Incremental bounded evaluability
Input: A class Q of queries, an access schema A Question: Can we find by using A, for any query Q Q, any
dataset D, and any changes D to D, a fraction DQ of D such that
|DQ | M,
Q(D D) = Q(D) Q(D, DQ ), and
DQ can be identified in time determined by Q and A?
access an additional bounded amount of data
Query Q may not be boundedly evaluable, but may be
incrementally boundedly evaluable!
Q( , )DDQ( ) DQDQ DDDDQ( )
DD
old output
DQDQ
43
Parallel query processing
44
Divide and conquer
partition G into fragments (G1, …, Gn), distributed to various sites
manageable sizes
upon receiving a query Q,
• evaluate Q( Gi ) in parallel
• collect partial answers at a coordinator site, and assemble them to find the answer Q( G ) in the entire G
evaluate Q on smaller Gi
Parallel processing = Partial evaluation + message passing
Q( )
GGQ( )
G1G1Q( )GnGnQ( )G2G2
… graph pattern matching in GRAPE: 21 times faster than MapReduce
44
Query preserving compression
45
The cost of query processing: f(|G|, |Q|)
Query preserving compression <R, P> for a class L of queries
For any data collection G, GC = R(G)
For any Q in L, Q( G ) = P(Q, Gc)
Q( G )
RG Gc
QP
Q
Q( Gc )
45
Compressing
Post-processing
Q( )
GGQ( )
GCGC
reduce the parameter?
18 times faster on average for reachability queries
In contrast to lossless compression, retain only relevant information for answering queries in L. Query preserving!
No need to restore the original graph G or decompress the data.
Better compression ratio!
Answering queries using views
46The complexity is no longer a function of |G|
can we compute Q(G) without accessing G, i.e., independent of |G|?
The cost of query processing: f(|G|, |Q|)
Query answering using views: given a query Q in a language L
and a set V views, find another query Q’ such that
Q and Q’ are equivalent
Q’ only accesses V(G )
for any G, Q(G) = Q’(G)
V(G ) is often much smaller than G (4% -- 12% on real-life data)
Improvement: 31 times faster for graph pattern matching
Q’( )Q( ) V(G)V(G)
46
GG
Incremental query answering
47Minimizing unnecessary recomputation
Incremental query processing:
Input: Q, G, Q(G), ∆G
Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕ ∆M
Changes to the outputNew output
Changes to the inputOld output
When changes ∆G to the data G are small, typically so are the
changes ∆M to the output Q(G⊕∆G)
Changes ∆G are typically small
Compute Q(G) once, and then incrementally maintain it
Real-life data is dynamic – constantly changes, ∆G
Re-compute Q(G ∆G⊕ ) starting from scratch?
5%/week in
Web graphs
5%/week in
Web graphs
At least twice as fast for pattern matching for changes up to 10%
47
Complexity of incremental problems
Bounded: the cost is expressible as f(|CHANGED|, |Q|)?
48Complexity analysis in terms of the size of changes
Incremental query answering
Input: Q, G, Q(G), ∆G
Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕ ∆M
The cost of query processing: a function of |G| and |Q|
incremental algorithms: |CHANGED|, the size of changes in • the input: ∆G, and • the output: ∆M
The updating cost that is inherent to the incremental problem itself
Incremental algorithms?
Incremental graph simulation: bounded
48
49
A principled approach: Making big data small
Bounded evaluable queries
Parallel query processing (MapReduce, GRAPE, etc)
Query preserving compression: convert big data to small data
Query answering using views: make big data small
Bounded incremental query answering: depending on the size of the changes rather than the size of the original big data
. . .
Combinations of these can do much better than MapReduce!
Including but not limited to graph queries
Yes, MapReduce is useful, but it is not the only way!
49
50
Summary and Review
What is BD-tractability? Why do we care about it?
What is parallel scalability? Name a few parallel scalable algorithms
What is bounded evaluability? Why do we want to study it?
How to make big data “small”?
Is MapReduce the only way for querying big data? Can we do better than
it?
What is query preserving data compression? Query answering using
views? Bounded incremental query answering?
If a class of queries is known not to be BD-tractable, how can we process
the queries in the context of big data?
51
Projects (1)
Prove or disprove one of the following query classes is • BD-tractable,• parallel scalable• in MMC
If so, give an algorithm as a proof. Otherwise, prove the impossibility but identify practical sub-classes that are scalable.
The query classes include• Distance queries on graphs• Graph pattern matching by subgraph isomorphism• Graph pattern matching by graph simulation• Subgraph isomorphism and graph simulation on trees
Experimentally evaluate your algorithms
Both impossibility and possibility results are useful! 51
Pick one of these
52
Projects (2)
Improve the performance of graph pattern matching via subgraph isomorphism via one of the following approaches:
• query-preserving graph compression• query answering using views
Prove the correctness of your algorithm, give complexity analysis and provide performance guarantees
Experimentally evaluate your algorithm and demonstrate the improvement
A research and development project 52
53
Projects (3)
It is known that graph pattern matching via graph simulation can benefit from:
• query-preserving graph compression• query answering using views
W. Fan, J. Li, X. Wang, and Y. Wu. Query Preserving Graph Compression, SIGMOD, 2012. (query-preserving compression)
W. Fan, X. Wang, and Y. Wu. Answering Graph Pattern Queries using Views, ICDE 2014. (query answering using views)
Implement one of the algorithms Experimentally evaluate your algorithm and demonstrate the improvementBonus: can you combine the two approaches and verify its benefit?
A development project 53
54
Projects (4)
Find an application with a set of SPC (conjunctive) queries and a dataset
Identify access constraints on your dataset for your queries Implement an algorithm that, given a query in your class, decide
whether the query is boundedly evaluable under your access constraints
If so, generate a query plan to evaluate your queries by accessing a bounded amount of data
Experimentally evaluate your algorithm and demonstrate the improvement
A development project 54
55
Projects (5)
Write a survey on techniques for querying big data, covering• parallel query processing, • data compression• query answering using views• incremental query processing• …
Develop a good understanding on the topic
Survey:• A set of 5-6 representative papers• A set of criteria for evaluation• Evaluate each model based on the criteria• Make recommendation: what to use in different applications
55
56
Reading: data quality
W. Fan and F.Geerts. Foundations of data quality management. Morgan & Claypool Publishers, 2012. (available upon request)
56
– Data consistency (Chapter 2)– Entity resolution (record matching; Chapter 4)– Information completeness (Chapter 5)– Data currency (Chapter 6)– Data accuracy (SIGMOD 2013 paper) – Deducing the true values of objects in data fusion (Chap. 7)
Reading for the next weekhttp://homepages.inf.ed.ac.uk/wenfei/publication.html
57
1. M. Arenas, L. E. Bertossi, J. Chomicki: Consistent Query Answers in Inconsistent Databases, PODS 1999. http://web.ing.puc.cl/~marenas/publications/pods99.pdf
2. Indrajit Bhattacharya and Lise Getoor. Collective Entity Resolution in Relational Data. TKDD, 2007. http://linqs.cs.umd.edu/basilic/web/Publications/2007/bhattacharya:tkdd07/bhattacharya-tkdd.pdf
3. P. Li, X. Dong, A. Maurino, and D. Srivastava. Linking Temporal Records. VLDB 2011. http://www.vldb.org/pvldb/vol4/p956-li.pdf
4. W. Fan and F. Geerts , Relative information completeness, PODS, 2009.
5. Y. Cao. W. Fan, and W. Yu. Determining relative accuracy of attributes. SIGMOD 2013.
6. P. Buneman, S. Davidson, W. Fan, C. Hara and W. Tan. Keys for XML. WWW 2001.