Scalable Discovery of Unique Column...

Preview:

Citation preview

Scalable Discovery of Unique Column Combinations

1Hasso Plattner Institute 2Qatar Computing Research Institute

Arvid Heise1 Jorge-Arnulfo Quiané-Ruiz2 Ziawasch Abedjan1 Anja Jentzsch1 Felix Naumann1

Motivation Problem Large datasets at very fast rates: -  Social networks -  Scientific applications -  Transactional applications -  ...

Finding uniques is crucial for: -  Query optimization -  Anomaly detection -  Data modeling -  Indexing -  ...

Ø Unique column combinations often unknown in big datasets Ø Exponential search space

Finding unique column combinations is an NP-Hard problem

DUCC

Results

* Work done in the context of Metanome: joint project between HPI and QCRI that provides a fresh view on data profiling and aims at providing scalability for Big Data. Website: http://www.hpi.uni-potsdam.de/naumann/projekte/metanome_data_profiling.html

Lattice with eight attributes

minimal unique

unique

maximal non-unique

non-unique

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE o  Graph divided into uniques and non-uniques

o  Minimal uniques summarize uniques

o  Maximal non-uniques summarize non-uniques

o  DuCC simultaneously detects (non-)uniques

o  Completeness verified (proof in paper)

Modeled as graph coloring problem

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

1.

2. 3.

4. 5.

6.

7.

Random walk graph traversal o  Randomly pick next CC

from current CC o  Go upwards if CC is non-

unique o  Go downwards if CC is

unique o  Trace back if no pruned CC

is left o  Check for holes by

comparing min uniques and max non-uniques

minimal unique

unique

maximal non-unique

non-unique

o  Worker receives next CC from traversal algorithm

o  Performs fast check with PLI intersection

o  Maintains pruning data structures

o  Seed provider detects holes and calculates restart CC

Modular architecture

o  Workers exchange minimal uniques and maximal non-uniques

o  Scale up with local event bus

o  Scale out with distributed event bus (ZooKeeper)

o  Fault-tolerance with Map-only Hadoop Job

Scalable architecture

Scaling the Number of Columns on 100,000 Rows

Scaling the Number of Rows on 15 Columns

Scaling up/out Solution space and checks

Related Work Gordian: [Gordian: efficient and scalable discovery of composite keys. VLDB’06] -  Row-based approach -  Prefix-tree data organization

HCA: [Advancing the discovery of unique column combinations. CIKM’11] -  Column-based approach -  Histograms- and value-counting-based

Swan: [Detecting Unique Column Combinations on Dynamic Data. ICDE’14] -  Builds on top of DUCC -  Focus on dealing with incremental data

Number of Columns

log

Exec

utio

n Ti

me

(s)

5 10 15 20 25 30 35 40 45 50 55 60 65 701

10

100

1000

10000GORDIANHCADUCC

● ●

log

Exec

utio

n Ti

me

(s)

Number of Columns

5 10 15 1610^0

10^1

10^2

10^3

10^4● GORDIAN

HCADUCC

log

Exec

utio

n Ti

me

(s)

Number of Rows

10,000 100,000 539,1651

10

100

1000

10000GORDIANHCADUCC

log

Exec

utio

n Ti

me

(s)

Number of Rows

10,000 100,000 1,000,000 6,001,2151

10

100

1000

10000GORDIANHCADUCC

● ● ● ●●

●●

Number of Columns

Exec

utio

n Ti

me

(s)

0

5,000

10,000

15,000

20,000

20 25 30 35 40 45 50

● 1 node2 nodes4 nodes8 nodes

Minimal uniques

Uni

quen

ess

chec

ks

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

450,000

10,000 30,000 50,000 70,000 90,000 110,000

R2 = 0.9983

Uniprot

Uniprot

TPC-H

TPC-H

Recommended