Scalable Discovery of Unique Column...

Scalable Discovery of Unique Column Combinations

1Hasso Plattner Institute 2Qatar Computing Research Institute

Arvid Heise1 Jorge-Arnulfo Quiané-Ruiz2 Ziawasch Abedjan1 Anja Jentzsch1 Felix Naumann1

Motivation Problem Large datasets at very fast rates: -  Social networks -  Scientific applications -  Transactional applications -  ...

Finding uniques is crucial for: -  Query optimization -  Anomaly detection -  Data modeling -  Indexing -  ...

Ø Unique column combinations often unknown in big datasets Ø Exponential search space

Finding unique column combinations is an NP-Hard problem

Results

* Work done in the context of Metanome: joint project between HPI and QCRI that provides a fresh view on data profiling and aims at providing scalability for Big Data. Website: http://www.hpi.uni-potsdam.de/naumann/projekte/metanome_data_profiling.html

Lattice with eight attributes

minimal unique

unique

maximal non-unique

non-unique

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE o  Graph divided into uniques and non-uniques

o  Minimal uniques summarize uniques

o  Maximal non-uniques summarize non-uniques

o  DuCC simultaneously detects (non-)uniques

o  Completeness verified (proof in paper)

Modeled as graph coloring problem

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Random walk graph traversal o  Randomly pick next CC

from current CC o  Go upwards if CC is non-

unique o  Go downwards if CC is

unique o  Trace back if no pruned CC

is left o  Check for holes by

comparing min uniques and max non-uniques

minimal unique

unique

maximal non-unique

non-unique

o  Worker receives next CC from traversal algorithm

o  Performs fast check with PLI intersection

o  Maintains pruning data structures

o  Seed provider detects holes and calculates restart CC

Modular architecture

o  Workers exchange minimal uniques and maximal non-uniques

o  Scale up with local event bus

o  Scale out with distributed event bus (ZooKeeper)

o  Fault-tolerance with Map-only Hadoop Job

Scalable architecture

Scaling the Number of Columns on 100,000 Rows

Scaling the Number of Rows on 15 Columns

Scaling up/out Solution space and checks

Related Work Gordian: [Gordian: efficient and scalable discovery of composite keys. VLDB’06] -  Row-based approach -  Prefix-tree data organization

HCA: [Advancing the discovery of unique column combinations. CIKM’11] -  Column-based approach -  Histograms- and value-counting-based

Swan: [Detecting Unique Column Combinations on Dynamic Data. ICDE’14] -  Builds on top of DUCC -  Focus on dealing with incremental data

Number of Columns

5 10 15 20 25 30 35 40 45 50 55 60 65 701

10000GORDIANHCADUCC

● ●

Number of Columns

5 10 15 1610^0

10^4● GORDIAN

HCADUCC

Number of Rows

10,000 100,000 539,1651

10000GORDIANHCADUCC

Number of Rows

10,000 100,000 1,000,000 6,001,2151

10000GORDIANHCADUCC

● ● ● ●●

●●

Number of Columns

10,000

15,000

20,000

20 25 30 35 40 45 50

● 1 node2 nodes4 nodes8 nodes

Minimal uniques

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

450,000

10,000 30,000 50,000 70,000 90,000 110,000

R2 = 0.9983

Uniprot

Scalable Discovery of Unique Column...

Documents

Cat.No.E1E-004-00 JFE COLUMN...1 2 Preface JFE Column products include JFE Column- ER, JFE Column BCR, and JFE Column- BCP. JFE Column ER is a square pipe manufactured by roll forming

HPLC Column Troubleshooting: Is It Really The Column? Column Troubleshooting (2-23...Mobile phase pH > 7 - likely column void due to silica dissolution (unless specialty column used,

HALFEN HCC Column shoe · HALFEN COLUMN SHOE SYSTEM Column shoe HALFEN HCC - M Column shoe Assembly tolerance e Column shoe Anchor bolt for column concrete ≥ C30/37 acc. to type

Councillor’s Column - · PDF file10.10.2017 · councillor’s column october 20, 2017 councillor’s column 1 councillor’s column keeping you informed shad qadri – ward 6 -

Hampton Court Palace - Cartoon Gallery - VenuefinderHampton Court Palace - The Garden Room Awning 5'6'ʼ Awning Awning 26.10 m length Cloakroom Column Column Column Column Column Column

KM C754e-20171218095442 · Column A (Category l) TABLE A Column D (Category Ill) Column E (Category IV) Column B (Category Il) Column C (Category IIA)

Do Now: Match column A with column B –

Column Arch ROMEGREECE Oakridge Greece Column Column Types DORICIONICCORINTHIAN

Chapter 7 Teaching Literature from Reading to Interpretationwps.ablongman.com/wps/media/objects/2988/3060220/Chapter 7.pdf · 106 Column 1 Column 2 Column 3 Column 4 Column 5 ITEMS

Use Case Debutanizer Column (Distillation Column) monitoring...Debutanizer Column (Distillation Column) monitoring– Extension Module Challenges • Gas Chromatograph is used to continuously

HPLC COLUMN CAREhplc-column-care.com/resources/Column+Care+Guide.pdf · Void Replace column or guard Blocked Frit Reverse flush the column Sample Overload Dilute sample Build up of

GC Column Care Guide - Phenomenexphx.phenomenex.com/lib/gu61770410_l.pdf · GC Column Care Guide GC Column Installation ... GC Column Conditioning and Testing A. Column Conditioning

Improving the life of a data scientistda.qcri.org/jquiane/jorgequianeFiles/presentations/... · A DB-like Machine Learning System) 1 100 10000 Dataset adult covtypeyearpred rcv1 higgs

The Geologic Column (A2 Some Definitions). _______________ Column ?

Calculating Fertilizer Costs · 2017. 10. 3. · O 5 and K and K 2 O To convert column 1 to column 2, multiply by Column 1 Column 2 To convert column 2 to column 1, multiply by 2.29

A Cost-based Optimizer for Gradient Descent Optimizationda.qcri.org/jquiane/jorgequianeFiles/papers/sigmod17a.pdf · A Cost-based Optimizer for Gradient Descent Optimization Zoi Kaoudi1

HPLC Column Care - fortunesci.com document/siminar2_56/Column Care an… · HPLC Column Care • Verify column identity ... • UHPLC columns have higher pressure limits (≥9000

Curriculum for Parents Mar 15 - Royal High School, Edinburgh · Column 1 Column 2 Column 3 Column 4 Column 5 Column 6 Column 7 Core Choice 1 Nat3/4/5 4bpw Choice 2 Nat3/4/5 4bpw Choice

Building 026 Heavy Fabrication - Stanford University · Building 499 499-A 499-B 26-A Column 12 Column 10 Column 9 Column 8 Column 7 Column 6 Column 5 Column 4 Column 2 Column 3 Column

COLUMN LOUDSPEAKERS COM Weatherproof Column Loudspeakers technical... · COLUMN LOUDSPEAKERS COM Weatherproof Column Loudspeakers ... full range speakersand one 1”(25 (38,1 mm)