46
BOOSTING TUPLE PROPAGATION IN MULTI-RELATIONAL CLASSIFICATION Dept. of Mathematics, University of Calabria, Italy Lucantonio Ghionna , Gianluigi Greco 15th International Database Engineering & Applications Sympos Lisbon, Portugal, 21-23 September, 2

B oosting tuple propagation in multi- relational classification

  • Upload
    moana

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

15th International Database Engineering & Applications Symposium. Lisbon , Portugal, 21-23 September , 2011. Lucantonio Ghionna , Gianluigi Greco. B oosting tuple propagation in multi- relational classification. Dept . of Mathematics, University of Calabria, Italy. Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: B oosting tuple propagation in multi- relational classification

BOOSTING TUPLE PROPAGATION IN MULTI-RELATIONAL CLASSIFICATION

Dept. of Mathematics, University of Calabria, Italy

Lucantonio Ghionna, Gianluigi Greco

15th International Database Engineering & Applications SymposiumLisbon, Portugal, 21-23 September, 2011

Page 2: B oosting tuple propagation in multi- relational classification

Outline

Background Multi-Relational Classification

Problem Complexity Tractability Islands Heuristic Approaches

DBMS Implementation System Design Experiments

Conclusion Remarks

Page 3: B oosting tuple propagation in multi- relational classification

Multi-Relational Classification

Target relation:

Each tuple has a class label, indicating whether a loan is paid on time.

district-id

frequency

date

Accountaccount-id

account-id

date

amount

duration

Loanloan-id

payment

account-id

bank-to

account-to

amount

Orderorder-id

type

disp-id

type

issue-date

Cardcard-id

account-id

client-id

Disposition

disp-id

birth-date

gender

district-id

Clientclient-id

dist-name

region

#people

#lt-500

District

district-id

#lt-2000

#lt-10000

#gt-10000

#city

ratio-urban

avg-salary

unemploy95

unemploy96

den-enter

#crime95

#crime96

account-id

date

type

operation

Transactiontrans-id

amount

balance

symbolHow to make decision on loan granting?

Page 4: B oosting tuple propagation in multi- relational classification

Multi-Relational Classification

Applicant #1

Applicant #2

Applicant #3

Applicant #4

Loan ID Account ID Amount Duration Decision

1 124 1000 12 Yes

2 124 4000 12 Yes

3 108 10000 24 No

4 45 12000 36 No

Account ID Frequency Open date District ID

128 monthly 02/27/96 61820

108 weekly 09/23/95 61820

45 monthly 12/09/94 61801

67 weekly 01/01/95 61822

Loan Applications

Accounts

Orders

DistrictsOther relations

Search for good predicates across multiple relations

Do good payers access their account with a "monthly"

frequency?

Page 5: B oosting tuple propagation in multi- relational classification

Solving CLP: State-of-Art

Flattening approach [Krogel03] Build the universal relation through joins

Combinatorial explosition of data, large tables with many attributes [Mugg92]

Upgrading approach [Xu06] Keep the universal relation virtual by propagating labels

through foreign keys Global Perspective [Xu06] Local Perspective [Blockheel03,Yin04,Xu06]

Page 6: B oosting tuple propagation in multi- relational classification

Contributions

We show that the propagation problem can effectively be solved on databases whose hypergraphs are nearly-acyclic

We design effective algorithms for the global/local perspectives

We provide an implementation of a complete JDBC based system for tuple propagation

Experiments

Page 7: B oosting tuple propagation in multi- relational classification

Problem Complexity Tractability Islands Heuristic Approaches

Page 8: B oosting tuple propagation in multi- relational classification

Global Perspective: Tractability Islands of CLP

Good newsExponentially large universal relations does not imply CLP intractability [Xu06]

p1(X,Y)

p2(X,Z,W)p5(Y,T,X)

x1 y1 1 <0,0>

x2 y1 1 <0,0>

x1 y2 1 <0,0>

x1 z1 v1 C1 1 <1,0>

x2 z2 v2 C2 1 <0,1>

y1 t1 x1 1 <0,0>

y1 t2 x1 1 <0,0>

y1 t2 x2 1 <0,0>

Page 9: B oosting tuple propagation in multi- relational classification

Global Perspective: Tractability Islands of CLP

Good newsExponentially large universal relations does not imply CLP intractability [Xu06]

p1(X,Y)

p2(X,Z,W)p5(Y,T,X)

x1 y1 1 <1,0>

x2 y1 1 <0,1>

x1 y2 1 <1,0>

x1 z1 v1 C1 1 <1,0>

x2 z2 v2 C2 1 <0,1>

y1 t1 x1 1 <0,0>

y1 t2 x1 1 <0,0>

y1 t2 x2 1 <0,0>

Bottom up

Page 10: B oosting tuple propagation in multi- relational classification

Global Perspective: Tractability Islands of CLP

Good newsExponentially large universal relations does not imply CLP intractability [Xu06]

p1(X,Y)

p2(X,Z,W)p5(Y,T,X)

x1 y1 2 <2,0>

x2 y1 1 <0,1>

x1 y2 1 <1,0>

x1 z1 v1 C1 1 <1,0>

x2 z2 v2 C2 1 <0,1>

y1 t1 x1 1 <0,0>

y1 t2 x1 1 <0,0>

y1 t2 x2 1 <0,0>

Bottom up

Page 11: B oosting tuple propagation in multi- relational classification

Global Perspective: Tractability Islands of CLP

Good newsExponentially large universal relations does not imply CLP intractability [Xu06]

p1(X,Y)

p2(X,Z,W)p5(Y,T,X)

x1 y1 2 <2,0>

x2 y1 1 <0,1>

x1 y2 1 <1,0>

x1 z1 v1 C1 2 <2,0>

x2 z2 v2 C2 1 <0,1>

y1 t1 x1 1 <0,0>

y1 t2 x1 1 <0,0>

y1 t2 x2 1 <0,0>

Top down

Page 12: B oosting tuple propagation in multi- relational classification

Global Perspective: Tractability Islands of CLP

p1(X,Y)

p2(X,Z,W)p5(Y,T,X)

x1 y1 2 <2,0>

x2 y1 1 <0,1>

x1 y2 1 <1,0>

x1 z1 v1 C1 2 <2,0>

x2 z2 v2 C2 1 <0,1>

y1 t1 x1 1 <1,0>

y1 t2 x1 1 <1,0>

y1 t2 x2 1 <0,1>

Good newsExponentially large universal relations does not imply CLP intractability [Xu06]

CLP tractable on dependency graphs whose undirected versions are (forests of) trees [Xu06]

Top down

Page 13: B oosting tuple propagation in multi- relational classification

Tractability Islands of CLP. Are trees enough?

The (undirected) dependency graph is a bipartite clique of size m × n, and hence it is not a tree and the result in [XU06] does not apply

CLP is still tractable !

R1 (B1,A1, …, Am) R2 (B2,A1, …, Am) Rn (Bn,A1, …, Am)…..

R’1 (A1) R’

2 (A2) R’m (Am)…..

…..

…..…..

Q=R1 (B1,A1, …, Am), R2 (B2,A1, …, Am), …, R1 (B1,A1, …, Am), …, R’1 (A1), R’

2 (A2), …,R’m (Am)

Page 14: B oosting tuple propagation in multi- relational classification

Tractability Islands of CLP. Hypertree Decompositions

For fixed k, deciding whether hw(Q) k is in P [Gottlob02] computing hypertree decompositions is in P [Gottlob02]

{B1, …, Bm ,A1, …, Am} R1, R2, …, Rm

{A2} R’2{A1} R’

1 {Am} R’m…….

Q=R1 (B1,A1, …, Am), R2 (B2,A1, …, Am), …, R1 (B1,A1, …, Am), …, R’1 (A1), R’

2 (A2), …,R’m (Am)

R1

R2

Rm

A1A2 Am

B1

B2

Bm

R’1 R’

2 R’m

Page 15: B oosting tuple propagation in multi- relational classification

Tractability Islands of CLP. Hypertree Decompositions

Cyclic dependency graph……….bounded width!

Page 16: B oosting tuple propagation in multi- relational classification

CLPonDBk algorithm

Bottom up PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

L1 A1 #C1

L2 A1 #C2O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

AccountA1 D1

A2 D2

Page 17: B oosting tuple propagation in multi- relational classification

CLPonDBk algorithm

Bottom up PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

L1 A1 #C1

L2 A1 #C2

Page 18: B oosting tuple propagation in multi- relational classification

CLPonDBk algorithm

Bottom up PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

L1 A1 #C1

L2 A1 #C2

Page 19: B oosting tuple propagation in multi- relational classification

CLPonDBk algorithm

Bottom up PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

L1 A1 #C1

L2 A1 #C2

Page 20: B oosting tuple propagation in multi- relational classification

CLPonDBk algorithm

Bottom up PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

L1 A1 #C1

L2 A1 #C2

Page 21: B oosting tuple propagation in multi- relational classification

CLPonDBk algorithm

Bottom up PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

L1 A1 #C1

L2 A1 #C2

Page 22: B oosting tuple propagation in multi- relational classification

CLPonDBk algorithm

Bottom up PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

L1 A1 #C1

L2 A1 #C2

Page 23: B oosting tuple propagation in multi- relational classification

CLPonDBk algorithm

Bottom up PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

L1 A1 #C1

L2 A1 #C2

Page 24: B oosting tuple propagation in multi- relational classification

CLPonDBk algorithm

Top Down PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

L1 A1 #C1

L2 A1 #C2

Page 25: B oosting tuple propagation in multi- relational classification

CLPonDBk algorithm

Top Down PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

<1,1>

L1 A1 #C1

L2 A1 #C2

Page 26: B oosting tuple propagation in multi- relational classification

CLPonDBk algorithm

Top Down PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

<1,1> <1,1>

L1 A1 #C1

L2 A1 #C2

Page 27: B oosting tuple propagation in multi- relational classification

CLPonDBk algorithm

Top Down PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

<1,1> <1,1>

<1,1>

L1 A1 #C1

L2 A1 #C2

Page 28: B oosting tuple propagation in multi- relational classification

CLPonDBk algorithm

Top Down PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

<1,1> <1,1>

<1,1>

<1,1>

L1 A1 #C1

L2 A1 #C2

Page 29: B oosting tuple propagation in multi- relational classification

CLPonDBk algorithm

Top Down PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

<1,1>

<1,1>

<1,1>

<1,1>

<1,1>

L1 A1 #C1

L2 A1 #C2

Page 30: B oosting tuple propagation in multi- relational classification

CLPonDBk algorithm

Top Down PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

<1,1>

<1,1>

<1,1>

<1,1><1,1>

<1,1>

L1 A1 #C1

L2 A1 #C2

Page 31: B oosting tuple propagation in multi- relational classification

CLPonDBk algorithm

Top Down PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

CLPonDBk solves CLP in time O(|D| × max RiD ||Ri||k+3), on the class of those instances whose associated hypergraphs have hypertree width bounded by k.

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

<1,1>

<1,1>

<1,1>

<1,1><1,1>

<1,1>

<1,1>

L1 A1 #C1

L2 A1 #C2

Page 32: B oosting tuple propagation in multi- relational classification

L-CLP: Local Perspective on Propagation Problem

In several multi-relational approaches, CLP is heuristically restricted to portions of the database

Reducing the search space can pragmatically speed-up the computation

• Still, joining many relations may be challenging from a computational viewpoint.

Page 33: B oosting tuple propagation in multi- relational classification

L-CLP: NTtoT_onDBMS and TtoNT_onDBMS

“Target to Non-Target” Propagation (TtoNT onDBMS) Propagate information from R1 to Rm, evaluate C on the result

“Non-Target to Target” Propagation (NTtoT onDBMS) Start by filtering Rm with the condition C, by joining the result with Rm-1, and by iterating the process back to R1

Propagation path from R1 to Rm only requires joining pairs of “adjacent” relations

Page 34: B oosting tuple propagation in multi- relational classification

L-CLP: NTtoT_onDBMS and TtoNT_onDBMS

NTtoT_onDBMS

TtoNT_onDBMS

Page 35: B oosting tuple propagation in multi- relational classification

DBMS Implementation System Design Experiments

Page 36: B oosting tuple propagation in multi- relational classification

A JDBC System for CLP

Page 37: B oosting tuple propagation in multi- relational classification

Experimentation Settings

Scenario: CROSSMINE + NTtoT_onDBMS CROSSMINE + TtoNT_onDBMS CROSSMINE + TupleIDPropagation

Parameters: The number m of relations The number ||target || of tuples in the target relation; The “propagation ratio” ||target ||/||R|| The selectivity s of each join attribute

Environment:

2.1GHz Centrino PC, 1 Gb RAM, 5400 rpm hard disk (Windows XP Professional)

Page 38: B oosting tuple propagation in multi- relational classification

Computation Time and Propagation Time

m=5; ||target ||/||R||=1; s=50%

Dramatic improvements w.r.t. standard Crossmine • Effective scaling for large relations• ….

Page 39: B oosting tuple propagation in multi- relational classification

Gains w.r.t. Crossmine

m=5; s=50%

• Gain on propagation up to 95 %• Gain on computation time up to 90 %• ……

NTtoT_onDBMS or TtoNT_onDBMS ?

Page 40: B oosting tuple propagation in multi- relational classification

NTtoT_onDBMS vs TtoNT_onDBMS

||target ||=100000; m=5; s=50%

||target ||=100000; m=5; s=50% ||target ||/R=1

• TtoNT_onDBMS is the best with low propagation ratio• NTtoT_onDBMS is the best when target relation is much larger

than other relations• Semi-joins operators are a winning choice in practical database

applications

Page 41: B oosting tuple propagation in multi- relational classification

Conclusion and Discussion

CLP problem is a challenging task which can be effectively asked using state-of-art query-optimization methods Propagation over large class of nearly-acyclic database

schemas is in fact tractable (polynomial upper bound guarantee) Result in [Xu06] emerges as a special case

Database implementation of local-perspective methods shows tremendous benefits w.r.t. standard in-memory strategies

Potential benefits for many classifications algorithms, such as Bayesian classifiers[Getoor01], probabilistic models [Taskar02], and decision tree learning methods[Leiva03].

Page 42: B oosting tuple propagation in multi- relational classification

THANK YOU!

Page 43: B oosting tuple propagation in multi- relational classification

References

P. A. Bernstein and N. Goodman. Power of natural semijoins. SIAM Journal on Computing, 10(4):751–771, 1981.

H. Blockeel and L. De Raedt. Top-down Induction of First-Order Logical Decision Trees. Artificial Intelligence, 101(1-2):285–297, 1998.

H. Blockeel and M. Sebag. Scalability and Efficiency in Multi-relational Data Mining. SIGKDD Explorations Newsletters, 5(1):17–30, 2003.

M. Ceci and D. Malerba. Mr-SBC: a Multi-Relational Naive Bayes Classifier. In Proc. of PKDD’03, pages 95–106, 2003.

S. Dˇzeroski. Multi-relational Data Mining: an Introduction. SIGKDD Explorations Newsletters, 5(1):1–16, 2003.

P. A. Flach and N. Lachiche. IBC2: A True First-Order Bayesian Classifier. In Proc. of ILP’02, pages 133–148, 2002.

R. Frank and F.M.M. Ester. A Method for Multi-relational Classification Using Single and Multi-feature Aggregation Functions. In Proc. Of PKDD’07, pages 430–437, 2007.

L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning Probabilistic Models of Relational Structure. In Proc. of ICML’01, pages 170–177, 2001.

G. Gottlob, N. Leone, and F. Scarcello. Hypertree decomposition and tractable queries. Journal of Computer and System Sciences, 64:579–627, 2002.

G. Gottlob, Z. Miklos, and T. Schwentick. Generalized hypertree decompositions: Np-hardness and tractable variants. In Proc. of PODS’07, pages 13–22, 2007.

H. Guo and H. L. Viktor. Multirelational classification: a multiple view approach. Knowledge and Information Systems, 17(3):287–312, 2008.

Page 44: B oosting tuple propagation in multi- relational classification

References

G. Jing-Feng, L. Jing, and B. Wei-Feng. An Efficient Relational Decision Tree Classification Algorithm. In Proc. of ICNC’07, pages 530–534, 2007.

M. A. Krogel, S. Rawles, F. Zelezny, P. A. Flach, N. Lavrac, and S. Wrobel. Comparative Evaluation of Approaches to Propositionalization. In In Proc. Of ILP’03, pages 197–214, 2003.

H. Leiva, A. Atramentov, and V. Honavar. A Multi-relational Decision Tree Learning Algorithm. In Proc. of ILP’03, pages 97–112, 2002.

H. Liu, X. Yin, and J. Han. An efficient Multi-relational Na¨ıve Bayesian classifier based on Semantic Relationship Graph. In Proc. of MRDM’05, pages 39–48, 2005.

S. Muggleton. Inductive Logic Programming. Academic Press, New York, 1992. J. Neville, D. Jensen, L. Friedland, and M. Hay. Learning Relational Probability Trees. In Proc. Of KDD’03, pages 625–630, 2003.

J. Neville, D. Jensen, and B. Gallagher. Simple Estimators for Relational Bayesian Classifiers. In Proc. of ICDM’03, page 609, 2003.

U. Pompe and I. Kononenko. Naive Bayesian Classifier within ILP-R. In Proc. of ILP’95, pages 417–436, 1995.

B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proc. Of UAI’02, 2002.

K. Wang, Y. Xu, P.S. Yu, and R. She. Building Decision Trees on Records Linked through Key References. In Proc. of SDM’05, 2005.

Y. Xu, K. Wang, A. Wai-Chee Fu, R. She, and J. Pei. Classification Spanning Correlated Data Streams. In Proc. of CIKM’06, pages 132–141, 2006.

M. Yannakakis. Algorithms for acyclic database schemes. In Proc. of VLDB’81, pages 82–94.

X. Yin, J. Han, J. Yang, and P.S. Yu. CrossMine: Efficient Classification Across Multiple Database Relations. In Proc. of t ICDE’04, page 399, 2004.

Page 45: B oosting tuple propagation in multi- relational classification

Multi-Relational Classification

Formal Framework

Input: D (with target having attribute CL), I, a class label ‘l’, and a condition C over the attributes of some relation RD;

Output: key[target] C^target.CL=‘l’R(D, I)

Page 46: B oosting tuple propagation in multi- relational classification

{account-id,district-id} {Account}

{transaction-id,account-id} {Transaction}

{loan-id,account-id} {Loan}

{order-id,account-id} {Order}

{account-id,disp-id,client-id,district-id} {Account,Disposition}

{disp-id,card-id} {card} {client-id,district-id} {Client}

{district-id} {District}