23
Mayssam Sayyadian University of Illinois at Urbana- Champaign Final Project Presentation for CS512: Principles and Algorithms for Data Mining May 2006 HeteroClass HeteroClass : A Framework for : A Framework for Effective Classification from Effective Classification from Heterogeneous Databases Heterogeneous Databases

Mayssam Sayyadian University of Illinois at Urbana-Champaign Final Project Presentation for CS512: Principles and Algorithms for Data Mining May 2006 HeteroClass

Embed Size (px)

Citation preview

Mayssam SayyadianUniversity of Illinois at Urbana-Champaign

Final Project Presentation for CS512: Principles and Algorithms for Data Mining

May 2006

HeteroClassHeteroClass: A Framework for Effective : A Framework for Effective Classification from Heterogeneous DatabasesClassification from Heterogeneous Databases

RoadmapRoadmap Motivation

– From multi-relational classification multi-DB classification– Motivating examples

Challenges– Heterogeneous data sources– Attribute instability

HeteroClass framework– Data integration techniques help data mining techniques– Diverse ensembles to address attribute instability

Empirical evaluation Discussion, broader impact, and conclusion

Multi-relational ClassificationMulti-relational Classification Classification: Old problem, but

important problem ! Most real-world data are stored

in relational databases.– To classify objects in one

relation, other relations provide crucial information.

– Cannot convert relational data into a single table without expert knowledge or loosing essential information.

– Multi-relational classification: Automatically classifying objects using multiple relations

MotivationMotivation Necessity drives applications …

– From multi-relational scenarios to multi-database scenarios– Cross-database links play an important role; but are difficult to capture

automatically

product_id

product_name

UPC

category

Product

manufacturer

trans_id

product_id

user_id

vendor_id

Transaction

ip_addressamount

dateproduct_id

vendor_id

price

Listing

vendor_id

vendor_name

description

Vendor

user_id

customer_name

address

zipcode

Customer

phoneemail

Vendor1: Shopping Database

product_id

product_name

UPC

category

Product1

manufacturer

trans_id

product_id

user_id

vendor_id

Transaction1

ip_address

amount

dateproduct_id

vendor_id

price

Listing1

vendor_id

vendor_name

description

Vendor1

user_id

customer_name

address

zipcode

Customer1

phone

email

Vendor1 database

product_id

product_name

UPC

Product1

trans_id

product_id

customer_id

date

Transaction1

credit_nocustomer_id

customer_name

address

Customer2

creditcard_no

customer_name

phone

Customer3

trans_id

product_id

creditcard_no

date

Transaction3

prd_id

product_name

UPC

Product3

manuf_name

address

Manufacturer 3

manufacturer

zipcode

Vendor2database

Vendor4database

Vendor3database

Vendor4database

Vendor4database

product_id

product_name

UPC

category

Product1

manufacturer

trans_id

product_id

user_id

vendor_id

Transaction1

ip_address

amount

dateproduct_id

vendor_id

price

Listing1

vendor_id

vendor_name

description

Vendor1

user_id

customer_name

address

zipcode

Customer1

phone

email

Vendor1 database

product_id

product_name

UPC

Product1

trans_id

product_id

customer_id

date

Transaction1

credit_nocustomer_id

customer_name

address

Customer2

creditcard_no

customer_name

phone

Customer3

trans_id

product_id

creditcard_no

date

Transaction3

prd_id

product_name

UPC

Product3

manuf_name

address

Manufacturer 3

manufacturer

zipcode

Vendor2database

Vendor4database

Vendor3database

Vendor4database

Vendor4database

ExamplesExamples In many real-world settings, data sources are

– Diverse– Autonomous and heterogeneous – Scattered over Web, organizations, enterprises, digital libraries, “smart”

homes, personal information systems (PIM), etc.– … but: inter-related

Examples:– What sort of graduates of UIUC will be donating money to the school in

future? (multiple databases in an organization)– What products are probable to be sold for the next 30 day? (multiple

databases in an enterprise)– Which customers will be using both traveling and dining services of our

company? – Which movie scenes are captured outdoor? (multimedia/spatial databases)– Etc.

Challenge 1: Heterogeneity of Data SourcesChallenge 1: Heterogeneity of Data Sources

Heterogeneity of data sources:– Data level heterogeneity; e.g. Phil vs. Philippe, Jiawei Han vs. Han, J.

– Heterogeneity of schemas; e.g.

Personnel(Pno, Pname, Dept, Born) vs.Empl(id, name, DeptNo, Salary, Birth), Dept(id,

name) – Structure/format heterogeneity; e.g.

<Schema name=“HR-Schema”>

<ElementType name=“Personnel”>

<element type=“EmpNo”/>

<element type=“EmpName”/>

<ElementType name=“Dept”>

<element type=“id”/>

<element type=“name”/>

</ElementType>

</ElementType>

</Schema>

CREATE TABLE Personnel(

EmpNo int PRIMARY KEY,

EmpName varchar(50),

DeptNo int REFERENCES Dept,

Salary dec(15,2)

);

CREATE TABLE Dept(

DeptNo int PRIMARY KEY,

DeptName varchar(70)

);

vs.

Challenge 2: Attribute Instability Challenge 2: Attribute Instability

In distributed databases, objects/data values are– Horizontally distributed

– E.g. UIUC organization: HR, OAR, Facilities, Business, Academics, …

– Vertically distributed– E.g. Comparison shopping: Yahoo Shopping, Amazon, E-Bay, …

– Distributed with various distributions– E.g. Spatial/Multimedia databases; DBs of different training size

– Distributed with various characteristics (features)– Important when we need to run feature selection before learning

the model

HeteroClassHeteroClass Framework: Observations Framework: Observations

Data integration before data mining is very costly and difficult (if not impossible)

Data warehousing before data mining is difficult and not efficient– Security concerns– Providers do not give full access to their data– Efficiency concerns (large databases)

Data co-existence paradigm is most natural– Data sources expose standalone classifiers

– DataSupport Systems Platform (DSSPs)– Data mining embedded in DBMSs

– It is possible (and useful) to build specialized classifiers– No need for full integration/warehousing; partial integration is enough

HeteroClassHeteroClass Framework: Architecture Framework: Architecture

Join Discovery Module– Alleviate heterogeneity– Output useful links between databases (and their relations)

Ensemble builder– Build diverse ensemble of classifiers

Meta-learner– Voting– Decision tree meta-learner

Approximate Join Discovery – Step 1Approximate Join Discovery – Step 1 Two step process

1. Find approximate join paths based on set resemblance– Field similarity by exact match/substring similarity– The resemblance of two sets A, B is

, |A– After some algebra, we find that

|A,(|A|+|B|)/(1+ ,)– There are fast algorithms to compute :

– h(a) is a hash function.– m(A) = min(h(a), a in A).– Observation:

Pr[m(A) = m(B)] = ,– The sets are the unique values in the fields– The resemblance of two fields is a good measure of their

similarity

Approximate Join Discovery – Step 2Approximate Join Discovery – Step 2 For each set of joinable attributes in the two schemas

– Estimate their semantic distance using a schema matching tool– Filter join paths, that are not semantically meaningful, i.e. their

semantic distance is high– E.g.

– (DB1.last-name, DB2.street-name)– Might be joinable because of common substrings (Clinton

St. vs. Bill Clinton, George West vs. West Main St., etc.)– But this join is not semantically meaningful– This happens frequently in categorical attributes (e.g. with

“yes” / “no” values.

Schema MatchingSchema Matching

CREATE TABLE Personnel(

Pno int,

Pname string,

Dept string);

Given a schema:

table

column columnType

&1 &3

&2

&6

&5

&4Personnel

Pno

Pname

Dept

int

stringname

typecolumn

column

column

SQL type

Build its graph model:

Given two models:

Run similarity flooding algorithm

For each two schema elements (a, b), output a semantic similarity distance based on an iterative graph matching algorithms

DI & Approximate Join DiscoveryDI & Approximate Join Discovery

Why it is useful?– When learning a classifier in multi-relational setting:

– The number of possible predicates (predicate space) grows exponentially with the number of links

– Improves efficiency– We eliminate spurious links

– Improves accuracy– Best effort DI

– We need only a semantic distance– No need for exact matches (no need for human confirmation)– No need for schema mapping

Contribution: Using schema matching and DI techniqueshelps mining database structure and join discovery

Ensemble of Classifier’s ApproachEnsemble of Classifier’s Approach Ensemble methods:

– Use a combination of models to increase global accuracy– Ensemble should be

– Accurate– Build specialized (site-specific) classifiers– Build classifiers on homogeneous regions

Improve prediction capability– Diverse

– Each learned hypothesis should be as different as possible– Each learned hypothesis should be consistent with training data– Ensure all classifiers do not make same errors– Exploit various attribute collection (joins/schemas/DBs)

HeteroClass: HeteroClass: IntuitionIntuition Main observation (theoretically proved)

– To improve accuracy of an ensemble of classifiers:

– They should disagree on some inputs

– Diversity/ambiguity: measure of disagreement

– Generalization error = mean error – diversity (variance): ( )

– Increase ensemble diversity

– Maintain average error of ensemble members

Decrease generalization error

DEE

HeteroClassHeteroClass: : AlgorithmAlgorithm1- learn C0 = BaseLearn (training data)

2- initialize ensemble={C0}3- set = ensemble error 4- while (error converges, or no more training data left,

or enough ensembles created) do4.1- Pick two (or more) site-specific classifiers and add a number of joins (J1 … Jn) that connect them4.2- training data = training data + data accessible through J1 … Jn4.3- C’ = BaseLearn (training data)4.4- C* = C + {C’}4.5- if new ensemble’s error is increased: C = C – {C’}

Evaluation SettingsEvaluation Settings Dataset 1: Inventory

– Products, Stores, Inventory records: 5 databases – Classification task:

– Product availability in store

Dataset 2: DBLife data sources– DBLP database (2 XML databases from Clio project @Toronto)

– Authors, Papers, Publications, Conferences– DBLife-Researchers database:

– CS Researchers, their associations and meta-date (Department, Title, etc.)

– DBWorld-Events:– People, Events (service, talk, etc.)

– Classification tasks:– Research area for researchers

HeteroClass HeteroClass AccuracyAccuracy

Algorithms: Accuracy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Inventory DBLife

accu

racy

CrossMine MDBM* HeteroClass

HeteroClass: Sensitivity AnalysisHeteroClass: Sensitivity Analysis

Accuracy vs. Ensemble Size

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5 6 7 8 9 10#classifiers in the ensemble

accu

racy

Inventory

DBLife

Join Discovery Accuracy

0

0.2

0.4

0.6

0.8

1

Inventory 1 Inventory 2 Inventory 3 Inventory 4 Inventory 5

Join Discovery Join Discovery + Schema Matching

ConclusionConclusion Take-home messages:

– Data mining across multiple data sources is– Important and necessary

– Heterogeneity is challenging– Best-effort DI helps data mining

– Improve structure mining using schema matching– Ensemble's approach help in heterogeneous settings with data co-

existence paradigm– Committee of local & specialized classifiers can help in building a

global classifier HeteroClass framework is general

– Applicable to other data mining tasks:– Clustering: build global clusters from local clusters– Etc.

Thank You!Thank You!

Backup: Similarity Flooding AlgorithmBackup: Similarity Flooding Algorithm

e(s,p,o) A : edge e labeled p from s to o in A Pairwise Connectivity Graph PCG(A, B):

Induced Propagation Graph: add edges in opposite direction– Forward edge weights: initialize with a q-gram string similarity matcher– Edge weights: propagation coefficients (how the similarity propagates to

neighbors)

BypyandAxpxBAPCGyxpyx ,,,,,,,,,

Backup: Similarity Flooding AlgorithmBackup: Similarity Flooding Algorithm Input Graph Mapping Filtering Mapping: similarity flooding (SFJoin)

– Initial similarity values taken from initial mapping; e.g. a name matcher or a fully functional schema matching tool

– In each iteration similarity of two elements affects the similarity of their respective neighbors

– Iterate until similarity values are stable (fixpoint computation)

Filtering:– Needed to choose best candidate match– Many heuristics– Difficult to implement fully automatically (needs human intervention)– In HeteroClass: only need semantic distance no need for filtering