Upload
darcy-hill
View
213
Download
0
Embed Size (px)
Citation preview
Mayssam SayyadianUniversity of Illinois at Urbana-Champaign
Final Project Presentation for CS512: Principles and Algorithms for Data Mining
May 2006
HeteroClassHeteroClass: A Framework for Effective : A Framework for Effective Classification from Heterogeneous DatabasesClassification from Heterogeneous Databases
RoadmapRoadmap Motivation
– From multi-relational classification multi-DB classification– Motivating examples
Challenges– Heterogeneous data sources– Attribute instability
HeteroClass framework– Data integration techniques help data mining techniques– Diverse ensembles to address attribute instability
Empirical evaluation Discussion, broader impact, and conclusion
Multi-relational ClassificationMulti-relational Classification Classification: Old problem, but
important problem ! Most real-world data are stored
in relational databases.– To classify objects in one
relation, other relations provide crucial information.
– Cannot convert relational data into a single table without expert knowledge or loosing essential information.
– Multi-relational classification: Automatically classifying objects using multiple relations
MotivationMotivation Necessity drives applications …
– From multi-relational scenarios to multi-database scenarios– Cross-database links play an important role; but are difficult to capture
automatically
product_id
product_name
UPC
category
Product
manufacturer
trans_id
product_id
user_id
vendor_id
Transaction
ip_addressamount
dateproduct_id
vendor_id
price
Listing
vendor_id
vendor_name
description
Vendor
user_id
customer_name
address
zipcode
Customer
phoneemail
Vendor1: Shopping Database
product_id
product_name
UPC
category
Product1
manufacturer
trans_id
product_id
user_id
vendor_id
Transaction1
ip_address
amount
dateproduct_id
vendor_id
price
Listing1
vendor_id
vendor_name
description
Vendor1
user_id
customer_name
address
zipcode
Customer1
phone
Vendor1 database
product_id
product_name
UPC
Product1
trans_id
product_id
customer_id
date
Transaction1
credit_nocustomer_id
customer_name
address
Customer2
creditcard_no
customer_name
phone
Customer3
trans_id
product_id
creditcard_no
date
Transaction3
prd_id
product_name
UPC
Product3
manuf_name
address
Manufacturer 3
manufacturer
zipcode
Vendor2database
Vendor4database
Vendor3database
Vendor4database
Vendor4database
product_id
product_name
UPC
category
Product1
manufacturer
trans_id
product_id
user_id
vendor_id
Transaction1
ip_address
amount
dateproduct_id
vendor_id
price
Listing1
vendor_id
vendor_name
description
Vendor1
user_id
customer_name
address
zipcode
Customer1
phone
Vendor1 database
product_id
product_name
UPC
Product1
trans_id
product_id
customer_id
date
Transaction1
credit_nocustomer_id
customer_name
address
Customer2
creditcard_no
customer_name
phone
Customer3
trans_id
product_id
creditcard_no
date
Transaction3
prd_id
product_name
UPC
Product3
manuf_name
address
Manufacturer 3
manufacturer
zipcode
Vendor2database
Vendor4database
Vendor3database
Vendor4database
Vendor4database
ExamplesExamples In many real-world settings, data sources are
– Diverse– Autonomous and heterogeneous – Scattered over Web, organizations, enterprises, digital libraries, “smart”
homes, personal information systems (PIM), etc.– … but: inter-related
Examples:– What sort of graduates of UIUC will be donating money to the school in
future? (multiple databases in an organization)– What products are probable to be sold for the next 30 day? (multiple
databases in an enterprise)– Which customers will be using both traveling and dining services of our
company? – Which movie scenes are captured outdoor? (multimedia/spatial databases)– Etc.
Challenge 1: Heterogeneity of Data SourcesChallenge 1: Heterogeneity of Data Sources
Heterogeneity of data sources:– Data level heterogeneity; e.g. Phil vs. Philippe, Jiawei Han vs. Han, J.
– Heterogeneity of schemas; e.g.
Personnel(Pno, Pname, Dept, Born) vs.Empl(id, name, DeptNo, Salary, Birth), Dept(id,
name) – Structure/format heterogeneity; e.g.
<Schema name=“HR-Schema”>
<ElementType name=“Personnel”>
<element type=“EmpNo”/>
<element type=“EmpName”/>
<ElementType name=“Dept”>
<element type=“id”/>
<element type=“name”/>
</ElementType>
</ElementType>
</Schema>
CREATE TABLE Personnel(
EmpNo int PRIMARY KEY,
EmpName varchar(50),
DeptNo int REFERENCES Dept,
Salary dec(15,2)
);
CREATE TABLE Dept(
DeptNo int PRIMARY KEY,
DeptName varchar(70)
);
vs.
Challenge 2: Attribute Instability Challenge 2: Attribute Instability
In distributed databases, objects/data values are– Horizontally distributed
– E.g. UIUC organization: HR, OAR, Facilities, Business, Academics, …
– Vertically distributed– E.g. Comparison shopping: Yahoo Shopping, Amazon, E-Bay, …
– Distributed with various distributions– E.g. Spatial/Multimedia databases; DBs of different training size
– Distributed with various characteristics (features)– Important when we need to run feature selection before learning
the model
HeteroClassHeteroClass Framework: Observations Framework: Observations
Data integration before data mining is very costly and difficult (if not impossible)
Data warehousing before data mining is difficult and not efficient– Security concerns– Providers do not give full access to their data– Efficiency concerns (large databases)
Data co-existence paradigm is most natural– Data sources expose standalone classifiers
– DataSupport Systems Platform (DSSPs)– Data mining embedded in DBMSs
– It is possible (and useful) to build specialized classifiers– No need for full integration/warehousing; partial integration is enough
HeteroClassHeteroClass Framework: Architecture Framework: Architecture
Join Discovery Module– Alleviate heterogeneity– Output useful links between databases (and their relations)
Ensemble builder– Build diverse ensemble of classifiers
Meta-learner– Voting– Decision tree meta-learner
Approximate Join Discovery – Step 1Approximate Join Discovery – Step 1 Two step process
1. Find approximate join paths based on set resemblance– Field similarity by exact match/substring similarity– The resemblance of two sets A, B is
, |A– After some algebra, we find that
|A,(|A|+|B|)/(1+ ,)– There are fast algorithms to compute :
– h(a) is a hash function.– m(A) = min(h(a), a in A).– Observation:
Pr[m(A) = m(B)] = ,– The sets are the unique values in the fields– The resemblance of two fields is a good measure of their
similarity
Approximate Join Discovery – Step 2Approximate Join Discovery – Step 2 For each set of joinable attributes in the two schemas
– Estimate their semantic distance using a schema matching tool– Filter join paths, that are not semantically meaningful, i.e. their
semantic distance is high– E.g.
– (DB1.last-name, DB2.street-name)– Might be joinable because of common substrings (Clinton
St. vs. Bill Clinton, George West vs. West Main St., etc.)– But this join is not semantically meaningful– This happens frequently in categorical attributes (e.g. with
“yes” / “no” values.
Schema MatchingSchema Matching
CREATE TABLE Personnel(
Pno int,
Pname string,
Dept string);
Given a schema:
table
column columnType
&1 &3
&2
&6
&5
&4Personnel
Pno
Pname
Dept
int
stringname
typecolumn
column
column
SQL type
Build its graph model:
Given two models:
Run similarity flooding algorithm
For each two schema elements (a, b), output a semantic similarity distance based on an iterative graph matching algorithms
DI & Approximate Join DiscoveryDI & Approximate Join Discovery
Why it is useful?– When learning a classifier in multi-relational setting:
– The number of possible predicates (predicate space) grows exponentially with the number of links
– Improves efficiency– We eliminate spurious links
– Improves accuracy– Best effort DI
– We need only a semantic distance– No need for exact matches (no need for human confirmation)– No need for schema mapping
Contribution: Using schema matching and DI techniqueshelps mining database structure and join discovery
Ensemble of Classifier’s ApproachEnsemble of Classifier’s Approach Ensemble methods:
– Use a combination of models to increase global accuracy– Ensemble should be
– Accurate– Build specialized (site-specific) classifiers– Build classifiers on homogeneous regions
Improve prediction capability– Diverse
– Each learned hypothesis should be as different as possible– Each learned hypothesis should be consistent with training data– Ensure all classifiers do not make same errors– Exploit various attribute collection (joins/schemas/DBs)
HeteroClass: HeteroClass: IntuitionIntuition Main observation (theoretically proved)
– To improve accuracy of an ensemble of classifiers:
– They should disagree on some inputs
– Diversity/ambiguity: measure of disagreement
– Generalization error = mean error – diversity (variance): ( )
– Increase ensemble diversity
– Maintain average error of ensemble members
Decrease generalization error
DEE
HeteroClassHeteroClass: : AlgorithmAlgorithm1- learn C0 = BaseLearn (training data)
2- initialize ensemble={C0}3- set = ensemble error 4- while (error converges, or no more training data left,
or enough ensembles created) do4.1- Pick two (or more) site-specific classifiers and add a number of joins (J1 … Jn) that connect them4.2- training data = training data + data accessible through J1 … Jn4.3- C’ = BaseLearn (training data)4.4- C* = C + {C’}4.5- if new ensemble’s error is increased: C = C – {C’}
Evaluation SettingsEvaluation Settings Dataset 1: Inventory
– Products, Stores, Inventory records: 5 databases – Classification task:
– Product availability in store
Dataset 2: DBLife data sources– DBLP database (2 XML databases from Clio project @Toronto)
– Authors, Papers, Publications, Conferences– DBLife-Researchers database:
– CS Researchers, their associations and meta-date (Department, Title, etc.)
– DBWorld-Events:– People, Events (service, talk, etc.)
– Classification tasks:– Research area for researchers
HeteroClass HeteroClass AccuracyAccuracy
Algorithms: Accuracy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Inventory DBLife
accu
racy
CrossMine MDBM* HeteroClass
HeteroClass: Sensitivity AnalysisHeteroClass: Sensitivity Analysis
Accuracy vs. Ensemble Size
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 2 3 4 5 6 7 8 9 10#classifiers in the ensemble
accu
racy
Inventory
DBLife
Join Discovery Accuracy
0
0.2
0.4
0.6
0.8
1
Inventory 1 Inventory 2 Inventory 3 Inventory 4 Inventory 5
Join Discovery Join Discovery + Schema Matching
ConclusionConclusion Take-home messages:
– Data mining across multiple data sources is– Important and necessary
– Heterogeneity is challenging– Best-effort DI helps data mining
– Improve structure mining using schema matching– Ensemble's approach help in heterogeneous settings with data co-
existence paradigm– Committee of local & specialized classifiers can help in building a
global classifier HeteroClass framework is general
– Applicable to other data mining tasks:– Clustering: build global clusters from local clusters– Etc.
Backup: Similarity Flooding AlgorithmBackup: Similarity Flooding Algorithm
e(s,p,o) A : edge e labeled p from s to o in A Pairwise Connectivity Graph PCG(A, B):
Induced Propagation Graph: add edges in opposite direction– Forward edge weights: initialize with a q-gram string similarity matcher– Edge weights: propagation coefficients (how the similarity propagates to
neighbors)
BypyandAxpxBAPCGyxpyx ,,,,,,,,,
Backup: Similarity Flooding AlgorithmBackup: Similarity Flooding Algorithm Input Graph Mapping Filtering Mapping: similarity flooding (SFJoin)
– Initial similarity values taken from initial mapping; e.g. a name matcher or a fully functional schema matching tool
– In each iteration similarity of two elements affects the similarity of their respective neighbors
– Iterate until similarity values are stable (fixpoint computation)
Filtering:– Needed to choose best candidate match– Many heuristics– Difficult to implement fully automatically (needs human intervention)– In HeteroClass: only need semantic distance no need for filtering