View
43
Download
0
Category
Preview:
DESCRIPTION
Hierarchies in Data Mining. Raghu Ramakrishnan ramakris@yahoo-inc.com Chief Scientist for Audience and Cloud Computing Yahoo!. About this Talk. Common theme—multidimensional view of data: Reveals patterns that emerge at coarser granularity - PowerPoint PPT Presentation
Citation preview
Bellwether Analysis
Hierarchies in Data Mining
Raghu Ramakrishnan
ramakris@yahoo-inc.com
Chief Scientist for Audience and Cloud Computing
Yahoo!
2Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
About this Talk
• Common theme—multidimensional view of data:– Reveals patterns that emerge at coarser
granularity• Widely recognized, e.g., generalized association rules
– Helps handle imprecision• Analyzing imprecise and aggregated data
– Helps handle data sparsity• Even with massive datasets, sparsity is a challenge!
– Defines candidate space of subsets for exploratory mining
• Forecasting query results over “future data” • Using predictive models as summaries • Potentially, space of “mining experiments”?
4Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Background: The Multidimensional Data Model
Cube Space
5Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Star Schema
SERVICEpidtimeidlocidrepair
PRODUCTpidpnameCategoryModel
TIMEtimeiddateweekyear
LOCATIONlocidcountryregionstate
“FACT” TABLE
DIMENSION TABLES
6Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Dimension Hierarchies
• For each dimension, the set of values can be organized in a hierarchy:
PRODUCT TIME LOCATION
category week month region
model date state
year
automobile quarter country
7Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Multidimensional Data Model
• One fact table =(X,M)– X=X1, X2, ... Dimension attributes
– M=M1, M2,… Measure attributes
• Domain hierarchy for each dimension attribute:– Collection of domains Hier(Xi)= (Di
(1),..., Di(k))
– The extended domain: EXi = 1≤k≤t DXi(k)
• Value mapping function: γD1D2(x)
– e.g., γmonthyear(12/2005) = 2005
– Form the value hierarchy graph– Stored as dimension table attribute (e.g., week for a time
value) or conversion functions (e.g., month, quarter)
8Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
MA
NY
TX
CAW
est
Eas
t
ALL
LOC
AT
ION
Civic SierraF150Camry
TruckSedan
ALL
Automobile
Model
Category
Re
gio
n
Sta
te
ALL
AL
L
1
3
2
2 1 3
FactID Auto Loc Repair
p1 F150 NY 100
p2 Sierra NY 500
p3 F150 MA 100
p4 Sierra MA 200
Multidimensional Data
p3
p1
p4
p2
DIMENSIONATTRIBUTES
9Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Cube Space
• Cube space: C = EX1EX2…EXd
• Region: Hyper rectangle in cube space– c = (v1,v2,…,vd) , vi EXi
– E.g., c1= (NY, Camry); c2 = (West, Sedan)
• Region granularity:– gran(c) = (d1, d2, ..., dd), di = Domain(c.vi)– E.g., gran(c1) = (State, Model); gran(c2) = (State, Category)
• Region coverage: – coverage(c) = all facts in c
• Region set: All regions with same granularity
10Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
OLAP Over Imprecise Data
with Doug Burdick, Prasad Deshpande, T.S. Jayram, and Shiv Vaithyanathan
In VLDB 05, 06 joint work with IBM Almaden
11Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
MA
NY
TX
CAW
est
Eas
t
ALL
LOC
AT
ION
Civic SierraF150Camry
TruckSedan
ALL
Automobile
Model
Category
Re
gio
n
Sta
te
ALL
AL
L
1
3
2
2 1 3
FactID Auto Loc Repair
p1 F150 NY 100
p2 Sierra NY 500
p3 F150 MA 100
p4 Sierra MA 200
p5 Truck MA 100
p5
Imprecise Data
p3
p1
p4
p2
12Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Querying Imprecise Facts
p3
p1
p4
p2
p5
MA
NY
SierraF150FactID Auto Loc Repair
p1 F150 NY 100
p2 Sierra NY 500
p3 F150 MA 100
p4 Sierra MA 200
p5 Truck MA 100
Truck
East
Auto = F150Loc = MASUM(Repair) = ??? How do we treat p5?
13Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
p3
p1
p4
p2
p5
MA
NY
SierraF150FactID Auto Loc Repair
p1 F150 NY 100
p2 Sierra NY 500
p3 F150 MA 100
p4 Sierra MA 200
p5 Truck MA 100
Truck
East
Allocation (1)
14Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
p3
p1
p4
p2
MA
NY
SierraF150
ID FactID Auto Loc Repair Weight
1 p1 F150 NY 100 1.0
2 p2 Sierra NY 500 1.0
3 p3 F150 MA 100 1.0
4 p4 Sierra MA 200 1.0
5 p5 F150 MA 100 0.5
6 p5 Sierra MA 100 0.5
Truck
East
Allocation (2)
p5 p5
(Huh? Why 0.5 / 0.5? - Hold on to that thought)
15Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
p3
p1
p4
p2
MA
NY
SierraF150
ID FactID Auto Loc Repair Weight
1 p1 F150 NY 100 1.0
2 p2 Sierra NY 500 1.0
3 p3 F150 MA 100 1.0
4 p4 Sierra MA 200 1.0
5 p5 F150 MA 100 0.5
6 p5 Sierra MA 100 0.5
Truck
East
Allocation (3)
p5 p5
Auto = F150Loc = MASUM(Repair) = 150 Query the Extended Data Model!
16Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Allocation Policies
• Procedure for assigning allocation weights is referred to as an allocation policy– Each allocation policy uses different information to
assign allocation weight
• Key contributions:– Appropriate characterization of the large space of
allocation policies (VLDB 05)– Designing efficient algorithms for allocation policies
that take into account the correlations in the data (VLDB 06)
17Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
SierraF150
Truck
MA
NY
East
p1
p3
p5
p4
p2
Motivating Example
We propose desiderata that enable appropriate definition of query semantics for imprecise data
We propose desiderata that enable appropriate definition of query semantics for imprecise data
Query: COUNT
18Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Desideratum I: Consistency
• Consistency specifies the relationship between answers to related queries on a fixed data set
SierraF150
Truck
MA
NY
East
p1
p3
p5
p4
p2
19Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Desideratum II: Faithfulness
• Faithfulness specifies the relationship between answers to a fixed query on related data sets
SierraF150
MA
NY
p3
p1
p4
p2
p5
SierraF150
MA
NY
p3
p1
p4
p2
p5
SierraF150
MA
NY
p3
p1
p4
p2
p5
Data Set 1 Data Set 2 Data Set 3
20Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
p3
p1
p4
p2
p5
MA
NY
SierraF150
SierraF150
MA
NY
p4
p1
p3 p5
p2
p1p3
p4p5
p2
p4
p1p3
p5
p2
MA
NY
MA
NY
SierraF150SierraF150
p3 p4
p1
p5
p2
MA
NY
SierraF150
w1
w2 w3
w4
Imprecise facts lead to many possible worlds[Kripke63, …]
21Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Query Semantics
• Given all possible worlds together with their probabilities, queries are easily answered using expected values– But number of possible worlds is exponential!
• Allocation gives facts weighted assignments to possible completions, leading to an extended version of the data– Size increase is linear in number of (completions of)
imprecise facts– Queries operate over this extended version
Bellwether Analysis
Dealing with Data Sparsity
Deepak Agarwal, Andrei Broder, Deepayan Chakrabarti, Dejan Diklic, Vanja Josifovski, Mayssam
Sayyadian
Estimating Rates of Rare Events at Multiple Resolutions, KDD 2007
31Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Motivating ApplicationContent Match Problem
• Problem: – Which ads are good on what pages– Pages: no control; Ads: can control
• First simplification:– (Page, Ad) completely characterized by a
set of high-dimensional features• Naïve Approach:
– Experiment with all possible pairs several times and estimate CTR.
• Of course, this doesn’t work• Most (ad, page) pairs have very few
impressions, if any,• and even fewer clicksSevere data sparsity
pages ads
32Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Estimation in the “Tail”
• Use an existing, well-understood hierarchy– Categorize ads and webpages to leaves of the
hierarchy– CTR estimates of siblings are correlatedThe hierarchy allows us to aggregate data
• Coarser resolutions– provide reliable estimates for rare events– which then influences estimation at finer resolutions
Similar “coarsening”, different motivation:Mining Generalized Association RulesRamakrishnan Srikant, Rakesh Agrawal , VLDB 1995
34Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Sampling of Webpages
• Naïve strategy: sample at random from the set of URLsSampling errors in impression volume AND click
volume
• Instead, we propose:– Crawling all URLs with at least one click, and– a sample of the remaining URLsVariability is only in impression volume
35Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Imputation of Impression Volume
Z(0)
Z(i)
Page hierarchy Ad hierarchy
• Region node= (page node, ad node)
• Build a Region HierarchyA cross-product of the page
hierarchy and the ad hierarchy
Page leaves Ad leaves
Leaf Region
36Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Exploiting Taxonomy Structure
• Consider the bottom two levels of the taxonomy
• Each cell corresponds to a (page, ad)-class pair
Key point : Children under a parent node are alike and expected to have similar CTRs (i.e., form a cohesive block)
37Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Imputation of Impression Volume
For any level Z(i)
Ad classes
Pag
e cl
asse
s
sums to #impressions on ads of this ad class
[column constraint]
sums to ∑nij + K.∑mij
[row constraint]
sums toTotal impressions
(known)
#impressions = nij + mij + xij
Clicked pool
Sampled Non-clicked
pool
Excess impressions(to be imputed)
38Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Imputation of Impression Volume
sums to
[block constraint]
39Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Imputing xij
Z(i)
Z(i+1)
Iterative Proportional Fitting [Darroch+/1972]
Initialize xij = nij + mij
Top-down:
• Scale all xij in every block in Z(i+1) to sum to its parent in Z(i)
• Scale all xij in Z(i+1) to sum to the row totals
• Scale all xij in Z(i+1) to sum to the column totals
Repeat for every level Z(i)
Bottom-up: Similar
blockPage classes Ad classes
40Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Imputation: Summary
• Given– nij (impressions in clicked pool)
– mij (impressions in sampled non-clicked pool)
– # impressions on ads of each ad class in the ad hierarchy
• We get– Estimated impression volume
Ñij = nij + mij + xij
in each region ij of every level Z(.)
Bellwether Analysis
Dealing with Data Sparsity
Deepak Agarwal, Pradheep Elango, Nitin Motgi, Seung-Taek Park, Raghu Ramakrishnan, Scott Roy,
Joe Zachariah
Real-time Content Optimization through Active User Feedback, NIPS 2008
42Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Yahoo! Home Page Featured Box
• It is the top-center part of the Y! Front Page
• It has four tabs: Featured, Entertainment, Sports, and Video
43Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Novel Aspects• Classical: Arms assumed fixed over time
– We gain and lose arms over time • Some theoretical work by Whittle in 80’s; operations research
• Classical: Serving rule updated after each pull– We compute optimal design in batch mode
• Classical: Generally. CTR assumed stationary– We have highly dynamic, non-stationary CTRs
44Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Bellwether Analysis:Global Aggregates from Local Regions
with Beechung Chen, Jude Shavlik, and Pradeep TammaIn VLDB 06
45Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Motivating Example
• A company wants to predict the first year worldwide profit of a new item (e.g., a new movie)– By looking at features and profits of previous (similar) movies, we
predict expected total profit (1-year US sales) for new movie• Wait a year and write a query! If you can’t wait, stay awake …
– The most predictive “features” may be based on sales data gathered by releasing the new movie in many “regions” (different locations over different time periods).
• Example “region-based” features: 1st week sales in Peoria, week-to-week sales growth in Wisconsin, etc.
• Gathering this data has a cost (e.g., marketing expenses, waiting time)
• Problem statement: Find the most predictive region features that can be obtained within a given “cost budget”
46Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Key Ideas
• Large datasets are rarely labeled with the targets that we wish to learn to predict– But for the tasks we address, we can readily use OLAP
queries to generate features (e.g., 1st week sales in Peoria) and even targets (e.g., profit) for mining
• We use data-mining models as building blocks in the mining process, rather than thinking of them as the end result– The central problem is to find data subsets
(“bellwether regions”) that lead to predictive features which can be gathered at low cost for a new case
47Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Motivating Example
• A company wants to predict the first year’s worldwide profit for a new item, by using its historical database
• Database Schema:
Profit Table
TimeLocationCustIDItemIDProfit
Item Table
ItemIDCategoryR&D Expense
Ad Table
TimeLocationItemIDAdExpenseAdSize
• The combination of the underlined attributes forms a key
48Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
A Straightforward Approach
• Build a regression model to predict item profit
• There is much room for accuracy improvement!
Profit Table
TimeLocationCustIDItemIDProfit
Item Table
ItemIDCategoryR&D Expense
Ad Table
TimeLocationItemIDAdExpenseAdSize
ItemID Category R&D Expense Profit
1 Laptop 500K 12,000K
2 Desktop 100K 8,000K
… … … …
By joining and aggregating tables in the historical database we can create a training set:
Item-table features Target
An Example regression model:Profit = 0 + 1 Laptop + 2 Desktop + 3 RdExpense
49Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Using Regional Features
• Example region: [1st week, HK]• Regional features:
– Regional Profit: The 1st week profit in HK– Regional Ad Expense: The 1st week ad expense in HK
• A possibly more accurate model:
Profit[1yr, All] = 0 + 1 Laptop + 2 Desktop + 3 RdExpense +
4 Profit[1wk, HK] + 5 AdExpense[1wk, HK]
• Problem: Which region should we use?– The smallest region that improves the accuracy the most– We give each candidate region a cost– The most “cost-effective” region is the bellwether region
52Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Basic Bellwether Problem
1 2 3 4 5 … 52
KR
USA
…
WI
WY
... …
ItemID Category … Profit[1-2,USA] …
… … … … …
i Desktop 45K
… … … … …
Aggregate over data recordsin region r = [1-2, USA]
Features i,r(DB)
ItemID Total Profit
… …
i 2,000K
… …
Target i(DB)
Total Profitin [1-52, All]
For each region r, build a predictive model hr(x); and then choose bellwether region:
• Coverage(r) fraction of all items in region minimum coverage support • Cost(r, DB) cost threshold
• Error(hr) is minimized
r
53Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Experiment on a Mail Order Dataset
0
5000
10000
15000
20000
25000
30000
5 25 45 65 85Budget
RM
SE
Bel Err Avg Err
Smp Err
• Bel Err: The error of the bellwether region found using a given budget
• Avg Err: The average error of all the cube regions with costs under a given budget
• Smp Err: The error of a set of randomly sampled (non-cube) regions with costs under a given budget
[1-8 month, MD]
Error-vs-Budget Plot
(RMSE: Root Mean Square Error)
54Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Experiment on a Mail Order Dataset
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
5 25 45 65 85Budget
Fra
ctio
n of
indi
stin
guis
able
s
Uniqueness Plot
• Y-axis: Fraction of regions that are as good as the bellwether region– The fraction of regions that
satisfy the constraints and have errors within the 99% confidence interval of the error of the bellwether region
• We have 99% confidence that that [1-8 month, MD] is a quite unusual bellwether region
[1-8 month, MD]
55Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Basic Bellwether Computation
• OLAP-style bellwether analysis– Candidate regions: Regions in a data cube
– Queries: OLAP-style aggregate queries
• E.g., Sum(Profit) over a region
• Efficient computation:
– Use iceberg cube techniques to prune infeasible regions (Beyer-Ramakrishnan, ICDE 99; Han-Pei-Dong-Wang SIGMOD 01)
• Infeasible regions: Regions with cost > B or coverage < C
– Share computation by generating the features and target values for all the feasible regions all together
• Exploit distributive and algebraic aggregate functions• Simultaneously generating all the features and target values
reduces DB scans and repeated aggregate computation
1 2 3 4 5 … 52
KR …
USA
WI
... WY
57Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Subset-Based Bellwether Prediction
• Motivation: Different subsets of items may have different bellwether regions– E.g., The bellwether region for laptops may be
different from the bellwether region for clothes
• Two approaches:
R&D Expense 50K
YesNo
Category
Desktop Laptop
[1-2, WI] [1-3, MD]
[1-1, NY]
Bellwether Tree Bellwether Cube
Low Medium High
Software OS [1-3,CA] [1-1,NY] [1-2,CA]
… ... … …
Hardware Laptop [1-4,MD] [1-1, NY] [1-3,WI]
… … … …
… … … … …
R&D Expenses
Cat
egor
y
69Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Characteristics of Bellwether Trees & Cubes
Dataset generation:• Use random tree to generate different bellwether regions for different subset of itemsParameters:• Noise• Concept complexity: # of tree nodes
Result:• Bellwether trees & cubes have better accuracy than basic bellwether search• Increase noise increase error• Increase complexity increase error
0
0.5
1
1.5
2
2.5
3
0.05 0.5 1 2Noise
RM
SE
basic
cube
tree
0
0.5
1
1.5
2
3 7 15 31 63Number of nodes
RM
SE
basic
cube
tree
15 nodes Noise level: 0.5
70Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Efficiency Comparison
0
500
1000
1500
2000
2500
3000
100 150 200 250 300Thousands of examples
Sec
naive cube
naive tree
RF tree
single-scancube
optimizedcube
Naïve computationmethods
Our computationtechniques
71Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Scalability
0
200
400
600
800
1000
1200
2.5 5 7.5 10Millions of examples
Sec
single-scancube
optimizedcube
0
1000
2000
3000
4000
5000
6000
7000
2.5 5 7.5 10Millions of examples
Sec RF tree
72Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Exploratory Mining:Prediction Cubes
with Beechung Chen, Lei Chen, and Yi LinIn VLDB 05
73Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
The Idea
• Build OLAP data cubes in which cell values represent decision/prediction behavior– In effect, build a tree for each cell/region in the cube—
observe that this is not the same as a collection of trees used in an ensemble method!
– The idea is simple, but it leads to promising data mining tools
– Ultimate objective: Exploratory analysis of the entire space of “data mining choices”
• Choice of algorithms, data conditioning parameters …
74Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Example (1/7): Regular OLAP
Location Time # of App.
… … ...AL, USA Dec, 04 2
… … …WY, USA Dec, 04 3
Goal: Look for patterns of unusually high numbers of applications:
Z: Dimensions Y: Measure
All
85 86 04
Jan., 86 Dec., 86
All
Year
Month
Location Time
All
Japan USA Norway
AL WY
All
Country
State
75Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Example (2/7): Regular OLAP
Location Time # of App.
… … ...AL, USA Dec, 04 2
… … …WY, USA Dec, 04 3
Goal: Look for patterns of unusually high numbers of applications:
……………………
………108270USA
……3025502030CA…Dec…JanDec…Jan…20032004
Cell value: Number of loan applications
Z: Dimensions Y: Measure
…………
…9080USA
…90100CA…0304
Roll up
Coarserregions
………………
………10WY
……5…
………55ALUSA
…1535YT
…2025…
…151520AB
CA
…Dec…Jan…2004
Drilldown
Finer regions
76Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Model h(X, Z(D))E.g., decision tree
No…FBlackDec, 04WY, USA
………………
Yes…MWhiteDec, 04AL, USA
Approval…SexRaceTimeLocation
Example (3/7): Decision AnalysisGoal: Analyze a bank’s loan decision process
w.r.t. two dimensions: Location and Time
All
85 86 04
Jan., 86 Dec., 86
All
Year
Month
Location Time
All
Japan USA Norway
AL WY
All
Country
State
Z: Dimensions X: Predictors Y: Class
Fact table D
Cube subset
77Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Example (3/7): Decision Analysis
• Are there branches (and time windows) where approvals were closely tied to sensitive attributes (e.g., race)?
– Suppose you partitioned the training data by location and time, chose the partition for a given branch and time window, and built a classifier. You could then ask, “Are the predictions of this classifier closely correlated with race?”
• Are there branches and times with decision making reminiscent of 1950s Alabama?
– Requires comparison of classifiers trained using different subsets of data.
78Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Model h(X, [USA, Dec 04](D))E.g., decision tree
Example (4/7): Prediction Cubes
2004 2003 …
Jan … Dec Jan … Dec …
CA 0.4 0.8 0.9 0.6 0.8 … …
USA 0.2 0.3 0.5 … … …
… … … … … … … …
1. Build a model using data from USA in Dec., 1985
2. Evaluate that model
Measure in a cell:• Accuracy of the model• Predictiveness of Race measured based on that model• Similarity between that model and a given model
N…FBlackDec, 04WY, USA
………………
Y…MWhiteDec, 04AL ,USA
Approval…SexRaceTimeLocation
Data [USA, Dec 04](D)
79Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
No…FBlackDec, 04WY, USA
………………
Yes…MWhiteDec, 04AL, USA
Approval…SexRaceTimeLocation
Data table D
Example (5/7): Model-Similarity
Given: - Data table D - Target model h0(X) - Test set w/o labels
…MBlack
………
…FWhite
…SexRace
Test set
……………………
………0.90.30.2USA
……0.50.60.30.20.4CA
…Dec…JanDec…Jan
…20032004
Level: [Country, Month]
The loan decision process in USA during Dec 04 was similar to a discriminatory decision model
h0(X)
Build a model
Similarity
No
…
Yes
Yes
…
Yes
80Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Location Time Race Sex … Approval
AL, USA Dec, 04 White M … Yes
… … … … … …
WY, USA Dec, 04 Black F … No
Example (6/7): Predictiveness
2004 2003 …
Jan … Dec Jan … Dec …
CA 0.4 0.2 0.3 0.6 0.5 … …
USA 0.2 0.3 0.9 … … …
… … … … … … … …
Given: - Data table D - Attributes V - Test set w/o labels
Race Sex …White F …
… … …
Black M …
Data table D
Test set
Level: [Country, Month]Predictiveness of V
Race was an important predictor of loan approval decision in USA during Dec 04
Build models
h(X) h(XV)
YesNo..No
YesNo..Yes
81Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Example (7/7): Prediction Cube
2004 2003 …
Jan … Dec Jan … Dec …
CA 0.4 0.1 0.3 0.6 0.8 … …
USA 0.7 0.4 0.3 0.3 … … …
… … … … … … … …
………………………
…………0.80.70.9WY
………0.10.10.3…
…………0.20.10.2AL
USA
………0.20.10.20.3YT
………0.30.30.10.1…
……0.20.10.10.20.4AB
CA
…Dec…JanDec…Jan
…20032004
Drill down
…………
…0.30.2USA
…0.20.3CA
…0304Roll up
Cell value: Predictiveness of Race
82Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Efficient Computation
• Reduce prediction cube computation to data cube computation– Represent a data-mining model as a distributive or
algebraic (bottom-up computable) aggregate function, so that data-cube techniques can be directly applied
83Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Bottom-Up Data Cube Computation
1985 1986 1987 1988
Norway 10 30 20 24
… 23 45 14 32
USA 14 32 42 11
1985 1986 1987 1988
All 47 107 76 67
All
Norway 84
… 114
USA 99
All
All 297
Cell Values: Numbers of loan applications
84Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Functions on Sets
• Bottom-up computable functions: Functions that can be computed using only summary information
• Distributive function: (X) = F({(X1), …, (Xn)})
– X = X1 … Xn and Xi Xj =
– E.g., Count(X) = Sum({Count(X1), …, Count(Xn)})
• Algebraic function: (X) = F({G(X1), …, G(Xn)})
– G(Xi) returns a length-fixed vector of values
– E.g., Avg(X) = F({G(X1), …, G(Xn)})
• G(Xi) = [Sum(Xi), Count(Xi)]
• F({[s1, c1], …, [sn, cn]}) = Sum({si}) / Sum({ci})
85Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Scoring Function
• Represent a model as a function of sets
• Conceptually, a machine-learning model h(X; Z(D)) is a scoring function Score(y, x; Z(D)) that gives each class y a score on test example x– h(x; Z(D)) = argmax y Score(y, x; Z(D))
– Score(y, x; Z(D)) p(y | x, Z(D))
Z(D): The set of training examples (a cube subset of D)
87Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Machine-Learning Models
• Naïve Bayes:– Scoring function: algebraic
• Kernel-density-based classifier:– Scoring function: distributive
• Decision tree, random forest:– Neither distributive, nor algebraic
• PBE: Probability-based ensemble (new)– To make any machine-learning model distributive– Approximation
88Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Probability-Based Ensemble
1985
Jan … Dec
WA…
…
…
1985
Jan … Dec
WA…
…
…
Decision trees built on the lowest-level cells
Decision tree on [WA, 85]PBE version of decision
tree on [WA, 85]
94Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Efficiency Comparison
0
500
1000
1500
2000
2500
40K 80K 120K 160K 200K
RFex
KDCex
NBex
J48ex
NB
KDC
RF-PBE J48-PBE
Using exhaustivemethod
Using bottom-upscore computation
# of Records
Exe
cuti
on T
ime
(sec
)
Bellwether Analysis
Conclusions
96Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Related Work: Building Models on OLAP Results
• Multi-dimensional regression [Chen, VLDB 02]– Goal: Detect changes of trends– Build linear regression models for cube cells
• Step-by-step regression in stream cubes [Liu, PAKDD 03]
• Loglinear-based quasi cubes [Barbara, J. IIS 01]– Use loglinear model to approximately compress dense regions of
a data cube
• NetCube [Margaritis, VLDB 01]– Build Bayes Net on the entire dataset of approximate answer
count queries
97Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Related Work (Contd.)
• Cubegrades [Imielinski, J. DMKD 02]– Extend cubes with ideas from association rules– How does the measure change when we rollup or drill down?
• Constrained gradients [Dong, VLDB 01]– Find pairs of similar cell characteristics associated with big
changes in measure
• User-cognizant multidimensional analysis [Sarawagi, VLDBJ 01]– Help users find the most informative unvisited regions in a data
cube using max entropy principle
• Multi-Structural DBs [Fagin et al., PODS 05, VLDB 05]• Experiment Databases: Towards an Improved
Experimental Methodology in Machine Learning [Blockeel & Vanschoren, PKDD 2007]
98Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Take-Home Messages
• Promising exploratory data analysis paradigm:– Can use models to identify interesting subsets– Concentrate only on subsets in cube space
• Those are meaningful subsets, tractable
– Precompute results and provide the users with an interactive tool
• A simple way to plug “something” into cube-style analysis:– Try to describe/approximate “something” by a distributive or
algebraic function
100Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaHierarchies in Data Mining R. Ramakrishnan
Conclusion
• Hierarchies are widely used, and a promising tool to help us deal with– Data sparsity– Data imprecision and uncertainty– Exploratory analysis– “Experiment” planning and management
• Area is as yet under-appreciated– Lots of work on taxonomies and how to use them,
but there are many novel ways of using them that have not received enough attention
Recommended