Upload
nate
View
33
Download
2
Embed Size (px)
DESCRIPTION
Estimation As A Data Mining Task. Theoretical Understanding. Vinh Ngo & Mike Ellis. Introduction. Estimation Predicting values not in predetermined categories Three main techniques Regression Decision trees Neural networks. Regression. Linear regression Method of Least Squares - PowerPoint PPT Presentation
Citation preview
Estimation As A Data Mining Task
Theoretical Understanding
Vinh Ngo & Mike Ellis
Introduction
• Estimation– Predicting values not in predetermined
categories
• Three main techniques– Regression– Decision trees– Neural networks
Regression
• Linear regression–
• Method of Least Squares– –
• Easy technique to use– Excel “Data Analysis”– Excel chart example
βXαY
s
ii
s
iii
xx
yyxx
1
2
1
)(
))((
xy
Excel Linear Regression Example
Other Regression Forms
•Multiple regression–
•Polynomial equation
–
– Define new variables
– Equation becomes
2211 XXY
33
221 XXXY
332211 XXXY
X X and ,X X X, X 33
221
Decision Trees
• Regression tree– Leaves → average values
• Model tree– Leaves → linear regression models
• Discretizing the data– Convert continuous data into discrete
partitions– Threshold values
Threshold Values
• Entropy– Measure of purity–
• Information gain– Expected reduction in entropy due to
partitioning–
– Maximize for best threshold
c
iii ppSEntropy
12log)(
)(
|||| )()(),(
AValuevvS
S SEntropySEntropyASGain v
CART Algorithm
• Classification and Regression Tree– Grow a tree that overfits data– Prune the tree– Select best subtree
Decision Trees
• Strengths– Understandable– Which fields are most important
• Weaknesses– Intended for discrete data– Time to grow and prune tree
Comparison Example
CPU Performance Data
cycle main memory cache channels performance
time (Kb) (Kb)
(ns) min max min max
MYCT MMIN MMAX CACH CHMIN CHMAX PRP
1 125 256 6000 256 16 128 198
2 29 8000 32000 32 8 32 269
3 29 8000 32000 32 8 32 220
4 29 8000 32000 32 8 32 172
5 29 8000 16000 32 8 16 132
…
207 125 2000 8000 0 2 14 52
208 480 512 8000 32 0 0 67
209 480 1000 4000 0 0 0 45
Linear Regression Result
PRP = - 56.1+ 0.049 MYCT+ 0.015 MMIN+ 0.006 MMAX+ 0.630 CACH - 0.270 CHMIN+ 1.46 CHMAX
Regression TreeCHMIN
CACH
MMAXMMAX CHMAX
MMAXCACH
MMIN
MYCT
≤ 7.5 > 7.5
64.6(24 points)
75.7(10 points)
133(16 points)
783(5 points)
29.8(37 points)
19.3(28 points)
157(21 points)
281(11 points)
492(7 points)
59.3(24 points)
37.3(19 points)
18.3(7 points)
≤ 8.5
≤ 2500
≤ 550
≤ 0.5
≤ 12000
≤ 58
≤ 28000
≤ 10000
> 28 > 28000
> 58
> 12000
> 550
> 4250
> 10000
(8.5,28]
(2500, 4250]
(0.5,8.5]
Model TreeCHMIN
CACH
MMAX
MMAXCACH
≤ 7.5 > 7.5
LM4(50 points)
LM1(65 points)
LM5(21 points)
LM6(23 points)
LM3(24 points)
LM2(26 points)
≤ 8.5
≤ 4250
≤ 0.5
≤ 28000 > 28000
> 4250
> 8.5
(0.5,8.5]
LM1 PRP = 8.29 + 0.004 MMAX + 2.77 CHMINLM2 PRP = 20.3 + 0.004 MMIN – 3.99 CHMIN
+ 0.946 CHMAXLM3 PRP = 38.1 + 0.012 MMINLM4 PRP =19.5 + 0.002 MMAX + 0.698 CACH
+ 0.969 CHMAXLM5 PRP = 285 – 1.46 MYCT + 1.02 CACH
– 9.39 CHMINLM6 PRP = -65.8 + 0.03 MMIN – 2.94 CHMIN
+ 4.98 CHMAX
Side-By-Side
PRP = – - 56.1–+ 0.049 MYCT–+ 0.015 MMIN–+ 0.006 MMAX–+ 0.630 CACH– - 0.270 CHMIN–+ 1.46 CHMAX
CHMIN
CACH
MMAXMMAX CHMAX
MMAXCACH
MMIN
MYCT
≤ 7.5 > 7.5
64.6(24 points)
75.7(10 points)
133(16 points)
783(5 points)
29.8(37 points)
19.3(28 points)
157(21 points)
281(11 points)
492(7 points)
59.3(24 points)
37.3(19 points)
18.3(7 points)
≤ 8.5
≤ 2500
≤ 550
≤ 0.5
≤ 12000
≤ 58
≤ 28000
≤ 10000
> 28 > 28000
> 58
> 12000
> 550
> 4250
> 10000
(8.5,28]
(2500, 4250]
(0.5,8.5]
CHMIN
CACH
MMAX
MMAXCACH
≤ 7.5 > 7.5
LM4(50 points)
LM1(65 points)
LM5(21 points)
LM6(23 points)
LM3(24 points)
LM2(26 points)
≤ 8.5
≤ 4250
≤ 0.5
≤ 28000 > 28000
> 4250
> 8.5
(0.5,8.5]
LM1 PRP = 8.29 + 0.004 MMAX + 2.77 CHMINLM2 PRP = 20.3 + 0.004 MMIN – 3.99 CHMIN
+ 0.946 CHMAXLM3 PRP = 38.1 + 0.012 MMINLM4 PRP =19.5 + 0.002 MMAX + 0.698 CACH
+ 0.969 CHMAXLM5 PRP = 285 – 1.46 MYCT + 1.02 CACH
– 9.39 CHMINLM6 PRP = -65.8 + 0.03 MMIN – 2.94 CHMIN
+ 4.98 CHMAX
Linear regressionRegression tree
Model tree
Simple Neural Network
3.6
0.80.00.4
Prediction
IncomeGenderAge
2.0 3.5
Output Layer
Hidden Layer
Input Layer
1.9
Building the Neural Net
• Recursive process– Assign initial weights– Run training values through network– Compare results to actual value
• Backpropagation– Pass errors back through net– Incorrect node gets less influence
• Military metaphor• Recurrent networks• Genetic algorithms• Simulated annealing
Neural Networks
• Strengths– Accurate– Fast to use– Handle missing or corrupt data well
• Weaknesses– Not intuitive– Don’t handle large numbers of
predictors well– Data preprocessing
Neural Net Example
•Four years of 30-minute trading data, 1985-1988
–1986 & 1987 for training–1985 for testing
•USD/CHF•Single layer model
–Input nodes: 7–Hidden nodes: 7
•Two layer model–Input nodes: 7–Hidden nodes: 5/2
•Output–Value between -1.0 and 1.0–Rise or fall?
Model % correct
Return %
Random 50.0 ----
Linear 52.5 -6.8
1 hidden layer
53.4 9.9
2 hidden layers
54.0 9.8
Average network
53.7 11.5
Accuracy of Models
• Overfitting– Applies to all three– Independent test data
• Statistical measures
Statistical Measuresmean-squared error
n
apn
iii
1
2)(
root mean squared error
n
apn
iii
1
2)(
mean absolute error
n
apn
iii
1
relative squared error
n
i i
ii
aa
ap
12
2
)(
)( where
n
aa i i
root relative squared error
n
i i
ii
aa
ap
12
2
)(
)(
relative absolute error
n
i i
ii
aa
ap
1
correlation coefficient
AP
PA
SS
S, where
1
))((
n
aappS i iiPA ,
1
)( 2
n
aaS
i
A
1
)( 2
n
ppS
i
P , and
Any questions?
Special Bonus Slide