Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Data Mining: A KDD Process
Data mining: the core of
knowledge discovery
process.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
Steps of a KDD Process
• Learning the application domain:
– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation:
– Find useful features, dimensionality/variable reduction, invariant representation.
• Choosing functions of data mining (summarization, classification, regression, association, clustering)
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge
Data Mining Algorithms
Online Analytical
ProcessingDiscovery Driven Methods
SQL Query ToolsDescription Prediction
Classification Regressions
Decision Trees
Neural Networks
Visualization
Clustering
Association
Sequential Analysis
Data Mining Algorithms
Online Analytical
ProcessingDiscovery Driven Methods
SQL Query ToolsDescription Prediction
Classification Regressions
Decision Trees
Neural Networks
Visualization
Clustering
Association
Sequential Analysis
Rafael Nadal
Day Outlook Temp Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Weak Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Strong Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
Outlook
Humidity Wind
Yes No Yes No
Yes
Ove
rcas
t
Day Outlook Temp Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Weak Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Strong Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
Question 1
Yes No
Question 2
Yes No
Question 3
Yes No
Question 4
Yes No
E(S) = �p+log(p+)� p�log(p�)
�✓8
8
◆log2
✓8
8
◆�✓0
8
◆log2
✓0
8
◆= 0
�✓7
8
◆log2
✓7
8
◆�✓1
8
◆log2
✓1
8
◆= 0.54
�✓6
8
◆log2
✓6
8
◆�✓2
8
◆log2
✓2
8
◆= 0.81
�✓5
8
◆log2
✓5
8
◆�✓3
8
◆log2
✓3
8
◆= 0.95
�✓4
8
◆log2
✓4
8
◆�✓4
8
◆log2
✓4
8
◆= 1
�✓3
8
◆log2
✓3
8
◆�✓5
8
◆log2
✓5
8
◆= 0.95
�✓2
8
◆log2
✓2
8
◆�✓6
8
◆log2
✓6
8
◆= 0.81
�✓1
8
◆log2
✓1
8
◆�✓7
8
◆log2
✓7
8
◆= 0.54
�✓0
8
◆log2
✓0
8
◆�✓8
8
◆log2
✓8
8
◆= 0
y = �✓
1
10
◆log4
✓1
10
◆�
| {z }�
✓3
10
◆log4
✓3
10
◆�
| {z }�✓
2
10
◆log4
✓2
10
◆�
| {z }�
✓4
10
◆log4
✓4
10
◆�
| {z }
y = �kX
i=1
pilogk(pi)
Question 2 Question 1
Yes No Yes No
E=1 E=1
E=0.97 E=0.92 E=0.72 E=0
G(S,Q) = E(S)�kX
i=1
piE(S,Qi)
Information Gain
Yes No Yes No
E=1 E=1
E=0.97 E=0.92 E=0.72 E=0
G(S,Q1) = 1�✓10
16
◆⇥ 0.97�
✓6
16
◆⇥ 0.92 G(S,Q2) = 1�
✓10
16
◆⇥ 0.72�
✓6
16
◆⇥ 0
G(S,Q1) = 0.049 G(S,Q2) = 0.55
Question 2 Question 1
Day Outlook Temp Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Weak Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Strong Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
Wind
G(S,Wind) = 0.048
E=0.811 E=1
E=0.954
G(S,Wind) = 0.954� 8
140.811� 6
141
Humidity
E=0.954
E=0.985 E=0.592
G(S,Humidity) = 0.151
G(S,Humidity) = 0.954� 7
140.985� 7
140.592
G(S,Wind) = 0.048
Temp
Mild
E=0.954
E=1 E=0.92 E=0.81
G(S,Wind) = 0.048
G(S,Humidity) = 0.151
G(S, Temp) = 0.954� 4
141� 6
140.92� 4
140.81
G(S, Temp) = 0.042
Outlook
Overcast
E=0.954
E=0.971 E=0 E=0.971
G(S,Outlook) = 0.247
G(S,Outlook) = 0.954� 5
140.971� 4
140� 5
140.971
G(S,Wind) = 0.048
G(S,Humidity) = 0.151
G(S, Temp) = 0.042
Outlook
Humidity Wind
Yes No Yes No
Yes
Ove
rcas
t