Upload
duongphuc
View
215
Download
0
Embed Size (px)
Citation preview
Data Mining and Machine Learning
DWML, 2007 1/46
What is Data Mining?
?
DWML, 2007 2/46
What is Data Mining?
?
DWML, 2007 2/46
What is Data Mining?
!
DWML, 2007 2/46
What is Data Mining?
Definitions
• Knowledge Discovery is the nontrivial extraction of implicit, previously unknown, andpotentially useful information from data [Frawley, Piatetsky-Shapiro, Matheus 1991].
• Data Mining is a step in the KDD process consisting of applying computationaltechniques that, under acceptable computational efficiency limitations, produce aparticular enumeration of patterns (or models) over the data [Fayyad,Piatetsky-Shapiro, Smyth 1996].
• Data Mining is the analysis of (often large) observational data sets to find unsuspectedrelationships and to summarize the data in novel ways that are both understandableand useful to the data owner [Hand, Mannila, Smyth 2001].
• The field of machine learning is concerned with the question of how to constructcomputer programs that automatically improve with experience [Mitchell, 1997]
DWML, 2007 3/46
What is Data Mining?
Definitions
• Knowledge Discovery is the nontrivial extraction of implicit, previously unknown, andpotentially useful information from data [Frawley, Piatetsky-Shapiro, Matheus 1991].
• Data Mining is a step in the KDD process consisting of applying computationaltechniques that, under acceptable computational efficiency limitations, produce aparticular enumeration of patterns (or models) over the data [Fayyad,Piatetsky-Shapiro, Smyth 1996].
• Data Mining is the analysis of (often large) observational data sets to find unsuspectedrelationships and to summarize the data in novel ways that are both understandableand useful to the data owner [Hand, Mannila, Smyth 2001].
• The field of machine learning is concerned with the question of how to constructcomputer programs that automatically improve with experience [Mitchell, 1997]
Data Mining vs. Machine Learning
• Different roots: information extraction vs. intelligent machines• Today very large overlap of techniques and applications• Some remaining differences: emphasis on large datasets (DM), theoretical analysis of
learnability (ML),. . .• For this course: Data Mining ≈ Machine Learning
DWML, 2007 3/46
What is Data Mining?
Data Mining in practice
DWML, 2007 4/46
What is Data Mining?
Data Mining in practice
Real−life data Off−the−shelf algorithm
adaptpreprocess
DWML, 2007 4/46
What is Data Mining?
Data Mining in practice
Real−life data Off−the−shelf algorithm
adapt
evaluate + iterate
preprocess
DWML, 2007 4/46
What is Data Mining?
Data Mining in practice
Real−life data Off−the−shelf algorithm
adapt
evaluate + iterate
preprocess
data/domain − specific operationsgeneral algorithmic methods
DWML, 2007 4/46
The CRISP model
Background
Developed by a four member consortium in a EU project. Members of the consortium:• Teradata (NCR)• SPSS (statistical software)• DaimlerChrysler• OHRA (Insurance and Banking)
Consortium supported by a special interest group composed of over 300 organizationsinvolved in data mining projects.
Aim
From http://www.crisp-dm.org/:
The CRISP-DM project has developed an industry- and tool-neutral DataMining process model. [. . . ] this project defined and validated a data miningprocess that is applicable in diverse industry sectors. This will make large datamining projects faster, cheaper, more reliable and more manageable. Even smallscale data mining investigations will benefit from using CRISP-DM.
DWML, 2007 5/46
The CRISP model
Phases of the CRISP DM Process Model
(Illustration from www.crisp-dm.org)DWML, 2007 6/46
The CRISP model
Business/Data understanding
• Vision: Data Mining extracts whatever interesting hidden information there is in thedata
• Reality: Data Mining techniques solve several types of well-defined tasks• Reality: The data used must support the task at hand• Reality: The data miner must understand the background of the data, in order to select
an appropriate data mining technique
DWML, 2007 7/46
The CRISP model
Our Focus
DWML, 2007 8/46
The CRISP model
Selecting the Modeling Technique
(Time, Data Characteristics,Staff Training/Knowledge)
Tool(s) Selected
Constraints
Political Requirements
Techniques Appropriate for Problem
Universe of Techniques
(Management,Understandability)
(Defined by Tool)
DWML, 2007 9/46
Types of Tasks and Models
Prediction (Supervised Learning)
• Task: predict some (unobserved) target variable based on observed values of attributevariables
- Regression (Larose: Estimation), if target is continuous
- Classification, if target is discrete• Models e.g.: Decision trees, Neural networks, Bayesian (classification) networks,. . .
Clustering
• Task: identify coherent subgroups in data• Models e.g.: k-means, hierarchical clustering,. . .
Association analysis
• Task: identify patterns of co-occurrence of attribute values• Models: Apriori and extensions
Visualization (Exploratory Data Analysis)
• Task: find intelligible visualization of relevant data properties• Models: Graphs, plots,. . .
DWML, 2007 10/46
Examples: Regression
Nutritional rating of cereals
Data: nutritional information and ratings for 77 cereals.Task: find best linear approximation of the dependency of rating on sugars
DWML, 2007 11/46
Examples: Classification
Text Categorization
The Association for Computing Machinery (ACM) maintains a subject classification schemefor computer science research papers. Part of the subject hierarchy (1998 version):
I. Computing MethodologiesI.2 Artificial Intelligence
I.2.6 Learning- Analogies- Concept learning- Connectionism and neural nets- Induction- Knowledge acquisition- Language acquisition- Parameter learning
Papers are manually classified by authors or editors.
Data: collection of classified papers (full text or abstracts)Task: build a classifier that automatically assigns a subject index to new, unclassified papers.
DWML, 2007 12/46
Examples: Classification
Spam Filtering
Spam filtering in Mozilla: user trains the mail reader to recognize spam by manually labelingincoming mails as spam/no spam.
Data: collection of user-classified emails (full text).Task: build a classifier that automatically categorizes an incoming email as spam/no spam
DWML, 2007 13/46
Examples: Classification
Character Recognition
Example for a Pattern Recognition problem (pattern recognition is an older discipline thandata mining, but now can also be seen as a sub-area of data mining):
Data: collection of handwritten characters, correctly labeled.Task: build a classifier that identifies new handwritten characters.
DWML, 2007 14/46
Examples: Classification
Credit Rating
From existing customer data predict whether a person applying for a new loan will repay ordefault on the loan.
Data: existing customer records with attributes like age, employment type, income, . . . andinformation on payback history.Task: build a classifier that predicts whether a new customer will repay the loan.
DWML, 2007 15/46
Examples: Clustering
Text Categorization
Web mining: automatically detect similarity between web pages (e.g. to support searchengines or automatic construction of internet directories).
Data: the WWW.Task: Construct a (similarity) model for pages on the WWW.
DWML, 2007 16/46
Examples: Clustering
Bioinformatics: Phylogenetic Trees
From biological data construct a model of evolution.
Hom
o Sapiens
Pan T
roglodytes
Rattus N
orvegicus
Bacillus Subtilis
Bacillus H
alodurans
Caulobacter C
rescentus
Lactococcus L
actis
Data: e.g. genome sequences of different animal species.Task: construct a hierarchical model of similarity between the species.
DWML, 2007 17/46
Examples: Association Analysis
Association Rules
Data: transaction dataTransaction Items bought
1 Beer,Soap,Milk,Butter2 Beer,Chips,Butter3 Milk,Spaghetti,Butter,Tomatos
. . . . . .
Task: infer association rules
{Beer} ⇒ {Chips}{Spaghetti,Tomatos} ⇒ {Wine}. . .
DWML, 2007 18/46
Tools
WEKA
• Free open source Java toolbox (www.cs.waikato.ac.nz/ml/weka/)• Many methods, good interface
Clementine
• Commercial system, Windows only• Many methods, good interface, integrated use of MS SQL server
For all toolboxes: easy use of methods can be dangerous – correct interpretation of resultsrequires understanding of methods. Documentation essential (and often a weak point...)!
DWML, 2007 19/46
Classification and Decision Trees
DWML, 2007 20/46
Classification
A high-level view
Classifi
er Spamyes/no
DWML, 2007 21/46
Classification
A high-level view
SubAllCapyes/no
TrustSendyes/no
InvRetyes/no
Body’adult’yes/no
Body’zambia’yes/no
Classifi
er Spamyes/no
DWML, 2007 21/46
Classification
A high-level view
Classifi
erCell-11..64
Cell-21..64
Cell-31..64
Cell-3241..64
SymbolA..Z,0..9
DWML, 2007 21/46
Classification
Labeled DataIn
stan
ces
(Cas
es,E
xam
ples
)
Attributes Class variable
(Features, Predictor Variables) (Target variable)SubAllCap TrustSend InvRet . . . B’zambia’ Spam
y n n . . . n yn n n . . . n nn y n . . . n yn n n . . . n n
. . . . . . . . . . . . . . . . . .
Inst
ance
s
Attributes Class variable
Cell-1 Cell-2 Cell-3 . . . Cell-324 Symbol
1 1 4 . . . 12 B1 1 1 . . . 3 134 37 43 . . . 22 Z1 1 1 . . . 7 0
. . . . . . . . . . . . . . . . . .
(In principle, any attribute can become the designated class variable)DWML, 2007 22/46
Classification
Attribute Types
Each attribute (including the class variable) has associated with it a set of possible values orstates. E.g.
States(A) = {yes,no}States(A) = {red,blue,green}States(A) = {010100, 020100, . . . , 311299}
States(A) = R
States(A) finite: A is called discreteStates(A) = R: A is called continuous or numericStates(A) = N: A can be interpreted as continuous (N ⊂ R), or made discrete
by replacing N e.g. with {1, 2, . . . , 100, > 100} (few data miningmethods are specifically adapted to integer valued attributes).
DWML, 2007 23/46
Classification
Complete/Incomplete Data
Name Gender DoB Income Customer since Last Purchase
Thomas Jensen m 050367 190000 010397 250504Jens Nielsen m 171072 250000 051103 040204Lene Hansen f 021159 140000 300300 250105Ulla Sørensen f 220879 210000 180998 031099. . . . . . . . . . . . . . . . . .
Name Gender DoB Income Customer since Last Purchase
Thomas Jensen m 050367 190000 010397 250504Jens Nielsen m ? ? 051103 040204Lene Hansen f 021159 ? 300300 250105Ulla Sørensen f ? ? 180998 031099. . . . . . . . . . . . . . . . . .
DWML, 2007 24/46
Classification
Classification
Classification data in general:
Attributes : Variables A1, A2, . . . , An (discrete or continuous).
Class variable : Variable C. Always discrete: States(C) = {c1, . . . , cl} (set of class labels)
A (complete data) Classifier is a mapping
C : States(A1, . . . , An) → States(C).
A classifier able to handle incomplete data provides mappings
C : States(Ai1, . . . , Aik
) → States(C)
for subsets {Ai1, . . . , Aik
} of {A1, . . . , An}.
A classifier partitions Attribute-value space (also: instance space) into subsets labelled withclass labels.
DWML, 2007 25/46
Classification
Iris dataset
PLPW
SL
SW
Measurement of petal width/length and sepalwidth/length for 150 flowers of 3 different speciesof Iris.
first reported in:Fisher,R.A. "The use of multiple measurementsin taxonomic problems" Annual Eugenics, 7(1936).
Attributes Class variable
SL SW PL PW Species
5.1 3.5 1.4 0.2 Setosa4.9 3.0 1.4 0.2 Setosa6.3 2.9 6.0 2.1 Virginica6.3 2.5 4.9 1.5 Versicolor. . . . . . . . . . . . . . .
DWML, 2007 26/46
Classification
Labeled data in instance space:
DWML, 2007 27/46
Classification
Labeled data in instance space:
Virginica
Versicolor
Setosa
Partition defined by classifier
DWML, 2007 27/46
Classification
Decision Regions
Axis-parallel linear: e.g. Deci-sion Trees
Piecewise linear: e.g. NaiveBayes
Nonlinear: e.g. Neural Network DWML, 2007 28/46
Classification
Classifiers differ in . . .
• Model space: types of partitions and their representation.• how they compute the class label corresponding to a point in instance space (the
actual classification task).• how they are learned from data.
Some important types of classifiers:
• Decision trees• Naive Bayes classifier• Other probabilistic classifiers (TAN,. . . )• Neural networks• K-nearest neighbors
DWML, 2007 29/46
Decision Trees
Example
Attributes: height ∈ [0, 2.5], sex ∈ {m, f}. Class labels: {tall, short}.
h h
2.0
1.0
0
2.5
fm
sfm
< 1.8≥ 1.8
< 1.7≥ 1.7
short
short
short short
talltall
tall tall
Partition of instance space Representation by decision tree
DWML, 2007 30/46
Decision Trees
A decision tree is a tree
- whose internal nodes are labeled with attributes
- whose leaves are labeled with class labels
- edges going out from node labeled with attribute A are labeled with subsets ofStates(A), such that all labels combined form a partition of States(A).
Possible partitions e.g.:
States(A) = R : [−∞, 2.3[, [2.3,∞]
[−∞, 1.9[, [1.9, 3.5[, [3.5,∞]
States(A) = {a, b, c} : {a}, {b}, {c}
{a, b}, {c}
DWML, 2007 31/46
Decision Trees
Decision tree classification
Each point in the instance space is sorted into a leaf by the decision tree. It is classifiedaccording to the class label at that leaf.
hh
sm f
[m,1.85]
< 1.8≥ 1.8
< 1.7≥ 1.7
shortshort talltall
C([m, 1.85]) = tall
DWML, 2007 32/46
Decision Trees
How to learn a decision tree?
Given a dataset:
Id. Savings Assets Income Credit Risk($ 1000s)
1 Medium High 75 Good2 Low Low 50 Bad3 High Medium 25 Bad4 Medium Medium 50 Good5 Low Medium 100 Good6 High High 25 Good7 Low Low 25 Bad8 Medium Medium 75 Good
We want to build a decision tree that• is small• has high classification accuracy
DWML, 2007 33/46
Decision Trees
Some simple candidate trees:
Savings
1,4,82,5,7 3,6
Income
1,5,82,3,4,6,7
Assets
3,4,5,82,7 1,6
Income
1,2,4,5,83,6,7
L M H L M H
G:1, B:2
G:1, B:2 G:1, B:1
G:2, B:3
G:0, B:2 G:3, B:1 G:2, B:0
G:3, B:0
G:3, B:0
G:4, B:1
≤ 50 > 50 ≤ 25 > 25
DWML, 2007 34/46
Decision Trees
How accurate are these trees? Accurate trees: pure class label distributions at the leaves:
(1,1)(2,2)
(2,3)(3,1)(1,2)
pure impure
(3,0)(0,2)(2,0)
DWML, 2007 35/46
Decision Trees
How accurate are these trees? Accurate trees: pure class label distributions at the leaves:
(1,1)(2,2)
(2,3)(3,1)(1,2)
pure impure
(3,0)(0,2)(2,0)
Entropy
A measure of impurity: for S = (x1, x2, . . . , xn) with x =P
n
i=1xi :
Entropy(S) = −n
X
i=1
xi
xlog2(
xi
x)
Entropy(2, 0) = Entropy(0, 2) = Entropy(3, 0) = −(1 · log2(1) + 0 · log2(0)) = 0 + 0 = 0
Entropy(3, 1) = −(0.75 · log2(0.75) + 0.25 · log2(0.25)) = 0.311 + 0.5 = 0.811
Entropy(2, 3) = −(0.4 · log2(0.4) + 0.6 · log2(0.6)) = 0.528 + 0.442 = 0.97
Entropy(2, 2) = Entropy(1, 1) = −(0.5 · log2(0.5) + 0.5 · log
2(0.5)) = 0.5 + 0.5 = 1.0
DWML, 2007 35/46
Decision Trees
Information Gain
A B
5,1 2,21,18,2 2,0
0.722 0.0 0.651.0 1.0Entropy:
L M Htrue false
Expected Entropy : A : 10
12· 0.722 + 2
12· 1.0 = 0.768
B : 2
12· 0.0 + 6
12· 0.65 + 4
12· 1.0 = 0.658
Data Entropy : Entropy(9, 3) = 0.811
Information Gain : A : 0.811 − 0.768 = 0.043
B : 0.811 − 0.658 = 0.153
DWML, 2007 36/46
Decision Trees
Expected entropies:
Assets
3,10,2 2,0
Income
4,11,2
Savings
3,01,2 1,1
Income
3,02,3
L L M HM H
≤ 50 > 50 ≤ 25 > 25
3
8· 0.918 + 3
8· 0.0 + 2
8· 1.0 = 0.594 2
8· 0.0 + 4
8· 0.811 + 2
8· 0.0 = 0.405
5
8· 0.97 + 3
8· 0.0 = 0.606 3
8· 0.918 + 5
8· 0.722 = 0.795
Information gains are Entropy(5, 3) = 0.954 minus expected entropies.DWML, 2007 37/46
Decision Trees
After the second (and final) ID3 iteration:
Assets
2,7 1,6Savings
4,85 3
L
L
M H
M HG:0, B:2 G:2, B:0
G:2, B:0 G:0, B:1G:0, B:1
DWML, 2007 38/46
Decision Trees
After the second (and final) ID3 iteration:
Assets
Savings
good
good
bad bad
bad
L
L
M H
M H
DWML, 2007 38/46
Decision Trees
ID3 algorithm for decision tree learning
• Determine attribute A with highest information gain (for continuous attributes: alsodetermine split-value)
• Construct decision tree with root A, and one leaf for each value of A (two leaves if A iscontinuous)
• For a non-pure leaf L: determine attribute B with highest information gain for the datasorted into L.
• Replace L with a subtree consisting of root B and one leaf for each value of B (twoleaves if B is continuous)
• Continue until all leaves are pure, or some other termination condition applies (e.g.:possible information gains below a given threshold)
• Label each leaf with the class label that is most frequent among the data sorted intothe leaf
DWML, 2007 39/46
Decision Trees
Pros and Cons
+ Easy to interpret.
+ Efficient learning methods.
- Difficulties with handling missing data.
DWML, 2007 40/46
Overfitting
The problem
Assets
Savings
good
good
bad bad
bad
L
L
M H
M H
Predictions made by the learned model:
Assets=M,Savings=M ⇒ Risk=goodAssets=M,Savings=H ⇒ Risk=bad
• The training data contained a single case with Assets=M,Savings=H• This case had the (uncharacteristic?) class label Risk=bad• The model is overfitted to the training data• With the prediction Assets=M,Savings=H ⇒ Risk=good we will likely obtain a higher
accuracy on future casesDWML, 2007 41/46
Overfitting
The general problem
• Complex models will represent properties of the training data very precisely• The training data may contain some peculiar properties that are not representative for
the domain• The model will not perform optimally in classifying future instances
Model size
Future data
Training data
Cla
ssif
icat
ion
erro
r
DWML, 2007 42/46
Overfitting
Decision Tree Pruning
To prevent overfitting, extensions of ID3 (C4.5, C5.0) add a pruning step after the treeconstruction:
• Data is split into training set and test set• Decision tree is learned using training data only• Pruning: for internal node A: replace subtree rooted at A with leaf if this reduces the
classification error on the test set.
DWML, 2007 43/46
Overfitting
Example
Full
Assets
Savings
good
good
bad bad
bad
L
L
M H
M H
Pruned
Assets
goodbad good
L M H
Test data (show only cases with As-sets=M):
Id. Savings Assets Income Risk
9 High Medium 50 Good10 Low Medium 50 Bad11 High Medium 75 Good12 Medium Medium 50 Good
Accuracy of full tree on test data: 50%Accuracy of pruned tree on test data: 75%⇒ prune the Savings node.
DWML, 2007 44/46
Overfitting
Model Tuning with Test Set
Modellearn apply
tune
finalModel
split
parametertuning
learn
setting
Train Test
Train
Test
Data
Data
• Models can be adjusted or tuned (e.g. pruning subtrees, setting model parameters)• Tuning can be an iterative process that requires repeated evaluations on the test set• A final model is learned using all the data• Problem: part of data “wasted” as test set
DWML, 2007 45/46
Overfitting
Cross Validation
• Partition the data into n subsets or folds (typically: n = 10).• For each setting of tuning parameter:
- for i = 1 to n:learn a model using folds 1, . . . , i − 1, i + 1, . . . , n as training datameasure performance on fold i
- model performance = average performance on the n test sets• Choose parameter setting with best performance• Learn final model with chosen parameter setting using the whole available data
Cross Validation is also used for final evaluation of a learned model.
DWML, 2007 46/46