Upload
kathleen-webb
View
212
Download
0
Tags:
Embed Size (px)
Citation preview
Knowledge Discovery via Data mining
Enrico TronciDipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113,
00198 Roma, Italy, [email protected], http://www.dsi.uniroma1.it/~tronci
Workshop ENEA: I Sistemi di Supporto alle DecisioniCentro Ricerche ENEA Casaccia, Roma, October 28, 2003
2
Data Mining
Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.
A data miner is a computer program that sifts through data seeking regularities or patterns.
Obstructions: noise and computational complexity.
3
Some Applications
•Decisions involving judgments, e.g. loans.
•Screening images. Example: detection of oil slicks from satellite images, warning of ecological disasters, illegal dumping.
•Load forecasting in the electricity supply industry.
•Diagnosis, e.g. for preventive maintenance of electromechanical devices.
•Marketing and Sales. … On Thursday customers often purchase beer and diapers together …
•Stock Market Analysis.
•Anomaly Detection.
4
Data
Age Spectacle Prescription
Astigmatism Tear production rate
Recommended lens
young myope no reduced none
young myope no normal soft
young myope yes reduced none
young myope yes normal hard
young hypermetrope no reduced none
young hypermetrope no normal soft
young hypermetrope yes reduced none
young hypermetrope yes normal hard
Pre-presbyopic
myope no reduced none
Pre-presb myope no normal soft
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
AttributesGoal
Instance
5
Classification
Assume instances have n attributes A1, … An-1, An. Let attribute An our goal. A classifier is a function f from (A1 x …x An-1) to An. That is f looks at the values of the first (n-1) attributes and returns the (estimated) value of the goal. In other words f classifies each instance w.r.t. the goal attribute.
The problem of computing a classifier from a set of instances is called the classification problem.
Note that in a classification problem the set of classes (i.e. the possible goal value) is known in advance.
Note that a classifier works on any possible instance. That is also on instances that were not present in our data set. This is way classification is a form of machine learning.
6
ClusteringAssume instances have n attributes A1, … An.
A clustering function is a function f from the set (A1 x …x An) to some small subset of the natural numbers. That is f splits the set of instances into a small number of classes.
The problem of computing a clustering function from our data set is called the clustering problem.
Note that, unlink in a classification problem, in a clustering problem the set of classes is not known in advance.
Note that a clustering function works on any possible instance. That is also on instances that were not present in our data set. This is way clustering is a form of machine learning.
In the following we will focus on classification.
7
Rules for Contact Lens Data(An example of calssification)
if (<tear production rate> = <reduced>) then <recommendation> = <none>;
if (<age> = <young> and <astigmatic> = <no> and <tear production rate> = <soft>) then <recommendation> = <soft>
if (<age> = <pre-presbyotic > and <astigmatic> = <no> and <tear production rate> = <normal>) then <recommendation> = <soft>
. . . .
Attribute recommendation is the attribute we would like to predict. Such attribute is usually called Goal and is typically written on the last column.
A possible way of defining a classifier is by using a set of rules as above.
8
Labor Negotiations DataAttribute Type 1 2 3 . . . 40
Duration years 1 2 3 . . . 2
Wage increase first year
percentage 2% 4% 4.3% . . . 4.5%
Wage increase second year
percentage ? ? ? . . . ?
Working hours per week
Number of hours
28 35 38 . . . 40
pension {none, r, c} none ? ? . . . ?
Education allowance {yes, no} yes ? ? . . . ?
Statutory holidays Nun of days 11 15 12 . . . 12
vacation Below-avg, avg, gen
avg gen gen . . . avg
. . . . . . . . . . . . . . . . . .
Acceptability of contract
{good, bad} bad good good . . . good
9
Classification using Decision Trees (The Labor Negotiations Data Example (1))
Wage increase first year
Statutory holidays
Wage increase first year
> 2.5
<= 10
bad good bad good
<= 2.5
> 10
<= 4 > 4
10
Classification using Decision Trees (The Labor Negotiations Data Example (2))
Wage increase first year
working hours per week Statutory holidays
Health plan contributionbad
<= 36 > 36
Wage increase first yeargood
> 10 <= 10
bad good bad bad good
none half full <= 4 > 4
<= 2.5 > 2.5
11
Which Classifiers is good for me ?
From the same data set we may get many classifiers with different properties. Here are some of the properties usually considered for a classifiers. Note that depending on the problem under consideration, some property may or may not not be relevant.
•Success rate. That is the percentage of instances classified correctly.
•Easy of computation.
•Readability. There are cases in which the definition of the classifier must be read by a human being. In such cases the readability of the classifier definition is an important parameter to judge the goodness of a classifier.
Finally we should note that starting from the same data set different classification algorithms may return different classifiers. Usually deciding which one to use requires running some testing experiments.
12
A Classification Algorithm
Decision Trees
Decision trees are among the most used and more effective classifiers.
We will show the decision tree classification algorithm with an example: the weather data.
13
Weather DataOutlook Temperature Humidity Windy Play
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no
14
Constructing a decision tree for the weather data (1)
Outlook Temperature Humidity Windy
yynnn
yyyy
yynnn
yynn
yyyynn
yyyn
sunn
y
over
cast
rain
y
0.971 0.0 0.971
H([2, 3]) = -(2/5)*log(2/5) – (3/5)*log(3/5) = 0.971 bits; H([4, 0]) = 0 bits; H([3, 2]) = 0.971 bitsH([2, 3], [4, 0], [3, 2]) = (5/14)*H([2, 3]) + (4/14)*H([4, 0]) + (5/14)*H([3, 2]) = 0.693 bits
Info before any decision tree was created (9 yes, 5 no): H([9, 5]) = 0.940. Gain(outlook) = H([9, 5]) - H([2, 3], [4, 0], [3, 2]) = 0.247
Gain: 0.247 Gain: 0.029 Gain: 0.152 Gain: 0.048
hot
mil
d
cool
yyynnnn
yyyyyyn
high
norm
al
yyyyyynn
yyynnn
falsetrue
H(p1, … pn) = -p1logp1 - … -pnlogpn
H(p, q, r) = H(p. q + r) + (q + r)*H(q/(q + r), r/(q + r))
15
Constructing a decision tree for the weather data (2)
Outlook Outlook Outlook
Temperature0.571
Humidity0.971
Windy0.020
nn
yn
nnn
y yy
yynn
yn
sunn
y
sunn
y
sunn
y
hot
mil
d
cool
high
norm
al
true
fals
e
16
Constructing a decision tree for the weather data (3)
Outlook
Humidity Windy
yes
yes
yes nono
sunny
overcast
rainy
high normalfalse
true
Computational cost of decision tree construction for a data set with m attributes and n instances:
O(mn(log n)) + O(n(log n)2)
17
Naive BayesOutlook Temperature Humidity Windy Play
yes no
sunny
2 3
overcast
4 0
rainy
3 2
yes no
sunny
2/9 3/5
overcast
4/9 0/5
rainy 3/9 2/5
yes no
hot 2 2
mild 4 2
cool 3 1
yes no
hot 2/9 2/5
mild 4/9 2/5
cool 3/9 1/5
yes no
high 3 4
normal
6 1
yes no
high 3/9 4/5
normal
6/9 1/5
yes no
false
6 2
true 3 3
yes no
false
6/9 2/5
true 3/9 3/5
yes no
9 5
yes no
9/14 5/14
18
Naive Bayes (2)A new day:
Outlook temperature Humidity Windy Play
sunny cool high true ?
E = (sunny and cool and high and true)Bayes: P(yes | E) = (P(E| yes) P(yes)) / P(E).
Assuming attributes statistically independent:
P(yes | E) = (P(sunny | yes) * P(cool| yes) * P(high | yes) * P(true | yes) * P(yes)) / P(E) = (2/9)*(3/9)*(3/9)*(3/9)*(9/14) / P(E) = 0.0053 / P(E).
P(no | E) = 0.0026 / P(E).
Since P(yes | E) + P(no | E) = 1 we have that P(E) = 0.0053 + 0.0026 = 0.0079.Thus: P(yes | E) = 0.205 P(no | E) = 0.795;
Thus we answer: NO
Obstruction: usually attributes are not statistically independent. However naive Bayes works quite well in practice.
19
Performance EvaluationSplit data set into two parts: training set and test set.
Use training set to compute classifier.
Use test set to evaluate classifier. Note: test set data have no been used in the training process.
This allows us to compute the following quantites (on the test set). For sake of simplicity we refer to a two-class prediction.
yes no
yes TP (true positive) FN (false negative)
no FP (false positive) TN (true negative)
Predicted class
Actual class
20
Lift Chart
Predicted positive subset size = (TP + FP)/(TP + FP + TN + FN)
Number of true positives = TP
100%
1000
Lift charts are typically used in Marketing Applications
21
Receiver Operating Characteristic (ROC) Curve
FP rate = FP/(FP + TN)
Tp rate = TP/(TP + FN)
100%
100%
ROC curves are typically used in Communication Applications
22
A glimpse of the data mining in Safeguard
We outline our use of data mining techniques in the safeguard project.
23
On line schemaFormat
FilterPort2506
TCP PacketsPreprocessedTCP payload
Classifier 1(Hash Table based)
Classifier 2(Hidden Markov Models)
Cluster Analyzer
tcpdump
Supervisor
Alarm level
Format Filter
Format Filter
Format Filter
Sequence of payload bytes
Distribution of payload bytes
Conditional probabilities of chars and words in payload
Statistics info (avg, var, dev) on payload bytes
24
Training schemaFormatFilter
Port2506
TCP PacketsPreprocessed
TCP payload log
WEKA(Datamining tool)
tcpdump
HT ClassifierSynthesizer
Classifier 1
(Hash Table based)
Classifier 2
(Hidden Markov Models)
HMM Synthesizer
Cluster Analyzer
FormatFilter
FormatFilter
FormatFilter
Sequence of payload bytes
Distribution of payload bytes
Conditional probabilities of chars and words in payload
Statistics info (avg, var, dev) on payload bytes