24
Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci @ dsi .uniroma1.it , http://www. dsi .uniroma1.it/~ tronci Workshop ENEA: I Sistemi di Supporto alle Decisioni Centro Ricerche ENEA Casaccia, Roma, October 28, 2003

Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, [email protected],

Embed Size (px)

Citation preview

Page 1: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

Knowledge Discovery via Data mining

Enrico TronciDipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113,

00198 Roma, Italy, [email protected], http://www.dsi.uniroma1.it/~tronci

Workshop ENEA: I Sistemi di Supporto alle DecisioniCentro Ricerche ENEA Casaccia, Roma, October 28, 2003

Page 2: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

2

Data Mining

Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.

A data miner is a computer program that sifts through data seeking regularities or patterns.

Obstructions: noise and computational complexity.

Page 3: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

3

Some Applications

•Decisions involving judgments, e.g. loans.

•Screening images. Example: detection of oil slicks from satellite images, warning of ecological disasters, illegal dumping.

•Load forecasting in the electricity supply industry.

•Diagnosis, e.g. for preventive maintenance of electromechanical devices.

•Marketing and Sales. … On Thursday customers often purchase beer and diapers together …

•Stock Market Analysis.

•Anomaly Detection.

Page 4: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

4

Data

Age Spectacle Prescription

Astigmatism Tear production rate

Recommended lens

young myope no reduced none

young myope no normal soft

young myope yes reduced none

young myope yes normal hard

young hypermetrope no reduced none

young hypermetrope no normal soft

young hypermetrope yes reduced none

young hypermetrope yes normal hard

Pre-presbyopic

myope no reduced none

Pre-presb myope no normal soft

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

AttributesGoal

Instance

Page 5: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

5

Classification

Assume instances have n attributes A1, … An-1, An. Let attribute An our goal. A classifier is a function f from (A1 x …x An-1) to An. That is f looks at the values of the first (n-1) attributes and returns the (estimated) value of the goal. In other words f classifies each instance w.r.t. the goal attribute.

The problem of computing a classifier from a set of instances is called the classification problem.

Note that in a classification problem the set of classes (i.e. the possible goal value) is known in advance.

Note that a classifier works on any possible instance. That is also on instances that were not present in our data set. This is way classification is a form of machine learning.

Page 6: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

6

ClusteringAssume instances have n attributes A1, … An.

A clustering function is a function f from the set (A1 x …x An) to some small subset of the natural numbers. That is f splits the set of instances into a small number of classes.

The problem of computing a clustering function from our data set is called the clustering problem.

Note that, unlink in a classification problem, in a clustering problem the set of classes is not known in advance.

Note that a clustering function works on any possible instance. That is also on instances that were not present in our data set. This is way clustering is a form of machine learning.

In the following we will focus on classification.

Page 7: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

7

Rules for Contact Lens Data(An example of calssification)

if (<tear production rate> = <reduced>) then <recommendation> = <none>;

if (<age> = <young> and <astigmatic> = <no> and <tear production rate> = <soft>) then <recommendation> = <soft>

if (<age> = <pre-presbyotic > and <astigmatic> = <no> and <tear production rate> = <normal>) then <recommendation> = <soft>

. . . .

Attribute recommendation is the attribute we would like to predict. Such attribute is usually called Goal and is typically written on the last column.

A possible way of defining a classifier is by using a set of rules as above.

Page 8: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

8

Labor Negotiations DataAttribute Type 1 2 3 . . . 40

Duration years 1 2 3 . . . 2

Wage increase first year

percentage 2% 4% 4.3% . . . 4.5%

Wage increase second year

percentage ? ? ? . . . ?

Working hours per week

Number of hours

28 35 38 . . . 40

pension {none, r, c} none ? ? . . . ?

Education allowance {yes, no} yes ? ? . . . ?

Statutory holidays Nun of days 11 15 12 . . . 12

vacation Below-avg, avg, gen

avg gen gen . . . avg

. . . . . . . . . . . . . . . . . .

Acceptability of contract

{good, bad} bad good good . . . good

Page 9: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

9

Classification using Decision Trees (The Labor Negotiations Data Example (1))

Wage increase first year

Statutory holidays

Wage increase first year

> 2.5

<= 10

bad good bad good

<= 2.5

> 10

<= 4 > 4

Page 10: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

10

Classification using Decision Trees (The Labor Negotiations Data Example (2))

Wage increase first year

working hours per week Statutory holidays

Health plan contributionbad

<= 36 > 36

Wage increase first yeargood

> 10 <= 10

bad good bad bad good

none half full <= 4 > 4

<= 2.5 > 2.5

Page 11: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

11

Which Classifiers is good for me ?

From the same data set we may get many classifiers with different properties. Here are some of the properties usually considered for a classifiers. Note that depending on the problem under consideration, some property may or may not not be relevant.

•Success rate. That is the percentage of instances classified correctly.

•Easy of computation.

•Readability. There are cases in which the definition of the classifier must be read by a human being. In such cases the readability of the classifier definition is an important parameter to judge the goodness of a classifier.

Finally we should note that starting from the same data set different classification algorithms may return different classifiers. Usually deciding which one to use requires running some testing experiments.

Page 12: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

12

A Classification Algorithm

Decision Trees

Decision trees are among the most used and more effective classifiers.

We will show the decision tree classification algorithm with an example: the weather data.

Page 13: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

13

Weather DataOutlook Temperature Humidity Windy Play

sunny hot high false no

sunny hot high true no

overcast hot high false yes

rainy mild high false yes

rainy cool normal false yes

rainy cool normal true no

overcast cool normal true yes

sunny mild high false no

sunny cool normal false yes

rainy mild normal false yes

sunny mild normal true yes

overcast mild high true yes

overcast hot normal false yes

rainy mild high true no

Page 14: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

14

Constructing a decision tree for the weather data (1)

Outlook Temperature Humidity Windy

yynnn

yyyy

yynnn

yynn

yyyynn

yyyn

sunn

y

over

cast

rain

y

0.971 0.0 0.971

H([2, 3]) = -(2/5)*log(2/5) – (3/5)*log(3/5) = 0.971 bits; H([4, 0]) = 0 bits; H([3, 2]) = 0.971 bitsH([2, 3], [4, 0], [3, 2]) = (5/14)*H([2, 3]) + (4/14)*H([4, 0]) + (5/14)*H([3, 2]) = 0.693 bits

Info before any decision tree was created (9 yes, 5 no): H([9, 5]) = 0.940. Gain(outlook) = H([9, 5]) - H([2, 3], [4, 0], [3, 2]) = 0.247

Gain: 0.247 Gain: 0.029 Gain: 0.152 Gain: 0.048

hot

mil

d

cool

yyynnnn

yyyyyyn

high

norm

al

yyyyyynn

yyynnn

falsetrue

H(p1, … pn) = -p1logp1 - … -pnlogpn

H(p, q, r) = H(p. q + r) + (q + r)*H(q/(q + r), r/(q + r))

Page 15: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

15

Constructing a decision tree for the weather data (2)

Outlook Outlook Outlook

Temperature0.571

Humidity0.971

Windy0.020

nn

yn

nnn

y yy

yynn

yn

sunn

y

sunn

y

sunn

y

hot

mil

d

cool

high

norm

al

true

fals

e

Page 16: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

16

Constructing a decision tree for the weather data (3)

Outlook

Humidity Windy

yes

yes

yes nono

sunny

overcast

rainy

high normalfalse

true

Computational cost of decision tree construction for a data set with m attributes and n instances:

O(mn(log n)) + O(n(log n)2)

Page 17: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

17

Naive BayesOutlook Temperature Humidity Windy Play

yes no

sunny

2 3

overcast

4 0

rainy

3 2

yes no

sunny

2/9 3/5

overcast

4/9 0/5

rainy 3/9 2/5

yes no

hot 2 2

mild 4 2

cool 3 1

yes no

hot 2/9 2/5

mild 4/9 2/5

cool 3/9 1/5

yes no

high 3 4

normal

6 1

yes no

high 3/9 4/5

normal

6/9 1/5

yes no

false

6 2

true 3 3

yes no

false

6/9 2/5

true 3/9 3/5

yes no

9 5

yes no

9/14 5/14

Page 18: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

18

Naive Bayes (2)A new day:

Outlook temperature Humidity Windy Play

sunny cool high true ?

E = (sunny and cool and high and true)Bayes: P(yes | E) = (P(E| yes) P(yes)) / P(E).

Assuming attributes statistically independent:

P(yes | E) = (P(sunny | yes) * P(cool| yes) * P(high | yes) * P(true | yes) * P(yes)) / P(E) = (2/9)*(3/9)*(3/9)*(3/9)*(9/14) / P(E) = 0.0053 / P(E).

P(no | E) = 0.0026 / P(E).

Since P(yes | E) + P(no | E) = 1 we have that P(E) = 0.0053 + 0.0026 = 0.0079.Thus: P(yes | E) = 0.205 P(no | E) = 0.795;

Thus we answer: NO

Obstruction: usually attributes are not statistically independent. However naive Bayes works quite well in practice.

Page 19: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

19

Performance EvaluationSplit data set into two parts: training set and test set.

Use training set to compute classifier.

Use test set to evaluate classifier. Note: test set data have no been used in the training process.

This allows us to compute the following quantites (on the test set). For sake of simplicity we refer to a two-class prediction.

yes no

yes TP (true positive) FN (false negative)

no FP (false positive) TN (true negative)

Predicted class

Actual class

Page 20: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

20

Lift Chart

Predicted positive subset size = (TP + FP)/(TP + FP + TN + FN)

Number of true positives = TP

100%

1000

Lift charts are typically used in Marketing Applications

Page 21: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

21

Receiver Operating Characteristic (ROC) Curve

FP rate = FP/(FP + TN)

Tp rate = TP/(TP + FN)

100%

100%

ROC curves are typically used in Communication Applications

Page 22: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

22

A glimpse of the data mining in Safeguard

We outline our use of data mining techniques in the safeguard project.

Page 23: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

23

On line schemaFormat

FilterPort2506

TCP PacketsPreprocessedTCP payload

Classifier 1(Hash Table based)

Classifier 2(Hidden Markov Models)

Cluster Analyzer

tcpdump

Supervisor

Alarm level

Format Filter

Format Filter

Format Filter

Sequence of payload bytes

Distribution of payload bytes

Conditional probabilities of chars and words in payload

Statistics info (avg, var, dev) on payload bytes

Page 24: Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, tronci@dsi.uniroma1.it,

24

Training schemaFormatFilter

Port2506

TCP PacketsPreprocessed

TCP payload log

WEKA(Datamining tool)

tcpdump

HT ClassifierSynthesizer

Classifier 1

(Hash Table based)

Classifier 2

(Hidden Markov Models)

HMM Synthesizer

Cluster Analyzer

FormatFilter

FormatFilter

FormatFilter

Sequence of payload bytes

Distribution of payload bytes

Conditional probabilities of chars and words in payload

Statistics info (avg, var, dev) on payload bytes