Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, [email protected],

Knowledge Discovery via Data mining

Enrico TronciDipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113,

00198 Roma, Italy, [email protected], http://www.dsi.uniroma1.it/~tronci

Workshop ENEA: I Sistemi di Supporto alle DecisioniCentro Ricerche ENEA Casaccia, Roma, October 28, 2003

mailto:[email protected]




http://www.dsi.uniroma1.it/~tronci




2

Data Mining

Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.

A data miner is a computer program that sifts through data seeking regularities or patterns.

Obstructions: noise and computational complexity.

3

Some Applications

•Decisions involving judgments, e.g. loans.

•Screening images. Example: detection of oil slicks from satellite images, warning of ecological disasters, illegal dumping.

•Load forecasting in the electricity supply industry.

•Diagnosis, e.g. for preventive maintenance of electromechanical devices.

•Marketing and Sales. … On Thursday customers often purchase beer and diapers together …

•Stock Market Analysis.

•Anomaly Detection.

4

Data

Age Spectacle Prescription

Astigmatism Tear production rate

Recommended lens

young myope no reduced none

young myope no normal soft

young myope yes reduced none

young myope yes normal hard

young hypermetrope no reduced none

young hypermetrope no normal soft

young hypermetrope yes reduced none

young hypermetrope yes normal hard

Pre-presbyopic

myope no reduced none

Pre-presb myope no normal soft

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

AttributesGoal

Instance

5

Classification

Assume instances have n attributes A1, … An-1, An. Let attribute An our goal. A classifier is a function f from (A1 x …x An-1) to An. That is f looks at the values of the first (n-1) attributes and returns the (estimated) value of the goal. In other words f classifies each instance w.r.t. the goal attribute.

The problem of computing a classifier from a set of instances is called the classification problem.

Note that in a classification problem the set of classes (i.e. the possible goal value) is known in advance.

Note that a classifier works on any possible instance. That is also on instances that were not present in our data set. This is way classification is a form of machine learning.

6

ClusteringAssume instances have n attributes A1, … An.

A clustering function is a function f from the set (A1 x …x An) to some small subset of the natural numbers. That is f splits the set of instances into a small number of classes.

The problem of computing a clustering function from our data set is called the clustering problem.

Note that, unlink in a classification problem, in a clustering problem the set of classes is not known in advance.

Note that a clustering function works on any possible instance. That is also on instances that were not present in our data set. This is way clustering is a form of machine learning.

In the following we will focus on classification.

7

Rules for Contact Lens Data(An example of calssification)

if (<tear production rate> = <reduced>) then <recommendation> = <none>;

if (<age> = <young> and <astigmatic> = <no> and <tear production rate> = <soft>) then <recommendation> = <soft>

if (<age> = <pre-presbyotic > and <astigmatic> = <no> and <tear production rate> = <normal>) then <recommendation> = <soft>

. . . .

Attribute recommendation is the attribute we would like to predict. Such attribute is usually called Goal and is typically written on the last column.

A possible way of defining a classifier is by using a set of rules as above.

8

Labor Negotiations DataAttribute Type 1 2 3 . . . 40

Duration years 1 2 3 . . . 2

Wage increase first year

percentage 2% 4% 4.3% . . . 4.5%

Wage increase second year

percentage ? ? ? . . . ?

Working hours per week

Number of hours

28 35 38 . . . 40

pension {none, r, c} none ? ? . . . ?

Education allowance {yes, no} yes ? ? . . . ?

Statutory holidays Nun of days 11 15 12 . . . 12

vacation Below-avg, avg, gen

avg gen gen . . . avg

. . . . . . . . . . . . . . . . . .

Acceptability of contract

{good, bad} bad good good . . . good

9

Classification using Decision Trees (The Labor Negotiations Data Example (1))


Statutory holidays


> 2.5

<= 10

bad good bad good

<= 2.5

> 10

<= 4 > 4

10

Classification using Decision Trees (The Labor Negotiations Data Example (2))


working hours per week Statutory holidays

Health plan contributionbad

<= 36 > 36

Wage increase first yeargood

> 10 <= 10

bad good bad bad good

none half full <= 4 > 4

<= 2.5 > 2.5

11

Which Classifiers is good for me ?

From the same data set we may get many classifiers with different properties. Here are some of the properties usually considered for a classifiers. Note that depending on the problem under consideration, some property may or may not not be relevant.

•Success rate. That is the percentage of instances classified correctly.

•Easy of computation.

•Readability. There are cases in which the definition of the classifier must be read by a human being. In such cases the readability of the classifier definition is an important parameter to judge the goodness of a classifier.

Finally we should note that starting from the same data set different classification algorithms may return different classifiers. Usually deciding which one to use requires running some testing experiments.

12

A Classification Algorithm

Decision Trees

Decision trees are among the most used and more effective classifiers.

We will show the decision tree classification algorithm with an example: the weather data.

13

Weather DataOutlook Temperature Humidity Windy Play

sunny hot high false no

sunny hot high true no

overcast hot high false yes

rainy mild high false yes

rainy cool normal false yes

rainy cool normal true no

overcast cool normal true yes

sunny mild high false no

sunny cool normal false yes

rainy mild normal false yes

sunny mild normal true yes

overcast mild high true yes

overcast hot normal false yes

rainy mild high true no

14

Constructing a decision tree for the weather data (1)

Outlook Temperature Humidity Windy

yynnn

yyyy

yynnn

yynn

yyyynn

yyyn

sunn

y

over

cast

rain

y

0.971 0.0 0.971

H([2, 3]) = -(2/5)*log(2/5) – (3/5)*log(3/5) = 0.971 bits; H([4, 0]) = 0 bits; H([3, 2]) = 0.971 bitsH([2, 3], [4, 0], [3, 2]) = (5/14)*H([2, 3]) + (4/14)*H([4, 0]) + (5/14)*H([3, 2]) = 0.693 bits

Info before any decision tree was created (9 yes, 5 no): H([9, 5]) = 0.940. Gain(outlook) = H([9, 5]) - H([2, 3], [4, 0], [3, 2]) = 0.247

Gain: 0.247 Gain: 0.029 Gain: 0.152 Gain: 0.048

hot

mil

d

cool

yyynnnn

yyyyyyn

high

norm

al

yyyyyynn

yyynnn

falsetrue

H(p1, … pn) = -p1logp1 - … -pnlogpn

H(p, q, r) = H(p. q + r) + (q + r)*H(q/(q + r), r/(q + r))

15


Outlook Outlook Outlook

Temperature0.571

Humidity0.971

Windy0.020

nn

yn

nnn

y yy

yynn

yn

sunn

y

sunn

y

sunn

y

hot

mil

d

cool

high

norm

al

true

fals

e

16


Outlook

Humidity Windy

yes

yes

yes nono

sunny

overcast

rainy

high normalfalse

true

Computational cost of decision tree construction for a data set with m attributes and n instances:

O(mn(log n)) + O(n(log n)2)

17

Naive BayesOutlook Temperature Humidity Windy Play

yes no

sunny

2 3

overcast

4 0

rainy

3 2

yes no

sunny

2/9 3/5

overcast

4/9 0/5

rainy 3/9 2/5

yes no

hot 2 2

mild 4 2

cool 3 1

yes no

hot 2/9 2/5

mild 4/9 2/5

cool 3/9 1/5

yes no

high 3 4

normal

6 1

yes no

high 3/9 4/5

normal

6/9 1/5

yes no

false

6 2

true 3 3

yes no

false

6/9 2/5

true 3/9 3/5

yes no

9 5

yes no

9/14 5/14

18

Naive Bayes (2)A new day:

Outlook temperature Humidity Windy Play

sunny cool high true ?

E = (sunny and cool and high and true)Bayes: P(yes | E) = (P(E| yes) P(yes)) / P(E).

Assuming attributes statistically independent:

P(yes | E) = (P(sunny | yes) * P(cool| yes) * P(high | yes) * P(true | yes) * P(yes)) / P(E) = (2/9)*(3/9)*(3/9)*(3/9)*(9/14) / P(E) = 0.0053 / P(E).

P(no | E) = 0.0026 / P(E).

Since P(yes | E) + P(no | E) = 1 we have that P(E) = 0.0053 + 0.0026 = 0.0079.Thus: P(yes | E) = 0.205 P(no | E) = 0.795;

Thus we answer: NO

Obstruction: usually attributes are not statistically independent. However naive Bayes works quite well in practice.

19

Performance EvaluationSplit data set into two parts: training set and test set.

Use training set to compute classifier.

Use test set to evaluate classifier. Note: test set data have no been used in the training process.

This allows us to compute the following quantites (on the test set). For sake of simplicity we refer to a two-class prediction.

yes no

yes TP (true positive) FN (false negative)

no FP (false positive) TN (true negative)

Predicted class

Actual class

20

Lift Chart

Predicted positive subset size = (TP + FP)/(TP + FP + TN + FN)

Number of true positives = TP

100%

1000

Lift charts are typically used in Marketing Applications

21

Receiver Operating Characteristic (ROC) Curve

FP rate = FP/(FP + TN)

Tp rate = TP/(TP + FN)

100%

100%

ROC curves are typically used in Communication Applications

22

A glimpse of the data mining in Safeguard

We outline our use of data mining techniques in the safeguard project.

23

On line schemaFormat

FilterPort2506

TCP PacketsPreprocessedTCP payload

Classifier 1(Hash Table based)

Classifier 2(Hidden Markov Models)

Cluster Analyzer

tcpdump

Supervisor

Alarm level

Format Filter

Format Filter

Format Filter

Sequence of payload bytes

Distribution of payload bytes

Conditional probabilities of chars and words in payload

Statistics info (avg, var, dev) on payload bytes

24

Training schemaFormatFilter

Port2506

TCP PacketsPreprocessed

TCP payload log

WEKA(Datamining tool)

tcpdump

HT ClassifierSynthesizer

Classifier 1

(Hash Table based)

Classifier 2

(Hidden Markov Models)

HMM Synthesizer

Cluster Analyzer

FormatFilter

FormatFilter

FormatFilter

Sequence of payload bytes

Distribution of payload bytes

Conditional probabilities of chars and words in payload

Statistics info (avg, var, dev) on payload bytes

Documents

Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, [email protected],