Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma...

Knowledge Discovery via Data mining

Enrico TronciDipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113,

00198 Roma, Italy, tronci@dsi.uniroma1.it, http://www.dsi.uniroma1.it/~tronci

Workshop ENEA: I Sistemi di Supporto alle DecisioniCentro Ricerche ENEA Casaccia, Roma, October 28, 2003

Data Mining

Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.

A data miner is a computer program that sifts through data seeking regularities or patterns.

Obstructions: noise and computational complexity.

Some Applications

•Decisions involving judgments, e.g. loans.

•Screening images. Example: detection of oil slicks from satellite images, warning of ecological disasters, illegal dumping.

•Load forecasting in the electricity supply industry.

•Diagnosis, e.g. for preventive maintenance of electromechanical devices.

•Marketing and Sales. … On Thursday customers often purchase beer and diapers together …

•Stock Market Analysis.

•Anomaly Detection.

Age Spectacle Prescription

Astigmatism Tear production rate

Recommended lens

young myope no reduced none

young myope no normal soft

young myope yes reduced none

young myope yes normal hard

young hypermetrope no reduced none

young hypermetrope no normal soft

young hypermetrope yes reduced none

young hypermetrope yes normal hard

Pre-presbyopic

myope no reduced none

Pre-presb myope no normal soft

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

AttributesGoal

Instance

Classification

Assume instances have n attributes A1, … An-1, An. Let attribute An our goal. A classifier is a function f from (A1 x …x An-1) to An. That is f looks at the values of the first (n-1) attributes and returns the (estimated) value of the goal. In other words f classifies each instance w.r.t. the goal attribute.

The problem of computing a classifier from a set of instances is called the classification problem.

Note that in a classification problem the set of classes (i.e. the possible goal value) is known in advance.

Note that a classifier works on any possible instance. That is also on instances that were not present in our data set. This is way classification is a form of machine learning.

ClusteringAssume instances have n attributes A1, … An.

A clustering function is a function f from the set (A1 x …x An) to some small subset of the natural numbers. That is f splits the set of instances into a small number of classes.

The problem of computing a clustering function from our data set is called the clustering problem.

Note that, unlink in a classification problem, in a clustering problem the set of classes is not known in advance.

Note that a clustering function works on any possible instance. That is also on instances that were not present in our data set. This is way clustering is a form of machine learning.

In the following we will focus on classification.

Rules for Contact Lens Data(An example of calssification)

if (<tear production rate> = <reduced>) then <recommendation> = <none>;

if (<age> = <young> and <astigmatic> = <no> and <tear production rate> = <soft>) then <recommendation> = <soft>

if (<age> = <pre-presbyotic > and <astigmatic> = <no> and <tear production rate> = <normal>) then <recommendation> = <soft>

. . . .

Attribute recommendation is the attribute we would like to predict. Such attribute is usually called Goal and is typically written on the last column.

A possible way of defining a classifier is by using a set of rules as above.

Labor Negotiations DataAttribute Type 1 2 3 . . . 40

Duration years 1 2 3 . . . 2

Wage increase first year

percentage 2% 4% 4.3% . . . 4.5%

Wage increase second year

percentage ? ? ? . . . ?

Working hours per week

Number of hours

28 35 38 . . . 40

pension {none, r, c} none ? ? . . . ?

Education allowance {yes, no} yes ? ? . . . ?

Statutory holidays Nun of days 11 15 12 . . . 12

vacation Below-avg, avg, gen

avg gen gen . . . avg

. . . . . . . . . . . . . . . . . .

Acceptability of contract

{good, bad} bad good good . . . good

Classification using Decision Trees (The Labor Negotiations Data Example (1))

Statutory holidays

bad good bad good

<= 2.5

<= 4 > 4

Classification using Decision Trees (The Labor Negotiations Data Example (2))

working hours per week Statutory holidays

Health plan contributionbad

<= 36 > 36

Wage increase first yeargood

> 10 <= 10

bad good bad bad good

none half full <= 4 > 4

<= 2.5 > 2.5

Which Classifiers is good for me ?

From the same data set we may get many classifiers with different properties. Here are some of the properties usually considered for a classifiers. Note that depending on the problem under consideration, some property may or may not not be relevant.

•Success rate. That is the percentage of instances classified correctly.

•Easy of computation.

•Readability. There are cases in which the definition of the classifier must be read by a human being. In such cases the readability of the classifier definition is an important parameter to judge the goodness of a classifier.

Finally we should note that starting from the same data set different classification algorithms may return different classifiers. Usually deciding which one to use requires running some testing experiments.

A Classification Algorithm

Decision Trees

Decision trees are among the most used and more effective classifiers.

We will show the decision tree classification algorithm with an example: the weather data.

Weather DataOutlook Temperature Humidity Windy Play

sunny hot high false no

sunny hot high true no

overcast hot high false yes

rainy mild high false yes

rainy cool normal false yes

rainy cool normal true no

overcast cool normal true yes

sunny mild high false no

sunny cool normal false yes

rainy mild normal false yes

sunny mild normal true yes

overcast mild high true yes

overcast hot normal false yes

rainy mild high true no

Constructing a decision tree for the weather data (1)

Outlook Temperature Humidity Windy

yyyynn

0.971 0.0 0.971

H([2, 3]) = -(2/5)*log(2/5) – (3/5)*log(3/5) = 0.971 bits; H([4, 0]) = 0 bits; H([3, 2]) = 0.971 bitsH([2, 3], [4, 0], [3, 2]) = (5/14)*H([2, 3]) + (4/14)*H([4, 0]) + (5/14)*H([3, 2]) = 0.693 bits

Info before any decision tree was created (9 yes, 5 no): H([9, 5]) = 0.940. Gain(outlook) = H([9, 5]) - H([2, 3], [4, 0], [3, 2]) = 0.247

Gain: 0.247 Gain: 0.029 Gain: 0.152 Gain: 0.048

yyynnnn

yyyyyyn

yyyyyynn

yyynnn

falsetrue

H(p1, … pn) = -p1logp1 - … -pnlogpn

H(p, q, r) = H(p. q + r) + (q + r)*H(q/(q + r), r/(q + r))

Outlook Outlook Outlook

Temperature0.571

Humidity0.971

Windy0.020

Outlook

Humidity Windy

yes nono

overcast

high normalfalse

Computational cost of decision tree construction for a data set with m attributes and n instances:

O(mn(log n)) + O(n(log n)2)

Naive BayesOutlook Temperature Humidity Windy Play

yes no

overcast

yes no

2/9 3/5

overcast

4/9 0/5

rainy 3/9 2/5

yes no

hot 2 2

mild 4 2

cool 3 1

yes no

hot 2/9 2/5

mild 4/9 2/5

cool 3/9 1/5

yes no

high 3 4

normal

yes no

high 3/9 4/5

normal

6/9 1/5

yes no

true 3 3

yes no

6/9 2/5

true 3/9 3/5

yes no

9/14 5/14

Naive Bayes (2)A new day:

Outlook temperature Humidity Windy Play

sunny cool high true ?

E = (sunny and cool and high and true)Bayes: P(yes | E) = (P(E| yes) P(yes)) / P(E).

Assuming attributes statistically independent:

P(no | E) = 0.0026 / P(E).

Since P(yes | E) + P(no | E) = 1 we have that P(E) = 0.0053 + 0.0026 = 0.0079.Thus: P(yes | E) = 0.205 P(no | E) = 0.795;

Thus we answer: NO

Obstruction: usually attributes are not statistically independent. However naive Bayes works quite well in practice.

Performance EvaluationSplit data set into two parts: training set and test set.

Use training set to compute classifier.

Use test set to evaluate classifier. Note: test set data have no been used in the training process.

This allows us to compute the following quantites (on the test set). For sake of simplicity we refer to a two-class prediction.

yes no

yes TP (true positive) FN (false negative)

no FP (false positive) TN (true negative)

Predicted class

Actual class

Lift Chart

Predicted positive subset size = (TP + FP)/(TP + FP + TN + FN)

Number of true positives = TP

Lift charts are typically used in Marketing Applications

Receiver Operating Characteristic (ROC) Curve

FP rate = FP/(FP + TN)

Tp rate = TP/(TP + FN)

ROC curves are typically used in Communication Applications

A glimpse of the data mining in Safeguard

We outline our use of data mining techniques in the safeguard project.

On line schemaFormat

FilterPort2506

TCP PacketsPreprocessedTCP payload

Classifier 1(Hash Table based)

Classifier 2(Hidden Markov Models)

Cluster Analyzer

tcpdump

Supervisor

Alarm level

Format Filter

Sequence of payload bytes

Distribution of payload bytes

Conditional probabilities of chars and words in payload

Statistics info (avg, var, dev) on payload bytes

Training schemaFormatFilter

Port2506

TCP PacketsPreprocessed

TCP payload log

WEKA(Datamining tool)

tcpdump

HT ClassifierSynthesizer

Classifier 1

(Hash Table based)

Classifier 2

(Hidden Markov Models)

HMM Synthesizer

Cluster Analyzer

FormatFilter

Sequence of payload bytes

Distribution of payload bytes

Conditional probabilities of chars and words in payload

Statistics info (avg, var, dev) on payload bytes

Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma...

Documents

24 Alex roma roma

Automatic Verification of a Turbogas Control System with the Murphi Verifier Enrico Tronci Computer Science Department, University of Rome “La Sapienza”,

J.K.Place Roma Hotel - Luxury Hotel Roma - Five Stars Hotel in

51° Circolo Didattico Principessa Mafalda ROMA. Welcome to Roma - Italia

Guide for Roma school mediators/assistants · Guide for Roma school mediators/assistants Education of Roma children in Europe

GLAST Meeting in Roma Roma, September 15-17, 2003 Emission from Clusters of Galaxies R.Fusco-Femiano - IASF/CNR, Roma, Italy GLAST Meeting in Roma Roma, September 15-17, 2003

Roma Representation in Bulgarian News Mediamedia.hjartatillhjarta.se/2015/06/Roma-Representation-in...Roma Representation in Bulgarian News Media A study on the depiction of Roma in

Multi-scale Mechanical Characterization of Highly Swollen ... · 1 Multi-scale Mechanical Characterization of Highly Swollen Photo-activated Collagen Hydrogels Giuseppe Tronci,1,2*

Presentazione Roma

Roma Routes - creative-europe.culture.grcreative-europe.culture.gr/...ROMA-ROUTES_20151.pdf · Roma Routes BYZANTINO & XPIETIANIKO MOYEEIO YTT0tJPYEí0 na16Eíaç Kal . BYZANTINO

Roma Capitale Municipio Roma)(l DO PHOTO CROSS Tag ... · Roma Capitale Municipio Roma)(l DO PHOTO CROSS Tag: Diapositiva; C-41; Cross Processing. d o p r s o Lions Club Roma San

antica roma - Ceramic Tile Merchants · unicomstarker antica roma collection

Roma Kjøkken/Roma Kitchen/Roma Køkken - nordpeis.com.... NO 3 Roma Kjøkken Innsats NI-22 Bifold / Panorama Stålpipe Kan monteres med stålpipe. Vekt inkl. innsats Roma Kjøkken

Roma and non Roma in the labour market

Brochure DIMA 2017 - DIMA - Sapienza - Università di Roma · Massimo TRONCI Claudio SCARPONI Sara MODINI Benedetta ERMINI Mauro VALORANI Jean-Paul MOLLICONE Angela LO BELLO ... Francesca

WRITTEN COMMENTS OF THE EUROPEAN ROMA RIGHTS … · 3 ROMA IN TURKEY INTRODUCTION The European Roma Rights Centre (ERRC) and the Edirne Roma Association (EDROM) respectfully submit

Monthly Periodical of The Roma Inclusion Office The Roma ... 51/engleski.pdf · Monthly Periodical of The Roma Inclusion Office ... Monthly Periodical of The Roma Inclusion Office

The dialect of the Mitrovica Roma - ROMANI Projectromani.humanities.manchester.ac.uk/downloads/2/Leggio_The dialect... · Kosovače Roma ‘Kosovo Roma’ or Mitricače Roma ‘Mitrovica

Messere, Roma...ROMA A. ROMA LIFESTYLE HOTEL INCONTRI REUMATOLOGICI ROMANI FACULTY Antonella Afeltra, Roma Stefano Alivernini, Roma Laura Andreoli, Brescia Salvatore Antonelli, Roma

RoMA: Interactive Fabrication with Augmented Reality and a ...groups.csail.mit.edu/hcie/files/research-projects/roma/2018-chi-roma-paper.pdf · RoMA: Interactive Fabrication with