21
The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

Embed Size (px)

Citation preview

Page 1: The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

The role of Domain Knowledge in a large scale Data Mining Project

Kopanas I., Avouris N., Daskalaki S.

University of Patras

Page 2: The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

University of Patras, HCI Group - SETN02 2

Outline of the talk

• Knowledge in a DM process

• Case study in a large DM project: Prediction of customer insolvency in Telecommunications business

• The role of domain expertise (and domain experts ) in the process

• Summary and conclusions

Page 3: The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

University of Patras, HCI Group - SETN02 3

Data Mining• Evolution of knowledge-based systems

• Key partners in Data Mining

– Data analyst / statistician

– Knowledge Engineer

– Domain Expert

• Role of domain knowledge in Data Mining

Page 4: The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

University of Patras, HCI Group - SETN02 4

DM phases(a) Problem definition

(b) Creating target data set

(c ) Data pre-processing and transformation

(d ) Feature and algorithm selection

(e) Data Mining

(f) Evaluation of learned knowledge

(g) Fielding the knowledge base

Page 5: The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

University of Patras, HCI Group - SETN02 5

Case study: Prediction of Customer Insolvency in Telecommunications businessPredict the insolvent customers to be, that is the

customers that will refuse to pay their telephone bills in the next payment due date, while there is still time for preventive (and possibly avertive) measures

• Problem ObjectivesProblem Objectives– Detect as many insolvent customers as possible

– Minimize false alarms (solvent customers classified as insolvent)

Page 6: The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

University of Patras, HCI Group - SETN02 6

Case study: problem characteristics

• Significant loss of revenue for the company

• Human behavior is (generally) unpredictable

• Insolvency cases are rare compared to non-insolvencies

• Information can be retrieved only after processing huge amounts of data from several sources

Page 7: The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

University of Patras, HCI Group - SETN02 7

The billing process (domain knowledge)

Jun Jul Aug Sept Feb AprMarOct Nov JanDec

Billing Period

Due Date

Issue of Bill

Service Interruption

Nullification

Page 8: The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

University of Patras, HCI Group - SETN02 8

Target data set definition (semantic value of data)• Data from 3 different cities (combination of rural,

urban and touristic areas)

• Types of data – Customer data (coded)

– Data from billing and payments

– Call detail records (from switching centers)

• Time span of data studied– Cases of collected and uncollected bills (10/99-2/01)

– Calls records (8/99-12/00)

Page 9: The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

University of Patras, HCI Group - SETN02 9

Data pre-processing (knowledge-based reduction of search space)

• Eliminated inexpensive calls (< 0.3 €)

• Synchronizing data

• Removing noise

• Missing values

• Data aggregation by period

DATA

WAREHOUSE

Page 10: The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

University of Patras, HCI Group - SETN02 10

Dataset for model fitting

• Stratified sample of solvent customers– Class distribution: 90% solvent customers and 10%

insolvent customers

• 2066 total number of cases and 46 variables – 2 variables describing the phone account

– 4 variables describing customer attitude towards previous phone bills

– 40 variables summarizing customer call habits over fifteen 2-week periods

Page 11: The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

University of Patras, HCI Group - SETN02 11

Data mining

• Classification problemClassification problem

– 2 classes: solvent and insolvent customers

– Distribution among classes in original dataset: 99% of solvent customers and 1% of insolvent customers

– Very small number of insolvencies

– Very different costs of misclassification between the two classes of customers

Page 12: The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

University of Patras, HCI Group - SETN02 12

Criteria for evaluation of predictionThe precision of the classifier, defined as the percentage of the actually insolvent customers in those, predicted as insolvent by the classifier.

The accuracy of the classifier, defined as the percentage of the correctly predicted insolvent out of the total cases of insolvent customers in the data set.Precision > 30% & Accuracy > 70%

Page 13: The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

University of Patras, HCI Group - SETN02 13

Features selected (most popular in 50 classifiers)

• NewCust

• Latency

• Count_X_charges

• CountResiduals

• StdDif

• TrendDif11

• TrendDif10

• TrendDif7

• TrendDif6

• TrendDif3

•TrendUnitsMax

•TrendDif5

•TrendDif8

•Average_Dif

•Type

•MaxSec

•TrendUnits5

•AverageUnits

•TrendCount5

•CountInstallments

TrendDifxx , StdDif dispersion of called telephone numbers in a given time interval xx

Page 14: The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

University of Patras, HCI Group - SETN02 14

Deployment of the Knowledge-based system

• The classifiers are combined (voting algorithms have been used)

• Heuristics are used as applicability criteria

• Visualization plays an important role in the design of the system

• The roles of the user and the knowledge-based system have to be carefully defined

Page 15: The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

University of Patras, HCI Group - SETN02 15

Stepwise Discriminant Analysis

Classification Results E3 Predicted

Category 0 1 Total 0 78 58 136 Count 1 28 1184 1212 0 57.35 42.65 100

Original

% 1 2.31 97.69 100 0 77 59 136 Count 1 35 1177 1212 0 56.62 43.38 100

Cases Selected Cross-

validated %

1 2.89 97.11 100 0 36 28 64 Count 1 22 632 654 0 56.25 43.75 100

Cases not Selected

Original

% 1 3.36 96.64 100

93.6% of selected original grouped cases correctly classified 93.02% of selected cross-validated cases correctly classified 93.04% of unselected original grouped cases correctly classified

Page 16: The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

University of Patras, HCI Group - SETN02 16

Decision Tree

CCaatteeggoorryy 00 11 TToottaall

CCoouunntt 00 110011 3355 113366 11 99 11220033 11221122

%% 00 7744..2266 2255..7744 110000 11 00..7744 9999..2266 110000

CCoouunntt 00 4422 2222 6644 11 1166 663388 665544

%% 00 6655..6622 3344..3388 110000 11 22..4455 9977..5555 110000

CCllaassssiiffiiccaattiioonn RReessuullttss EE2211 Predicted Group

Original

Cases not Selected

Original

Cases Selected

Page 17: The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

University of Patras, HCI Group - SETN02 17

Neural Network

Category 00 11 Total

CCoouunntt 00 6655 6699 113366 11 88 11220033 11221122

%% 00 4477..77 5500..77 110000 11 00..66 9999..22 110000

CCoouunntt 00 2244 4400 6644 11 1111 664433 665544

%% 00 3377..55 6622..55 110000 11 11..66 9988..33 110000

Classification Results E30 Predicted Group

Original

Cases not Selected

Original

Cases Selected

Page 18: The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

University of Patras, HCI Group - SETN02 18

Evaluation of classifiers (example)

• Performance over 90% in the majority class and over 83% in the minority class.

• precision = 113/2844= 3.9%

• accuracy = 113/136= 83%,

Predicted cases

Category Insolvent (0) Solvent (1)

Insolvent (0) 113

(83.1 %) 23

(16.9%) Actual cases

Solvent (1) 2731

(9.8 %) 25081

(90.2 %)

Page 19: The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

University of Patras, HCI Group - SETN02 19

stage DK Type of DK(a) Problem definition HIGH Business and domain knowledge,

requirements Implicit, tacit knowledge

(b) Creating target data set

MEDIUM Attribute relations, semantics of corporate DB

(c ) Data pre-processing

HIGH Tacit and implicit knowledge for inferences

(d ) Feature and algorithm selection

MEDIUM Interpretation of the selected features

(e) Data Mining 

LOW Inspection of discovered knowledge

(f) Evaluation of learned knowledge

MEDIUM Definition of criteria related to business objectives

(g) Fielding the knowledge base

HIGH Supplementary domain knowledge necessary for implementing the system

Page 20: The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

University of Patras, HCI Group - SETN02 20

Selection of DM tool (Elder 98)

Page 21: The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

University of Patras, HCI Group - SETN02 21

Conclusion• Data mining is a knowledge-driven process

• All stages contribute to the success of the process

• Domain experts play significant role in most phases of the process

• Need for selection of algorithms and techniques that support interpretation of mined knowledge

• Need for integrated tools and adequate techniques to support involvement of domain experts in the process