54
mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa http://www-kdd.di.unipi.it/ A tutorial @ EDBT2000

Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Embed Size (px)

Citation preview

Page 1: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Knowledge discovery & data mining

Tools, methods, and experiences

Fosca Giannotti and Dino PedreschiPisa KDD Lab

CNUCE-CNR & Univ. Pisahttp://www-kdd.di.unipi.it/

A tutorial @ EDBT2000

Page 2: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 2

Contributors and acknowledgements

The people @ Pisa KDD Lab: Francesco BONCHI, Giuseppe MANCO, Mirco NANNI, Chiara RENSO, Salvatore RUGGIERI, Franco TURINI and many students

The many KDD tutorialists and teachers which made their slides available on the web (all of them listed in bibliography) ;-)

In particular: Jiawei HAN, Simon Fraser University, whose forthcoming book Data

mining: concepts and techniques has influenced the whole tutorial Rajeev RASTOGI and Kyuseok SHIM, Lucent Bell Labs Daniel A. KEIM, University of Halle Daniel Silver, CogNova Technologies

The EDBT2000 board who accepted our tutorial proposal

Page 3: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 3

Tutorial goals

Introduce you to major aspects of the Knowledge Discovery Process, and theory and applications of Data Mining technology

Provide a systematization to the many many concepts around this area, according the following lines the process the methods applied to paradigmatic cases the support environment the research challenges

Important issues that will be not covered in this tutorial: methods: time series, exception detection, neural nets systems: parallel implementations

Page 4: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 4

Tutorial Outline

1. Introduction and basic concepts1. Motivations, applications, the KDD process, the techniques

2. Deeper into DM technology1. Decision Trees and Fraud Detection

2. Association Rules and Market Basket Analysis3. Clustering and Customer Segmentation

3. Trends in technology1. Knowledge Discovery Support Environment2. Tools, Languages and Systems

4. Research challenges

Page 5: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 5

Introduction - module outline

Motivations Application Areas KDD Decisional Context KDD Process Architecture of a KDD system The KDD steps in short

Page 6: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 6

Evolution of Database Technology:from data management to data analysis

1960s: Data collection, database creation, IMS and network DBMS.

1970s: Relational data model, relational DBMS implementation.

1980s: RDBMS, advanced data models (extended-relational, OO,

deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.).

1990s: Data mining and data warehousing, multimedia databases,

and Web technology.

Page 7: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 7

Motivations “Necessity is the Mother of Invention”

Data explosion problem: Automated data collection tools, mature database

technology and internet lead to tremendous amounts of data stored in databases, data warehouses and other information repositories.

We are drowning in information, but starving for knowledge! (John Naisbett)

Data warehousing and data mining : On-line analytical processing

Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases.

Page 8: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 8

Also referred to as: Data dredging, Data harvesting, Data

archeology

A multidisciplinary field: Database Statistics Artificial intelligence

Machine learning, Expert systems and Knowledge Acquisition

Visualization methods

A rapidly emerging fieldA rapidly emerging field

Page 9: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 9

Motivations for DM

Abundance of business and industry dataCompetitive focus - Knowledge

Management Inexpensive, powerful computing enginesStrong theoretical/mathematical

foundations machine learning & logic statistics database management systems

Page 10: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 10

What is DM useful for?

Marketing

DatabaseMarketing

DataWarehousing

KDD &Data Mining

Increase knowledge to base decision upon.

E.g., impact on marketing

Page 11: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 11

The Value Chain

DataData• Customer data• Store data• Demographical Data• Geographical data

InformationInformation• X lives in Z• S is Y years old• X and S moved• W has money in Z

KnowledgeKnowledge• A quantity Y of product A is used in region Z• Customers of class Y use x% of C during period D

DecisionDecision• Promote product A in region Z.• Mail ads to families of profile P• Cross-sell service B to clients C

Page 12: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 12

Application Areas and Opportunities

Marketing: segmentation, customer targeting, ... Finance: investment support, portfolio management Banking & Insurance: credit and policy approval Security: fraud detection Science and medicine: hypothesis discovery,

prediction, classification, diagnosis Manufacturing: process modeling, quality control,

resource allocation Engineering: simulation and analysis, pattern

recognition, signal processing Internet: smart search engines, web marketing

Page 13: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 13

Classes of applications

Market analysis target marketing, customer relation management, market

basket analysis, cross selling, market segmentation.

Risk analysis Forecasting, customer retention, improved underwriting,

quality control, competitive analysis.

Fraud detection

Text (news group, email, documents) and Web analysis.

Page 14: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro

Market Analysis

Where are the data sources for analysis? Credit card transactions, loyalty cards, discount coupons,

customer complaint calls, plus (public) lifestyle studies.

Target marketing Find clusters of “model” customers who share the same

characteristics: interest, income level, spending habits, etc.

Determine customer purchasing patterns over time Conversion of single to a joint bank account: marriage,

etc.

Cross-market analysis Associations/co-relations between product sales Prediction based on the association information.

Page 15: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro

Customer profiling data mining can tell you what types of customers

buy what products (clustering or classification).

Identifying customer requirements identifying the best products for different

customers use prediction to find what factors will attract

new customers

Provides summary information various multidimensional summary reports; statistical summary information (data central

tendency and variation)

Market Analysis and ManagementMarket Analysis (2)

Page 16: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro

Risk Analysis

Finance planning and asset evaluation: cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-

ratio, trend analysis, etc.)

Resource planning: summarize and compare the resources and

spending

Competition: monitor competitors and market directions (CI:

competitive intelligence). group customers into classes and class-based

pricing procedures set pricing strategy in a highly competitive market

Page 17: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro

Fraud Detection

Applications: widely used in health care, retail, credit card services,

telecommunications (phone card fraud), etc.

Approach: use historical data to build models of fraudulent

behavior and use data mining to help identify similar instances.

Examples: auto insurance: detect a group of people who stage

accidents to collect on insurance money laundering: detect suspicious money

transactions (US Treasury's Financial Crimes Enforcement Network)

medical insurance: detect professional patients and ring of doctors and ring of references

Page 18: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro

More examples: Detecting inappropriate medical treatment:

Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr).

Detecting telephone fraud: Telephone call model: destination of the call, duration,

time of day or week. Analyze patterns that deviate from an expected norm.

British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud.

Retail: Analysts estimate that 38% of retail shrink is due to dishonest employees.

Fraud Detection (2)

Page 19: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro

Sports IBM Advanced Scout analyzed NBA game statistics

(shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat.

Astronomy JPL and the Palomar Observatory discovered 22

quasars with the help of data mining

Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to Web

access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

Watch for the PRIVACY pitfall!

Other applications

Page 20: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 20

The selection and processing of data for: the identification of novel, accurate,

and useful patterns, and the modeling of real-world

phenomena.

Data mining is a major component of the KDD process - automated discovery of patterns and the development of predictive and explanatory models.

What is KDD? A process!

Page 21: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 21

Selection and Preprocessing

Data Mining

Interpretation and Evaluation

Data Consolidation

Knowledge

p(x)=0.02

Warehouse

Data Sources

Patterns & Models

Prepared Data

ConsolidatedData

The KDD process

Page 22: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 22

The KDD Process

Core Problems & Approaches Problems:

identification of relevant data representation of data search for valid pattern or model

Approaches: top-down deduction by expert interactive visualization of data/models * bottom-up induction from data *

DataMining

OLAP

Page 23: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro

Learning the application domain: relevant prior knowledge and goals of application

Data consolidation: Creating a target data set Selection and Preprocessing

Data cleaning : (may take 60% of effort!) Data reduction and projection:

find useful features, dimensionality/variable reduction, invariant representation.

Choosing functions of data mining summarization, classification, regression, association,

clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Interpretation and evaluation: analysis of results.

visualization, transformation, removing redundant patterns, …

Use of discovered knowledge

The steps of the KDD process

Page 24: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 24

CogNovaTechnologies

9

The KDD ProcessThe KDD Process

Selection and Preprocessing

Data Mining

Interpretation and Evaluation

Data Consolidation

Knowledge

p(x)=0.02

Warehouse

Data Sources

Patterns & Models

Prepared Data

ConsolidatedData

IdentifyProblem or Opportunity

Measure effectof Action

Act onKnowledge

Knowledge

ResultsStrategy

Problem

The virtuous cycle

Page 25: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 25

Applications, operations, techniques

Page 26: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 26

Roles in the KDD process

Page 27: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 27

Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

MakingDecisions

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

Data mining and business intelligence

Page 28: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 28

Graphical User Interface

DataConsolidation

Selectionand

Preprocessing

DataMining

Interpretationand Evaluation

Warehouse KnowledgeData Sources

Architecture of a KDD system

Page 29: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 29

A business intelligence environment

Page 30: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 30

Selection and Preprocessing

Data Mining

Interpretation and Evaluation

Data Consolidation

Knowledge

p(x)=0.02

Warehouse

Data Sources

Patterns & Models

Prepared Data

ConsolidatedData

The KDD process

Page 31: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 31

Garbage in Garbage out The quality of results relates directly to

quality of the data 50%-70% of KDD process effort is spent

on data consolidation and preparation Major justification for a corporate data

warehouse

Data consolidation and preparation

Page 32: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 32

From data sources to consolidated data repository

RDBMS

Legacy DBMS

Flat Files

DataConsolidationand Cleansing

Warehouse

Object/Relation DBMS Object/Relation DBMS

Multidimensional DBMS Multidimensional DBMS

Deductive Database Deductive Database

Flat files Flat files External

Data consolidation

Page 33: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 33

Determine preliminary list of attributes Consolidate data into working database

Internal and External sources

Eliminate or estimate missing values Remove outliers (obvious exceptions) Determine prior probabilities of

categories and deal with volume bias

Data consolidation

Page 34: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 34

Selection and Preprocessing

Data Mining

Interpretation and Evaluation

Data Consolidation

Knowledge

p(x)=0.02

Warehouse

The KDD process

Page 35: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 35

Generate a set of examples choose sampling method consider sample complexity deal with volume bias issues

Reduce attribute dimensionality remove redundant and/or correlating attributes combine attributes (sum, multiply, difference)

Reduce attribute value ranges group symbolic discrete values quantize continuous numeric values

Transform data de-correlate and normalize values map time-series data to static representation

OLAP and visualization tools play key role

Data selection and preprocessing

Page 36: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 36

Selection and Preprocessing

Data Mining

Interpretation and Evaluation

Data Consolidation

Knowledge

p(x)=0.02

Warehouse

The KDD process

Page 37: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 37

Data mining tasks and methods

Automated Exploration/Discovery e.g.. discovering new market segments clustering analysis

Prediction/Classification e.g.. forecasting gross sales given current factors regression, neural networks, genetic algorithms,

decision trees

Explanation/Description e.g.. characterizing customers by demographics

and purchase history decision trees, association rules

x1

x2

f(x)

x

if age > 35 and income < $35k then ...

Page 38: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 38

Clustering: partitioning a set of data into a set

of classes, called clusters, whose members share some interesting common properties.

Distance-based numerical clustering metric grouping of examples (K-NN) graphical visualization can be used

Bayesian clustering search for the number of classes which result in

best fit of a probability distribution to the data AutoClass (NASA) one of best examples

Automated exploration and discovery

Page 39: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 39

Learning a predictive modelClassification of a new

case/sample Many methods:

Artificial neural networks Inductive decision tree and rule systems Genetic algorithms Nearest neighbor clustering algorithms Statistical (parametric, and non-

parametric)

Prediction and classification

Page 40: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 40

The objective of learning is to achieve good generalization to new unseen cases.

Generalization can be defined as a mathematical interpolation or regression over a set of training points

Models can be validated with a previously unseen test set or using cross-validation methods

f(x)

x

Generalization and regression

Page 41: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 41

Classification and prediction

Classify data based on the values of a target attribute, e.g., classify countries based on climate, or classify cars based on gas mileage.

Use obtained model to predict some unknown or missing attribute values based on other information.

Page 42: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 42

Objective: Develop a general model or hypothesis from specific

examples

Function approximation (curve fitting)

Classification (concept learning, pattern recognition)

x1

x2

AB

f(x)

x

Summarizing: inductive modeling = learning

Page 43: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 43

Learn a generalized hypothesis (model) from selected data

Description/Interpretation of model provides new knowledge

Methods: Inductive decision tree and rule systems Association rule systems Link Analysis …

Explanation and description

Page 44: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 44

Generate a model of normal activity Deviation from model causes alert Methods:

Artificial neural networks Inductive decision tree and rule systems Statistical methods Visualization tools

Exception/deviation detection

Page 45: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 45

Outlier and exception data analysis

Time-series analysis (trend and deviation): Trend and deviation analysis: regression,

sequential pattern, similar sequences, trend and deviation, e.g., stock analysis.

Similarity-based pattern-directed analysis

Full vs. partial periodicity analysis Other pattern-directed or statistical

analysis

Page 46: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 46

Selection and Preprocessing

Data Mining

Interpretation and Evaluation

Data Consolidationand Warehousing

Knowledge

p(x)=0.02

Warehouse

The KDD process

Page 47: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro

A data mining system/query may generate thousands of patterns, not all of them are interesting.

Interestingness measures: easily understood by humans valid on new or test data with some degree of certainty. potentially useful novel, or validates some hypothesis that a user seeks to

confirm

Objective vs. subjective interestingness measures Objective: based on statistics and structures of patterns,

e.g., support, confidence, etc. Subjective: based on user’s beliefs in the data, e.g.,

unexpectedness, novelty, etc.

Are all the discovered pattern interesting?

Page 48: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro

Find all the interesting patterns: Completeness. Can a data mining system find all the interesting

patterns?

Search for only interesting patterns: Optimization. Can a data mining system find only the interesting

patterns? Approaches

First generate all the patterns and then filter out the uninteresting ones.

Generate only the interesting patterns - mining query optimization.

Completeness vs. optimization

Page 49: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 49

Evaluation Statistical validation and significance testing Qualitative review by experts in the field Pilot surveys to evaluate model accuracy

Interpretation Inductive tree and rule models can be read

directly Clustering results can be graphed and tabled Code can be automatically generated by

some systems (IDTs, Regression models)

Interpretation and evaluation

Page 50: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 50

Visualization tools can be very helpful sensitivity analysis (I/O relationship) histograms of value distribution time-series plots and animation requires training and practice

Response

Velocity

Temp

Interpretation and evaluation

Page 51: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro

1989 IJCAI Workshop on KDD Knowledge Discovery in Databases (G. Piatetsky-

Shapiro and W. Frawley, eds., 1991)

1991-1994 Workshops on KDD Advances in Knowledge Discovery and Data Mining

(U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., 1996)

1995-1998 AAAI Int. Conf. on KDD and DM (KDD’95-98) Journal of Data Mining and Knowledge Discovery

(1997)

1998 ACM SIGKDD 1999 SIGKDD’99 Conf.

Important dates of data mining

Page 52: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 52

References - general P. Adriaans and D. Zantinge. Data Mining. Addison-Wesley: Harlow, England, 1996. M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database perspective. IEEE

Trans. Knowledge and Data Engineering, 8:866-883, 1996. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge

Discovery and Data Mining. AAAI/MIT Press, 1996. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000. To

appear. T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications

of ACM, 39:58-64, 1996. G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An

overview. In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1-35. AAAI/MIT Press, 1996.

G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991.

Michael Berry & Gordon Linoff. Data Mining Techniques for Marketing, Sales and Customer Support. John Wiley & Sons, 1997.

Sholom M. Weiss and Nitin Indurkhya. Predictive Data Mining: A Practical Guide. Morgan Kaufmann, 1997.

W.H. Inmon, J.D. Welch, Katherine L. Glassey. Managing the data warehouse. Wiley, 1997. T. Mitchell. Machine Learning. McGraw-Hill, 1997.

Page 53: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 53

Main Web resources

http://www.kdnuggets.com KDD Newsletter and comprehensive website

http://www.acm.org/sigkdd ACM SIGKDD

http://www.research.microsoft.com/datamine/ Journal of Data Mining and Knowledge Discovery

Page 54: Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 54

Tutorial Outline

Introduction and basic concepts Motivations, applications, the KDD process, the techniques

Deeper into DM technology Decision Trees and Fraud Detection Association Rules and Market Basket Analysis Clustering and Customer Segmentation

Trends in technology Knowledge Discovery Support Environment Tools, Languages and Systems

Research challenges