20
COMP309 Data Mining: COMP309 — Week 04 Data Mining Process CRISP-DM Dr Qi Chen School of Engineering and Computer Science Victoria University of Wellington [email protected]

Data Mining Process CRISP-DM

  • Upload
    others

  • View
    4

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Data Mining Process CRISP-DM

COMP309 Data Mining:

COMP309 — Week 04

Data Mining Process

CRISP-DM

Dr Qi Chen

School of Engineering and Computer Science

Victoria University of Wellington

[email protected]

Page 2: Data Mining Process CRISP-DM

COMP309 Data Mining:

Week Overview

• What is Data Mining

• Data Mining and Machine Learning

• Data Mining Process: CRISP-DM

• The Six Phases in CRISP-DM

2

Page 3: Data Mining Process CRISP-DM

COMP309 Data Mining: Cross Industry Standard Process for Data Mining(CRISP-DM)

3

● Business Understanding● Data Understanding● Data Preparation● Modeling● Evaluation● Deployment

CRISP-DM is a standard methodology and procedures for best practices in data mining

Page 4: Data Mining Process CRISP-DM

COMP309 Data Mining:

Business UnderstandingGain a true understanding of the business, and to identify the specific goals and problems a business wish to solve.

4

This phase includes four tasks

● determine business objectives

● assess situation

● determine data mining goals

● develop project plan

Page 5: Data Mining Process CRISP-DM

COMP309 Data Mining:

Determine Business Objectives

• Business Objectives

⎼ describing primary objective from a business perspective

⎼ other related questions that would like to address

• Produce Project Plan

⎼ the plan for achieving the business goals

• Business Success Criteria

⎼ lay out the criteria to determine whether the project has been successful from the business point of view

5

Primary goal: keep current customers by predicting when they are prone to move to a competitorRelated questions: does the channel affect? will lower ATM fees reduce the number of high-value customers who leave?

Page 6: Data Mining Process CRISP-DM

COMP309 Data Mining:

Access Current Situation

Determine resources availability, assess risks and conduct a cost-benefit analysis

● list resources available to the project: personnel, data, computing resources and software

● list the requirements, assumptions and constraints (e.g. any datalegal issues)

● list risks and the corresponding action

● compile a glossary of terminology (business and data mining ones)

● construct a cost-benefit analysis

6

Page 7: Data Mining Process CRISP-DM

COMP309 Data Mining:

Determine Data Mining Goals

Covert business objectives to the definition of data mining problem and goals

● State project objectives in technical terms

● Define the data mining success criteria, might be subjective

(e.g. a certain level of predictive accuracy)

7

Business Goal: Increase catalogue sales to existing customersData Mining Goal: Predict how many widgets a customer will buy, given their purchases over the past three years, information (age, salary, city, etc.), and the price of the item

Page 8: Data Mining Process CRISP-DM

COMP309 Data Mining:

Produce Project Plan

Describe the intended plan for achieving the data mining goals and thereby achieving the business goals

● Project plan: list the stages to be executed in the projectwith their duration, resources required, inputs, outputs, and dependencies

● Initial assessment of tools and techniques (e.g. a datamining tool)

8

Page 9: Data Mining Process CRISP-DM

COMP309 Data Mining:

Data UnderstandingTake a closer look at the data, access and explore the data, match between the business problem and the data

9

This phase includes four tasks

• Collect initial data

• Describe data

• Explore data

• Verify data quality

Page 10: Data Mining Process CRISP-DM

COMP309 Data Mining:

Data Understanding

• Collect initial data:acquire the data listed in the project resources, may need data loading, Integrate multiple datasources

• Describe data: examine the “gross” or “surface” properties of the data (e.g. data format, quantity)

• Explore data: dig deeper into the data, query, visualize, and identify relationships among the data

• Verify data quality: Examine the quality of the data, addressing questions (e.g. is the data complete? Correct?Missing?)

10

Page 11: Data Mining Process CRISP-DM

COMP309 Data Mining:

Data Preparation“data munging”: prepare the final data set(s) for modelling

A common rule of thumb is that 80% of the project is data preparation

11

This phase includes five tasks

• Data Selection

• Data Cleaning

• Data Construction

• Integrate data

• Format data

Page 12: Data Mining Process CRISP-DM

COMP309 Data Mining:

Data Preparation• Data Selection: determine data sets to be used, selection of

features, selection of records/rows

• Data Cleaning: the lengthiest task, to correct, impute, or remove erroneous values, missing values

• Data Construction: constructive data preparation operations

⎼ feature construction, instance generation, featuretransformation

• Integrate data: create new records or values by combined from multiple data source

⎼ merge information from different sources, aggregations

• Format data: re-format data, convert to format convenient for modelling

12

Page 13: Data Mining Process CRISP-DM

COMP309 Data Mining:

Model Building

Build and assess various models based on several different modeling techniquesThis phase is widely regarded as data science’s most exciting work butoften the shortest in the process

13

Page 14: Data Mining Process CRISP-DM

COMP309 Data Mining:

Model Building

• Select modelling technique: select the specific modelling technique, and record assumptions

• Generate test design: generate a procedure or mechanism to test the model’s quality and validity (e.g. separate data

into training and test)

• Build model: run the modelling tool on the prepared dataset to create one or more models- choose parameter settings, describe the resulting models

• Assess model: Interpret the models

according to domain knowledge, data mining success criteria and desired test design

14

Page 15: Data Mining Process CRISP-DM

COMP309 Data Mining:

Access Model• Evaluation of model: how well it performed on test data?

• Methods and criteria depend on model type:‣ e.g. confusion matrix with classification models, mean error rate with regression

models

• Interpretation of model: important or not, easy or hard depends on algorithm

15

Page 16: Data Mining Process CRISP-DM

COMP309 Data Mining:

Model Evaluation

Evaluate and determine which model best meets the business and what to do next

16

This phase has three tasks:● Evaluate results● Review process● Determine next steps

Page 17: Data Mining Process CRISP-DM

COMP309 Data Mining:

Model Evaluation17

• Evaluate results: assesses model meets business objectives

• Review process: do a more thorough review of the data mining engagement , also cover quality assurance issues

• Determine next steps: decide how to proceed depending on the results of the assessment and the process review

Page 18: Data Mining Process CRISP-DM

COMP309 Data Mining:

Deployment

The process of using new insights to make improvements, a formal integration of model, use the insights gained from data mining to make change

18

This phase has four tasks:● Plan deployment● Plan monitoring and maintenance● Produce final report● Review project

Page 19: Data Mining Process CRISP-DM

COMP309 Data Mining:

This table summarizes the main inputs to the deliverables. This does not mean that only the inputs listed should be considered—for example, the business objectives should be considered to all deliverables. However, the deliverables should address specific issues raised by their inputs.

Page 20: Data Mining Process CRISP-DM

COMP309 Data Mining:

Data Mining with Privacy20

• Data contains information about real people• Data mining looks for patterns, not people!• Technical solutions can limit privacy invasion

⎼ replacing sensitive personal data with anon. ID⎼ give randomized outputs⎼ multi-party computation on distributed data

Bayardo & Srikant, Technological Solutions for Protecting Privacy, IEEE Computer, Sep 2003