Upload
others
View
4
Download
1
Embed Size (px)
Citation preview
COMP309 Data Mining:
COMP309 — Week 04
Data Mining Process
CRISP-DM
Dr Qi Chen
School of Engineering and Computer Science
Victoria University of Wellington
COMP309 Data Mining:
Week Overview
• What is Data Mining
• Data Mining and Machine Learning
• Data Mining Process: CRISP-DM
• The Six Phases in CRISP-DM
2
COMP309 Data Mining: Cross Industry Standard Process for Data Mining(CRISP-DM)
3
● Business Understanding● Data Understanding● Data Preparation● Modeling● Evaluation● Deployment
CRISP-DM is a standard methodology and procedures for best practices in data mining
COMP309 Data Mining:
Business UnderstandingGain a true understanding of the business, and to identify the specific goals and problems a business wish to solve.
4
This phase includes four tasks
● determine business objectives
● assess situation
● determine data mining goals
● develop project plan
COMP309 Data Mining:
Determine Business Objectives
• Business Objectives
⎼ describing primary objective from a business perspective
⎼ other related questions that would like to address
• Produce Project Plan
⎼ the plan for achieving the business goals
• Business Success Criteria
⎼ lay out the criteria to determine whether the project has been successful from the business point of view
5
Primary goal: keep current customers by predicting when they are prone to move to a competitorRelated questions: does the channel affect? will lower ATM fees reduce the number of high-value customers who leave?
COMP309 Data Mining:
Access Current Situation
Determine resources availability, assess risks and conduct a cost-benefit analysis
● list resources available to the project: personnel, data, computing resources and software
● list the requirements, assumptions and constraints (e.g. any datalegal issues)
● list risks and the corresponding action
● compile a glossary of terminology (business and data mining ones)
● construct a cost-benefit analysis
6
COMP309 Data Mining:
Determine Data Mining Goals
Covert business objectives to the definition of data mining problem and goals
● State project objectives in technical terms
● Define the data mining success criteria, might be subjective
(e.g. a certain level of predictive accuracy)
7
Business Goal: Increase catalogue sales to existing customersData Mining Goal: Predict how many widgets a customer will buy, given their purchases over the past three years, information (age, salary, city, etc.), and the price of the item
COMP309 Data Mining:
Produce Project Plan
Describe the intended plan for achieving the data mining goals and thereby achieving the business goals
● Project plan: list the stages to be executed in the projectwith their duration, resources required, inputs, outputs, and dependencies
● Initial assessment of tools and techniques (e.g. a datamining tool)
8
COMP309 Data Mining:
Data UnderstandingTake a closer look at the data, access and explore the data, match between the business problem and the data
9
This phase includes four tasks
• Collect initial data
• Describe data
• Explore data
• Verify data quality
COMP309 Data Mining:
Data Understanding
• Collect initial data:acquire the data listed in the project resources, may need data loading, Integrate multiple datasources
• Describe data: examine the “gross” or “surface” properties of the data (e.g. data format, quantity)
• Explore data: dig deeper into the data, query, visualize, and identify relationships among the data
• Verify data quality: Examine the quality of the data, addressing questions (e.g. is the data complete? Correct?Missing?)
10
COMP309 Data Mining:
Data Preparation“data munging”: prepare the final data set(s) for modelling
A common rule of thumb is that 80% of the project is data preparation
11
This phase includes five tasks
• Data Selection
• Data Cleaning
• Data Construction
• Integrate data
• Format data
COMP309 Data Mining:
Data Preparation• Data Selection: determine data sets to be used, selection of
features, selection of records/rows
• Data Cleaning: the lengthiest task, to correct, impute, or remove erroneous values, missing values
• Data Construction: constructive data preparation operations
⎼ feature construction, instance generation, featuretransformation
• Integrate data: create new records or values by combined from multiple data source
⎼ merge information from different sources, aggregations
• Format data: re-format data, convert to format convenient for modelling
12
COMP309 Data Mining:
Model Building
Build and assess various models based on several different modeling techniquesThis phase is widely regarded as data science’s most exciting work butoften the shortest in the process
13
COMP309 Data Mining:
Model Building
• Select modelling technique: select the specific modelling technique, and record assumptions
• Generate test design: generate a procedure or mechanism to test the model’s quality and validity (e.g. separate data
into training and test)
• Build model: run the modelling tool on the prepared dataset to create one or more models- choose parameter settings, describe the resulting models
• Assess model: Interpret the models
according to domain knowledge, data mining success criteria and desired test design
14
COMP309 Data Mining:
Access Model• Evaluation of model: how well it performed on test data?
• Methods and criteria depend on model type:‣ e.g. confusion matrix with classification models, mean error rate with regression
models
• Interpretation of model: important or not, easy or hard depends on algorithm
15
COMP309 Data Mining:
Model Evaluation
Evaluate and determine which model best meets the business and what to do next
16
This phase has three tasks:● Evaluate results● Review process● Determine next steps
COMP309 Data Mining:
Model Evaluation17
• Evaluate results: assesses model meets business objectives
• Review process: do a more thorough review of the data mining engagement , also cover quality assurance issues
• Determine next steps: decide how to proceed depending on the results of the assessment and the process review
COMP309 Data Mining:
Deployment
The process of using new insights to make improvements, a formal integration of model, use the insights gained from data mining to make change
18
This phase has four tasks:● Plan deployment● Plan monitoring and maintenance● Produce final report● Review project
COMP309 Data Mining:
This table summarizes the main inputs to the deliverables. This does not mean that only the inputs listed should be considered—for example, the business objectives should be considered to all deliverables. However, the deliverables should address specific issues raised by their inputs.
COMP309 Data Mining:
Data Mining with Privacy20
• Data contains information about real people• Data mining looks for patterns, not people!• Technical solutions can limit privacy invasion
⎼ replacing sensitive personal data with anon. ID⎼ give randomized outputs⎼ multi-party computation on distributed data
Bayardo & Srikant, Technological Solutions for Protecting Privacy, IEEE Computer, Sep 2003