43
Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad de Informática Universidad Politecnica de Madrid. Spain [email protected] November 2004

Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

  • View
    220

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Facultad de Informática

FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF

DATA MINING PROJECT DEVELOPMENT

Ernestina MenasalvasFacultad de Informática

Universidad Politecnica de Madrid. [email protected]

November 2004

Page 2: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Background(I)

• 1995: doctoral student.– Visit University of Regina (Prof. Ziarko)– Visit Warsaw University (Prof. Pawlak)

• 1998: Defend thesis. Data Mining process model (Anita Wasilewska & C. Fernandez-Baizan)

• Since then: – Data Bases Professor: Data bases, data mining– Coordinator of the Data Mining group at Facultad de

Informática UPM• Techniques: Rough Sets, Bayes, …• Methodologies for data mining process management

– Evaluation in Data Mining– Experimentation in Web Mining

• Web Mining: Web Goal Mining

Page 3: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Background(II)

• Projects developed:– Pure Research:

• Data Mining to be integrated on RDBMS• Web Profiler• Methodology for Data Mining process management

– Research and application:• Data Mining applied on different domains:

– Car dealers– Travel agency– ….

Page 4: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Data Mining Project Development

• Methodologies for Data Mining project development– Is it really Data Mining a Science?– Are we developing proyects as an art?– Has the research got the same results in all the areas??

• Algorithms• Data Preparation• Data enrichment• Conceptualization of Data Mining problems

Page 5: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Data Mining: an art, a science?

• Since it appeared a lot of algorithms have been programmed

• Standards:– Crisp-DM– SEMMA– PMML 3.0

• Process depends on the expertise of the data miner

• User speaks about business problems

• Data Miner speaks about algorithms

Page 6: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Data Mining as a project

• Data Mining is data intensive activity– Data understanding– Data Preparation

• Database manager:– Transactional databases– Datawarehouses

• The end result of a data mining project is a tool (software project) for better decision making process:– Software development project

• IT department has to be involved

Page 7: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Project Management

• Why?– In order to organize the process of develpoment and to

produce a project plan

• How?• Establish how the process is going to be develop:

– Sequential– Incremental

• What?• Establish how is the process is splitted into phases and

define the tasks to be developed in each step:– RUP– XP– COMMONKADS

LIFECYCLE MODELS

METHODOLOGY

•Way of making things

• Independent of the process being developed

•Particular tasks

• Detail of tasks to be developed

Page 8: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Common pitfall of data mining implementation

• The common pitfall of data mining implementation the following:– Not being able to efficiently communicate mining results

within an organization.– Not having the right data to conduct effective analysis.– Not using existing data correctly.– Not being able to evaluate results

• Questions that arise:– Can the adequateness of a set of data for a problem be

established when preparing the project plan?– How the set of data can be used to produce the expected

results?– How we can evaluate the results?– Cost estimation?

Page 9: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Data Mining Approaches

• Vendor independent:– CRISP-DM

• Based on the commercial tools:– CAT’s– SEMMA

• CRM Methodology:– CRM Catalyst

Model Process

Not Real Methodology

Based on Crisp-DM

Globlal CRM process

Does not concentrate on Data Mining step

Page 10: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Cross-Industry Standard Process for Data Mining:CRISP-DM

Page 11: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Data Mining as a project: CATs

• CATs :Clementine Application Templates : [CATs]

– Specific libraries of best practices that provide inmediate value right out of the box

– Following the CRISP-DM standard. Every CAT stream is assigned to a CRISP-DM phase

– They provide long term value as they can always be used with a new data set for new insight in other projects.

• Available as an add-on module to Clementine, include: – Telco CAT - improve retention and cross-selling efforts for

telecommunications – CRM CAT - understand and predict customer migration

between segments, – Microarray CAT - accelerate biological discoveries, find

genes Fraud CAT - predict and detect instances of fraud in financial transactions, claims, tax returns …

– Web CAT

Page 12: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

What is a CAT?[CATs]

Page 13: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

SEMMA(1)

• SEMMA (Sample, Explore, Modify, Model, Assess): [SEMMA]– Is not a data mining methodology – Rather a logical organization of the functional tool set of

SAS Enterprise Miner for carrying out the core tasks of data mining.

– Enterprise Miner can be used as part of any iterative data mining methodology adopted by the client.

– Naturally steps such as formulating a well defined business or research problem and assembling quality representative data sources are critical to the overall success of any data mining project.

Page 14: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

SEMMA(2)

•SEMMA is focused on the model development aspects of data mining:[SEMMA]

– Sample the data to extract a portion of a large data set big enough to contein significant information, yet small to manipulate quickly.

– Explore the data by searching for anticipated trends and anomalies in order to gain understanding and ideas.

– Modify the data by creating selecting and transforming the variables to focus the model selection problem.

– Model the data allowing the software to search automatically for a combination of data that reliably predicts a desired outcome. Modelling techniques include neural networks, tree-clasiffiers, statistical models, etc.

– Assess the data by evaluating the usefulness and reliability of the findings from the data mining process and estimate how well it performs.

Page 15: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Methods for Project Management:CRM Catalyst(1)

• Developed jointly by CustomISe, MACS and SalesPathways. Together they have formed the Catalyst Foundation http://www.crmmethodology.com/

Motivations:• CRM projects are difficult to execute successfully because of the

wide range of factors influencing their success. So it can take a long time to make CRM work properly for an organisation.

• Solution: CRM Catalyst. • Methodology acts as a catalyst for CRM projects enabling them

to achieve their objectives more reliably and in less time.• It gives a project life cycle with a set of defined phases broken

down into steps with clearly stated inputs and outputs.

Page 16: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Methods for Project Management: CRM Catalyst(2)

Implementation requires

Data Mining development process

Implementation is Knowledge intensive

The resutls are obtained in a progressive way

Progressive Lifecycle Model

In some steps Knowledge Intensive Methdology could be appropriate

Page 17: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Main steps in a Data Mining Project

1. Define the goals:– Business and data mining experts together have to define

the goals– Each goal must be defined with measurements for success

2. Obtain the models:– Apply data mining algorithms. – Preprocesing is important

3. Evaluate results:– ascertaine the value of an object according to specified

criteria, operationalised in terms of measures.

4. Deploy:– Decide patterns and models that can be deployed

5. Evaluate– After product working it should be contrasted the result

Page 18: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

1. Define the goals

• Distinguish between :– Data Mining goals– Business goals

• How do we translate?

Clasification Estimation Association

¿? ¿? ¿?

Increase the lifetime value of valuable customers

It has to be solved in the Business Understanding step of CRISP-DM

Page 19: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Business Understandingin the CRISP-DM Process

Business UnderstandingBusiness Understanding

Determine Business

Objectives

Assess Situation

Determine Data Mining

Goals

Produce Project Plan

BackgroundBusiness

Objectives

Business Success Criteria

Inventory & Resources

Reqs, Assumptions &Constraints

Risks & Contingencie

sTerminology

Costs & Benefits

Data Mining Goals

Data Mining Success Criteria

Project PlanInitial Assessment of Tools & Techniques

Page 20: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

1.1 Determine Business objectives and success criteria

• Not only business objectives have to be established but measures in order to be able to evaluate the results

• Business objectives:– What is the customer's primary objective?

• Increase the number of loyal customers• Selling more of a certain product• Have a positive marketing campaing

• Business success criteria:– What constitutes a successful outcome of the project?– Objectives measures so that the success can be established– ROI

Page 21: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

1.2 Costs & Benefits

• Perform a cost-benefits analysis• Compute the benefits of the project

– Which measures do we have?– ROI– APEX– OPEX....

• Compute the costs of the project (equipment, human resources...)

– Which methodology do we have?– COCOMO for sortware

• Quantify the risk that the project fails– Knowledge not available– Data Not available– Proper tools

Page 22: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Data Mining Estimation Model

• Establishing a parametrical estimation model for Data Mining (Marban’03)

DMCOMO(Data Mining COst MOdel)

Page 23: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Data Mining Cost Estimation

• Main factors in a Data Mining project– Data Sources (number, kind, nature, …)– Data mining problem to be solved (descriptive,

predictive, …)– Development platform– Available tools– Expertise of the development team

• Drivers

Data Drivers Model Drivers Platform Drivers

Tools and techniques Drivers Project Drivers People Drivers

Page 24: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

1.3 Data Mining goals and success

Data mining goals:– Translate the customer's primary objective into a data

mining goal, e.g.• Loyalty program translated into segmentation problem • Decreasing the attrition rate transformed into classification

problem

• Data mining success criteria:– Determine success in technical terms

• Translate the notion of sucess into confidence, support and lift and other parameteres

• Determine de cost of errors

• How do we make the translation?

Page 25: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Methodology

• Which is the methodology to be followed to translate business objectives into data mining objectives?

• Unluckily, there is no such methodology. First we have to solve: – How a business objective is expressed? – What is a data mining goal?– How are data mining goals achieved? – Which are the requirements of data mining functions?

In order to describe everything in a standard way:

Conceptualize the problem

Page 26: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Conceptualization in other disciplines

• Data Bases:– E/R diagrams– Independent of the domain– A tool for business understanding and for data base

designer– Translation from E/R to implementation

Internal Schema

Conceptual Schema

External view 1 External view n

Page 27: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

3 levels proposed architecture

Internal Schema

Conceptual Schema

Business problem Business problem

Requirements of algorithms will

be solved at this level

Tools requirements to be solved

SAS, WEKA, Clementine…

Page 28: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

3 layers architecture for data mining

• It is the bridge:– Between business goals and the final tool– Independent of the domain

• Provides independence:– Changes in the tool do not reflect to the solution

• It has to be decided what to model in the conceptualization

• Automatic translation of business goals into data mining goals

• Data Mining goals +constraints = feasible data mining goals

Page 29: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Elements to conceptualize

• Elements to be taken into account:– Data:

• Quality from data mining point of view• Adequateness for the problem• Classification for data mining purposes

– Knowledge:• Related to the process being analyzed• Related to the data used

– People• Owners of data• Experts in the process

– Data mining problems requirements– Data mining methods requirements

Page 30: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Proposed process

Page 31: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

DMMO

• Data Mining Modelling Objects:– Data– Knowledge– Constraints of data and applications– Data Mining objects

• Algorithms• Measures• Methods

• To bridge the gap between data miners and business users

Page 32: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Are data adequate for analysis?

• The adequateness of the data is analyzed taking into account goals to fulfil.

• Data together with the knowledge extracted from the experts can be transformed so that just by being the input of a certain data mining algorithm will produce the required patterns.

• Quality of the data, in this context:– is not only related to the technical quality: proper model,

percentage of null values, • but also has to do with:

– meaning of the attributes, – Where each piece of data comes from, – relationship among data, and – finally how the data fulfil the requirements of the data mining

functions

Page 33: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

2. Data Mining: obtain models

• Apply data mining process model• Associated problems solved by the 3 layers

architecture:– Comparison of approaches– Evaluate costs– Pros and cons of approaches

• Only experience or a conceptualization can help • The conceptual model will help to establish the

process to obtain each feasible model.• Requirements and transformations implicit in the

model

Page 34: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

2.1 Determine type of problem

– What are data mining problems? • Classification• Estimation• Association• Segmentation

– In the conceptual model requirements for each type will be settled

Page 35: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

2.2 Apply CRISP-DMprocess model

– Data Mining problem has to be settled before going into modeling step

– Requierements will be established in Business understanding

– Requierements will be checked in Data Understanding and data Preparation

– Preparation will be guided by conceptual model– Evaluation on feasibility can be done before applying

the model

Business Understanding

Data Understanding

DataPreparation

Modeling Evaluation DeploymentBusiness Understanding

Page 36: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

3. Evaluate results

[Spilipopou, Berendt]• Evaluation: the act of ascertaining the value of an object

according to specified criteria, operationalised in terms of measures.• Object= model already obtained• Criteria and Measures and has to do with goals

• Evaluation requires a well-defined notion of success, which must be in place before– the evaluation takes place– the data mining phase starts– any work with the data starts

• i.e. already during the business understanding process.• Here once again conceptualization plays its role

Page 37: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Evaluation in the CRISP-DM Process

• The CRISP-DM process is– a non-ending circle of iterations– a non-sequential process, where backtracking at previous

phases is usually necessary

• In each sequential instantiation evaluation takes place:

• But it is a cycle• In all the iterations all the steps should be revisited• Results have to be evaluated!!

Business Understanding

Data Understanding

DataPreparation

Modeling Evaluation DeploymentBusiness Understanding

Page 38: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

4. Deployment

• All the models that have possitive evaluation can be deployed

• For measurements of success to trust deployment has to follow rules established at the beginning of the project

– The real evaluation has not yet been performed

Page 39: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

5. Evaluate after deployment

• After deployment there is the need to proof that the improvements are really due to the actions taken after a data mining discovery and not to any other factor or action carried out in the company

• None of the obvious claims about success of data mining have ever been systematically tested.

• Experiments are crucial to establish if the impact of the deployment is really positive or negative

• Experiments have to be designed at the beginning of the project

Page 40: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Conclusions

• Data mining projects are being developed more as art than a science

• Many algorithms have been implemented but no systematically proof of one better than another in real case is done after deployment

• Conceptual model is required:– To map business goals to the model– To map data mining algorithms to a conceptual model

• Achievements of the model:– Will be used along the process to guide the project– Evaluation tool

Page 41: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Future works

• Conceptual model– Define DMMO objects

• Evaluation techniques related to the model:– Evaluate data mining goals– Evaluate business goals

• Experimentation methods: – obstursively and – non obstrusivelsly

Page 42: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

References

• Evaluation in Web mining Tutorial at ECML/PKDD 2004 Pisa, Italy; 20th September, 2004. Bettina Berendt, Myra Spiliopoulou, Ernestina Menasalvas

• Towards a Methodology for Data mining Project Development : The Importance of Abstraction. Menasalvas, Millán, Gonzalez-Aranda, Segovia

• Bettina Berendt, Andreas Hotho, Dunja Mladenic, Maarten van Someren, Myra Spiliopoulou, Gerd Stumme: Web Mining: From Web to Semantic Web, First European Web Mining Forum, EMWF 2003, Cavtat-Dubrovnik, Croatia, September 22, 2003, Revised Selected and Invited Papers Springer 2004

• Myra Spiliopoulou, Carsten Pohle: Modelling and Incorporating Background Knowledge in the Web Mining Process. Pattern Detection and Discovery 2002: 154-169

• www.crisp-dm.org• www.spss.com/clementine/cats.htm • www.sas.com/technologies/analytics/datamining/miner/semma.html• www.crmmethodology.com • www.emetrics.org/articles/whitepaper.html

Page 43: Facultad de Informática FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad

Facultad de Informática

THANKS