13
Chapter 3 Data Mining Methodology and Best Practices

Chapter 3 Data Mining Methodology and Best Practices

Embed Size (px)

DESCRIPTION

Chapter 3 Data Mining Methodology and Best Practices. Data Mining’s Virtuous Cycle. Identify the business opportunity* Mining data to transform it into actionable information Acting on the information Measuring the results. * Textbook interchanges “problem” with “opportunity”. It’s time to…. - PowerPoint PPT Presentation

Citation preview

Page 1: Chapter 3 Data Mining Methodology and Best Practices

Chapter 3Data Mining Methodology and

Best Practices

Page 2: Chapter 3 Data Mining Methodology and Best Practices

2

Data Mining’s Virtuous Cycle

1. Identify the business opportunity*

2. Mining data to transform it into

actionable information

3. Acting on the information

4. Measuring the results

* Textbook interchanges “problem” with “opportunity”

Page 3: Chapter 3 Data Mining Methodology and Best Practices

3

It’s time to…

• Turn our attention to translating business opportunities (problems) into data mining opportunities (problems) including:– Transforming data into information via:

• Hypothesis testing• Profiling• Predictive modeling

– Taking action• Model deployment• Scoring

– Measurement• Assessing a model’s stability & effectiveness before it is used

Page 4: Chapter 3 Data Mining Methodology and Best Practices

4

DM General Guidelines

• The DM virtuous cycle (4 steps) is iterative• No steps should be skipped• Common sense prevails with respect to

how rigorous each step is carried out• Simplest approach: ad-hoc queries to test

hypotheses• Rigorous approach: The 4 steps of the

virtuous cycle expand to become an 11-step methodology

Page 5: Chapter 3 Data Mining Methodology and Best Practices

5

Why have a Methodology?

• A DM methodology which includes DM Best Practices helps to avoid:– Learning things that are not true– Learning things that are true, but not useful

• Learning things that are not true is more dangerous than the other.

Why is that? …

Page 6: Chapter 3 Data Mining Methodology and Best Practices

6

Learning Things that are not True

• Patterns may not represent any underlying rule

• Sample may not reflect its parent population, hence bias

• Data may be at the wrong level of detail (granularity; aggregation)

Examples?

Page 7: Chapter 3 Data Mining Methodology and Best Practices

7

Learning Things that are True, but not Useful

• Learning things that are already known

Examples?

• Learning things that cannot be used

Examples?

Page 8: Chapter 3 Data Mining Methodology and Best Practices

8

Hypothesis Testing

• A hypothesis is a proposed explanation whose validity can be tested by analyzing data

• Purpose is to validate or invalidate preconceived ideas• Usually included in all DM projects• Data collection done via:

– Observation– Experiment (lab, survey)

• Bias must be avoided and usually requires both analytical and business knowledge to do so

• Hypothesis testing is useful, but often insufficient which leads us to…

Page 9: Chapter 3 Data Mining Methodology and Best Practices

9

Models

• Model: An explanation or description of how something works that reflects reality well enough that it can be used to make inferences about the real world.

• We use models every day…Examples?

• DM uses models of data called Model Set

• Applying model set to new data is called Score Set

• Model Set includes:– Training Set – used to build a set of DM models

– Validation Set – used to choose best DM model

– Test Set – used to determine how the model performs

• Models – 3 kinds of DM models for 3 kinds of tasks…next slide

Page 10: Chapter 3 Data Mining Methodology and Best Practices

10

Profiling and Prediction

• Profiling– describes what is in the data

– Demographic variables

– Inability to distinguish cause and effect (eg. Beer drinkers and males)

– Focus is on the past to explain it (timing = past)

• Prediction– Finding patterns in data from prior period(s) that are capable of

explaining or anticipating outcomes in a later period (timing = future)

– Predictive models require separation in time between the model inputs and output.

Page 11: Chapter 3 Data Mining Methodology and Best Practices

11

Data Mining Methodology

1. Translate biz opportunity (problem) into DM opportunity (problem)

2. Select appropriate data

3. Get to know the data

4. Create a model set

5. Fix problems with the data

6. Transform data to bring information to the surface

7. Build models

8. Assess models

9. Deploy models

10. Assess results

11. Begin again

Page 12: Chapter 3 Data Mining Methodology and Best Practices

12

In-Class Exercise

• 10 Teams

• Each team take one of the 1-10 methodology steps (step 11 is skipped)

• Discuss it and prepare a 5 minute (or less) summary for your colleagues

• Have each team present its summary

Discussion: 15 minutes Present: 45 minutes

Page 13: Chapter 3 Data Mining Methodology and Best Practices

13

End of Chapter 3