Strata: 9 laws of Data Mining

Preview:

DESCRIPTION

My 9 Laws of Data Mining presentation from Strata Santa Clara 2013-02-26

Citation preview

Advanced Analytics

Duncan Ross@duncan3rossduncan.ross@teradata.com

Based on the 9 Laws of Data Mining by Tom Khabaza

THE NINE LAWS OF DATA MINING

04/08/2023 @duncan3ross

• The last two algorithms you need to know!• An explanation of Bayes’ theorem• The name of the software that will make you $ millions

> Not even a comparison of different software!

What you won’t get from this presentation

The grave of Thomas Bayes (probably) – near “silicon roundabout” Image via Wikimedia

Advanced Analytics

Data Mining laws also work as Data Science laws

THE 0TH LAW

04/08/2023 @duncan3ross

• This question generates more arguments than answers

• Common features> Predicting or classifying things> Based on historical cases (with or without outcomes)> Machine learning techniques> No predefined underlying model assumed

What is data mining?

Image via Wikimedia

04/08/2023 @duncan3ross

What, where, why and how of data mining

9 Laws

CRISP-DM

What?

Where? Unified data architecture

Who?

Why?

How?

04/08/2023 @duncan3ross

CRISP-DM created to help

Advanced Analytics

Prediction increases information locally by generalisation

THE 7TH LAW

04/08/2023 @duncan3ross

• Data mining learns from generalisations> Historical cases build a model of reality

• These general models then predict an outcome that is local to a case and a time> How likely is it that someone will purchase product ‘x’> Will person a influence person b> What number will the ball land on in roulette

• The knowledge gained may have been implied in the data, but it is new and valuable

This may seem obvious

04/08/2023 @duncan3ross

• Results need to be thought of at a group level for assessment> Individual results may be poor even when generated from a

great model

• Two levels of value> Prediction (what, when etc…)> Model (how…)

• The gap between the general and the local is the difference between model building and scoring> Hadoop?> R?

Why the 7th Law is important

Advanced Analytics

There are always patterns

THE 5TH LAW

04/08/2023 @duncan3ross

… is taking the 5th Law to heart

• A major difference between the approach of data mining and data science is in the “Field of Dreams”> Data mining (usually) requires measurable ROI prior to projects> Data science is trading on probable ROI prior to projects

• Fortunately there is still a lot of gold in those hills> And as technologies and data increase the number of hills is also

increasing

The heart of data science…

04/08/2023 @duncan3ross

Graph of hills vs gold extracted

04/08/2023 @duncan3ross

• Just because there are always patterns doesn’t mean that they are useful> Algorithms can (and will) cluster a cloud> Without Laws 1 and 2 patterns may not be a good thing

But…

Advanced Analytics

Business objectives are the origin of every data mining solution

Business knowledge is central to every step of the data mining

process

THE 1ST LAW

Advanced Analytics

THE 2ND LAW

04/08/2023 @duncan3ross

• This story begins with a gains curve…

The sad tale of churn

04/08/2023 @duncan3ross

• To predict churn

• What was the definition of churn?

• What did the business actually want to do?> Predict “churn”?> Predict people who became inactive?> Predict people who became inactive who might not if contacted?

What was the business objective?

04/08/2023 @duncan3ross

• Because we aren’t doing this for the fun of it> Or at least not just for the fun of it

• At every stage ask:> Does this relate to the business question?> Is the original business question still valid?> Is there a better question that could be asked of this data?> Can this be acted on?> What does this actually mean?

• Document the answers, and refer back to them

Why the 1st and 2nd Laws are important

Advanced Analytics

There is no free lunch for the data miner

THE 4TH LAW

04/08/2023 @duncan3ross

• Is….

• I spent a lot of time on this in the 1990s> Neural nets> Regression> Decision trees

• If you know in advance what technique you need to use the problem has already been solved

The last algorithm you will need to learn

04/08/2023 @duncan3ross

The case that worked... then didn‘t

Campaign Topic

Identify fingerprint of churners

Description

SNA offers an opportunity to detect potential churners earlier (possibly before they have completely ceased all on-net activity) and also identifies the individuals who are likely to have the best chance of persuading them to return. The aim of this campaign format is to use SNA to detect potential churners during the process of leaving and motivate them to stay.

Current Approach: New Approach

Active Inactive

Churn detected Churn detected

04/08/2023 @duncan3ross

• Solutions are not generally reproducible> It may work here, but not there

• Methodologies are reproducible

• Learnings may have value

• Time will invalidate even the best models

Why the 4th Law is important

Advanced Analytics

Data preparation is more than half of every data mining process

THE 3RD LAW

04/08/2023 @duncan3ross

Data preparation through a case…

04/08/2023 @duncan3ross

The problems of text data

04/08/2023 @duncan3ross

Data quality raises it’s head…

04/08/2023 @duncan3ross

CREATE dimension table wrk.npath_reboot_5eventsAS SELECT path, COUNT(*) AS path_countFROM nPath

(ON wrk.w_event_f PARTITION BY srv_id ORDER BY evt_ts desc MODE (NONOVERLAPPING ) PATTERN ('X{0,5}.reboot') SYMBOLS

(true as X, evt_name = 'REBOOT' AS reboot) RESULT (FIRST( srv_id OF X) AS srv_id, ACCUMULATE (evt_name OF ANY (X,reboot))

AS path) ) GROUP BY 1 ;

SELECT * FROM GraphGen (ON

(SELECT * from wrk.npath_reboot_5events ORDER BY path_count LIMIT 30 )PARTITION BY 1ORDER BY path_count descitem_format('npath')item1_col('path') score_col('path_count') output_format('sankey')justify('right'));

Note number of paths with a reboot,

following another reboot!

What events lead up to a reboot?

04/08/2023 @duncan3ross

Looks like an issue with the data on the 30th September and beyond, the Reboot data for October seems to have been aggregated and added to September the 30th

More data issues

04/08/2023 @duncan3ross

• Duncan’s theorem> The usefulness of a variable in a model is inversely related to the

amount of time you spend creating it

• Edouard’s corollary> If it turns out to be useful you could have created it in the time

indicated by Duncan’s theorem

Data preparation is tough

04/08/2023 @duncan3ross

• Data just got noisier and less consistent

• Maintaining an analytical data dictionary just moved from vital to really really vital

Welcome to the world of big data

04/08/2023 @duncan3ross

• Because data prep is such a huge task you need to plan for it well> Assume that you will need to do it at least twice

– Experimentation– Model building– Deployment

• Look for software that makes it easy> And repeatable> And documentable

– Scripts ≠ documentation

• Documentation of your data is even more important than documentation of your models> Models can be very sensitive to data inputs

Why the 3rd Law is important

Advanced Analytics

Data mining amplifies perception in the business domain

THE 6TH LAW

04/08/2023 @duncan3ross

srv_id dslam err_cnt srvid_cnt nra_id dslam_cnt errorspersrvid20785675 lgp44-2 2 248 MZL 2 1522254516 ltc56-1 4 314 BOT 10 1521059184 bch66-1 2 184 RIV 15 1521149846 tsm83-1 2 308 LCR 3 1320833837 did75-4 10 216 DID 23 1322295785 gbw68-1 36 170 HRS 1 1221807750 gmo34-1 2 117 BER 17 1221374927 bgl93-1 2 246 G5Y 8 1220291116 ien11-1 2 211 ALZ 2 1221459244 pai34-1 4 210 M7C 3 1121027647 bel60-1 4 223 TRO 10 1120551629 pla13-1 10 332 BED 4 1120633112 crj95-2 2 332 G5Y 8 1120585199 bau06-1 46 349 BLA 21 1021477790 cvl92-1 4 180 IMS 35 1021292874 che78-1 2 163 PIT 2 10

Look for patterns in Network Infrastructure

• Too many end customers to visualise as a graph but network has a hierarchy> Internet Gateway Area Hub Customer Router

• Create a table using standard SQL to join the reference data plus the Customer Hub error data into a single view

04/08/2023 @duncan3ross

Size of Node = number of customersWidth of Edge = number of errors

SELECT * FROM graphgen (ON

(SELECT DISTINCT dmt_act_dslam, nra_id,

nbr_of_srvid, errorspersrv, nbr_of_dslam

FROM wrk.srvid_dslam_err) PARTITION BY 1 ORDER BY errorspersrv item_format('cfilter') item1_col('dmt_act_dslam') item2_col('nra_id') score_col('errorspersrv') cnt1_col('nbr_of_srvid') cnt2_col('nbr_of_dslam') output_format('sigma') directed('false') width_max(10) width_min(1) nodesize_max (3) nodesize_min (1));

Visualise as a Graph using Aster GraphGen

04/08/2023 @duncan3ross

Zoom in on area where the edge width/colour indicates a problem

04/08/2023 @duncan3ross

Add churn information

• Add churn information to find customers connected to this Hub that have cancelled their accounts

04/08/2023 @duncan3ross

Synch Issues by Hub Type

04/08/2023 @duncan3ross

Error and Complaint rates by equipment type

04/08/2023 @duncan3ross

• We don’t exist in a vacuum> We need to sell the results of analysis

• This is a virtuous feedback loop

Why the 6th Law is important

Advanced Analytics

The value of data mining results is not determined by the accuracy or

stability of predictive models

THE 8TH LAW

04/08/2023 @duncan3ross

• Or if it’s right 1 time in 35?

If your model is 98% accurate – so what?

04/08/2023 @duncan3ross

• Type I and Type II errors> What is the cost (opportunity and actual) of a false positive?> What is the cost of a false negative?

• Gains curves> But beware the over accurate curve

• Don’t the forget the user> Decision trees fight back

How can you evaluate models?

Advanced Analytics

All patterns are subject to change

THE 9TH LAW

Advanced Analytics

0 Listen to data miners…7 Data mining brings new knowledge5 And there will always be new knowledge1 Start with the business2 Keep going back to the business4 It won’t get easier with time3 Especially given the state your data is in6 But you will improve business results8 As long as you look for the right outputs9 Goto 0

SUMMARY

Advanced Analytics

• http://khabaza.codimension.net/index_files/9laws.htm

• The Society of Data Miners (coming soon)> Available on LinkedIn

• CRISP-DM

RESOURCES

Recommended