Virtual Site Event Predictive Analytics: What Managers ...pmibaltimore.org/pmi/events/attachments/9678200.pdf · Predictive Analytics: What Managers Need to Know Presented by: Paul

Virtual Site Event

1

Predictive Analytics:

What Managers Need to Know

Presented by:

Paul Arnest, MS, MBA, PMP

February 11, 2015

Virtual Site Ground Rules

Ground Rules PMI Code of Conduct applies for this virtual presentation.

The Virtual Attendees are expected to:

Participate for a minimum of 40 minutes. Login information

will be verified.

Answer the question pertaining to the presentation correctly

in the survey in order to obtain the PDU credit (1).

Respond to the survey within 48 hours (By Friday February

13, 2015) of participation in order to obtain the PDU credit.

2

http://www.pmi.org/About-Us/Ethics/~/media/PDF/Ethics/ap_pmicodeofethics.ashx

Predictive Analytics

What Managers Need to Know

3

A NEW ENVIRONMENT


4

Definition

Predictive Analytics: Techniques that

quantify potential outcomes or events

based on past data

Not descriptive analysis and descriptive

statistics

Not techniques that enable end-users to

perform individual data discovery or to

customize reports

5

Convergence

Once restricted to specialized statistics

organizations, advanced modeling

techniques are moving into the IT

mainstream

6

Stat/Analytics Shop IT

Concepts/Buzzwords

Machine learning ▫ Supervised learning ▫

Unsupervised learning

Response variable ▫ Target variable ▫

Dependent variable ▫ Left hand side

variable

Explanatory variable ▫ Independent

variable ▫ Right hand side variable

Logistic regression ▫ Random forest, etc.

Sensitivity ▫ Specificity

7

Tool independence

Predictive techniques use mathematical

algorithms that are independent of

particular tools

SAS, R, Stata, SPSS, many more

Use specialized tools for model

development

It is possible to implement models using

general software tools, i.e., Java, .Net

8

Don’t be intimidated

Your stat/analysis package is

programmed to do the heavy math

You’ll discover that most internal stat shops

are using a small set of models and

techniques over and over again

Most of the work:

Understanding what you want to accomplish

Understanding the data

Organizing the data

9

Understand the results

Predictive analytics produce a probability

of a characteristic or behavior based on a

detailed analysis of past characteristics or

behaviors

Probability is ≤ 100% ≠ Certainty

Model accuracy depends on similarity of past

conditions to present

10

HOW IT WORKS

AND WHAT TO EXPECT


11

Logistic regression

Workhorse procedure for predictive

analytics

Supervised technique

12

Step 1

Identify a known population that exhibits

the characteristic you want to predict —

‘dependent’, ‘target’ or ‘response’ variable

— plus a known population that does not

You may take the whole population (‘big

data’) or a sample

Use 80% or 90% of the sample as the

training data set

Withhold the remainder for validation

13

Step 2

Construct a hypothesis (‘null hypothesis’)

Select variables expected to distinguish

target population — ‘independent’ or

‘explanatory’ variables

14

Step 3

Run a logistic regression against the

variables

Logistic regression will calculate the

likelihood (predictive odds) that the

independent variables are associated with

the dependent variable

15

Step 4

Test the hypothesis on the withheld

sample and the broader population

Caution:

It’s critical to identify the target

characteristics accurately

16

Logistic regression: targets

17

Target: Workers’

Compensation

Fraudsters Target

High Incidence

Organization

Dr on CMS

Ineligible

List

High Risk

Occupation

Psychological

Impairment

Imperceptible

Physical

Impairment

Linda 1 1 1 1 1 1

Rebecca 1 1 1 1 0 1

Samuel 1 1 0 1 1 0

Stephen 1 0 0 0 1 1

Amanda 1 1 0 0 1 0

Hugh 1 0 1 0 0 1

Francesco 1 0 1 1 0 1

Allen 1 1 0 0 1 0

Eric 1 1 0 0 1 1

Gail 1 0 1 0 0 1

Joseph 1 1 1 1 0 0

Derek 1 1 1 0 1 0

Kevin 1 1 0 1 1 1

Logistic regression: general

18

General population

of covered workers Target

High Incidence

Organization

Dr on CMS

Ineligible

List

High Risk

Occupation

Psychological

Impairment

Imperceptible

Physical

Impairment

Linda 0 1 1 1 1 1

Rebecca 0 0 0 1 0 1

Samuel 0 0 0 0 0 0

Stephen 0 0 0 0 0 1

Amanda 0 1 0 0 1 0

Hugh 0 0 1 0 0 1

Francesco 0 0 0 0 0 0

Allen 0 0 0 0 1 0

Eric 0 0 0 0 1 1

Gail 0 0 1 0 0 1

Joseph 0 0 0 1 1 0

Derek 0 0 1 0 0 0

Kevin 0 1 0 1 1 1

Results

Maximum Likelihood Estimates:

Fraud likelihood = −1.9884 (intercept) +

2.1370 (multiple cases) + 1.2356 (CMS

ineligible) + .3784 (rep disciplined)

+.1877 (psychological) + .4805

(imperceptible physical)

19

Interpretation

Positive coefficients mean all factors

contribute to likelihood of fraud

Coefficients reflect the actual weight the

model places on each factor

Intercept (−1.9884) means this model

predicts a 12% likelihood of fraud if no

modeled factors present

20

Test of model accuracy

C-statistic (probability outcome is better

than chance) = 0.814

≥0.70 indicates an acceptable model

≥0.80 indicates a strong model — the closer

to 1 the better

Visually represented as ROC curve

21

Considerations

Accuracy only as good as the target

population sample

Sum of the terms = ‘logit’ of the predictive

probability of the model — translates into

odds a claim is fraudulent

Conversion of coefficient of the target

variable — logit(p) — to probability

𝑝 = 1

1+ 𝑒−logit(𝑝)

22

Logit transformation p logit(p) p logit(p) p logit(p) p logit(p)

0.01 -4.5951 0.26 -1.0460 0.51 0.0400 0.76 1.1527

0.02 -3.8918 0.27 -0.9946 0.52 0.0800 0.77 1.2083

0.03 -3.4761 0.28 -0.9445 0.53 0.1201 0.78 1.2657

0.04 -3.1781 0.29 -0.8954 0.54 0.1603 0.79 1.3249

0.05 -2.9444 0.30 -0.8473 0.55 0.2007 0.8 1.3863

0.06 -2.7515 0.31 -0.8001 0.56 0.2412 0.81 1.4500

0.07 -2.5867 0.32 -0.7538 0.57 0.2819 0.82 1.5163

0.08 -2.4423 0.33 -0.7082 0.58 0.3228 0.83 1.5856

0.09 -2.3136 0.34 -0.6633 0.59 0.3640 0.84 1.6582

0.10 -2.1972 0.35 -0.6190 0.60 0.4055 0.85 1.7346

0.11 -2.0907 0.36 -0.5754 0.61 0.4473 0.86 1.8153

0.12 -1.9924 0.37 -0.5322 0.62 0.4895 0.87 1.9010

0.13 -1.9010 0.38 -0.4895 0.63 0.5322 0.88 1.9924

0.14 -1.8153 0.39 -0.4473 0.64 0.5754 0.89 2.0907

0.15 -1.7346 0.40 -0.4055 0.65 0.6190 0.9 2.1972

0.16 -1.6582 0.41 -0.3640 0.66 0.6633 0.91 2.3136

0.17 -1.5856 0.42 -0.3228 0.67 0.7082 0.92 2.4423

0.18 -1.5163 0.43 -0.2819 0.68 0.7538 0.93 2.5867

0.19 -1.4500 0.44 -0.2412 0.69 0.8001 0.94 2.7515

0.20 -1.3863 0.45 -0.2007 0.70 0.8473 0.95 2.9444

0.21 -1.3249 0.46 -0.1603 0.71 0.8954 0.96 3.1781

0.22 -1.2657 0.47 -0.1201 0.72 0.9445 0.97 3.4761

0.23 -1.2083 0.48 -0.0800 0.73 0.9946 0.98 3.8918

0.24 -1.1527 0.49 -0.0400 0.74 1.0460 0.99 4.5951

0.25 -1.0986 0.50 0.0000 0.75 1.0986

23

If all factors present,

logit(p) = −1.9884 +

2.1370 + 1.2356 +

0.3784 + 0.1877 +

0.4805 = 2.4308 =

92% probability of

fraud

LR weaknesses

All potential fraud factors combined into a

single equation

With many independent predictor variables,

characteristics can cancel each other out

Logistic regression has a hard time weighting

interactions between individual variables

Must be programmed explicitly

Requires additional data manipulation

24

LR weaknesses (ctd)

In rare-event modeling with a large

number of predictive variables, logistic

regression can produce many false

positives

Difficult to differentiate rare events from

normal events when the rare events occur

with extremely low frequency

Bad solution is to boost the sensitivity of the

model

25

Other supervised methods

Decision tree mitigates the problem of

numerous weak predictors overwhelming

a strong predictor (logistic regression)

Sorts observations of the dependent

variable into buckets corresponding to its

available classification values

Conditional selection into paths (‘branches’)

Priority determined by frequency of

characteristics

26

Decision tree example

27

HighIncidence

Organization

4F/10N 9F/3N

Purity 4F/5N Purity 7F/3N

1F/3N 3F/2N 4F/1N 3F/2N

Purity Tie PurityTie Purity3F/1N Tie 2F/1N

0 cases = 01 case = 1

1 case = 01 case = 1

1 case = 02 cases = 1

0 cases = 01 cases = 1

Doctor on CMS Ineligible

List








List

High Risk Occupation



Imperceptible Physical

Impairment

Psychological Impairment




List

Imperceptible Physical

Impairment

Psychological Impairment

Imperfect PurityPurity Tie

Left-Facing Arrows: Value = Characteristic is absentRight-Facing Arrows: Value = Characteristic is present0 = No Fraud1 = FraudMisclassification Rate = 23.08%

Beyond decision tree

Decision tree may overweight high-

frequency but insignificant characteristics

Boosted decision tree and random forest

are techniques to improve on the results

of the basic algorithm based on

misclassification rates

Neural networks model all possible

combinations and select the best ones

based on misclassification rates

28

Unsupervised methods

K-means cluster

Consider it a generalization of logistic

regression

Identify a set of independent variables

Transformations likely required, as above

Procedure tries to identify a set of statistically

significant ‘clusters’ based on the selected

variables

Can tease out meaningful characteristics

29

SOME BEST PRACTICES

IN DATA MANAGEMENT


30

Data best practices

Understand your data

What does it represent

How does it enter your data warehouse

Check data for suitability

Missing values?

Do target and individual predictors correlate?

Ensure that data cleansing and

transformation steps are documented

and repeatable for model re-estimation

31

Counterintuitive-ness

The more independent variables, the less

predictive value each individual variable,

or characteristic, has, on average

32

Counterintuitive-ness (ctd)

In rare event modeling, even a very

accurate model can produce

disproportionately large false positives

Example: Target population 1% in a

population of 1,000,000 (10,000 targets).

If predictive model has a 10% false positive rate (90%

accurate):

33

Target General population

10,000 990,000

True positives: 9,000 True negatives: 891,000

False negatives: 1,000 False positives: 99,000

Takeaways for success

1.Clearly identify target variable

2.Limit predictor variables

3.Know the model data and manage it —

data management is most of the work

4.Know how to measure model

performance

5.Set goals and expectations for the model

6.Monitor model performance and adjust/

re-estimate as necessary

34

Thank you/Questions

Paul Arnest

[email protected]

35

Documents

Virtual Site Event Predictive Analytics: What Managers ...pmibaltimore.org/pmi/events/attachments/9678200.pdf · Predictive Analytics: What Managers Need to Know Presented by: Paul