Upload
vantu
View
222
Download
0
Embed Size (px)
Citation preview
Virtual Site Event
1
Predictive Analytics:
What Managers Need to Know
Presented by:
Paul Arnest, MS, MBA, PMP
February 11, 2015
Virtual Site Ground Rules
Ground Rules PMI Code of Conduct applies for this virtual presentation.
The Virtual Attendees are expected to:
Participate for a minimum of 40 minutes. Login information
will be verified.
Answer the question pertaining to the presentation correctly
in the survey in order to obtain the PDU credit (1).
Respond to the survey within 48 hours (By Friday February
13, 2015) of participation in order to obtain the PDU credit.
2
Predictive Analytics
What Managers Need to Know
3
A NEW ENVIRONMENT
Predictive Analytics
4
Definition
Predictive Analytics: Techniques that
quantify potential outcomes or events
based on past data
Not descriptive analysis and descriptive
statistics
Not techniques that enable end-users to
perform individual data discovery or to
customize reports
5
Convergence
Once restricted to specialized statistics
organizations, advanced modeling
techniques are moving into the IT
mainstream
6
Stat/Analytics Shop IT
Concepts/Buzzwords
Machine learning ▫ Supervised learning ▫
Unsupervised learning
Response variable ▫ Target variable ▫
Dependent variable ▫ Left hand side
variable
Explanatory variable ▫ Independent
variable ▫ Right hand side variable
Logistic regression ▫ Random forest, etc.
Sensitivity ▫ Specificity
7
Tool independence
Predictive techniques use mathematical
algorithms that are independent of
particular tools
SAS, R, Stata, SPSS, many more
Use specialized tools for model
development
It is possible to implement models using
general software tools, i.e., Java, .Net
8
Don’t be intimidated
Your stat/analysis package is
programmed to do the heavy math
You’ll discover that most internal stat shops
are using a small set of models and
techniques over and over again
Most of the work:
Understanding what you want to accomplish
Understanding the data
Organizing the data
9
Understand the results
Predictive analytics produce a probability
of a characteristic or behavior based on a
detailed analysis of past characteristics or
behaviors
Probability is ≤ 100% ≠ Certainty
Model accuracy depends on similarity of past
conditions to present
10
HOW IT WORKS
AND WHAT TO EXPECT
Predictive Analytics
11
Logistic regression
Workhorse procedure for predictive
analytics
Supervised technique
12
Step 1
Identify a known population that exhibits
the characteristic you want to predict —
‘dependent’, ‘target’ or ‘response’ variable
— plus a known population that does not
You may take the whole population (‘big
data’) or a sample
Use 80% or 90% of the sample as the
training data set
Withhold the remainder for validation
13
Step 2
Construct a hypothesis (‘null hypothesis’)
Select variables expected to distinguish
target population — ‘independent’ or
‘explanatory’ variables
14
Step 3
Run a logistic regression against the
variables
Logistic regression will calculate the
likelihood (predictive odds) that the
independent variables are associated with
the dependent variable
15
Step 4
Test the hypothesis on the withheld
sample and the broader population
Caution:
It’s critical to identify the target
characteristics accurately
16
Logistic regression: targets
17
Target: Workers’
Compensation
Fraudsters Target
High Incidence
Organization
Dr on CMS
Ineligible
List
High Risk
Occupation
Psychological
Impairment
Imperceptible
Physical
Impairment
Linda 1 1 1 1 1 1
Rebecca 1 1 1 1 0 1
Samuel 1 1 0 1 1 0
Stephen 1 0 0 0 1 1
Amanda 1 1 0 0 1 0
Hugh 1 0 1 0 0 1
Francesco 1 0 1 1 0 1
Allen 1 1 0 0 1 0
Eric 1 1 0 0 1 1
Gail 1 0 1 0 0 1
Joseph 1 1 1 1 0 0
Derek 1 1 1 0 1 0
Kevin 1 1 0 1 1 1
Logistic regression: general
18
General population
of covered workers Target
High Incidence
Organization
Dr on CMS
Ineligible
List
High Risk
Occupation
Psychological
Impairment
Imperceptible
Physical
Impairment
Linda 0 1 1 1 1 1
Rebecca 0 0 0 1 0 1
Samuel 0 0 0 0 0 0
Stephen 0 0 0 0 0 1
Amanda 0 1 0 0 1 0
Hugh 0 0 1 0 0 1
Francesco 0 0 0 0 0 0
Allen 0 0 0 0 1 0
Eric 0 0 0 0 1 1
Gail 0 0 1 0 0 1
Joseph 0 0 0 1 1 0
Derek 0 0 1 0 0 0
Kevin 0 1 0 1 1 1
Results
Maximum Likelihood Estimates:
Fraud likelihood = −1.9884 (intercept) +
2.1370 (multiple cases) + 1.2356 (CMS
ineligible) + .3784 (rep disciplined)
+.1877 (psychological) + .4805
(imperceptible physical)
19
Interpretation
Positive coefficients mean all factors
contribute to likelihood of fraud
Coefficients reflect the actual weight the
model places on each factor
Intercept (−1.9884) means this model
predicts a 12% likelihood of fraud if no
modeled factors present
20
Test of model accuracy
C-statistic (probability outcome is better
than chance) = 0.814
≥0.70 indicates an acceptable model
≥0.80 indicates a strong model — the closer
to 1 the better
Visually represented as ROC curve
21
Considerations
Accuracy only as good as the target
population sample
Sum of the terms = ‘logit’ of the predictive
probability of the model — translates into
odds a claim is fraudulent
Conversion of coefficient of the target
variable — logit(p) — to probability
𝑝 = 1
1+ 𝑒−logit(𝑝)
22
Logit transformation p logit(p) p logit(p) p logit(p) p logit(p)
0.01 -4.5951 0.26 -1.0460 0.51 0.0400 0.76 1.1527
0.02 -3.8918 0.27 -0.9946 0.52 0.0800 0.77 1.2083
0.03 -3.4761 0.28 -0.9445 0.53 0.1201 0.78 1.2657
0.04 -3.1781 0.29 -0.8954 0.54 0.1603 0.79 1.3249
0.05 -2.9444 0.30 -0.8473 0.55 0.2007 0.8 1.3863
0.06 -2.7515 0.31 -0.8001 0.56 0.2412 0.81 1.4500
0.07 -2.5867 0.32 -0.7538 0.57 0.2819 0.82 1.5163
0.08 -2.4423 0.33 -0.7082 0.58 0.3228 0.83 1.5856
0.09 -2.3136 0.34 -0.6633 0.59 0.3640 0.84 1.6582
0.10 -2.1972 0.35 -0.6190 0.60 0.4055 0.85 1.7346
0.11 -2.0907 0.36 -0.5754 0.61 0.4473 0.86 1.8153
0.12 -1.9924 0.37 -0.5322 0.62 0.4895 0.87 1.9010
0.13 -1.9010 0.38 -0.4895 0.63 0.5322 0.88 1.9924
0.14 -1.8153 0.39 -0.4473 0.64 0.5754 0.89 2.0907
0.15 -1.7346 0.40 -0.4055 0.65 0.6190 0.9 2.1972
0.16 -1.6582 0.41 -0.3640 0.66 0.6633 0.91 2.3136
0.17 -1.5856 0.42 -0.3228 0.67 0.7082 0.92 2.4423
0.18 -1.5163 0.43 -0.2819 0.68 0.7538 0.93 2.5867
0.19 -1.4500 0.44 -0.2412 0.69 0.8001 0.94 2.7515
0.20 -1.3863 0.45 -0.2007 0.70 0.8473 0.95 2.9444
0.21 -1.3249 0.46 -0.1603 0.71 0.8954 0.96 3.1781
0.22 -1.2657 0.47 -0.1201 0.72 0.9445 0.97 3.4761
0.23 -1.2083 0.48 -0.0800 0.73 0.9946 0.98 3.8918
0.24 -1.1527 0.49 -0.0400 0.74 1.0460 0.99 4.5951
0.25 -1.0986 0.50 0.0000 0.75 1.0986
23
If all factors present,
logit(p) = −1.9884 +
2.1370 + 1.2356 +
0.3784 + 0.1877 +
0.4805 = 2.4308 =
92% probability of
fraud
LR weaknesses
All potential fraud factors combined into a
single equation
With many independent predictor variables,
characteristics can cancel each other out
Logistic regression has a hard time weighting
interactions between individual variables
Must be programmed explicitly
Requires additional data manipulation
24
LR weaknesses (ctd)
In rare-event modeling with a large
number of predictive variables, logistic
regression can produce many false
positives
Difficult to differentiate rare events from
normal events when the rare events occur
with extremely low frequency
Bad solution is to boost the sensitivity of the
model
25
Other supervised methods
Decision tree mitigates the problem of
numerous weak predictors overwhelming
a strong predictor (logistic regression)
Sorts observations of the dependent
variable into buckets corresponding to its
available classification values
Conditional selection into paths (‘branches’)
Priority determined by frequency of
characteristics
26
Decision tree example
27
HighIncidence
Organization
4F/10N 9F/3N
Purity 4F/5N Purity 7F/3N
1F/3N 3F/2N 4F/1N 3F/2N
Purity Tie PurityTie Purity3F/1N Tie 2F/1N
0 cases = 01 case = 1
1 case = 01 case = 1
1 case = 02 cases = 1
0 cases = 01 cases = 1
Doctor on CMS Ineligible
List
0 cases = 01 case = 1
2 cases = 02 cases = 1
0 cases = 01 case = 1
2 cases = 00 cases = 1
1 case = 01 case = 1
1 case = 01 case = 1
Doctor on CMS Ineligible
List
High Risk Occupation
High Risk Occupation
0 cases = 02 cases = 1
Imperceptible Physical
Impairment
Psychological Impairment
High Risk Occupation
5 cases = 00 cases = 1
Doctor on CMS Ineligible
List
Imperceptible Physical
Impairment
Psychological Impairment
Imperfect PurityPurity Tie
Left-Facing Arrows: Value = Characteristic is absentRight-Facing Arrows: Value = Characteristic is present0 = No Fraud1 = FraudMisclassification Rate = 23.08%
Beyond decision tree
Decision tree may overweight high-
frequency but insignificant characteristics
Boosted decision tree and random forest
are techniques to improve on the results
of the basic algorithm based on
misclassification rates
Neural networks model all possible
combinations and select the best ones
based on misclassification rates
28
Unsupervised methods
K-means cluster
Consider it a generalization of logistic
regression
Identify a set of independent variables
Transformations likely required, as above
Procedure tries to identify a set of statistically
significant ‘clusters’ based on the selected
variables
Can tease out meaningful characteristics
29
SOME BEST PRACTICES
IN DATA MANAGEMENT
Predictive Analytics
30
Data best practices
Understand your data
What does it represent
How does it enter your data warehouse
Check data for suitability
Missing values?
Do target and individual predictors correlate?
Ensure that data cleansing and
transformation steps are documented
and repeatable for model re-estimation
31
Counterintuitive-ness
The more independent variables, the less
predictive value each individual variable,
or characteristic, has, on average
32
Counterintuitive-ness (ctd)
In rare event modeling, even a very
accurate model can produce
disproportionately large false positives
Example: Target population 1% in a
population of 1,000,000 (10,000 targets).
If predictive model has a 10% false positive rate (90%
accurate):
33
Target General population
10,000 990,000
True positives: 9,000 True negatives: 891,000
False negatives: 1,000 False positives: 99,000
Takeaways for success
1.Clearly identify target variable
2.Limit predictor variables
3.Know the model data and manage it —
data management is most of the work
4.Know how to measure model
performance
5.Set goals and expectations for the model
6.Monitor model performance and adjust/
re-estimate as necessary
34