22
How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3. Be aware of institutional challenges 4. Set your misclassification costs 5. Increase model acceptance with visualization 6. Insist on accurately labeled historical data 7. Look for collusion 8. Prepare for an ever-changing landscape 1

How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

How to Combat Modeling Hurdles

1. Compensate for needles in the haystack 2. Pre-process your data carefully 3. Be aware of institutional challenges 4. Set your misclassification costs 5. Increase model acceptance with visualization 6. Insist on accurately labeled historical data 7. Look for collusion 8. Prepare for an ever-changing landscape

1

Page 2: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

1. Compensate for Needles in the Haystack

¨  Up-sample / down-sample data to obtain balanced training data ¤  Pitfall 1: up-sampling exaggerates importance of particular instance of

the rare event ¤  Pitfall 2: not partitioning into training and testing FIRST

2

¨  Importance of baselining ¤  is a model that’s

right 99.9% of the time a good one? That depends...

Page 3: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

2. Pre-Process your Data Carefully

3 * first name has been changed

Page 4: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

Fuzzy Matching is Crucial

4

¨  Borrow techniques from text mining

¨  Change GUI inputs to restrict entries

Page 5: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

3. Be Aware of Institutional Challenges

5

                 

¨  PR considerations ¤  Sensitivity to customer relations

¨  Understanding that the models are not deciding fraud, but flagging cases as suspect ¤  internal jobs, budget concerns ¤  “automated screening process”

¨  Convince data keepers of importance of maintaining data for fraud detection purposes

                 

Page 6: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

4. Set Your Misclassification Costs

¨  Terminology: ¤  specificity/sensitivity (biometrics, fraud) ¤  type I/II (statistics) ¤  false positive/negative (medicine) ¤ precision/recall (information retrieval) ¤  false alarm/false dismissal (fraud, others)

Predic(on  Truth   Not  Flagged   Flagged  

Not  Fraud  True  

nega(ve  False  posi(ve  

Fraud   False  nega(ve  

True  posi(ve  

6

Page 7: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

False Positives Can Range from Mildly Embarrassing…..

Page 8: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

….To Hugely Costly

Airport evacuation due to a bomb scare…

Page 9: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

False Negatives: Not Always A Huge Deal…

Approving a $250 fraudulent insurance claim

Page 10: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

But Sometimes Life or Death

Page 11: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

5. Increase Model Acceptance with Visualization

¨  Present users with an intuitive and easy-to-use GUI to encourage use of model results

¨  Use interactive graphs and charts ¨  Explain how model generated results

11

Page 12: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

Contracting with the USPS

12

USPS managed $33 Billion in contracts (FY2009)

Page 13: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

RADR Risk Assessment Data Repository

Page 14: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

RADR Risk Assessment Data Repository

Page 15: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

RADR Risk Assessment Data Repository

Page 16: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

RADR Risk Assessment Data Repository

Page 17: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

6. Insist on Accurately Labeled Historical Data

17

¨  Undetected fraud shows up in training set as examples of non-fraud -> can confuse model

¨  Identified fraud might not be in training set

¨  Flagged cases that proved to be non-fraudulent are not recorded as such

¨  Institutional challenges with keepers of the data

Page 18: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

7. Look for Collusion

18

Breakout Fraud: ¨  Collusion where every

member flies “below the radar”.

¨  The group works in concert to commit one large act of fraud or several small ones.

¨  Perhaps 5 people pretending to be 100.

¨  Link analysis algorithms are very useful in detecting this type of fraud.

Page 19: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

8. Prepare for an Ever-changing Landscape

19

¨  Moving target: fraudsters constantly refining and expanding their schemes

¨  Models must be very closely guarded

¨  Models must be updated often

¨  Subject matter expertise is crucial

Page 20: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

How to Combat Modeling Hurdles

1. Compensate for needles in the haystack 2. Pre-process your data carefully 3. Be aware of institutional challenges 4. Set your misclassification costs 5. Increase model acceptance with visualization 6. Insist on accurately labeled historical data 7. Look for collusion 8. Prepare for an ever-changing landscape

20

Page 21: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

Thank you. Questions? [email protected]

Page 22: How to Combat Modeling Hurdles › ... › Day2_1645_deMedina… · How to Combat Modeling Hurdles 1. Compensate for needles in the haystack 2. Pre-process your data carefully 3

Antonia de Medinaceli

Antonia de Medinaceli is currently Director of Fraud

Analytics at Elder Research, Inc., the nation’s largest independent data mining consultancy. Ms. de Medinaceli has applied Data Mining

technologies to a range of projects, including direct marketing, web site personalization, pattern recognition in digital images, and financial analysis. In addition to her consulting experience, she has also co-taught Data Mining short courses with the Elder Research team. Her previous industry experience was largely focused on the design and implementation of algorithms for the optimization of large-scale systems. These projects included flight network optimization, data fusion, and efficiency improvements in manufacturing settings.

22