35
Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference Advanced Data Analytics: Introduction to data science techniques and applications March 28, 2013 Anna Cueni and Kate Eckhart Fraud Investigation & Dispute Services

Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Embed Size (px)

Citation preview

Page 1: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Institute of Internal AuditorsPuget Sound ChapterFourth Annual Fraud Conference

Advanced Data Analytics: Introduction to data science techniques and applications

March 28, 2013

Anna Cueni and Kate EckhartFraud Investigation & Dispute Services

Page 2: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 2 Advanced Data Analytics: Introduction to data science techniques and applications

Agenda

► Introduction

►Fuzzy matching

►Text mining

►Clustering

►Automated data classification

Page 3: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 3 Advanced Data Analytics: Introduction to data science techniques and applications

Introduction

Page 4: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 4 Advanced Data Analytics: Introduction to data science techniques and applications

What is data science?

► Wikipedia: “Data science seeks to use all available and relevant data to effectively tell a story that can be easily understood by non-practitioners.”

► Uses techniques from mathematics, statistics, linguistics, and computer science

► Examples of available data sources:► Structured data

► Financial Records► T&E► Purchase Orders► Inventory records► Employee and Vendor Lists

► Unstructured data► Description fields► Email / Instant messages► Text/Mobile messages

Page 5: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 5 Advanced Data Analytics: Introduction to data science techniques and applications

Why is data science important?

► Increasing variety, complexity and volume of data

► Near real-time information available, spread across multiple systems and communication channels

► Evolving inventory and maturity of fraud schemes and risks

► Three primary advantages over manual processes:

LessTime

LowerCost

MoreConsistency

Page 6: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 6 Advanced Data Analytics: Introduction to data science techniques and applications

How can data science assist Internal Audit?

► Risk assessments► Identify significant spend or activity across:

► Vendors

► Business units

► Locations

► Risk scoring

► Monitoring activities► Ability to consider entire population► Identify unusual patterns of activity / outliers ► Identify potential conflicts of interest

► Internal investigations► Keyword searches and document review► Identify transactions with similar attributes

Page 7: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 7 Advanced Data Analytics: Introduction to data science techniques and applications

Types of tools – open source vs. proprietary

► Open-source tools► Advantages

► Free – licenses tend to be unrestrictive► Support and tutorials available from

large community of users/developers► Can generally be customized to meet

individual needs► Large developer community means

access to up-to-date innovations/algorithms

► Proprietary tools

► Disadvantages► UI may be weak or non-existent► Might require in-house technical

expertise and quality evaluation

► Examples► R► Weka► Mahout / Hadoop

► Advantages► Support might be available as part of

the licensing contract► UI might be highly developed► Tested and less likely to require in-

house technical expertise► Might use proprietary algorithms not

available elsewhere

► Disadvantages► Paid – licenses may be restrictive► Customization may be limited, costly

or impossible► Might already be targeted at a specific

use-case

Page 8: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 8 Advanced Data Analytics: Introduction to data science techniques and applications

Fuzzy matching

Page 9: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 9 Advanced Data Analytics: Introduction to data science techniques and applications

What is fuzzy matching?

►Attempts to emulate the process a human uses to judge similarity in an automated fashion:

►Techniques► Levenshtein distance► Term-based similarity scoring

Name AddressJeff D Walton Enterprises 4798 Rand Blvd

Cyclex Inc 1333 Jered St

PrintTech 39 Oyster Lane

Maryanna Mason 4978 Rand Blvd #4

Jeffrey Walton P.O. Box 16926

JDW Corp 4978 Ran d Boulevard

Page 10: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 10 Advanced Data Analytics: Introduction to data science techniques and applications

Levenshtein distance

► Compares two strings and determines the minimum number of edits required to change one string into the other string

► Examples► “Jeff” and “Jeffrey” – distance of 3 (add “r”, “e”, “y”)► “Jeffrey” and “Jeffley” – distance of 1 (substitute “l” for “r”)

►Limitations► Not sensitive to common forms of language variation► Not sensitive to word/letter frequency

► Variations► Add additional types of edits► Apply weights to edits

Page 11: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 11 Advanced Data Analytics: Introduction to data science techniques and applications

Term-based similarity scoring

► Compares words instead of individual letters► “535 Main Street” is treated as | 535 | Main | Street |►Works best for longer pieces of data

► Represents each piece of data as a vector of words►Two types of vectors:

►Term frequency (TF) vectors►Counts the number of times each

word appears in the data►All words receive the same

weight► Term Frequency-Inverse Document

Frequency (TF-IDF) vectors►Weights words based on

frequency

Page 12: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 12 Advanced Data Analytics: Introduction to data science techniques and applications

Term-based similarity scoring - example

“535 Main Road” vs. “535 Apple Road, Apt 535”

► TF vector:

► TF-IDF vector:► Terms weighted based on occurrence in entire data set► Would expect “Road” and “Apt” to get lower weights► Would expect “Apple” to get higher weight

Data 535 Main Road Apple Apt

535 Main Road 1 1 1 0 0

535 Apple Road, Apt 535 2 0 1 1 1

Page 13: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 13 Advanced Data Analytics: Introduction to data science techniques and applications

Levenshtein distance – example

Name 1 Name 2 Distance Score

Jeff D Walton Jeffrey Walton 3

JD Walton Jeffrey Walton 6

Jeffrey Walton Jeff D Walton 3

Maryanna Mason Jeff D Walton 11

Maryanna Mason Jeffrey Walton 11

Page 14: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 14 Advanced Data Analytics: Introduction to data science techniques and applications

Term-based similarity scoring – TF-IDF weight example

Vendor 1 Vendor 2Similarity

score

Maryanna Mason 4978 Rand Blvd #4 JDW Corp 4978 ran d boulevard 0.37

Jeff D Walton Enterprises 4798 Rand Blvd Jeffrey Walton P.O. Box 16926 0.33

Jeff D Walton Enterprises 4798 Rand Blvd Maryanna Mason 4978 Rand Blvd #4 0.28

Cyclex Inc 1333 Jered St PrintTech 39 Oyster Lane 0

Cyclex Inc 1333 Jered St Jeffrey Walton P.O. Box 16926 0

PrintTech 39 Oyster Lane JDW Corp 4978 ran d boulevard 0

Maryanna Mason 4978 Rand Blvd #4 Jeffrey Walton P.O. Box 16926 0

Jeffrey Walton P.O. Box 16926 JDW Corp 4978 ran d boulevard 0

PrintTech 39 Oyster Lane Maryanna Mason 4978 Rand Blvd #4 0

Jeff D Walton Enterprises 4798 Rand Blvd PrintTech 39 Oyster Lane 0

Page 15: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 15 Advanced Data Analytics: Introduction to data science techniques and applications

Applications of fuzzy matching

► Identify potential conflicts of interest► Comparison of name/address/phone/bank account

information in vendor, employee and customer master files

► Data normalization/cleansing

► Identify multiple entries for the same vendor in the vendor master file

► Identify potential duplicate transactions► Comparison of vendor name/description/amount/invoice

number in AP► Comparison of description/amount in T&E

► Find documents that are similar in a known set of “hot documents”

Page 16: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 16 Advanced Data Analytics: Introduction to data science techniques and applications

Text mining

Page 17: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 17 Advanced Data Analytics: Introduction to data science techniques and applications

What is text mining?

► Process of extracting relevant information from textual data

► When to use text mining:► When valuable information is contained in unstructured data

such as:► Email / Instant messages► Description fields

► To add structure to unstructured data

► Techniques► Entity extraction► Automated keyword expansion► Sentiment analysis

Page 18: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 18 Advanced Data Analytics: Introduction to data science techniques and applications

Fraud triangle keywords

Rationalization

…I deserve it

…nobody will find out

…gray area

…they owe it to me

…everybody does it

…fix it later

…the company can afford it

…not hurting anyone

…won’t miss it

…don’t get paid enough

Incentive / Pressure

…make the number

…don’t let the auditor find out

…don’t leave a trail

…not comfortable

…why are we doing this

…pull out all the stops

…do not volunteer information

…want no part of this

…only a timing difference

…not ethical

Opportunity

…special fees

…client side storage

…off the books

…cash advance

…side commission

…backdate

…no inspection

…no receipt

…smooth earnings

…pull earnings forward

Page 19: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 19 Advanced Data Analytics: Introduction to data science techniques and applications

Comparison of individual scores

Fraud triangle keywords (cont.)

Page 20: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 20 Advanced Data Analytics: Introduction to data science techniques and applications

Entity extraction

► Identifies people, places, locations, organizations and currency in unstructured data

► Applications:► T&E

► Identify entertainment of government officials or organizations

► Identify location or date mismatches between fields► GL

► Identify references to proper names or organizations in description fields

Page 21: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 21 Advanced Data Analytics: Introduction to data science techniques and applications

Entity extraction – example

The year 1941 marked a major turning point. Victor Z. Brink, authored the first major book on internal auditing. And at the same time, John B. Thurston, internal auditor for the North American Company in New York, had been contemplating establishing an organization for internal auditors. He and Robert B. Milne had served together on an internal auditing subcommittee formed jointly by the Edison Electric Institute and the American Gas Association, and they agreed that further progress in bringing internal auditing to its proper level of recognition would be best made possible by forming an independent organization for internal auditors. When Brink’s book came to the attention of Thurston, the three men got together and found they had a mutual interest in furthering the role of internal auditing.

Tags:LOCATIONPERSONORGANIZATIONDATE

Page 22: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 22 Advanced Data Analytics: Introduction to data science techniques and applications

Automated keyword expansion

► Used to discover new keywords, expand coverage and reduce false positives by looking for:► Misspelled words► Synonyms► Jargon/code names► Grammatical expansions► Translations► Name expansions

► When to use automated keyword expansion:► Anytime a keyword search is being performed as most

keywords are either over or under-inclusive

Page 23: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 23 Advanced Data Analytics: Introduction to data science techniques and applications

Automated keyword expansion – example

Keyword Potential expansions

Government feeGovernment fees, govt fee, govt fees, government charge, government charges, govtcharge, govt charges

One time payment

One time payment, one time payments, 1 time payment, 1 time payments, one time pmt, one time pmts, 1 time pmt, 1 time pmts, onetime payment, onetime payments , onetime pmt, onetime pmts

Pay on behalf ofPaid on behalf of, paying on behalf of, pays on behalf of, payment on behalf of, payments on behalf of, pmt on behalf of, pmts on behalf of

Page 24: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 24 Advanced Data Analytics: Introduction to data science techniques and applications

Sentiment analysis

► Identifies emotional states in free text fields

► Data sources include:► Email / Instant messaging► Internal surveys► Social Media (e.g., Twitter, Facebook, Disqus)

► Sample emotions:

► When to use sentiment analysis:► Review of free text to identify potential documents of interest

► Angry► Confused► Cursing► Derogatory► Frustrated

► Problem► Secretive► Suspicious► Surprised► Worried

Page 25: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 25 Advanced Data Analytics: Introduction to data science techniques and applications

Sentiment analysis – sample keywords

Suspicious

…I find it odd

…I find it strange

…I find it weird

…I find it fishy

…I am starting to suspect

…I don’t trust

…I do not trust

…I don’t really trust

…I do not really trust

…increasingly odd

…increasingly strange

…increasingly weird

…increasingly fishy

Secretive

…just between us

…just between you and me

…just between you and I

…strictly between us

…strictly between you and me

…strictly between you and I

…for your eyes only

…do not tell anyone

…do not forward

…don’t tell anyone

…don’t forward

…keep it under wraps

…keep this under wraps

Worried

…I have concerns

…I have real concerns

…I have serious concerns

…I have some concerns

..I have some real concerns

…I have some serious concerns

…I have deep concerns

…I have some deep concerns

..I remain deeply concerned

…I remain seriously concerned

…I remain really concerned

…I remain concerned

Page 26: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 26 Advanced Data Analytics: Introduction to data science techniques and applications

Sentiment analysis – sample dashboard

Page 27: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 27 Advanced Data Analytics: Introduction to data science techniques and applications

Clustering analysis

Page 28: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 28 Advanced Data Analytics: Introduction to data science techniques and applications

What is clustering analysis?

►Clusters groups of similar items together to expedite review

►Basic approach:

►Represent each piece of data as a point in multidimensional space

► Find groups of items that are close together in space to form a cluster

►Fuzzy clustering allows each item to be a member of multiple clusters

►When to use clustering analysis:

►Characterize a large population of data

► Identify outliers and anomalies to target for further review

Page 29: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 29 Advanced Data Analytics: Introduction to data science techniques and applications

Clustering analysis – example

Ave

rage

tran

sact

ion

amou

nt

Average transactions/day

The following chart represents vendor data in terms of average transaction amount and average transactions/day

High-use vendors with highest average spend per transaction

Outliers that might be worth better understanding

Infrequent vendors with high average spend per transaction

Page 30: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 30 Advanced Data Analytics: Introduction to data science techniques and applications

Applications of clustering analysis

► T& E

► Identify outliers by business unit, location, title (e.g., Business Development Manager)

► Identify spend in unusual locations

► Identify fluctuations in T&E spend throughout a period of time

► Inventory

► Identify stores/departments/sales people with the highest amount of returns/shrinkage/exchanges

► Sales

► Identify stores with the highest/lowest number of sales

Page 31: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 31 Advanced Data Analytics: Introduction to data science techniques and applications

Automated data classification

Page 32: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 32 Advanced Data Analytics: Introduction to data science techniques and applications

What is automated data classification?

►Attempts to emulate the process a human uses to classify a piece of data

►Classifiers may include:► In-policy vs. Non-compliant vs. Needs further review►Responsive vs. Non-responsive►English vs. Foreign language

►When to use automated data classification:►Manual review processes►Large data volumes►Real-time/periodic monitoring

Page 33: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 33 Advanced Data Analytics: Introduction to data science techniques and applications

Automated data classification – process overview

1. Develop gold standard dataset► Define categories to be classified► If data is not already labeled,

train expert reviewers and label sample

2. Attribute selection► Identify and

extract predictive attributes

► Key step in assuring quality of classifier

3. Select algorithm and train model► Many algorithms are

available , including decision trees, SVMs, regression models, and Bayesian models

1. Develop gold

standard dataset

2. Attribute selection

3. Select algorithm and train

model

4. Evaluate model

5. Assess performance and iterate if

necessary

Classifier development

4. Evaluate model► Use labeled

data sample as a test set

► Measure model’s precision and recall

5. Assess performance and iterate if necessary► Performance can be

improved through additional training data or new attributes

Page 34: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 34 Advanced Data Analytics: Introduction to data science techniques and applications

Questions?

Page 35: Institute of Internal Auditors Puget Sound Chapter - … ·  · 2013-04-05Institute of Internal Auditors Puget Sound Chapter Fourth Annual Fraud Conference ... don’t let the auditor

Page 35 Advanced Data Analytics: Introduction to data science techniques and applications

Thank-you

Anna CueniManager

Forensic Technology & Dispute [email protected]

415.894.8826

Kate EckhartManager

Fraud Investigation & Dispute [email protected]

415.894.4365