Institute of Internal AuditorsPuget Sound ChapterFourth Annual Fraud Conference
Advanced Data Analytics: Introduction to data science techniques and applications
March 28, 2013
Anna Cueni and Kate EckhartFraud Investigation & Dispute Services
Page 2 Advanced Data Analytics: Introduction to data science techniques and applications
Agenda
► Introduction
►Fuzzy matching
►Text mining
►Clustering
►Automated data classification
Page 3 Advanced Data Analytics: Introduction to data science techniques and applications
Introduction
Page 4 Advanced Data Analytics: Introduction to data science techniques and applications
What is data science?
► Wikipedia: “Data science seeks to use all available and relevant data to effectively tell a story that can be easily understood by non-practitioners.”
► Uses techniques from mathematics, statistics, linguistics, and computer science
► Examples of available data sources:► Structured data
► Financial Records► T&E► Purchase Orders► Inventory records► Employee and Vendor Lists
► Unstructured data► Description fields► Email / Instant messages► Text/Mobile messages
Page 5 Advanced Data Analytics: Introduction to data science techniques and applications
Why is data science important?
► Increasing variety, complexity and volume of data
► Near real-time information available, spread across multiple systems and communication channels
► Evolving inventory and maturity of fraud schemes and risks
► Three primary advantages over manual processes:
LessTime
LowerCost
MoreConsistency
Page 6 Advanced Data Analytics: Introduction to data science techniques and applications
How can data science assist Internal Audit?
► Risk assessments► Identify significant spend or activity across:
► Vendors
► Business units
► Locations
► Risk scoring
► Monitoring activities► Ability to consider entire population► Identify unusual patterns of activity / outliers ► Identify potential conflicts of interest
► Internal investigations► Keyword searches and document review► Identify transactions with similar attributes
Page 7 Advanced Data Analytics: Introduction to data science techniques and applications
Types of tools – open source vs. proprietary
► Open-source tools► Advantages
► Free – licenses tend to be unrestrictive► Support and tutorials available from
large community of users/developers► Can generally be customized to meet
individual needs► Large developer community means
access to up-to-date innovations/algorithms
► Proprietary tools
► Disadvantages► UI may be weak or non-existent► Might require in-house technical
expertise and quality evaluation
► Examples► R► Weka► Mahout / Hadoop
► Advantages► Support might be available as part of
the licensing contract► UI might be highly developed► Tested and less likely to require in-
house technical expertise► Might use proprietary algorithms not
available elsewhere
► Disadvantages► Paid – licenses may be restrictive► Customization may be limited, costly
or impossible► Might already be targeted at a specific
use-case
Page 8 Advanced Data Analytics: Introduction to data science techniques and applications
Fuzzy matching
Page 9 Advanced Data Analytics: Introduction to data science techniques and applications
What is fuzzy matching?
►Attempts to emulate the process a human uses to judge similarity in an automated fashion:
►Techniques► Levenshtein distance► Term-based similarity scoring
Name AddressJeff D Walton Enterprises 4798 Rand Blvd
Cyclex Inc 1333 Jered St
PrintTech 39 Oyster Lane
Maryanna Mason 4978 Rand Blvd #4
Jeffrey Walton P.O. Box 16926
JDW Corp 4978 Ran d Boulevard
Page 10 Advanced Data Analytics: Introduction to data science techniques and applications
Levenshtein distance
► Compares two strings and determines the minimum number of edits required to change one string into the other string
► Examples► “Jeff” and “Jeffrey” – distance of 3 (add “r”, “e”, “y”)► “Jeffrey” and “Jeffley” – distance of 1 (substitute “l” for “r”)
►Limitations► Not sensitive to common forms of language variation► Not sensitive to word/letter frequency
► Variations► Add additional types of edits► Apply weights to edits
Page 11 Advanced Data Analytics: Introduction to data science techniques and applications
Term-based similarity scoring
► Compares words instead of individual letters► “535 Main Street” is treated as | 535 | Main | Street |►Works best for longer pieces of data
► Represents each piece of data as a vector of words►Two types of vectors:
►Term frequency (TF) vectors►Counts the number of times each
word appears in the data►All words receive the same
weight► Term Frequency-Inverse Document
Frequency (TF-IDF) vectors►Weights words based on
frequency
Page 12 Advanced Data Analytics: Introduction to data science techniques and applications
Term-based similarity scoring - example
“535 Main Road” vs. “535 Apple Road, Apt 535”
► TF vector:
► TF-IDF vector:► Terms weighted based on occurrence in entire data set► Would expect “Road” and “Apt” to get lower weights► Would expect “Apple” to get higher weight
Data 535 Main Road Apple Apt
535 Main Road 1 1 1 0 0
535 Apple Road, Apt 535 2 0 1 1 1
Page 13 Advanced Data Analytics: Introduction to data science techniques and applications
Levenshtein distance – example
Name 1 Name 2 Distance Score
Jeff D Walton Jeffrey Walton 3
JD Walton Jeffrey Walton 6
Jeffrey Walton Jeff D Walton 3
Maryanna Mason Jeff D Walton 11
Maryanna Mason Jeffrey Walton 11
Page 14 Advanced Data Analytics: Introduction to data science techniques and applications
Term-based similarity scoring – TF-IDF weight example
Vendor 1 Vendor 2Similarity
score
Maryanna Mason 4978 Rand Blvd #4 JDW Corp 4978 ran d boulevard 0.37
Jeff D Walton Enterprises 4798 Rand Blvd Jeffrey Walton P.O. Box 16926 0.33
Jeff D Walton Enterprises 4798 Rand Blvd Maryanna Mason 4978 Rand Blvd #4 0.28
Cyclex Inc 1333 Jered St PrintTech 39 Oyster Lane 0
Cyclex Inc 1333 Jered St Jeffrey Walton P.O. Box 16926 0
PrintTech 39 Oyster Lane JDW Corp 4978 ran d boulevard 0
Maryanna Mason 4978 Rand Blvd #4 Jeffrey Walton P.O. Box 16926 0
Jeffrey Walton P.O. Box 16926 JDW Corp 4978 ran d boulevard 0
PrintTech 39 Oyster Lane Maryanna Mason 4978 Rand Blvd #4 0
Jeff D Walton Enterprises 4798 Rand Blvd PrintTech 39 Oyster Lane 0
Page 15 Advanced Data Analytics: Introduction to data science techniques and applications
Applications of fuzzy matching
► Identify potential conflicts of interest► Comparison of name/address/phone/bank account
information in vendor, employee and customer master files
► Data normalization/cleansing
► Identify multiple entries for the same vendor in the vendor master file
► Identify potential duplicate transactions► Comparison of vendor name/description/amount/invoice
number in AP► Comparison of description/amount in T&E
► Find documents that are similar in a known set of “hot documents”
Page 16 Advanced Data Analytics: Introduction to data science techniques and applications
Text mining
Page 17 Advanced Data Analytics: Introduction to data science techniques and applications
What is text mining?
► Process of extracting relevant information from textual data
► When to use text mining:► When valuable information is contained in unstructured data
such as:► Email / Instant messages► Description fields
► To add structure to unstructured data
► Techniques► Entity extraction► Automated keyword expansion► Sentiment analysis
Page 18 Advanced Data Analytics: Introduction to data science techniques and applications
Fraud triangle keywords
Rationalization
…I deserve it
…nobody will find out
…gray area
…they owe it to me
…everybody does it
…fix it later
…the company can afford it
…not hurting anyone
…won’t miss it
…don’t get paid enough
Incentive / Pressure
…make the number
…don’t let the auditor find out
…don’t leave a trail
…not comfortable
…why are we doing this
…pull out all the stops
…do not volunteer information
…want no part of this
…only a timing difference
…not ethical
Opportunity
…special fees
…client side storage
…off the books
…cash advance
…side commission
…backdate
…no inspection
…no receipt
…smooth earnings
…pull earnings forward
Page 19 Advanced Data Analytics: Introduction to data science techniques and applications
Comparison of individual scores
Fraud triangle keywords (cont.)
Page 20 Advanced Data Analytics: Introduction to data science techniques and applications
Entity extraction
► Identifies people, places, locations, organizations and currency in unstructured data
► Applications:► T&E
► Identify entertainment of government officials or organizations
► Identify location or date mismatches between fields► GL
► Identify references to proper names or organizations in description fields
Page 21 Advanced Data Analytics: Introduction to data science techniques and applications
Entity extraction – example
The year 1941 marked a major turning point. Victor Z. Brink, authored the first major book on internal auditing. And at the same time, John B. Thurston, internal auditor for the North American Company in New York, had been contemplating establishing an organization for internal auditors. He and Robert B. Milne had served together on an internal auditing subcommittee formed jointly by the Edison Electric Institute and the American Gas Association, and they agreed that further progress in bringing internal auditing to its proper level of recognition would be best made possible by forming an independent organization for internal auditors. When Brink’s book came to the attention of Thurston, the three men got together and found they had a mutual interest in furthering the role of internal auditing.
Tags:LOCATIONPERSONORGANIZATIONDATE
Page 22 Advanced Data Analytics: Introduction to data science techniques and applications
Automated keyword expansion
► Used to discover new keywords, expand coverage and reduce false positives by looking for:► Misspelled words► Synonyms► Jargon/code names► Grammatical expansions► Translations► Name expansions
► When to use automated keyword expansion:► Anytime a keyword search is being performed as most
keywords are either over or under-inclusive
Page 23 Advanced Data Analytics: Introduction to data science techniques and applications
Automated keyword expansion – example
Keyword Potential expansions
Government feeGovernment fees, govt fee, govt fees, government charge, government charges, govtcharge, govt charges
One time payment
One time payment, one time payments, 1 time payment, 1 time payments, one time pmt, one time pmts, 1 time pmt, 1 time pmts, onetime payment, onetime payments , onetime pmt, onetime pmts
Pay on behalf ofPaid on behalf of, paying on behalf of, pays on behalf of, payment on behalf of, payments on behalf of, pmt on behalf of, pmts on behalf of
Page 24 Advanced Data Analytics: Introduction to data science techniques and applications
Sentiment analysis
► Identifies emotional states in free text fields
► Data sources include:► Email / Instant messaging► Internal surveys► Social Media (e.g., Twitter, Facebook, Disqus)
► Sample emotions:
► When to use sentiment analysis:► Review of free text to identify potential documents of interest
► Angry► Confused► Cursing► Derogatory► Frustrated
► Problem► Secretive► Suspicious► Surprised► Worried
Page 25 Advanced Data Analytics: Introduction to data science techniques and applications
Sentiment analysis – sample keywords
Suspicious
…I find it odd
…I find it strange
…I find it weird
…I find it fishy
…I am starting to suspect
…I don’t trust
…I do not trust
…I don’t really trust
…I do not really trust
…increasingly odd
…increasingly strange
…increasingly weird
…increasingly fishy
Secretive
…just between us
…just between you and me
…just between you and I
…strictly between us
…strictly between you and me
…strictly between you and I
…for your eyes only
…do not tell anyone
…do not forward
…don’t tell anyone
…don’t forward
…keep it under wraps
…keep this under wraps
Worried
…I have concerns
…I have real concerns
…I have serious concerns
…I have some concerns
..I have some real concerns
…I have some serious concerns
…I have deep concerns
…I have some deep concerns
..I remain deeply concerned
…I remain seriously concerned
…I remain really concerned
…I remain concerned
Page 26 Advanced Data Analytics: Introduction to data science techniques and applications
Sentiment analysis – sample dashboard
Page 27 Advanced Data Analytics: Introduction to data science techniques and applications
Clustering analysis
Page 28 Advanced Data Analytics: Introduction to data science techniques and applications
What is clustering analysis?
►Clusters groups of similar items together to expedite review
►Basic approach:
►Represent each piece of data as a point in multidimensional space
► Find groups of items that are close together in space to form a cluster
►Fuzzy clustering allows each item to be a member of multiple clusters
►When to use clustering analysis:
►Characterize a large population of data
► Identify outliers and anomalies to target for further review
Page 29 Advanced Data Analytics: Introduction to data science techniques and applications
Clustering analysis – example
Ave
rage
tran
sact
ion
amou
nt
Average transactions/day
The following chart represents vendor data in terms of average transaction amount and average transactions/day
High-use vendors with highest average spend per transaction
Outliers that might be worth better understanding
Infrequent vendors with high average spend per transaction
Page 30 Advanced Data Analytics: Introduction to data science techniques and applications
Applications of clustering analysis
► T& E
► Identify outliers by business unit, location, title (e.g., Business Development Manager)
► Identify spend in unusual locations
► Identify fluctuations in T&E spend throughout a period of time
► Inventory
► Identify stores/departments/sales people with the highest amount of returns/shrinkage/exchanges
► Sales
► Identify stores with the highest/lowest number of sales
Page 31 Advanced Data Analytics: Introduction to data science techniques and applications
Automated data classification
Page 32 Advanced Data Analytics: Introduction to data science techniques and applications
What is automated data classification?
►Attempts to emulate the process a human uses to classify a piece of data
►Classifiers may include:► In-policy vs. Non-compliant vs. Needs further review►Responsive vs. Non-responsive►English vs. Foreign language
►When to use automated data classification:►Manual review processes►Large data volumes►Real-time/periodic monitoring
Page 33 Advanced Data Analytics: Introduction to data science techniques and applications
Automated data classification – process overview
1. Develop gold standard dataset► Define categories to be classified► If data is not already labeled,
train expert reviewers and label sample
2. Attribute selection► Identify and
extract predictive attributes
► Key step in assuring quality of classifier
3. Select algorithm and train model► Many algorithms are
available , including decision trees, SVMs, regression models, and Bayesian models
1. Develop gold
standard dataset
2. Attribute selection
3. Select algorithm and train
model
4. Evaluate model
5. Assess performance and iterate if
necessary
Classifier development
4. Evaluate model► Use labeled
data sample as a test set
► Measure model’s precision and recall
5. Assess performance and iterate if necessary► Performance can be
improved through additional training data or new attributes
Page 34 Advanced Data Analytics: Introduction to data science techniques and applications
Questions?
Page 35 Advanced Data Analytics: Introduction to data science techniques and applications
Thank-you
Anna CueniManager
Forensic Technology & Dispute [email protected]
415.894.8826
Kate EckhartManager
Fraud Investigation & Dispute [email protected]
415.894.4365