View
224
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Large Scale Entity Resolution Tools for finding the important needle in the haystack Global Directions Confrence 2013
Citation preview
2
Large Scale Entity Resolution Tools for Finding the Important Needle in the Haystack
Mary Galvin, Technical Consultant, LexisNexis Kodak Global Directions ‘13
2
2 Strategies for Entity Resolution to Reveal Hidden Connections
2
Semantics
1. ‘Entity’: A thing with distinct and independent existence containing
enough attributes to uniquely set it apart from something else.
2. ‘Entity Resolution’: The processes and methodologies used to
uncover instances where the same ‘entity’ is referred to across
disparate sources of digital information (ie, records, news stories,
blogs/microblogs, etc.).
Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack 3
2
4
Large Scale Entity Resolution Use Case #1
Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
2
5
Scenario
Healthcare insurers need better analytics to identify drug seeking behavior and schemes that recruit members to use their membership fraudulently.
Groups of people collude to source schedule drugs through multiple members to avoid being detected by rules based systems. Providers recruit members to provide and escalate services that are not rendered.
Result
The analysis detected social groups that are sourcing Vicodin and other schedule drugs. Identifies prescribers and pharmacies involved to help the insurer focus investigations and intervene strategically to mitigate risk.
Large Scale Entity Resolution Use Case #2
Almost every prescription is in social isolation (> 96%)
Non-Social
Large % of prescriptions show socialization (long tail)
Social
Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
2
6
Large Scale Entity Resolution Challenges
1. Permanence/Persistence 2. Transparency 3. Spatial and Temporal Considerations 4. Source Credibility Considerations 5. Completeness
Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
2
Entity Resolution Methodologies
7
Rules-Based:
− Based on logic (IF/ELSE or SWITCH statements)
− Example: If field values 1, 2 and 5 from source ‘a’ are equivalent to values 3, 6 and 7 in source ‘b’, respectively, then declare a match.
Statistics-Based:
− Based on computation of weights and thresholds; a match is declared only when the sum of all weights surpasses a certain threshold
− Example:
Threshold = 29
Sum of Individual Field
Scores (based on specificity
Values)
Source A
Source B
Field 1 Score
Field 2 Score
Field 3 Score
Field 4 Score
Field 1 Score
Field 2 Score
Field 3 Score
Field 4 Score
Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
2
8
Choosing the Right Methodology
Methodology Pros Cons
Rules-Based • High Precision • Optimal for Small Datasets
• Heavy Maintenance Required • Performance Degradation as Rule Set
and Datasets Increase • Re-writing of Rules Required as
Additional Languages are Present
Statistics-Based • Language Agnostic • Entity Agnostic • Optimal for Large Datasets
• Overkill for Small Datasets
Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
2
9
Why Statistically-Based Systems Excel
“The advantage of this [statistical] approach over hand-coded rules is that the models develop probabilistic rules of which human experts are often not aware. We noticed that many of the rules that the system had automatically learned from the data differed in subtle but important ways from the rules established by human experts” - Ray Kurzweil, How To Create A Mind (in reference to using statistical approaches for speech recognition technology)
Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
2
10
Consideration #1: “Dirty” Data US Consumer Data
Frequent Zip Code Patterns
US Consumer Data Frequent Phone Number Values
International Cargo Shipping Data – Shipper Names
Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
2
11
Consideration #2: Incomplete Data
Null Field Value Scenarios
Partial Field Value Scenarios
Cluster # F Name M Name L Name
1 Sardar Khan Niazi
2. S. K. Niazi
Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
2
12 Strategies for Entity Resolution to Reveal Hidden Connections
Consideration #3: Semi-Structured Data
International Postal Addresses
OFFICE # 406 4TH FLOOR SUNNY PLAZA HASRAT MOHANI ROAD I.I
101 ZUBAIDA GARDEN NEAR AWAMI MARKAZ SHAHRAH-E-FAISAL,KARACHI
101 BLOCK E FIRST FLOOR ZUBAIDA GARDENS NEAR AWAMI MARKAZ SHAHRAH-E-FA,KARACHI
E-101 ZUBAIDA GARDENS NEAR AWAMI MARKAZ SHAHRAH-E-FAISAL,KARACHI
2
13 Strategies for Entity Resolution to Reveal Hidden Connections
Consideration #4: Semi-Structured Data
US Postal Addresses
939 JEFFERSON ST
110 E ELM ST
426 NEW YORK AVE
212 E MAIN ST
1900 EAGLE DR
Street Name City Name State Name Bakersfield
Ashland
Newton
Brookfield
Middletown
California
North Carolina
Ohio
Connecticut
Maryland
Average Specificity:
19.63 11.12 5.03
Location
14.03
2
Entity Resolution Benefits
14 Strategies for Entity Resolution to Reveal Hidden Connections
Which Scenario is More Optimal for Your Business?
2
Entity Resolution Vision
15 Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
• Across industry and government, many initiatives and missions boil down to 4 primary entity types:
• People • Businesses/Organizations • Locations • Assets
• A deeper understanding of entities and their interconnections translates to:
• Increased successes in cracking fraud, waste and abuse • Better matching of people to people across social networks • Stronger indicators of supply chain risk for the enterprise
2
Entity Resolution Vision
16 Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
From a technical implementation standpoint, can scientific findings pertaining to the neocortex help us further revolutionize entity resolution technology as it stands today?
• Our statistical approach has us heading in the right direction • We are continuously finding new ways to represent the hierarchical
nature of entities • We should take heed of the brain’s innate ability to “prune”, while
possibly looking at ways to emulate “pruning” so that unnecessary retention of data with little to no value doesn’t continue to bog the enterprise down
2
17 Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack
Mary Galvin Technical Consultant LexisNexis Special Services, Inc. (LNSSI) LexisNexis | Risk Solutions 202.595.4043 Mobile [email protected]
Q&A