Upload
truongthuan
View
213
Download
0
Embed Size (px)
Citation preview
US007720846B1
(12) Ulllted States Patent (10) Patent N0.: US 7,720,846 B1 Bayliss (45) Date of Patent: May 18, 2010
(54) SYSTEM AND METHOD OF USING GHOST 6,026,394 A 2/2000 Tsuchida et a1. IDENTIFIERS IN A DATABASE 6,026,398 A * 2/2000 Brown et al. ................. .. 707/5
(75) Inventor: David Bayliss, Delray Beach, FL (US) 6981301 A 6/2000 Cochrane et a1’ 6,266,804 B1 7/2001 Isman
(73) Assignee: LeXisNeXis Risk Data Management, 6,311,169 B2 10/2001 Duhon Inc., Baco Raton, FL (US)
( * ) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35 (Continued) U.S.C. 154(b) by 451 days.
OTHER PUBLICATIONS
(21) Appl. No.: 10/357,484 _ _ _ Elke Schallehn et al., “Advanced Grouping and Aggregation for Data
(22) Filed: Feb 4, 2003 Integration,” Department of Computer Science, Paper ID: 222, pp. 1-16.
(51) Int. Cl. . G06F 7/00 (2006.01) (Commued) US. Cl- .................................................... .. Primary ExamineriKaViIa Padmanabhan
(58) Field of Classi?cation Search ..................... .. None (74) Angrney, Agenz, 0r FirmiHumQn & Williams LLP See application ?le for complete search history.
(56) References Cited
U.S. PATENT DOCUMENTS
4,543,630 A 9/1985 Neches 4,860,201 A 8/1989 Stolfo et a1. 4,870,568 A 9/1989 Kahle etal. 4,925,311 A 5/1990 Neches et a1. 5,006,978 A 4/1991 Neches 5,276,899 A 1/1994 Neches 5,303,383 A 4/1994 Neches et a1. 5,423,037 A 6/1995 Hvasshovd 5,471,622 A 11/1995 Eadline 5,495,606 A 2/1996 Borden et a1. 5,551,027 A 8/1996 Choyetal. 5,555,404 A 9/1996 Torbjyarnsen et a1. 5,655,080 A 8/1997 Dias et a1. 5,732,400 A 3/1998 Mandler et a1. 5,745,746 A 4/1998 Jhingran et a1. 5,878,408 A 3/1999 Van Huben et al. 5,884,299 A 3/1999 Ramesh et a1. 5,897,638 A 4/1999 Lasser et a1. 5,983,228 A 11/1999 Kobayashiet al. 6,006,249 A 12/1999 Leong
(57) ABSTRACT
Various exemplary systems and methods for linking entity references and identifying associations are presented. In par ticular, a method for linking a plurality of entity references to at least one entity using one or more constructed ghost entity references is provided. The method comprises the steps of identifying, for one or more common data ?elds, at least one
unique ?eld value from one or more entity references linked to a given entity and constructing, for the given entity, at least one ghost entity reference including the at least one unique ?eld value, Wherein the at least one ghost entity reference represents at least one potential entity reference for the given entity. The method further comprises linking at least one of the plurality of entity references to the given entity When a match probability betWeen the at least one entity reference and the ghost entity reference is greater than a de?ned thresh old.
16 Claims, 31 Drawing Sheets
Select subset N of data ?elds
1102
l For each ?eld (X) of the subset, generate Field Unique Value Table
1104
Cross-Produce Field Unique Value Tables to generate Ghost Table
Ghost Table 1128
Update Master Flle to Include Ghost Entlty
Referen ces
1 1 08
US 7,720,846 B1 Page 2
US. PATENT DOCUMENTS
6,427,148 B1 2002/0073099 A1*
7/2002 Cossock 6/2002 Gilbert et al. .......... .. 707/104.1
2002/0184222 A1* 12/2002 Kohut et al. 707/10 2003/0167253 A1* 9/2003 Meinig ........ .. 707/1
2004/0088322 A1* 5/2004 Elder et al. . 707/103 Y
OTHER PUBLICATIONS
Vincent Coppola, “Killer APP,” Men’s Journal, vol. 12, No. 3, Apr. 2003, pp. 86-90. Eike Schallehn et al., “Extensible and Similarity-based Grouping for Data Integration,” Department of Computer Science, pp. 1-17, 2002. Rohit Ananthakrishna et al., “Eliminating Fuzzy Duplicates in Data Warehouses,” 12 pages, 2002. Peter Christen et al., “Parallel Computing Techniques for High-Per formance Probabilistic Record Linkage,” Data Mining Group, Aus tralian National University, Epidemiology and Surveillance Branch, Project web page: http://datamining.anu.edu.au/linkagehtml, 2002, pp. 1-1 1 .
Peter Christen et al., “Parallel Techniques for High-Performance Record Linkage (Data Matching),” Data Mining Group, Australian National University, Epidemiology and Surveillance Branch, Project web page: http://datamining.anu.edu.au/linkage.html, 2002, pp. 1-27. Peter Christen et al., “High-Performance Computing Techniques for Record Linkage,” Data Mining Group, Australian National Univer sity, Epidemiology and Surveillance Branch, Project web page: http://datamining.anu.edu.au/linkagehtml, 2002, pp. 1-14. William E. Winkler, “Matching And Record Linkage,” US. Bureau ofthe Census, pp. 1-38. Peter Christen et al., “High-Performance Computing Techniques for Record Linkage,” ANU Data Mining Group, Australian National University, Epidemiology and Surveillance Branch, Project web page: http://datamining.anu.edu.au/linkage.html, pp. 1-11. William E. Winkler, “The State of Record Linkage and Current Research Problems,” US. Bureau of the Census, 15 pages. William E. Winkler, “Advanced Methods For Record Linkage,” Bureau ofthe Census, pp. 1-21. William E. Winkler, Frequency-Based Matching in Fellegi-Sunter Model of Record Linkage, Bureau Of The Census Statistical Research Division, Oct. 4, 2000, 14 pages. William E. Winkler, “State of Statistical Data Editing And Current Research Problems,” Bureau Of The Census Statistical Research Division, 10 pages.
The First Open ETL/EAI Software For The Real-Time Enterprise, Sunopsis, A New Generation ETL Tool, “SunopsisTM v3 expedites integration between heterogeneous systems for Data Warehouse, Data Mining, Business Intelligence, and OLAP projects,” <www. suopsis.com>, 6 pages. Alan Dumas, “The ETL Market and SunopsisTM v3 Business Intel ligence, Data Warehouse & Datamart Projects,” 2002, Sunopsis, pp. 1-7. Teradata Warehouse Solutions, “Teradata Database Technical Over view,” 2002, pp. 1-7. WhiteCross White Paper, May 25, 2000, “wx/des-Technical Infor mation,” pp. 1-36. Teradata Alliance Solutions, “Teradata and Ab Initio,” pp. 1-2, 2001. Peter Christen et al., The Australian National University, “FebrliFreely extensible biomedical record linkage,” Oct. 2002, pp. 1-67.
William E. Winkler, “Using the EM Algorithim for Weight Compu tation in the Fellegi-Sunter Model of Record Linkage,” Bureau Of The Census Statistical Research Division, Oct. 4, 2000, 12 pages. William E. Winkler et al., “An Application of the Fellegi-Sunter Model Of Record Linkage To The 1990 US. Decennial Census,” US. Bureau of the Census, pp. 1-22. William E. Winkler, “Improved Decision Rules In The Fellegi-Sunter Model Of Record Linkage,” Bureau of the Census, pp. 1-13. Fritz Scheuren et al., “Recursive Merging and Analysis of Adminis trative Lists and Data,” US. Bureau of the Census, 9 pages. William E. Winkler, “Record Linkage Software and Methods for Merging Administrative Lists,” US. Bureau of the Census, Jul. 7, 2001, 11 pages. Enterprises, Publishing and Broadcasting Limited, Acxiom-Abilitec, pp. 44-45.
TransUnion, Credit Reporting System, Oct. 9, 2002, 4 pages, A<http://www.transunion.com/content/pagejsp?id:/transunion/ general/data/business/BusCre...>. TransUnion, ID Veri?cation & Fraud Detection, Account Acquisi tion, Account Management, Collection & Location Services, Employment Screening, Risk Management, Automotive, Banking Savings & Loan, Credit Card Providers, Credit Unions, Energy & Utilities, Healthcare, Insurance, Investment, Real Estate, Telecom munications, Oct. 9, 2002,46 pages, <http://www.transunion.com>. White Paper An Introduction to OLAP Multidimensional Terminol ogy and Technology, 20 pages.
* cited by examiner
US. Patent May 18, 2010 Sheet 1 0f 31 US 7,720,846 B1
‘ Fig. 1A
100A @
US. Patent May 18, 2010 Sheet 2 0f 31 US 7,720,846 B1
140
‘ Fig. 1B
144
US. Patent May 18, 2010 Sheet 3 0f 31 US 7,720,846 B1
Prepare Raw Data (Preparation Phase)
&
‘N O O
Translate Data to Entity References (Link Phase) M
Repeat for Iteration N Incoming Data —> 208
A
Determine Inter-Relationships Between Entities
(Association Phase) E
Perform One or More Queries Using Master
File
US. Patent
Preparation Phase m
May 18,2010 Sheet 4 0f 31 US 7,720,846 B1
Format Raw Data into Entity References
£32
1 Join Entity References
(Master File) &
l
Incoming Data —————> Repeat for Iteration N &
Remove Duplicate Entity References &
1 Fill In Null Field Values
m
1 Remove Junk Fieid
Values/Entries @
US. Patent
Link Phase E
May 18,2010 Sheet 5 0f 31
Incoming Data —> Repeat for Iteration N
m
A
US 7,720,846 B1
Select Relevant Fields &
l Measure Field Variance
and Reset DlDs lf Necessary
:LQA
l Fill In Null Field Values
@
l Generate Ghost Entity
References iQQ
l Link Entity References
m
l Transition Links
11;;
l Append/Modify DlDs in
Master File 41_4
US. Patent May 18, 2010 Sheet 6 0f 31 US 7,720,846 B1
Fig. 5
Probability-Based Matching M
Content Weighting Field Weighting
Entity Reference A r
C E ft R f Indication ofa Ompare eerences --—> Link Between
~— Entity References Entity Reference B
A
Context @
Location Familial Nicknames/ Relationships Synonyms
US. Patent May 18, 2010 Sheet 7 0f 31 US 7,720,846 B1
For each particular ?eld entry fn, determine total number (Count) of
I same ?eld entries in master file F l g . 6 @
Z
Count = 2 [if (f_t :: f”) then Lelse 0]
Count Table 512.
For each particular ?eld entry fn, determine context weight wc_i
%
1 WC; : ."‘ Count + Cautious/1e ss
Context Weight Table @
Calculate probability (P) of match between Entry References using
context weight(s)
P(erl : er2) : 2H", wCJ. *
l Assign DlDs to Entity References
based on probability (P) m
US. Patent
Select subset N of Entity Reference fields
Z92
i
May 18,2010 Sheet 8 0f 31 US 7,720,846 B1
Next Entity Ref. A and ' For each ?eld (X) ofthe Entity Ref. B subset: M I E
A
Compare 700 N0 - _- Match A.fX with B.fX
Match
l Add (AB) to Match Table
w
Common DID transition using Match Table m
i Adjust DID of Affected Entity References in Master File
HA
US. Patent May 18, 2010 Sheet 9 0f 31 US 7,720,846 B1
Fig. 8 808
804 802
806
US. Patent May 18, 2010 Sheet 10 0f 31 US 7,720,846 B1
Fig. 9A
V
US. Patent May 18, 2010 Sheet 11 0131 US 7,720,846 B1
Fig. 10
Match Table 1030
Inner Join of Match Table with itself by left DID
1002
Expanded Match Table w
1022
Inner Join of Expanded Match Table with itself from
right DID to left DID EDA
Transitive Closure Table 1024
Transition DIDs to lowest possible DID value
1006
US. Patent May 18,2010 Sheet 12 0f 31
Fig. 11
US 7,720,846 B1
Select subset N of data ?elds
1 102
V For each ?eld (X) of the subset, generate Field Unique Value Table
1104
i Cross-Produce Field
Unique Value Tables to generate Ghost Table
L125.
Ghost Table 1128
Update Master File to Include Ghost Entity
References m
US. Patent May 18, 2010 Sheet 13 0f 31 US 7,720,846 B1
US. Patent May 18,2010
Measure variance along each ?eld ‘axis’
1302
Sheet 14 0f 31
Variance >
Threshold? 1304
Yes ‘0
Reset DID of Each Entity Reference to its RID
1306
1 Mark Entity References as having been ‘Broken’
1308
Entity Ref. Broken
Mark Entity References as suspect
1314
US 7,720,846 B1
Fig. 13
m
1 End 1m
US. Patent May 18, 2010 Sheet 15 0f 31 US 7,720,846 B1
Fig. 14 Determine Degree of
Commonaiity ————-—> (Association) Between
Entities HQ
Association Phase M
Mark Highly Associated Entities as Related
1404
Incoming Data Repeat to; ltgration N
A Generate Ghost Entity References from
Relations Hi6
Transitive Closure For Additional Associations
Between Entities 1M
US. Patent May 18,2010
Select subset N of Entity Reference ?elds
1502
l
Sheet 16 0f 31
Fig. 15
Next Entity Ref. A and Entity Ref. B
1504
For each ?eld (X) of the subset: 1 506
Entity C not Associated with Entity o, <—"'0
Entity D not Associated with Entity C
ls
Score(C, D) >=Threshold
E12.
Yes
L
No Match
US 7,720,846 B1
Compare A.fX with B.fx 1m Entity
Ref‘ A
Match
l Increase score of (CD)
pair in Score Table 1510
Mark Entity C as Associate of Entity D,
Entity D as Associate of Entity C M
V
Score Table 1522
US. Patent May 18,2010 Sheet 17 0f 31 US 7,720,846 B1
Fig. 16
Relatives File 1620
1 Filter 1602
l Duplicate Records
1604
1 Inner Join by left DID
1606
l Set weight, separation,
and dedup values 1608
US. Patent
Match Table 1730
May 18,2010
Filter 1702
l Duplicate Records
1704
Duplicate Match Table
1722
lnner Join duplicate match table with master
?le ?Q?
Outlier Reference Table 1724
Sheet 18 0f 31
Fig. 17
Score DlDs using grading criteria
1708 <—-—— Grading Criteria
l Sum DID scores
1710
DID Score Table 1726
Filter DID Score Table 1712
US 7,720,846 B1
Obtain entity references of selected DlDs from
Outlier Reference Table LZH