72
US007720846B1 (12) Ulllted States Patent (10) Patent N0.: US 7,720,846 B1 Bayliss (45) Date of Patent: May 18, 2010 (54) SYSTEM AND METHOD OF USING GHOST 6,026,394 A 2/2000 Tsuchida et a1. IDENTIFIERS IN A DATABASE 6,026,398 A * 2/2000 Brown et al. ................. .. 707/5 (75) Inventor: David Bayliss, Delray Beach, FL (US) 6981301 A 6/2000 Cochrane et a1’ 6,266,804 B1 7/2001 Isman (73) Assignee: LeXisNeXis Risk Data Management, 6,311,169 B2 10/2001 Duhon Inc., Baco Raton, FL (US) ( * ) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35 (Continued) U.S.C. 154(b) by 451 days. OTHER PUBLICATIONS (21) Appl. No.: 10/357,484 _ _ _ Elke Schallehn et al., “Advanced Grouping and Aggregation for Data (22) Filed: Feb 4, 2003 Integration,” Department of Computer Science, Paper ID: 222, pp. 1-16. (51) Int. Cl. . G06F 7/00 (2006.01) (Commued) US. Cl- .................................................... .. Primary ExamineriKaViIa Padmanabhan (58) Field of Classi?cation Search ..................... .. None (74) Angrney, Agenz, 0r FirmiHumQn & Williams LLP See application ?le for complete search history. (56) References Cited U.S. PATENT DOCUMENTS 4,543,630 A 9/1985 Neches 4,860,201 A 8/1989 Stolfo et a1. 4,870,568 A 9/1989 Kahle etal. 4,925,311 A 5/1990 Neches et a1. 5,006,978 A 4/1991 Neches 5,276,899 A 1/1994 Neches 5,303,383 A 4/1994 Neches et a1. 5,423,037 A 6/1995 Hvasshovd 5,471,622 A 11/1995 Eadline 5,495,606 A 2/1996 Borden et a1. 5,551,027 A 8/1996 Choyetal. 5,555,404 A 9/1996 Torbjyarnsen et a1. 5,655,080 A 8/1997 Dias et a1. 5,732,400 A 3/1998 Mandler et a1. 5,745,746 A 4/1998 Jhingran et a1. 5,878,408 A 3/1999 Van Huben et al. 5,884,299 A 3/1999 Ramesh et a1. 5,897,638 A 4/1999 Lasser et a1. 5,983,228 A 11/1999 Kobayashiet al. 6,006,249 A 12/1999 Leong (57) ABSTRACT Various exemplary systems and methods for linking entity references and identifying associations are presented. In par ticular, a method for linking a plurality of entity references to at least one entity using one or more constructed ghost entity references is provided. The method comprises the steps of identifying, for one or more common data ?elds, at least one unique ?eld value from one or more entity references linked to a given entity and constructing, for the given entity, at least one ghost entity reference including the at least one unique ?eld value, Wherein the at least one ghost entity reference represents at least one potential entity reference for the given entity. The method further comprises linking at least one of the plurality of entity references to the given entity When a match probability betWeen the at least one entity reference and the ghost entity reference is greater than a de?ned thresh old. 16 Claims, 31 Drawing Sheets Select subset N of data ?elds 1102 l For each ?eld (X) of the subset, generate Field Unique Value Table 1104 Cross-Produce Field Unique Value Tables to generate Ghost Table Ghost Table 1128 Update Master Flle to Include Ghost Entlty Referen ces 1 1 08

System and method of using ghost identifiers in a database

Embed Size (px)

Citation preview

Page 1: System and method of using ghost identifiers in a database

US007720846B1

(12) Ulllted States Patent (10) Patent N0.: US 7,720,846 B1 Bayliss (45) Date of Patent: May 18, 2010

(54) SYSTEM AND METHOD OF USING GHOST 6,026,394 A 2/2000 Tsuchida et a1. IDENTIFIERS IN A DATABASE 6,026,398 A * 2/2000 Brown et al. ................. .. 707/5

(75) Inventor: David Bayliss, Delray Beach, FL (US) 6981301 A 6/2000 Cochrane et a1’ 6,266,804 B1 7/2001 Isman

(73) Assignee: LeXisNeXis Risk Data Management, 6,311,169 B2 10/2001 Duhon Inc., Baco Raton, FL (US)

( * ) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35 (Continued) U.S.C. 154(b) by 451 days.

OTHER PUBLICATIONS

(21) Appl. No.: 10/357,484 _ _ _ Elke Schallehn et al., “Advanced Grouping and Aggregation for Data

(22) Filed: Feb 4, 2003 Integration,” Department of Computer Science, Paper ID: 222, pp. 1-16.

(51) Int. Cl. . G06F 7/00 (2006.01) (Commued) US. Cl- .................................................... .. Primary ExamineriKaViIa Padmanabhan

(58) Field of Classi?cation Search ..................... .. None (74) Angrney, Agenz, 0r FirmiHumQn & Williams LLP See application ?le for complete search history.

(56) References Cited

U.S. PATENT DOCUMENTS

4,543,630 A 9/1985 Neches 4,860,201 A 8/1989 Stolfo et a1. 4,870,568 A 9/1989 Kahle etal. 4,925,311 A 5/1990 Neches et a1. 5,006,978 A 4/1991 Neches 5,276,899 A 1/1994 Neches 5,303,383 A 4/1994 Neches et a1. 5,423,037 A 6/1995 Hvasshovd 5,471,622 A 11/1995 Eadline 5,495,606 A 2/1996 Borden et a1. 5,551,027 A 8/1996 Choyetal. 5,555,404 A 9/1996 Torbjyarnsen et a1. 5,655,080 A 8/1997 Dias et a1. 5,732,400 A 3/1998 Mandler et a1. 5,745,746 A 4/1998 Jhingran et a1. 5,878,408 A 3/1999 Van Huben et al. 5,884,299 A 3/1999 Ramesh et a1. 5,897,638 A 4/1999 Lasser et a1. 5,983,228 A 11/1999 Kobayashiet al. 6,006,249 A 12/1999 Leong

(57) ABSTRACT

Various exemplary systems and methods for linking entity references and identifying associations are presented. In par ticular, a method for linking a plurality of entity references to at least one entity using one or more constructed ghost entity references is provided. The method comprises the steps of identifying, for one or more common data ?elds, at least one

unique ?eld value from one or more entity references linked to a given entity and constructing, for the given entity, at least one ghost entity reference including the at least one unique ?eld value, Wherein the at least one ghost entity reference represents at least one potential entity reference for the given entity. The method further comprises linking at least one of the plurality of entity references to the given entity When a match probability betWeen the at least one entity reference and the ghost entity reference is greater than a de?ned thresh old.

16 Claims, 31 Drawing Sheets

Select subset N of data ?elds

1102

l For each ?eld (X) of the subset, generate Field Unique Value Table

1104

Cross-Produce Field Unique Value Tables to generate Ghost Table

Ghost Table 1128

Update Master Flle to Include Ghost Entlty

Referen ces

1 1 08

Page 2: System and method of using ghost identifiers in a database

US 7,720,846 B1 Page 2

US. PATENT DOCUMENTS

6,427,148 B1 2002/0073099 A1*

7/2002 Cossock 6/2002 Gilbert et al. .......... .. 707/104.1

2002/0184222 A1* 12/2002 Kohut et al. 707/10 2003/0167253 A1* 9/2003 Meinig ........ .. 707/1

2004/0088322 A1* 5/2004 Elder et al. . 707/103 Y

OTHER PUBLICATIONS

Vincent Coppola, “Killer APP,” Men’s Journal, vol. 12, No. 3, Apr. 2003, pp. 86-90. Eike Schallehn et al., “Extensible and Similarity-based Grouping for Data Integration,” Department of Computer Science, pp. 1-17, 2002. Rohit Ananthakrishna et al., “Eliminating Fuzzy Duplicates in Data Warehouses,” 12 pages, 2002. Peter Christen et al., “Parallel Computing Techniques for High-Per formance Probabilistic Record Linkage,” Data Mining Group, Aus tralian National University, Epidemiology and Surveillance Branch, Project web page: http://datamining.anu.edu.au/linkagehtml, 2002, pp. 1-1 1 .

Peter Christen et al., “Parallel Techniques for High-Performance Record Linkage (Data Matching),” Data Mining Group, Australian National University, Epidemiology and Surveillance Branch, Project web page: http://datamining.anu.edu.au/linkage.html, 2002, pp. 1-27. Peter Christen et al., “High-Performance Computing Techniques for Record Linkage,” Data Mining Group, Australian National Univer sity, Epidemiology and Surveillance Branch, Project web page: http://datamining.anu.edu.au/linkagehtml, 2002, pp. 1-14. William E. Winkler, “Matching And Record Linkage,” US. Bureau ofthe Census, pp. 1-38. Peter Christen et al., “High-Performance Computing Techniques for Record Linkage,” ANU Data Mining Group, Australian National University, Epidemiology and Surveillance Branch, Project web page: http://datamining.anu.edu.au/linkage.html, pp. 1-11. William E. Winkler, “The State of Record Linkage and Current Research Problems,” US. Bureau of the Census, 15 pages. William E. Winkler, “Advanced Methods For Record Linkage,” Bureau ofthe Census, pp. 1-21. William E. Winkler, Frequency-Based Matching in Fellegi-Sunter Model of Record Linkage, Bureau Of The Census Statistical Research Division, Oct. 4, 2000, 14 pages. William E. Winkler, “State of Statistical Data Editing And Current Research Problems,” Bureau Of The Census Statistical Research Division, 10 pages.

The First Open ETL/EAI Software For The Real-Time Enterprise, Sunopsis, A New Generation ETL Tool, “SunopsisTM v3 expedites integration between heterogeneous systems for Data Warehouse, Data Mining, Business Intelligence, and OLAP projects,” <www. suopsis.com>, 6 pages. Alan Dumas, “The ETL Market and SunopsisTM v3 Business Intel ligence, Data Warehouse & Datamart Projects,” 2002, Sunopsis, pp. 1-7. Teradata Warehouse Solutions, “Teradata Database Technical Over view,” 2002, pp. 1-7. WhiteCross White Paper, May 25, 2000, “wx/des-Technical Infor mation,” pp. 1-36. Teradata Alliance Solutions, “Teradata and Ab Initio,” pp. 1-2, 2001. Peter Christen et al., The Australian National University, “FebrliFreely extensible biomedical record linkage,” Oct. 2002, pp. 1-67.

William E. Winkler, “Using the EM Algorithim for Weight Compu tation in the Fellegi-Sunter Model of Record Linkage,” Bureau Of The Census Statistical Research Division, Oct. 4, 2000, 12 pages. William E. Winkler et al., “An Application of the Fellegi-Sunter Model Of Record Linkage To The 1990 US. Decennial Census,” US. Bureau of the Census, pp. 1-22. William E. Winkler, “Improved Decision Rules In The Fellegi-Sunter Model Of Record Linkage,” Bureau of the Census, pp. 1-13. Fritz Scheuren et al., “Recursive Merging and Analysis of Adminis trative Lists and Data,” US. Bureau of the Census, 9 pages. William E. Winkler, “Record Linkage Software and Methods for Merging Administrative Lists,” US. Bureau of the Census, Jul. 7, 2001, 11 pages. Enterprises, Publishing and Broadcasting Limited, Acxiom-Abilitec, pp. 44-45.

TransUnion, Credit Reporting System, Oct. 9, 2002, 4 pages, A<http://www.transunion.com/content/pagejsp?id:/transunion/ general/data/business/BusCre...>. TransUnion, ID Veri?cation & Fraud Detection, Account Acquisi tion, Account Management, Collection & Location Services, Employment Screening, Risk Management, Automotive, Banking Savings & Loan, Credit Card Providers, Credit Unions, Energy & Utilities, Healthcare, Insurance, Investment, Real Estate, Telecom munications, Oct. 9, 2002,46 pages, <http://www.transunion.com>. White Paper An Introduction to OLAP Multidimensional Terminol ogy and Technology, 20 pages.

* cited by examiner

Page 3: System and method of using ghost identifiers in a database

US. Patent May 18, 2010 Sheet 1 0f 31 US 7,720,846 B1

‘ Fig. 1A

100A @

Page 4: System and method of using ghost identifiers in a database

US. Patent May 18, 2010 Sheet 2 0f 31 US 7,720,846 B1

140

‘ Fig. 1B

144

Page 5: System and method of using ghost identifiers in a database

US. Patent May 18, 2010 Sheet 3 0f 31 US 7,720,846 B1

Prepare Raw Data (Preparation Phase)

&

‘N O O

Translate Data to Entity References (Link Phase) M

Repeat for Iteration N Incoming Data —> 208

A

Determine Inter-Relationships Between Entities

(Association Phase) E

Perform One or More Queries Using Master

File

Page 6: System and method of using ghost identifiers in a database

US. Patent

Preparation Phase m

May 18,2010 Sheet 4 0f 31 US 7,720,846 B1

Format Raw Data into Entity References

£32

1 Join Entity References

(Master File) &

l

Incoming Data —————> Repeat for Iteration N &

Remove Duplicate Entity References &

1 Fill In Null Field Values

m

1 Remove Junk Fieid

Values/Entries @

Page 7: System and method of using ghost identifiers in a database

US. Patent

Link Phase E

May 18,2010 Sheet 5 0f 31

Incoming Data —> Repeat for Iteration N

m

A

US 7,720,846 B1

Select Relevant Fields &

l Measure Field Variance

and Reset DlDs lf Necessary

:LQA

l Fill In Null Field Values

@

l Generate Ghost Entity

References iQQ

l Link Entity References

m

l Transition Links

11;;

l Append/Modify DlDs in

Master File 41_4

Page 8: System and method of using ghost identifiers in a database

US. Patent May 18, 2010 Sheet 6 0f 31 US 7,720,846 B1

Fig. 5

Probability-Based Matching M

Content Weighting Field Weighting

Entity Reference A r

C E ft R f Indication ofa Ompare eerences --—> Link Between

~— Entity References Entity Reference B

A

Context @

Location Familial Nicknames/ Relationships Synonyms

Page 9: System and method of using ghost identifiers in a database

US. Patent May 18, 2010 Sheet 7 0f 31 US 7,720,846 B1

For each particular ?eld entry fn, determine total number (Count) of

I same ?eld entries in master file F l g . 6 @

Z

Count = 2 [if (f_t :: f”) then Lelse 0]

Count Table 512.

For each particular ?eld entry fn, determine context weight wc_i

%

1 WC; : ."‘ Count + Cautious/1e ss

Context Weight Table @

Calculate probability (P) of match between Entry References using

context weight(s)

P(erl : er2) : 2H", wCJ. *

l Assign DlDs to Entity References

based on probability (P) m

Page 10: System and method of using ghost identifiers in a database

US. Patent

Select subset N of Entity Reference fields

Z92

i

May 18,2010 Sheet 8 0f 31 US 7,720,846 B1

Next Entity Ref. A and ' For each ?eld (X) ofthe Entity Ref. B subset: M I E

A

Compare 700 N0 - _- Match A.fX with B.fX

Match

l Add (AB) to Match Table

w

Common DID transition using Match Table m

i Adjust DID of Affected Entity References in Master File

HA

Page 11: System and method of using ghost identifiers in a database

US. Patent May 18, 2010 Sheet 9 0f 31 US 7,720,846 B1

Fig. 8 808

804 802

806

Page 12: System and method of using ghost identifiers in a database

US. Patent May 18, 2010 Sheet 10 0f 31 US 7,720,846 B1

Fig. 9A

V

Page 13: System and method of using ghost identifiers in a database

US. Patent May 18, 2010 Sheet 11 0131 US 7,720,846 B1

Fig. 10

Match Table 1030

Inner Join of Match Table with itself by left DID

1002

Expanded Match Table w

1022

Inner Join of Expanded Match Table with itself from

right DID to left DID EDA

Transitive Closure Table 1024

Transition DIDs to lowest possible DID value

1006

Page 14: System and method of using ghost identifiers in a database

US. Patent May 18,2010 Sheet 12 0f 31

Fig. 11

US 7,720,846 B1

Select subset N of data ?elds

1 102

V For each ?eld (X) of the subset, generate Field Unique Value Table

1104

i Cross-Produce Field

Unique Value Tables to generate Ghost Table

L125.

Ghost Table 1128

Update Master File to Include Ghost Entity

References m

Page 15: System and method of using ghost identifiers in a database

US. Patent May 18, 2010 Sheet 13 0f 31 US 7,720,846 B1

Page 16: System and method of using ghost identifiers in a database

US. Patent May 18,2010

Measure variance along each ?eld ‘axis’

1302

Sheet 14 0f 31

Variance >

Threshold? 1304

Yes ‘0

Reset DID of Each Entity Reference to its RID

1306

1 Mark Entity References as having been ‘Broken’

1308

Entity Ref. Broken

Mark Entity References as suspect

1314

US 7,720,846 B1

Fig. 13

m

1 End 1m

Page 17: System and method of using ghost identifiers in a database

US. Patent May 18, 2010 Sheet 15 0f 31 US 7,720,846 B1

Fig. 14 Determine Degree of

Commonaiity ————-—> (Association) Between

Entities HQ

Association Phase M

Mark Highly Associated Entities as Related

1404

Incoming Data Repeat to; ltgration N

A Generate Ghost Entity References from

Relations Hi6

Transitive Closure For Additional Associations

Between Entities 1M

Page 18: System and method of using ghost identifiers in a database

US. Patent May 18,2010

Select subset N of Entity Reference ?elds

1502

l

Sheet 16 0f 31

Fig. 15

Next Entity Ref. A and Entity Ref. B

1504

For each ?eld (X) of the subset: 1 506

Entity C not Associated with Entity o, <—"'0

Entity D not Associated with Entity C

ls

Score(C, D) >=Threshold

E12.

Yes

L

No Match

US 7,720,846 B1

Compare A.fX with B.fx 1m Entity

Ref‘ A

Match

l Increase score of (CD)

pair in Score Table 1510

Mark Entity C as Associate of Entity D,

Entity D as Associate of Entity C M

V

Score Table 1522

Page 19: System and method of using ghost identifiers in a database

US. Patent May 18,2010 Sheet 17 0f 31 US 7,720,846 B1

Fig. 16

Relatives File 1620

1 Filter 1602

l Duplicate Records

1604

1 Inner Join by left DID

1606

l Set weight, separation,

and dedup values 1608

Page 20: System and method of using ghost identifiers in a database

US. Patent

Match Table 1730

May 18,2010

Filter 1702

l Duplicate Records

1704

Duplicate Match Table

1722

lnner Join duplicate match table with master

?le ?Q?

Outlier Reference Table 1724

Sheet 18 0f 31

Fig. 17

Score DlDs using grading criteria

1708 <—-—— Grading Criteria

l Sum DID scores

1710

DID Score Table 1726

Filter DID Score Table 1712

US 7,720,846 B1

Obtain entity references of selected DlDs from

Outlier Reference Table LZH

Page 21: System and method of using ghost identifiers in a database
Page 22: System and method of using ghost identifiers in a database
Page 23: System and method of using ghost identifiers in a database
Page 24: System and method of using ghost identifiers in a database
Page 25: System and method of using ghost identifiers in a database
Page 26: System and method of using ghost identifiers in a database
Page 27: System and method of using ghost identifiers in a database
Page 28: System and method of using ghost identifiers in a database
Page 29: System and method of using ghost identifiers in a database
Page 30: System and method of using ghost identifiers in a database
Page 31: System and method of using ghost identifiers in a database
Page 32: System and method of using ghost identifiers in a database
Page 33: System and method of using ghost identifiers in a database
Page 34: System and method of using ghost identifiers in a database
Page 35: System and method of using ghost identifiers in a database
Page 36: System and method of using ghost identifiers in a database
Page 37: System and method of using ghost identifiers in a database
Page 38: System and method of using ghost identifiers in a database
Page 39: System and method of using ghost identifiers in a database
Page 40: System and method of using ghost identifiers in a database
Page 41: System and method of using ghost identifiers in a database
Page 42: System and method of using ghost identifiers in a database
Page 43: System and method of using ghost identifiers in a database
Page 44: System and method of using ghost identifiers in a database
Page 45: System and method of using ghost identifiers in a database
Page 46: System and method of using ghost identifiers in a database
Page 47: System and method of using ghost identifiers in a database
Page 48: System and method of using ghost identifiers in a database
Page 49: System and method of using ghost identifiers in a database
Page 50: System and method of using ghost identifiers in a database
Page 51: System and method of using ghost identifiers in a database
Page 52: System and method of using ghost identifiers in a database
Page 53: System and method of using ghost identifiers in a database
Page 54: System and method of using ghost identifiers in a database
Page 55: System and method of using ghost identifiers in a database
Page 56: System and method of using ghost identifiers in a database
Page 57: System and method of using ghost identifiers in a database
Page 58: System and method of using ghost identifiers in a database
Page 59: System and method of using ghost identifiers in a database
Page 60: System and method of using ghost identifiers in a database
Page 61: System and method of using ghost identifiers in a database
Page 62: System and method of using ghost identifiers in a database
Page 63: System and method of using ghost identifiers in a database
Page 64: System and method of using ghost identifiers in a database
Page 65: System and method of using ghost identifiers in a database
Page 66: System and method of using ghost identifiers in a database
Page 67: System and method of using ghost identifiers in a database
Page 68: System and method of using ghost identifiers in a database
Page 69: System and method of using ghost identifiers in a database
Page 70: System and method of using ghost identifiers in a database
Page 71: System and method of using ghost identifiers in a database
Page 72: System and method of using ghost identifiers in a database