28
1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine [email protected] u

1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine [email protected]

Embed Size (px)

Citation preview

Page 1: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

1

Probabilistic Linkage: Issues and Strategies

Craig A. Mason, Ph.D.

University of Maine

[email protected]

Page 2: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

2

Faculty Disclosure Information

In the past 12 months, I have not had a significant financial interest or other relationship with the manufacturer(s) of the product(s) or provider(s) of the service(s) that will be discussed in my presentation

This presentation will (not) include discussion of pharmaceuticals or devices that have not been approved by the FDA or if you will be discussing unapproved or "off-label" uses of pharmaceuticals or devices.

Page 3: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

3

Acknowledgements

• Shihfen Tu, Quansheng Song

• Keith Scott, Marygrace Yale, Tony Gonzalez

• Derek Chapman

Page 4: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

4

Overview of Linkage Process

• Two databases containing information on some of the same individuals

Birth Certificates EHDI Diagnostic Data

Page 5: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

5

Overview of Linkage Process

• Many births not in Diagnostic Data

Birth Certificates EHDI Diagnostic Data

Page 6: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

6

Overview of Linkage Process

• Some entries in EHDI Diagnostic Data do not appear in Electronic Birth Certificates

Birth Certificates EHDI Diagnostic Data

Page 7: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

7

Overview of Linkage Process

• Final linkage is a subset of each

Birth Certificates EHDI Diagnostic Data

Page 8: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

8

Linkage Algorithms

• Deterministic– Exactly match on specified common fields– Easiest, quickest linkage strategy– Misconception that this is the “gold standard”

ID First Mid Last ID First Mid Last

12 John J Dawson 382 John J Dawson

EHDI Data Birth Defects Registry

Page 9: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

9

Linkage Algorithms

• Deterministic– May result in significant bias

• Non-traditional spellings in African American names

– Result in errors due to non-links• Many non-links can result in greater bias than a few

erroneous pairings

ID First Mid Last ID First Mid Last

9 Zbignew Brezinsky 534 Zbignew J Brezinski

EHDI Data Birth Defects Registry

Page 10: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

10

Linkage Algorithms

• Probabilistic– Statistically estimate likelihood or odds that two

records are for the same individual, even if they disagree on some fields

ID First Mid Last ID First Mid Last

9 Zbignew Brezinsky 534 Zbignew J Brezinski

EHDI Data Birth Defects Registry

Page 11: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

11

Linkage Algorithms

• Factors Impacting Probabilistic Linkage– Likelihood that a fields would agree if a correct link

• Good quality data counts more than poor quality data

– Likelihood that fields would agree if not a correct link• Rare values count more than common values

– Number of expected matches

• Much more complicated and expensive strategy

ID First Mid Last ID First Mid Last

9 Zbignew Brezinsky 534 Zbignew J Brezinski

EHDI Data Birth Defects Registry

Page 12: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

12

Good work,but I think we mightneed just a littlemore detail righthere.

Implementing an Effective Data Linkage

Then amiracleoccurs

out

Start

•Modified from Kim Church, Maine Genetics Program

Page 13: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

13

Probabilistic Matching

• Probabilistic Matching: Two records are not required to match in all fields– Two records are compared on each of the specified

fields. – A weight—wi—is calculated for each field in a potential

match reflecting the strength of the agreement or disagreement

w1 w2

ID First Mid Last ID First Mid Last

9 Zbignew Brezinsky 534 Zbignew J Brezinski

EHDI Data Birth Defects Registry

Page 14: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

14

• Reliability of data fields– Greater reliability results in increased odds of

correct match• A match on a high-quality, reliably entered field is good

• Not matching on a poor-quality field with lots of known data entry errors may not be a fatal error

– If a field is pure noise, correct matches will be random across the databases

Factors Influencing Likelihood of Match

Page 15: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

15

• Frequency of field values– The more common the value in a field, the greater

the odds that the records will be erroneously matched• A match based on the name Zbignew is a relatively good

indicator of a match, even if there may be disagreement in other fields

• A match based on the name John may be of much less value, requiring matches on more fields in order to conclude two records are the same individual

• Number of expected matches one would obtain randomly

Factors Influencing Likelihood of Match

Page 16: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

16

• Weight Calculation– M-probability

• Probability that a field agrees if the pairing reflects a correct match

– U-probability• Probability that a field agrees if the pairing reflects an incorrect

match• Chance that a given field will agree randomly• Approximately = # records with a specific value/total # of records

Calculating Match Weights

ID First Mid Last ID First Mid Last

9 Zbignew Brezinsky 534 Zbignew J Brezinski

EHDI Data Birth Defects Registry

Page 17: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

17

Probabilistic Matching

• If the field agrees, wi is equal to ….

w1 w2

)(log2i

ii u

mw

ID First Mid Last ID First Mid Last

9 Zbignew Brezinsky 534 Zbignew J Brezinski

EHDI Data Birth Defects Registry

Page 18: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

18

Probabilistic Matching

– mi for first name = .98, or 98% of the time, if it’s a correct

match, the first names will agree

– ui for Zbignew is .00001 is the probability of randomly

getting two first names that are Zbignew

w1 w2

16.58049)00001.

98.(log)(log 221

i

ii u

mw

ID First Mid Last ID First Mid Last

9 Zbignew Brezinsky 534 Zbignew J Brezinski

EHDI Data Birth Defects Registry

Page 19: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

19

Probabilistic Matching

• In cases where two records disagree on a specified field, wi is equal to …..

w1 w2

)1

1(log2

i

ii u

mw

ID First Mid Last ID First Mid Last

9 Zbignew Brezinsky 534 Zbignew J Brezinski

EHDI Data Birth Defects Registry

Page 20: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

20

Probabilistic Matching

– mi for last name = .96, or 96% of the time, if it’s a correct

match, the last names will agree

– ui for Brezinsky is .00003 is the probability of randomly

getting two last names that are Brezinsky

w1 w2

-4.64381)00003.1

96.1(log)

1

1(log 222

i

ii u

mw

ID First Mid Last ID First Mid Last

9 Zbignew Brezinsky 534 Zbignew J Brezinski

EHDI Data Birth Defects Registry

Page 21: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

21

• A composite weight, wt calculated for each pair of

records– The sum of weights across all fields used in linkage

• Larger wt suggest a correct match,

• Smaller or negative wt suggest an incorrect match.

Calculating Match Weights

11.936684.64381-16.580491

it

k

iit

w

ww

Page 22: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

22

• Match Determination– Could compare every record in one dataset with

every record in the second dataset• Result in N1 x N2 comparisons

– Blocking• Records first “blocked” on a subset of fields for which

a deterministic match is required. • Within each block, all records from the one dataset

are compared to all records from the other dataset• wt calculated for each of these possible pairings.

• The distribution of wt’s across all blocks examined in order to determine a critical cut-off score necessary to classify two records as a match.

Blocking

Page 23: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

23

0

0.2

0.4

0.6

0.8

1

1.2

-8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20

Wt for Pairings

Page 24: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

24

0

0.5

1

1.5

2

2.5

-8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20

Wt for Pairings

Page 25: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

25

• The total-weight required for two records to have a probability, p, of being a match is equal to…

– Where p is the desired probability of a match, – E is the expected potential matches

– N1 and N2 are the number of records in each database,

Estimating Probabilities

ENN

E

212log

ENN

E

p

pwt

2122 log

1log

is the base 2 log of the odds of a random match

Page 26: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

26

if two fields agree, and…

Estimating Probabilities

i

iii

i

iii

u

mx

u

mx

ENN

Ex

1

10,

0,

210

10

0

K

ii

K

ii

x

xp

if two fields do not agree

odds of a random match,

From this formula, it is possible to derive an equation for estimating the probability that any two records are a match

Page 27: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

27

• Note that the probability equation is equivalent to a base-2 version of the logistic probability formula

• The computational formula avoids the need to repeatedly calculate powers of 2 and log2

– This is due to the weights in the exponent themselves being a log-value

• The same probability is obtained using e and the natural log in place of 2 and log2 throughout – Base 2 results in improved computational speed

Notes

12

20

0

t

t

ww

ww

p

Page 28: 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

28

That’s nice, but …..

• All right. But apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, the fresh water system, and public health… What have the Romans ever done for us?

--- Reg, spokesman for the People’s Front of Judea

Monty PythonLife of Brian

(and Martin White, UC Berkeley)