48
Linking data records using probabilistic techniques Linking data records using Linking data records using probabilistic techniques probabilistic techniques

Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

Embed Size (px)

Citation preview

Linking data records using probabilistic techniques

Linking data records using Linking data records using

probabilistic techniquesprobabilistic techniques

OverviewOverviewOverview

•• The choice between deterministic The choice between deterministic and probabilistic linkage methodsand probabilistic linkage methods

•• Demonstration of probabilistic Demonstration of probabilistic linkage software linkage software ---- LinksLinks

•• Conclusions regarding probabilistic Conclusions regarding probabilistic and deterministic methodsand deterministic methods

Linkage BeginningsLinkage BeginningsLinkage Beginnings

The Massachusetts Maternal Mortality The Massachusetts Maternal Mortality and Morbidity Projectand Morbidity Project

Hospital

Discharge

Mothers

Birth

Certificates

Hospital

Discharge

Children

Deterministic Linkage ChallengesDeterministic Linkage ChallengesDeterministic Linkage Challenges

•• Which are the best variables to link Which are the best variables to link with?with?

•• What is an objective way to decide What is an objective way to decide matched v. unmatched?matched v. unmatched?

•• When do we say When do we say ““enough is enough is enoughenough””??

The MA Pregnancy to Early Life Longitudinal Project (PELL)

The MA Pregnancy to Early Life The MA Pregnancy to Early Life

Longitudinal Project (PELL) Longitudinal Project (PELL)

Birth

Certificate/

Fetal Death

Hospital Discharge

Observational Stay

Emergency Department

Administrative DataAdministrative Data

The MA Pregnancy to Early Life Longitudinal Project (PELL)

The MA Pregnancy to Early Life The MA Pregnancy to Early Life

Longitudinal Project (PELL) Longitudinal Project (PELL)

Birth

Certificate/

Fetal Death

MA Healthy Start

Early Intervention

WIC

Programmatic DataProgrammatic Data

The MA Pregnancy to Early Life Longitudinal Project (PELL)

The MA Pregnancy to Early Life The MA Pregnancy to Early Life

Longitudinal Project (PELL) Longitudinal Project (PELL)

Birth

Certificate/

Fetal Death

Death Data

Birth Defects

Vital and Health Statistics DataVital and Health Statistics Data

PELL Data SetPELL Data SetPELL Data Set

•• Used for many different analysesUsed for many different analyses

-- Program ReviewProgram Review

-- SurveillanceSurveillance

-- ResearchResearch

How do we create a linked data set How do we create a linked data set with flexibility?with flexibility?

New ChallengesNew ChallengesNew Challenges

•• Reduce amount of time spent on Reduce amount of time spent on each linkage each linkage

•• Use one linkage algorithm for Use one linkage algorithm for multiple years of datamultiple years of data

•• Deal with matched and unmatched Deal with matched and unmatched in a consistent and objective wayin a consistent and objective way

•• But But howhow??

Probabilistic Record LinkageProbabilistic Record LinkageProbabilistic Record Linkage

•• Uses probabilities to determine Uses probabilities to determine whether a pair of records refer to whether a pair of records refer to the same individualthe same individual

•• Calculates weights to quantify the Calculates weights to quantify the likelihood that a pair of records are likelihood that a pair of records are a true matcha true match

•• Probabilistic weights may be either Probabilistic weights may be either nonnon--specific or value specificspecific or value specific

General (Non-Specific) WeightsGeneral (NonGeneral (Non--Specific) WeightsSpecific) Weights

•• Agreement on a specific variableAgreement on a specific variable

•• Example:Example:

-- AgreementAgreement on date of birth receives a on date of birth receives a higher weight then match on sex higher weight then match on sex

-- Disagreement Disagreement on sex receives a on sex receives a higher penalty than disagreement on higher penalty than disagreement on date of birthdate of birth

Value Specific WeightsValue Specific WeightsValue Specific Weights

•• Agreement on a specific value of Agreement on a specific value of the variable being comparedthe variable being compared

•• Example: Example: Comparing initials using value Comparing initials using value specific weightsspecific weights

-- Agreement on initial Z receives higher Agreement on initial Z receives higher weight than match on initial Sweight than match on initial S

-- Disagreement on initial S receives higher Disagreement on initial S receives higher penalty than disagreement on Zpenalty than disagreement on Z

Benefit of WeightsBenefit of WeightsBenefit of Weights

•• Weights objectively reflect our Weights objectively reflect our confidence in a matchconfidence in a match

•• Individual choice in cutting off low Individual choice in cutting off low weightsweights

Probabilistic Linkage MethodsProbabilistic Linkage MethodsProbabilistic Linkage Methods

•• Some SAS programmers write their Some SAS programmers write their own probabilistic codeown probabilistic code

•• Software packagesSoftware packages

-- Very expensiveVery expensive

-- Difficult to useDifficult to use

-- Some applications are available as Some applications are available as freeware or sharewarefreeware or shareware

Choosing Probabilistic SoftwareChoosing Probabilistic SoftwareChoosing Probabilistic Software

OS Initial $ Yearly $ Link Type Description Audience

Automatch (Integrity)

Windows $100,000 ??? Probablistic GUI Marketing

Generalized Record Linkage System (GRLS)

UNIX $18,800 10% Probablistic ORACLE Health care

LinkPro

None SAS Health careWindows/

Mainframe

$1,455 /

$1,190

Determ &

Prob

Automatch (Integrity)

Generalized Record Linkage System (GRLS)

LinkPro

•Links: same as LinkPro but freeware

•FEBRL: also freeware, opensource

LinkPro FeaturesLinkPro FeaturesLinkPro Features

•• InexpensiveInexpensive

•• Easy to useEasy to use

•• Created for health care record Created for health care record linkagelinkage

•• Both deterministic and probabilistic Both deterministic and probabilistic linkagelinkage

LinkPro FeaturesLinkPro FeaturesLinkPro Features

•• Capacity to recognize and Capacity to recognize and accommodate duplicate recordsaccommodate duplicate records

•• Supports full and partial/conditional Supports full and partial/conditional comparisons comparisons

LinkPro FeaturesLinkPro FeaturesLinkPro Features

•• Runs on any mainframe, mini, Runs on any mainframe, mini, workstation or PC with SAS 6.06 or workstation or PC with SAS 6.06 or higherhigher

•• Free technical supportFree technical support

How LinkPro WorksHow LinkPro WorksHow LinkPro Works

•• Automatically calculates and applies Automatically calculates and applies nonnon--specificspecific probabilistic weightsprobabilistic weights

•• Weights estimate the likelihood that Weights estimate the likelihood that a pair of records from separate files a pair of records from separate files correspond to the same individualcorrespond to the same individual

How Weights Are CalculatedHow Weights Are CalculatedHow Weights Are Calculated

Computed building on logComputed building on log22 of the of the odds or frequency ratio calculated odds or frequency ratio calculated for each variablefor each variable

weight = log2 OUTCOME freq in LINKED pairsOUTCOME freq in UNLINKABLE pairs

LinkPro Statements and SyntaxLinkPro Statements and SyntaxLinkPro Statements and Syntax

_LINKPRO DATA1= _LINKPRO DATA1= SASSAS--datadata--setset

DATA2= DATA2= SASSAS--datadata--setset <<optionsoptions> ;> ;

_VAR _VAR variablevariable--list;list;

_BY _BY variablevariable--list;list;

_RUN;_RUN;

_LINKPRO <options>_LINKPRO _LINKPRO <options><options>

DATA1= DATA1= SASSAS--datadata--setset

DATA2= DATA2= SASSAS--datadata--setset ;;

_VAR _VAR variablevariable--list;list;

_BY _BY variablevariable--list;list;

_RUN;_RUN;

<options>

_LINKPRO

_LINKPRO <options>_LINKPRO <_LINKPRO <options>options>

MIN = MIN = numbernumber

•• Minimum number of variables that Minimum number of variables that must agree in _VAR statementmust agree in _VAR statement

_LINKPRO <options>_LINKPRO <_LINKPRO <options>options>

USEMEMUSEMEM

•• Stores input data in memory for Stores input data in memory for faster executionfaster execution

SIMPLESIMPLE

•• Deterministic linkage onlyDeterministic linkage only

_LINKPRO <options>_LINKPRO <_LINKPRO <options>options>

PAIRS = PAIRS = SASSAS--datadata--setset

•• Creates data set containing all Creates data set containing all ‘‘linkablelinkable’’ pairs (potential links)pairs (potential links)

RESOLVE = RESOLVE = 1xN 1xN oror Nx1Nx1

•• Allows one to many matchAllows one to many match

_LINKPRO <options>_LINKPRO <_LINKPRO <options>options>

DEBUGDEBUG

•• Prints all SAS statements and Prints all SAS statements and messages for problem diagnosingmessages for problem diagnosing

_VAR statement_VAR statement_VAR statement

_LINKPRO DATA1= _LINKPRO DATA1= SASSAS--datadata--setset

DATA2= DATA2= SASSAS--datadata--setset <options> ;<options> ;

variablevariable--list;list;

_BY _BY variablevariable--list;list;

_RUN;_RUN;

_VAR

_VAR statement_VAR statement_VAR statement

•• Lists all variables used in linkageLists all variables used in linkage

•• Numeric or character variablesNumeric or character variables

_VAR statement_VAR statement_VAR statement

Partial/Conditional ComparisonsPartial/Conditional Comparisons

•• VAR1|VAR2VAR1|VAR2

•• VAR1 compared for agreement, if VAR1 compared for agreement, if no match, VAR2 comparedno match, VAR2 compared

•• 3 possible outcomes and weights3 possible outcomes and weights

_BY statement_BY statement_BY statement

_LINKPRO DATA1= _LINKPRO DATA1= SASSAS--datadata--setset

DATA2= DATA2= SASSAS--datadata--setset <options> ;<options> ;

_VAR _VAR variablevariable--list;list;

variablevariable--list;list;

_RUN;_RUN;

_BY

_BY statement_BY statement_BY statement

•• Optional statementOptional statement

•• Variable(s) that must match Variable(s) that must match exactlyexactly

•• Speeds up linkageSpeeds up linkage

_LPX and _WTX statements_LPX and _WTX statements_LPX and _WTX statements

_LINKPRO DATA1= _LINKPRO DATA1= SASSAS--datadata--setset

DATA2= DATA2= SASSAS--datadata--setset <options> ;<options> ;

_VAR _VAR variablevariable--list;list;

_BY _BY variablevariable--list;list;

'SAS'SAS--statement(s)';statement(s)';

'SAS'SAS--statement(s)';statement(s)';

_RUN;_RUN;

_LPX

_WTX

_LPX Statement_LPX Statement_LPX Statement

•• Optional statementOptional statement

•• Inserts SAS statements into the Inserts SAS statements into the data step that generates linkable data step that generates linkable pairspairs

_LPX 'if given1^=given2 _LPX 'if given1^=given2

and given1>given2 then and given1>given2 then

_matched = _matched+1;';_matched = _matched+1;';

_WTX Statement_WTX Statement_WTX Statement

•• Optional statementOptional statement

•• Inserts SAS statements into the Inserts SAS statements into the data step that calculates data step that calculates probabilistic weightsprobabilistic weights

_WTX 'if abs(birthyr1_WTX 'if abs(birthyr1--

birthyr2)<=3 then birthyr2)<=3 then

__wgtwgt=_wgt+1;'; =_wgt+1;';

LinkPro Output FilesLinkPro Output FilesLinkPro Output Files

Linked recordsLinked records

Links that could not be Links that could not be resolvedresolved

Unlinked records from the Unlinked records from the first data setfirst data set

Unlinked records from the Unlinked records from the second data setsecond data set

_LKD

_TIE

1

2

_DAT1

_DAT2

3

4

LinkPro Versus DeterministicLinkPro Versus DeterministicLinkPro Versus Deterministic

•• Replicated original BCReplicated original BC--HD link using HD link using LinkProLinkPro

•• Found over 99% agreement in Found over 99% agreement in resulting links from two linkage resulting links from two linkage methodsmethods

Benefits Of Probabilistic LinkageBenefits Of Probabilistic LinkageBenefits Of Probabilistic Linkage

•• Routinized linkage processRoutinized linkage process

•• Provided objective way to deal with Provided objective way to deal with matched and unmatched data matched and unmatched data

•• Reduced amount of code and time Reduced amount of code and time spent on linking dataspent on linking data

•• Able to inspect tied records or print Able to inspect tied records or print out those with lowest 5out those with lowest 5--10% weight10% weight

Abandon Deterministic Altogether?Abandon Deterministic Altogether?Abandon Deterministic Altogether?

Definitely NOTDefinitely NOT

Choice of deterministic or probabilistic Choice of deterministic or probabilistic methods depends on:methods depends on:

-- Type of project Type of project

-- DataData

ConclusionsConclusionsConclusions

•• Deterministic vs. ProbabilisticDeterministic vs. Probabilistic

-- Depends on your situation and goalsDepends on your situation and goals

•• Probabilistic linkage software can be Probabilistic linkage software can be affordable and easy to use!affordable and easy to use!

More Information on LinkProMore Information on LinkProMore Information on LinkPro

http://members.shaw.ca/andre.wajda/linkpro.htmlhttp://members.shaw.ca/andre.wajda/linkpro.html

AndrAndréé Wajda Wajda

[email protected]@shaw.ca

More Information on LinksMore Information on LinksMore Information on Links

Randy Randy WalldWalld

[email protected][email protected]