Upload
nguyenkhue
View
216
Download
0
Embed Size (px)
Citation preview
Linking data records using probabilistic techniques
Linking data records using Linking data records using
probabilistic techniquesprobabilistic techniques
OverviewOverviewOverview
•• The choice between deterministic The choice between deterministic and probabilistic linkage methodsand probabilistic linkage methods
•• Demonstration of probabilistic Demonstration of probabilistic linkage software linkage software ---- LinksLinks
•• Conclusions regarding probabilistic Conclusions regarding probabilistic and deterministic methodsand deterministic methods
Linkage BeginningsLinkage BeginningsLinkage Beginnings
The Massachusetts Maternal Mortality The Massachusetts Maternal Mortality and Morbidity Projectand Morbidity Project
Hospital
Discharge
Mothers
Birth
Certificates
Hospital
Discharge
Children
Deterministic Linkage ChallengesDeterministic Linkage ChallengesDeterministic Linkage Challenges
•• Which are the best variables to link Which are the best variables to link with?with?
•• What is an objective way to decide What is an objective way to decide matched v. unmatched?matched v. unmatched?
•• When do we say When do we say ““enough is enough is enoughenough””??
The MA Pregnancy to Early Life Longitudinal Project (PELL)
The MA Pregnancy to Early Life The MA Pregnancy to Early Life
Longitudinal Project (PELL) Longitudinal Project (PELL)
Birth
Certificate/
Fetal Death
Hospital Discharge
Observational Stay
Emergency Department
Administrative DataAdministrative Data
The MA Pregnancy to Early Life Longitudinal Project (PELL)
The MA Pregnancy to Early Life The MA Pregnancy to Early Life
Longitudinal Project (PELL) Longitudinal Project (PELL)
Birth
Certificate/
Fetal Death
MA Healthy Start
Early Intervention
WIC
Programmatic DataProgrammatic Data
The MA Pregnancy to Early Life Longitudinal Project (PELL)
The MA Pregnancy to Early Life The MA Pregnancy to Early Life
Longitudinal Project (PELL) Longitudinal Project (PELL)
Birth
Certificate/
Fetal Death
Death Data
Birth Defects
Vital and Health Statistics DataVital and Health Statistics Data
PELL Data SetPELL Data SetPELL Data Set
•• Used for many different analysesUsed for many different analyses
-- Program ReviewProgram Review
-- SurveillanceSurveillance
-- ResearchResearch
How do we create a linked data set How do we create a linked data set with flexibility?with flexibility?
New ChallengesNew ChallengesNew Challenges
•• Reduce amount of time spent on Reduce amount of time spent on each linkage each linkage
•• Use one linkage algorithm for Use one linkage algorithm for multiple years of datamultiple years of data
•• Deal with matched and unmatched Deal with matched and unmatched in a consistent and objective wayin a consistent and objective way
•• But But howhow??
Probabilistic Record LinkageProbabilistic Record LinkageProbabilistic Record Linkage
•• Uses probabilities to determine Uses probabilities to determine whether a pair of records refer to whether a pair of records refer to the same individualthe same individual
•• Calculates weights to quantify the Calculates weights to quantify the likelihood that a pair of records are likelihood that a pair of records are a true matcha true match
•• Probabilistic weights may be either Probabilistic weights may be either nonnon--specific or value specificspecific or value specific
General (Non-Specific) WeightsGeneral (NonGeneral (Non--Specific) WeightsSpecific) Weights
•• Agreement on a specific variableAgreement on a specific variable
•• Example:Example:
-- AgreementAgreement on date of birth receives a on date of birth receives a higher weight then match on sex higher weight then match on sex
-- Disagreement Disagreement on sex receives a on sex receives a higher penalty than disagreement on higher penalty than disagreement on date of birthdate of birth
Value Specific WeightsValue Specific WeightsValue Specific Weights
•• Agreement on a specific value of Agreement on a specific value of the variable being comparedthe variable being compared
•• Example: Example: Comparing initials using value Comparing initials using value specific weightsspecific weights
-- Agreement on initial Z receives higher Agreement on initial Z receives higher weight than match on initial Sweight than match on initial S
-- Disagreement on initial S receives higher Disagreement on initial S receives higher penalty than disagreement on Zpenalty than disagreement on Z
Benefit of WeightsBenefit of WeightsBenefit of Weights
•• Weights objectively reflect our Weights objectively reflect our confidence in a matchconfidence in a match
•• Individual choice in cutting off low Individual choice in cutting off low weightsweights
Probabilistic Linkage MethodsProbabilistic Linkage MethodsProbabilistic Linkage Methods
•• Some SAS programmers write their Some SAS programmers write their own probabilistic codeown probabilistic code
•• Software packagesSoftware packages
-- Very expensiveVery expensive
-- Difficult to useDifficult to use
-- Some applications are available as Some applications are available as freeware or sharewarefreeware or shareware
Choosing Probabilistic SoftwareChoosing Probabilistic SoftwareChoosing Probabilistic Software
OS Initial $ Yearly $ Link Type Description Audience
Automatch (Integrity)
Windows $100,000 ??? Probablistic GUI Marketing
Generalized Record Linkage System (GRLS)
UNIX $18,800 10% Probablistic ORACLE Health care
LinkPro
None SAS Health careWindows/
Mainframe
$1,455 /
$1,190
Determ &
Prob
Automatch (Integrity)
Generalized Record Linkage System (GRLS)
LinkPro
•Links: same as LinkPro but freeware
•FEBRL: also freeware, opensource
LinkPro FeaturesLinkPro FeaturesLinkPro Features
•• InexpensiveInexpensive
•• Easy to useEasy to use
•• Created for health care record Created for health care record linkagelinkage
•• Both deterministic and probabilistic Both deterministic and probabilistic linkagelinkage
LinkPro FeaturesLinkPro FeaturesLinkPro Features
•• Capacity to recognize and Capacity to recognize and accommodate duplicate recordsaccommodate duplicate records
•• Supports full and partial/conditional Supports full and partial/conditional comparisons comparisons
LinkPro FeaturesLinkPro FeaturesLinkPro Features
•• Runs on any mainframe, mini, Runs on any mainframe, mini, workstation or PC with SAS 6.06 or workstation or PC with SAS 6.06 or higherhigher
•• Free technical supportFree technical support
How LinkPro WorksHow LinkPro WorksHow LinkPro Works
•• Automatically calculates and applies Automatically calculates and applies nonnon--specificspecific probabilistic weightsprobabilistic weights
•• Weights estimate the likelihood that Weights estimate the likelihood that a pair of records from separate files a pair of records from separate files correspond to the same individualcorrespond to the same individual
How Weights Are CalculatedHow Weights Are CalculatedHow Weights Are Calculated
Computed building on logComputed building on log22 of the of the odds or frequency ratio calculated odds or frequency ratio calculated for each variablefor each variable
weight = log2 OUTCOME freq in LINKED pairsOUTCOME freq in UNLINKABLE pairs
LinkPro Statements and SyntaxLinkPro Statements and SyntaxLinkPro Statements and Syntax
_LINKPRO DATA1= _LINKPRO DATA1= SASSAS--datadata--setset
DATA2= DATA2= SASSAS--datadata--setset <<optionsoptions> ;> ;
_VAR _VAR variablevariable--list;list;
_BY _BY variablevariable--list;list;
_RUN;_RUN;
_LINKPRO <options>_LINKPRO _LINKPRO <options><options>
DATA1= DATA1= SASSAS--datadata--setset
DATA2= DATA2= SASSAS--datadata--setset ;;
_VAR _VAR variablevariable--list;list;
_BY _BY variablevariable--list;list;
_RUN;_RUN;
<options>
_LINKPRO
_LINKPRO <options>_LINKPRO <_LINKPRO <options>options>
MIN = MIN = numbernumber
•• Minimum number of variables that Minimum number of variables that must agree in _VAR statementmust agree in _VAR statement
_LINKPRO <options>_LINKPRO <_LINKPRO <options>options>
USEMEMUSEMEM
•• Stores input data in memory for Stores input data in memory for faster executionfaster execution
SIMPLESIMPLE
•• Deterministic linkage onlyDeterministic linkage only
_LINKPRO <options>_LINKPRO <_LINKPRO <options>options>
PAIRS = PAIRS = SASSAS--datadata--setset
•• Creates data set containing all Creates data set containing all ‘‘linkablelinkable’’ pairs (potential links)pairs (potential links)
RESOLVE = RESOLVE = 1xN 1xN oror Nx1Nx1
•• Allows one to many matchAllows one to many match
_LINKPRO <options>_LINKPRO <_LINKPRO <options>options>
DEBUGDEBUG
•• Prints all SAS statements and Prints all SAS statements and messages for problem diagnosingmessages for problem diagnosing
_VAR statement_VAR statement_VAR statement
_LINKPRO DATA1= _LINKPRO DATA1= SASSAS--datadata--setset
DATA2= DATA2= SASSAS--datadata--setset <options> ;<options> ;
variablevariable--list;list;
_BY _BY variablevariable--list;list;
_RUN;_RUN;
_VAR
_VAR statement_VAR statement_VAR statement
•• Lists all variables used in linkageLists all variables used in linkage
•• Numeric or character variablesNumeric or character variables
_VAR statement_VAR statement_VAR statement
Partial/Conditional ComparisonsPartial/Conditional Comparisons
•• VAR1|VAR2VAR1|VAR2
•• VAR1 compared for agreement, if VAR1 compared for agreement, if no match, VAR2 comparedno match, VAR2 compared
•• 3 possible outcomes and weights3 possible outcomes and weights
_BY statement_BY statement_BY statement
_LINKPRO DATA1= _LINKPRO DATA1= SASSAS--datadata--setset
DATA2= DATA2= SASSAS--datadata--setset <options> ;<options> ;
_VAR _VAR variablevariable--list;list;
variablevariable--list;list;
_RUN;_RUN;
_BY
_BY statement_BY statement_BY statement
•• Optional statementOptional statement
•• Variable(s) that must match Variable(s) that must match exactlyexactly
•• Speeds up linkageSpeeds up linkage
_LPX and _WTX statements_LPX and _WTX statements_LPX and _WTX statements
_LINKPRO DATA1= _LINKPRO DATA1= SASSAS--datadata--setset
DATA2= DATA2= SASSAS--datadata--setset <options> ;<options> ;
_VAR _VAR variablevariable--list;list;
_BY _BY variablevariable--list;list;
'SAS'SAS--statement(s)';statement(s)';
'SAS'SAS--statement(s)';statement(s)';
_RUN;_RUN;
_LPX
_WTX
_LPX Statement_LPX Statement_LPX Statement
•• Optional statementOptional statement
•• Inserts SAS statements into the Inserts SAS statements into the data step that generates linkable data step that generates linkable pairspairs
_LPX 'if given1^=given2 _LPX 'if given1^=given2
and given1>given2 then and given1>given2 then
_matched = _matched+1;';_matched = _matched+1;';
_WTX Statement_WTX Statement_WTX Statement
•• Optional statementOptional statement
•• Inserts SAS statements into the Inserts SAS statements into the data step that calculates data step that calculates probabilistic weightsprobabilistic weights
_WTX 'if abs(birthyr1_WTX 'if abs(birthyr1--
birthyr2)<=3 then birthyr2)<=3 then
__wgtwgt=_wgt+1;'; =_wgt+1;';
LinkPro Output FilesLinkPro Output FilesLinkPro Output Files
Linked recordsLinked records
Links that could not be Links that could not be resolvedresolved
Unlinked records from the Unlinked records from the first data setfirst data set
Unlinked records from the Unlinked records from the second data setsecond data set
_LKD
_TIE
1
2
_DAT1
_DAT2
3
4
LinkPro Versus DeterministicLinkPro Versus DeterministicLinkPro Versus Deterministic
•• Replicated original BCReplicated original BC--HD link using HD link using LinkProLinkPro
•• Found over 99% agreement in Found over 99% agreement in resulting links from two linkage resulting links from two linkage methodsmethods
Benefits Of Probabilistic LinkageBenefits Of Probabilistic LinkageBenefits Of Probabilistic Linkage
•• Routinized linkage processRoutinized linkage process
•• Provided objective way to deal with Provided objective way to deal with matched and unmatched data matched and unmatched data
•• Reduced amount of code and time Reduced amount of code and time spent on linking dataspent on linking data
•• Able to inspect tied records or print Able to inspect tied records or print out those with lowest 5out those with lowest 5--10% weight10% weight
Abandon Deterministic Altogether?Abandon Deterministic Altogether?Abandon Deterministic Altogether?
Definitely NOTDefinitely NOT
Choice of deterministic or probabilistic Choice of deterministic or probabilistic methods depends on:methods depends on:
-- Type of project Type of project
-- DataData
ConclusionsConclusionsConclusions
•• Deterministic vs. ProbabilisticDeterministic vs. Probabilistic
-- Depends on your situation and goalsDepends on your situation and goals
•• Probabilistic linkage software can be Probabilistic linkage software can be affordable and easy to use!affordable and easy to use!
More Information on LinkProMore Information on LinkProMore Information on LinkPro
http://members.shaw.ca/andre.wajda/linkpro.htmlhttp://members.shaw.ca/andre.wajda/linkpro.html
AndrAndréé Wajda Wajda
[email protected]@shaw.ca
More Information on LinksMore Information on LinksMore Information on Links
Randy Randy WalldWalld