Beyond 2011: Automating the linkage of anonymous data Pete Jones Office for National Statistics

Beyond 2011:

Automating the linkage of anonymous data

Pete Jones

Office for National Statistics

• Office for National Statistics (ONS) conducted a review (Beyond 2011 Programme) for the future approach to the census and population statistics in England and Wales

• National Statistician made a recommendation to Government in March 2014 that there should be a predominantly online census in 2021

• This will be supplemented with increased use of administrative data and surveys to enhance census outputs and annual statistics

• Part of our research leading up to the recommendation was to explore an administrative data option for producing population statistics

• Involved large scale record linkage with national datasets and surveys

• Lots of research into developing fully automated methods to link anonymous data

The Beyond 2011 Programme

LA = local authority, between 2,200 and 1 million people, average size = 160,000

Postcode = An alpha-numeric code assigned to a postal address to assist the sorting of mail

• Sources used in Beyond 2011 research:

PR = Patient Register – list of all patients registered with an NHS doctor in England and Wales

CIS = Customer Information System – list of people who have a National Insurance Number – tax register

HESA = Higher Education Statistics Agency – list of students registered on a Higher Education course in England and Wales

SC = School Census – list of pupils registered at state schools in England and Wales

Definitions

• For the 2011 Census, data matching was undertaken between Census and Census Coverage Survey (CCS)

• By aggregating the number of matched / unmatched records we are able to adjust for non-response

• Requires near perfect matching ( zero false positives / false negatives)

• Matching error will result in over-estimate or under-estimate of the population

• To ensure quality a combination of exact matching, probabilistic matching and clerical matching was used

Matching to produce population estimates

• There are additional challenges associated with the use of admin data

• Data quality – particularly lags in data being up to date

• Efficiency – need to match datasets with 60 million + records

• Public acceptability - ONS unique in holding multiple admin sources in one place

• Made the decision that names, dates of birth and addresses will be anonymised with a hashing algorithm (SHA-256)

• Converts original identifiers into meaningless hashed values

(e.g. John hashes to XY143257461)

• Consistently maps same entities to the same hashed value

Beyond 2011 Linkage Model

• Hashing data makes many of the traditional methods for resolving inconsistencies redundant

- Cannot run direct string comparison algorithms

- Cannot use clerical resolution

• Developed alternative ways of tackling data capture inconsistencies

(1) The development of match-keys that can be derived in pre- processing and hashed before linking two datasets

Methodological Research

Beyond 2011 Match-Keys

Key Type

Unique records on EPR (%)

1 Forename, Surname, DoB, Sex, Postcode 100.00%

2 Forename initial , Surname initial, DoB, Sex, Postcode District 99.55%

3 Forename bi-gram, Surname bi-gram, DoB, Sex, Postcode Area 99.44%

4 Forename initial, DoB, Sex, Postcode 99.84%

5 Surname initial, DoB, Sex, Postcode 99.44%

6 Forename, Surname, Age, Sex, Postcode Area 99.46%

7 Forename, Surname, Sex, Postcode 99.19%

8 Forename, Surname, DoB, Sex 98.87%

9 Forename, Surname, DoB, Postcode 99.52%

10 Surname, Forename, DoB, Sex, Postcode (matched on key 1) 100.00%

11 Middle name, Surname, DoB, Sex, Postcode (matched on key 1) 99.90%

• Constructing during pre-processing to support score-based methods that involve string comparison

• Non-disclosive to match single variables in isolation prior to encryption

(2) Similarity Tables

Data Storage Area

Forename Surname DoB PostCodeJohn Davis 02/04/1993 B1 2TGJohn Thomas 23/07/1986 M2 1JHJohn Smith 16/06/2003 BH12 1LTJon Reed 19/09/1993 DT8 4PBJon Ellis 16/06/2008 KT1 1LL JohnJonny Johnson 06/01/2002 N7 4ER JonJonny Daniels 21/10/1949 LN22 1AR JonnyJonny Barker 14/10/1974 PO11 7TG JonathanJonny King 26/02/1998 SO1 4KW ……Jonathan Khan 03/06/1999 E1 2BBJonathan Wright 11/10/2004 CR21 2JJJonathan Walker 10/07/2002 W5 6AD… … … …

Original Dataset (Source 2)

List of unique

Reception Server (Data Import Area)

Extract list of unique

forenames

• Follow the same process for the 2nd dataset import

Similarity Tables

Data Storage Area

Forename Surname DoB PostCodeJohn Davis 02/04/1993 B1 2TGJohn Thomas 23/07/1986 M2 1JHJohn Smith 16/06/2003 BH12 1LTJon Reed 19/09/1993 DT8 4PBJon Ellis 16/06/2008 KT1 1LL JohnJonny Johnson 06/01/2002 N7 4ER JonJonny Daniels 21/10/1949 LN22 1AR JonnyJonnie Barker 14/10/1974 PO11 7TG JonathanJonny King 26/02/1998 SO1 4KW JonnieJonathan Khan 03/06/1999 E1 2BB ……Jonathan Wright 11/10/2004 CR21 2JJJonathan Walker 10/07/2002 W5 6AD… … … …

Reception Server (Data Import Area)

Source 2 Dataset

Identify any additional

names not on list

List of unique forenames

• Run string comparison algorithm between all names on the list

Similarity Tables

Forename Matches ScoreJohn John 1

John John John Jonny 0.88Jon Jon John Jon 0.91Jonny Jonny John Jonathan 0.82Jonathan Jonathan Jonny Jonny 1Jonnie Jonnie Jonny John 0.88…… …… Jonny Jon 0.89

Jonny Jonathan 0.79Jon Jon 1Jon John 0.91Jon Jonny 0.89Jon Jonathan 0.81Jonathan Jonathan 1Jonathan John 0.82Jonathan Jon 0.81Jonathan Jonny 0.79

List of unique

List of unique forenames

String comparison algorithm

Similarity Tables (example)

PR_Forename PR_Surname PR_DoB PR_Sex PR_Pcode SC_Forename SC_Surname SC_DoB SC_Sex SC_PcodeJon Smyth 13/02/1965 M PO15 5RR John Smith 08/02/1965 M PO15 5RR

PR_Forename SC_Forename Similarity Score PR_Surname SC_Surname Similarity Score PR_DoB SC_DoB Similarity ScoreJohn John 1 Smith Smyth 0.93 13/02/1965 08/02/1965 0.67John Jonny 0.88 Smith Smithers 0.87 13/02/1965 09/02/1965 0.67John Jon 0.91 Smith Smithson 0.85 13/02/1965 10/02/1965 0.67John Jonathan 0.82 Smith Smith 1 13/02/1965 11/02/1965 0.67

Jonny Jonny 1 Smyth Smith 0.93 13/02/1965 12/02/1965 0.67Jonny John 0.88 Smyth Smithers 0.9 13/02/1965 13/02/1965 1Jonny Jon 0.89 Smyth Smithson 0.83 13/02/1965 14/02/1965 0.67Jonny Jonathan 0.79 Smyth Smyth 1 13/02/1965 15/02/1965 0.67

Jon Jon 1 Smithers Smith 0.87 13/02/1965 16/02/1965 0.67Jon John 0.91 Smithers Smyth 0.9 13/02/1965 17/02/1965 0.67Jon Jonny 0.89 Smithers Smithson 0.92 13/02/1965 18/02/1965 0.67Jon Jonathan 0.81 Smithers Smithers 1 13/02/1965 13/01/1965 0.67

Jonathan Jonathan 1 Smithson Smith 0.85 13/02/1965 13/03/1995 0.67Jonathan John 0.82 Smithson Smyth 0.83 13/02/1965 13/04/1995 0.67Jonathan Jon 0.81 Smithson Smithers 0.92 13/02/1965 13/05/1995 0.67Jonathan Jonny 0.79 Smithson Smithson 1 13/02/1965 13/06/1995 0.67

… … … … … … … … …


PR_Forename PR_Surname PR_DoB PR_Sex PR_Pcode SC_Forename SC_Surname SC_DoB SC_Sex SC_PcodeJon Smyth 13/02/1965 M PO15 5RR John Smith 09/02/1965 M PO15 5RR

PR_Forename SC_Forename Similarity Score PR_Surname SC_Surname Similarity Score PR_DoB SC_DoB Similarity ScoreJohn John 1 Smith Smyth 0.93 13/02/1965 08/02/1965 0.67John Jonny 0.88 Smith Smithers 0.87 13/02/1965 09/02/1965 0.67John Jon 0.91 Smith Smithson 0.85 13/02/1965 10/02/1965 0.67John Jonathan 0.82 Smith Smith 1 13/02/1965 11/02/1965 0.67

Jonny Jonny 1 Smyth Smith 0.93 13/02/1965 12/02/1965 0.67Jonny John 0.88 Smyth Smithers 0.9 13/02/1965 13/02/1965 1Jonny Jon 0.89 Smyth Smithson 0.83 13/02/1965 14/02/1965 0.67Jonny Jonathan 0.79 Smyth Smyth 1 13/02/1965 15/02/1965 0.67

Jon Jon 1 Smithers Smith 0.87 13/02/1965 16/02/1965 0.67Jon John 0.91 Smithers Smyth 0.9 13/02/1965 17/02/1965 0.67Jon Jonny 0.89 Smithers Smithson 0.92 13/02/1965 18/02/1965 0.67Jon Jonathan 0.81 Smithers Smithers 1 13/02/1965 13/01/1965 0.67

Jonathan Jonathan 1 Smithson Smith 0.85 13/02/1965 13/03/1995 0.67Jonathan John 0.82 Smithson Smyth 0.83 13/02/1965 13/04/1995 0.67Jonathan Jon 0.81 Smithson Smithers 0.92 13/02/1965 13/05/1995 0.67Jonathan Jonny 0.79 Smithson Smithson 1 13/02/1965 13/06/1995 0.67

… … … … … … … … …


# PR_Forename # PR_Surname # PR_DoB PR_Sex # PR_Pcode # SC_Forename # SC_Surname # SC_DoB SC_Sex # SC_PcodeEFIJ2465 CTYG0289 GXCX6714 M XXY1234 VRXM2613 XHDK5456 LRQP3671 M XXY1234

# PR_Forename# SC_Forename Similarity Score # PR_Surname # SC_Surname Similarity Score # PR_DoB # SC_DoB Similarity ScoreVRXM2613 VRXM2613 1 XHDK5456 CTYG0289 0.93 GXCX6714 JVNJ7158 0.67VRXM2613 XFVZ6018 0.88 XHDK5456 RDDM5656 0.87 GXCX6714 LRQP3671 0.67VRXM2613 EFIJ2465 0.91 XHDK5456 LLZY2510 0.85 GXCX6714 NFBN7474 0.67VRXM2613 UAXM3111 0.82 XHDK5456 XHDK5456 1 GXCX6714 XPKA6238 0.67XFVZ6018 XFVZ6018 1 CTYG0289 XHDK5456 0.93 GXCX6714 LIOO1416 0.67XFVZ6018 VRXM2613 0.88 CTYG0289 RDDM5656 0.9 GXCX6714 GXCX6714 1XFVZ6018 EFIJ2465 0.89 CTYG0289 LLZY2510 0.83 GXCX6714 MTVL1447 0.67XFVZ6018 UAXM3111 0.79 CTYG0289 CTYG0289 1 GXCX6714 URHR3837 0.67EFIJ2465 EFIJ2465 1 RDDM5656 XHDK5456 0.87 GXCX6714 ATNY6182 0.67EFIJ2465 VRXM2613 0.91 RDDM5656 CTYG0289 0.9 GXCX6714 QZIF1539 0.67EFIJ2465 XFVZ6018 0.89 RDDM5656 LLZY2510 0.92 GXCX6714 HIFN0726 0.67EFIJ2465 UAXM3111 0.81 RDDM5656 RDDM5656 1 GXCX6714 HJRM4460 0.67

UAXM3111 UAXM3111 1 LLZY2510 XHDK5456 0.85 GXCX6714 FFKD6141 0.67UAXM3111 VRXM2613 0.82 LLZY2510 CTYG0289 0.83 GXCX6714 UUGF4224 0.67UAXM3111 EFIJ2465 0.81 LLZY2510 RDDM5656 0.92 GXCX6714 YZWA1982 0.67UAXM3111 XFVZ6018 0.79 LLZY2510 LLZY2510 1 GXCX6714 UASD4867 0.67

… … … … … … … … …

ames

• The similarity tables identify all the candidate pairs that achieve a specified similarity threshold on forename, surname and DoB

• The researcher will only ever see the hashed fields

• Hashed variables are now redundant (can delete them)

• The only usable information is the scores themselves

• But what do you do with the scores?

Candidate Matches

Source 1 Forename

Source 2 Forename

Forename Score

Source 1 Surname

Source 2 Surname

Surname Score

Source 1 DoB

Source 2 DoB

Source 1 DoB

Overall Score

EFIJ2465 ZASG1635 0.78 CTYG0289 XHDK5456 0.93 GXCX6714 AFIQ8834 0.33 0.68EFIJ2465 VRXM2613 0.91 CTYG0289 XHDK5456 0.93 GXCX6714 LRQP3671 0.67 0.84EFIJ2465 HDNR3167 0.69 CTYG0289 CTYG0289 1 GXCX6714 EYGI9391 0.33 0.67

• Impractical to rely on clerical review when linking datasets at national level

• Clerical review is redundant when records are hash encoded

• For the 2011 Census we relied on clerical review to set thresholds for identifying the clerical region for scores derived from probabilistic matching

• Needed to develop methods that automate the classification of match statuses between candidate pairs

• Supervised or unsupervised methods

• Most of our research to date has focused on supervised methods, i.e. the use of training data

• Requires a small amount of clerically matched records that are sampled from the candidates

Role of clerical review

• Beyond 2011 are unable to undertake large-scale clerical work but will have access to a small sample set of candidate pairs

• Modelling approach that moves away from setting two thresholds – logistic regression

• Clerically match a small sample of unencrypted records:

- Fit a logistic regression model where y-variable is the decision to match or not

- Predictor variables are the similarity scores, name frequencies, geographic distances

• The idea is to substitute the clerical decision with an automated procedure

• Beta coefficients serve as weights for the matching variables

• Regression equation can be applied to remaining candidates between the two datasets

• Generates a single cut-off point (match where p >= 0.5)

Supervised Learning

• Piloted in SC-PR matching (12 year olds)

• Following auto-match, used similarity tables to identify 7303 records

• A clerical decision was made for 5% of records (365 candidate pairs)

• Fitted a logistic regression model with the dependent variable as the clerical match decision (binary outcome ‘Yes’ or ‘No’) and the following variables as predictors:

- Agreement between forenames (SPEDIS Score)

- Agreement between surnames (SPEDIS Score)

- Forename weight (highest on both sources)

- Surname weight (highest on both sources)

- Sex agreement (agree = 2, disagree = 1)

- Postcode agreement (full=5, sector=4, district=3, area=2, none=1)

- DoB agreement (full=3, M/Y=2, D/Y=1)

- Distance between OA centroids

Model Design

t tables

Model Fit – Training Data

B S.E. df Sig.Step 1 forename_similarity_score 0.14 0.039 1 0.000

surname_similarity_score 0.19 0.047 1 0.000dob_agreement (d/m/y) 2 0.000dob_agreement (m/y) -5.218 1.047 1 0.000dob_agreement (d/y) -5.153 0.684 1 0.000forename (weight) -109.177 65.86 1 0.097surname (weight) -441.276 99.903 1 0.000sex agreement (disagree) -2.104 4.051 1 0.603pcode_agreement (exact) 4 0.000pcode_agreement (none) -14.765 1.349 1 0.000pcode_agreement (area) -13.958 1.032 1 0.000pcode_agreement (district) -7.257 12.64 1 0.999pcode_agreement (sector) -4.032 12.17 1 0.999distance (km) -0.018 0.004 1 0.000constant 12.78 1.475 1 0.000

Variables in the Equation

t tables

Classifying Matches

Classification Tablea

No YesPercentage

CorrectMatch No 78 3 96.3

Yes 2 283 99.3Overall Percentage 98.6

a. The cut value is .500

PredictedMatch

Observed

• The rationale behind this approach is to automate decision making for more difficult candidate pairs

• Logistic regression provides an initial method for identifying a single threshold for classifying match candidates

• Optimum method could be something else

Support Vector Machine / Decision Trees / Bayesian Methods

• To what extent can we quality assure these matches?

• Can we apply this modelling approach to the match-keys?

• How much training data do we need to produce accurate models?

• Can synthetic data be used as training data?

Further research on supervised learning

• Supervised methods in 2001 Census matching resulted in over-fitting

• 2011 Census matching used Expectation-Maximisation (EM algorithm) to calculate m and u probabilities

• Winkler et al 2007 (Data Quality and Record Linkage Techniques) outline method in detail

– does not require training data

– can use data from all of the blocked candidates

– incorporated into probabilistic framework (Fellegi-Sunter model)

• But still requires clerical review to decide on the threshold score for match / non-match

• Further research is planned to explore ways of threshold setting that does not involve clerical resolution

Unsupervised approaches

• Tested whether probabilistic matching outperforms logistic regression

• Blocked records by postcode and used the EM algorithm to calculate match weights → probabilistic scores

• Good estimates of m and u probabilities

• Sampled 2500 records for clerical review

• Found the optimum cut-off point for the probabilistic score

• Fitted a logistic regression model with the candidates

• Undertook ROC curve analysis and precision / recall plots

Comparison with synthetic data

Comparison between logistic regression and probabilistic (EM algorithm)

• Probabilistic algorithms in Beyond 2011 will not identify all matches

(1) People with common names moving address

Probabilistic algorithms are designed to leave records unmatched where there are multiple candidates of similar likelihood

(2) Where data is missing or of poor quality

For a candidate pair to qualify for the logistic regression stage they must achieve a similarity threshold

• Having access to broad coverage sources provides opportunities for correctly identifying some of these matches

• By relying on the strength of a match made by someone else at the same address we can match difficult cases by association

Associative Matching


Person Matching

Administrative Records Coverage Survey

Name DoB Address PcodeJohn Smith 09/02/1965 12 Segensworth Rd PO15 5RR

Name DoB Address Pcode Name DoB Address PcodeJohn Smith 09/02/1965 132 Kings Ave SO4 2BR John Smith 09/02/1965 1 Wessex Way DE1 2BR

Name DoB Address PcodeJohn Smith 09/02/1965 44 London Rd W12 4DE

• 3 Candidates with same name and DoB. Unable to determine

?




Name DoB Address PcodeJohn Smith 09/02/1965 12 Segensworth Rd PO15 5RRBrenda Smith 19/05/1968 12 Segensworth Rd PO15 5RR

Name DoB Address Pcode Name DoB Address PcodeJohn Smith 09/02/1965 132 Kings Ave SO4 2BR John Smith 09/02/1965 1 Wessex Way DE1 2BRClare Smith 24/11/1966 132 Kings Ave SO4 2BR Brenda Smith 19/05/1968 1 Wessex Way DE1 2BRBen Smith 05/09/1991 132 Kings Ave SO4 2BR

Name DoB Address PcodeJohn Smith 09/02/1965 44 London Rd W12 4DEJane Smith 01/03/1964 44 London Rd W12 4DEElizabeth Smith 30/12/1936 44 London Rd W12 4DE

• Second resident at the current address has been successfully matched to administrative data• The record can now be matched by association

• Can also be applied in cases of poor data capture


Person Matching


Name DoB Address PcodeJohn Smith 09/02/1965 12 Segensworth Rd PO15 5RR

Name DoB Address Pcode Name DoB Address PcodeJohn Smith 09/02/1965 132 Kings Ave SO4 2BR Iohn Bmith 09/02/1965 1 Wessex Way DE1 2BR

Name DoB Address PcodeJohn Smith 09/02/1965 44 London Rd W12 4DE

• Scanning error for name entry on the coverage survey• Potential candidates cannot be identified from similarity tables

X

• Major requirement to understand quality loss between survey to admin source matching

• To date quality loss has been measured by undertaking controlled comparisons between conventional approaches and SRE simulations

• Number of caveats

- Sample bias (small scale / dob blocking)

- No understanding of geographic variance

- Limited clerical (compared to Census / Census QA matching)

• Undertaking record level comparison with Census QA to establish a more robust picture of precision and recall of the Beyond 2011 algorithm

Precision = number of true positives / number of B2011 matches

Recall = number of true positives / number of Census QA matches

Testing the Algorithms

• Running it for 8 LA’s to start withPowysWestminsterBirminghamMid-DevonLambethSouthwarkAylesbury ValeNewham

• Adjusting the Beyond 2011 matching strategy to make it comparable with census QA

• Subset PR by CCS cluster in LA Matching to Census / CCS LA

• Comparisons in cross LA matching to be undertaken at later date

Census QA Comparison Exercise

Census QA & B2011 Match Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Comparison of PR to Census/CCS match rates:Census QA and B2011

Census QA match rate

B2011 match rate

B2011 Precision

B2011 Recall

Summary Tables

Local Authority PR Count Census QA Beyond 2011B2011 True

PositivesBirmingham 21,313 17,482 17,255 17,185Westminster 9,626 6,268 6,178 6,152Lambeth 10,532 6,740 6,684 6,633Newham 13,461 9,193 9,032 8,990Southwark 9,993 6,627 6,496 6,472Powys 1,648 1,554 1,539 1,536Aylesbury Vale 2,732 2,455 2,448 2,441Mid Devon 613 543 543 542

Local AuthorityCensus QA match

rate B2011 match rateB2011 false

positivesB2011 false negatives

Birmingham 82.0% 81.0% 0.4% 1.7%Westminster 65.1% 64.2% 0.4% 1.9%Lambeth 64.0% 63.5% 0.8% 1.6%Newham 68.3% 67.1% 0.5% 2.2%Southwark 66.3% 65.0% 0.4% 2.3%Powys 94.3% 93.4% 0.2% 1.2%Aylesbury Vale 89.9% 89.6% 0.3% 0.6%Mid Devon 88.6% 88.6% 0.2% 0.2%

• The B2011 algorithms automate the matching process for anonymised data

• False positives are minimal

• False negatives are currently higher than target (<1%)

• Consider additional methods to improve matching accuracy

- Longitudinal data- Improved data capture- Widening the blocking strategy- Improving on classification methods

• Research will continue in phase 2 of the programme

Summary and Future Research

•

• Corporate strategy for record linkage at ONS

• Collaboration with statistics agencies internationally

• ONS partnering with ADRC England

• Working with researchers outside of ONS

• Publishing research

Summary and Future Research

Documents

Beyond 2011: Automating the linkage of anonymous data Pete Jones Office for National Statistics