An Automated Record Linkage System for the Canadian
Census, 1871-1881
L. Antonie (University of Guelph)P. Baskerville (Universities of Alberta and Victoria)
K. Inwood (University of Guelph)J. A. Ross (University of Guelph)
Record Linkage Workshop, May 24th-25th, 2010, University of Guelph
‘Unbiased’ links connecting individuals/households over several
census years
A comprehensive infrastructure of longitudinal data
What we are working towards
1851Census
1871Census
1881Census 1891
Census
1901Census
1906 Census
1916Census
1911Census
US 1880
Census
US 1900
Census
Current Work
100% of 1871
CensusAutomatic LinkingAutomatic Linking
4,277,807 records
3,601,663 records
Partners and collaborators: FamilySearch, Church of Latter Day Saints, Minnesota Population Center, Université de Montréal, University of Alberta
100% of 1871
Census
100% of 1871
Census
100% of 1881
Census
100% of 1871
Census
Existing (True) Links
• Ontario Industrial Proprietors – 8429 links
• Logan Township – 1760 links
• St. James Church, Toronto – 232 links
• Quebec City Boys – 1403 links
• Bias– family- context– others?
Logan Twp
Guelph
Attributes for Automatic Linking
• Last Name - string
• First Name - string
• Gender – binary
• Age - number
• Birthplace - number
• Marital status – single, married, divorced, widowed, unknown
Automatic Linkage
• The challenges:1) Identify the same person2) Deal with attribute characteristics3) Manage computational expense
• The system:
Data Cleaning and Standardization• Cleaning
– Names – remove non-alpha numerical characters; remove titles
– Age – transform non-numerical representations to corresponding numbers (e.g. 3 months);
– All attributes - deal with English/French notations (e.g. days/jours, married/mariee)
• Standardization– Birthplace codes and granularity– Marital status
Computational Expense
• Very expensive to compare all the possible pairs of records
• Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census)
• Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days.
Managing Computational Expense
• Blocking – By first letter of last name– By birthplace
• Using HPC– Running the system on multiple processors
Record Comparison
• Comparing Strings– Jaro-Winkler– Edit Distance– Double Metaphone
• Age– +/- 2 years
• Exact matches – Gender– Birthplace
Classification
• Classifier – Support Vector Machines– 5-fold cross validation
• Training Data– True links found by experts– Ontario proprietors
• Classes– Match– Non-match
Linkage Results
Province Linkage Rate (%)
New Brunswick 24.45
Nova Scotia 21.50
Ontario 18.36
Quebec 17.45
Linkage Results - EvaluationTrue Links Set Total TP (%) FP (%)
Ontario_Props 1647 21.59 9.28
Logan 1760 21.64 8.85
St_James 232 24.72 7.12
Les_Boys 1403 17.99 11.41
Province TP FP Possible Unsure
New Brunswick 66 27 6 1
Nova Scotia 70 22 5 -
Ontario 53 40 5 2
Quebec 42 52 6 -
Linkage Results - EvaluationAttribute ON71 QC71 CAN81 ON_Props Linked(ON) Linked(QC)
Gender Distribution
Female 47.46 49.83 49.35 48.63 45.26 43.50
Male 49.69 50.00 50.64 51.33 54.74 56.50
Age
0-15 42.20 41.84 38.68 60.28 40.96 43.24
15-25 20.12 20.72 21.22 9.44 20.70 22.56
25-50 26.42 25.78 27.68 31.35 26.95 23.07
>50 11.26 11.66 12.42 8.93 11.39 11.13
Birthplace
ON (15030) 67.29 0.57 34.04 73.24 66.30 0.48
QC (15081) 2.45 91.71 30.70 2.40 2.57 92.08
ENG (41000) 7.44 1.11 4.02 6.74 10.00 1.37
IRE (41100) 5.48 0.98 2.75 5.84 5.40 0.94
SCO (41400) 9.35 3.17 4.45 7.33 8.57 2.83
GER (45300) 1.23 0.06 0.56 1.12 2.10 0.07
USA (9900) 2.59 1.23 1.77 2.19 3.96 1.72
Marital Status
Married (1) 30.36 30.22 31.78 39.75 29.11 23.13
Widowed (5) 3.21 3.02 3.66 0.86 4.07 3.64
Single (6) 66.43 66.75 64.52 59.39 66.82 73.24
Directions to Improve
• Common patterns in incorrect links– Big age difference– Change in marital status for females– First name change
• Probability estimate score of the classifier
BeforeBefore
Results – Common Patterns
AfterAfter
Province Linkage Rate (%)
New Brunswick 24.45
Nova Scotia 21.50
Ontario 18.36
Quebec 17.45
Province Linkage Rate (%) Diff.
NB 22.24 -2.21
NS 18.72 -2.78
ON 15.68 -2.68
QC 14.82 -2.63
Results – Common Patterns
BeforeBefore
AfterAfter
True Links Set Total TP (%) FP (%)
Ontario_Props 1647 21.59 9.28
Logan 1760 21.64 8.85
St_James 232 24.72 7.12
Les_Boys 1403 17.99 11.41
Set TP (%) TPDiff. FP (%) FPDiff.
O_P 20.48 -1.11 7.32 -1.96
L 20.36 -1.28 7.25 -1.6
St_J 23 -1.72 5.92 -1.2
L_B 16.66 -1.33 10.36 -1.05
Results – Classification Scores
0.80.8
0.850.85
0.90.9
22.06 Total TP (%) FP (%)
Logan 1760 19.37 4.86
St_James 232 22.06 3.43
Les_Boys 1403 15.25 5.94
True Links Set Total TP (%) FP (%)
Logan 1760 18.97 4.61
St_James 232 22.06 3
Les_Boys 1403 14.64 5.31
True Links Set Total TP (%) FP (%)
Logan 1760 18.125 3.78
St_James 232 21.63 2.4
Les_Boys 1403 13.94 3.97
Conclusions
• Linking people across 1871-1881 Canadian censuses
• Preliminary automated linkage system
• More evaluation and experimentation is needed
Acknowledgements
• University of Guelph
• Ontario Ministry of Research and Innovation
• SHARCNET
• FamilySearch, Church of Latter Day Saints
• Minnesota Population Center
• University of Alberta
• Université de Montréal/PRDH
• Université Laval/CIEQ