23
Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph , May 24 th 2010

Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Embed Size (px)

Citation preview

Page 1: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Record Linkage at the Minnesota Population Center

Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick

RecordLink Workshop, 2010University of Guelph , May 24th 2010

Page 2: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Introduction

• Overview of linkage process– Prelims vs. final releases

• Name commonness scores• Error rate estimation• Weights• Looking ahead

Page 3: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Historical Record Linkage – U.S.

1850 1% sample1860 1% sample1870 1% sample

1880 complete-count 1900 1% sample1910 1% sample1920 1% sample1930 1% sample

Page 4: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Historical Record Linkage at the MPC

• Primary goals are to create linked sets that are– Representative – Accurate

Page 5: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Historical Record Linkage at the MPC

• Representative links– We use a very limited set of variables to predict

links to avoid linkage biasBlock by birthplace, sex and raceGiven (first) nameSurname (last) nameAge

Page 6: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Historical Record Linkage at the MPC

• Accurate links– If there is more than one ‘potential’ link for a

given person we exclude them all– We throw away a lot of potential links

Page 7: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Historical Record Linkage at the MPC

• Create given and surname and age similarity scores– Jaro-Winkler string similarity algorithm– 20% age difference score

• We apply name and age similarity thresholds to limit output of potential links

Page 8: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Additional Variables Based on Age

• Age– Age difference (absolute value , normalized)*– Age categories, in five-year groups*

Page 9: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Additional Variables Based on Name

• Phonetic Match (binary)– Double Metaphone– NYSIIS*

• Middle initials (if present) must not conflict (binary)*

Page 10: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Additional Variables Based on Name

• Name Commonness Scores*– Our answer to incorporating probabilistic information

into the process without complete standardization of all name strings.

– Proportion of records (by race, birthplace, and sex) in the 1880 data with a Jaro-Winkler score greater than 0.9

– Name commonness score works in tandem with a birthplace density measure, which is the proportion of 1880 records for specific birthplaces (by race and sex)

Page 11: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Classification of links

• Comparisons that beat the thresholds become ‘potential links’ that are classified as ‘true’ and ‘false’ links by two SVM models– One model includes age variables, the other does not*

• Link is accepted if both models call it a ‘true’ link and there are no conflicts

Page 12: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Name CommonnessTable 6. Distribution of 1870 Records (Males) by Name Commonness Scores

Page 13: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Linkage Rate by Name Commonness

Page 14: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Linkage Rates by Name Commonness and Birthplace Population Size

Page 15: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Linkage Rates by Name Commonness and Birthplace Population Size

Table 8. Linkage Rate for Native-Born 1870 Males by Birthplace Rank (number of males by birthplace) and Name Commonness Scores

Page 16: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Occupational Scores and Name Commonness

Page 17: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Estimating error rates

• Calculate migration rates by different slices of data, e.g. five-year age cats, age difference

• Split brothers• Compare link made in one dataset to link

made in another for same group of people• Compare to linked set made by another

independent source: Pleiades

Page 18: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Selected Linked Household – 1870-1880

LINKTYPE LAST70 FIRST70 LAST80 FIRST80 RELATE70 RELATE80 AGE70 AGE80

household UNDERWOOD NORMAN Head 52

household UNDERWOOD

MARY UNDERWOOD

MARY Spouse Head 33 49

household UNDERWOOD LUTHER Son 11

household UNDERWOOD IRVING UNDERWOOD ERVIN Son Son 3 13

primary UNDERWOOD VANDER UNDERWOOD VANDER Son Son 1 10

household UNDERWOOD ROSA Daughtr 18

household UNDERWOOD UNDERWOOD CHARLES Son 8

household UNDERWOOD UNDERWOOD ADDI Daughtr 5

Page 19: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Weights

• The weights are based on the linkable population, which is always based on the terminal census year data.

• Based on an iterative process• We capped weight minimums and maximums

(min is 1/5 the avg. weight for the subgroup; max is 4 times the avg. weight for subgroup)

Page 20: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Final Release Data Set Size, Males

MALEnat-b white for-b white af-am

1850 7013 299 82

1860 10426 634 235

1870 17725 879 2180

1900 18596 1515 1334

1910 14855 995 791

1920 10050 511 504

1930 9018 336 352

Page 21: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Final Release Data Set, Females

FEMALEnat-b white

married single Formerly1850 1077 468 798

1860 2253 1495 843

1870 4254 6700 1134

1900 3274 4241 1162

1910 1891 1407 1124

1920 793 849 894

1930 221 700 545

Page 22: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Final Release Data Set Size, Couples

COUPLEnat-b white for-b white af-am

1850 2135 227 6

1860 4538 932 19

1870 8862 2267 407

1900 7745 2132 180

1910 4650 1101 102

1920 2107 416 26

1930 612 105 7

Page 23: Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May

Looking Ahead

• Hope to alleviate small N problem in the future– Link 1900 and 1930 5% samples to 1800 complete

count– 1850 complete count database currently under

construction – Hope to have complete count data for 1860, 1870,

and 1900 in the future