34
Chris Dibben University of Edinburgh Linking historical administrative data

Chris Dibben University of Edinburgh

  • Upload
    aizza

  • View
    49

  • Download
    0

Embed Size (px)

DESCRIPTION

Linking historical administrative data. Chris Dibben University of Edinburgh. Context. History of very important contributions: Dutch Famine Birth Cohort Study – epigenetics, thrifty phenotype Överkalix study – epigenetics, sex differences UK Longitudinal Study – health inequalities. - PowerPoint PPT Presentation

Citation preview

Page 1: Chris Dibben University of Edinburgh

Chris DibbenUniversity of Edinburgh

Linking historical administrative data

Page 2: Chris Dibben University of Edinburgh

Context• History of very important

contributions:– Dutch Famine Birth Cohort Study

– epigenetics, thrifty phenotype– Överkalix study – epigenetics,

sex differences– UK Longitudinal Study – health

inequalities

Page 3: Chris Dibben University of Edinburgh

Two new developmental projects

• Scottish Mental Surveys 1932 and 1947

• Scottish civil registration data

• New cohorts for people now in old age

Page 4: Chris Dibben University of Edinburgh

The ‘Scottish Mental Survey’

Page 5: Chris Dibben University of Edinburgh

1947 Scottish Mental Survey

1939 register

Birth1936

ED code, address, household members:

marital status, occupation

The Scottish Longitudinal

study

Scottish morbidity records

1939 books recorded

the date of death (up to

1980)

linkage to the death database (1974 onwards)

EducationEmployment

Page 6: Chris Dibben University of Edinburgh

Early life environment

1970

34

Hospitalisation

Mortality

Birth1936

0Age

Year

Mental ability

11

SchoolAchievement

(time estimated)

1947

Occupation (estimated)

1991

55

Detailed household/ individual

information

2001 2011

65 75

Page 7: Chris Dibben University of Edinburgh

Background – Scottish vital events• Civil registration of births,

deaths and marriages in Scotland began on 1 January 1855

• All historical vital events records have been converted into digital image format with a supporting index

• Modern vital events data (from 1974 onwards) are available electronically

Page 8: Chris Dibben University of Edinburgh

Digitising Scotland• Approximately 50 million

occupation strings, 8 million causes of death

• Classify occupations to Historical International Standard Classification of Occupations (HISCO)

• Cause of death to a modified ICD10

• Each with a location

Page 9: Chris Dibben University of Edinburgh

Historical Geocoding

GEOCODINGTOOL

+

=+

GEOMETRYFEATURES

Year Historical address

2010 Ladywell House, Ladywell Road, Edinburgh, EH12 7T

1910 Ladywell House, Ladywell Street, Edinburgh

1810 Ladywell House, Ladywell Street, Edinburgh

1710 Ladywell House, Lady[vv]ell Street, Edinburgh

Postcode change

Without postcodeInterpretation error

17101810

19102010

• Change of road networks (new road replace old) over time• Change of road names over time• Interpretation errors from the address digitisation

GEOMETRYFEATURESGEOMETRY

FEATURESGEOMETRYFEATURES

17101810

19102010

Page 10: Chris Dibben University of Edinburgh
Page 11: Chris Dibben University of Edinburgh
Page 12: Chris Dibben University of Edinburgh

Challenges• Significant methodological issues:

– How can we consistently code occupational data so that researchers can explore changing patterns and trends?

– How can we automate this process so that the majority of records do not need to be manually coded?

[email protected] 12

Page 13: Chris Dibben University of Edinburgh

Digitising Scotland• Records of births, marriages and deaths recorded

in Scotland from 1855 to present day.

[email protected]

Page 14: Chris Dibben University of Edinburgh

14

Page 15: Chris Dibben University of Edinburgh

15

Page 16: Chris Dibben University of Edinburgh

16

Page 17: Chris Dibben University of Edinburgh

17

Page 18: Chris Dibben University of Edinburgh

18

Page 19: Chris Dibben University of Edinburgh

Experimental Dataset• Use a dataset with similar content for

experiments• 60,000 records from the Cambridge Family

History Study (records from 1800-1990)• Occupation descriptions and associated HISCO

codes• HISCO coding done by historians• Dataset contains 330 different HISCO codes

19

Page 20: Chris Dibben University of Edinburgh

20

HISCO Hierarchy Example

Page 21: Chris Dibben University of Edinburgh

Classification ExampleString from record Gold Standard

ClassificationAutomatic Classification Output

Farm horseman 62460 Horse Worker 62460 Horse Worker

Shoe maker 80110 Shoemaker, General

80110 Shoemaker, General

Fireman (railway) 98330 Railway Steam-Engine Fireman

98330 Railway Steam-Engine Fireman

Fireman 58100 Fire-Fighter 58100 Fire-Fighter

Stationer 41000 Working Proprietors (Wholesale and Retail Trade)

91000 Paper and Paperboard product makers

21

Page 22: Chris Dibben University of Edinburgh

Classification ExampleString from record Gold Standard

ClassificationAutomatic Classification Output

Farm horseman 62460 Horse Worker 62460 Horse Worker

Shoe maker 80110 Shoemaker, General

80110 Shoemaker, General

Fireman (railway) 98330 Railway Steam-Engine Fireman

98330 Railway Steam-Engine Fireman

Fireman 58100 Fire-Fighter 58100 Fire-Fighter

Stationer 41000 Working Proprietors (Wholesale and Retail Trade)

91000 Paper and Paperboard product makers

22

Page 23: Chris Dibben University of Edinburgh

Approach• Text analysis• Supervised machine learning

–Apache Mahout framework.• Combination of these techniques.

23

Page 24: Chris Dibben University of Edinburgh

Supervised Machine Learning

Training Data Machine Learning

Unseen Data

Prediction Model

Predicted Classification

24

Prediction Model

Page 25: Chris Dibben University of Edinburgh

Supervised Machine Learning

Training Data Machine Learning

Unseen Data

Prediction Model

Predicted Classification

25

Prediction Model

Farm horseman 62460Shoe maker 80110Fireman 58100 Stationer41000

Page 26: Chris Dibben University of Edinburgh

Supervised Machine Learning

Training Data Machine Learning

Unseen Data

Prediction Model

Predicted Classification

26

Prediction Model

Farm horseman 62460Shoe maker 80110Fireman 58100 Stationer41000

Farm horsemanBoot makerFiremanPainter

Page 27: Chris Dibben University of Edinburgh

Supervised Machine Learning

Training Data Machine Learning

Unseen Data

Prediction Model

Predicted ClassificationPrediction Model

Farm horseman 62460Shoe maker 80110Fireman 58100 Stationer41000

Farm horsemanBoot makerFiremanPainter ?

Prediction Model

Page 28: Chris Dibben University of Edinburgh

100%

100%

Asthma

Miners asthma

spasmodiccollier's

miner'sminers

asthma

dropsybronchial

Page 29: Chris Dibben University of Edinburgh

String Similarity SGD Naïve Bayes Majority Vote Confidence Weighted 1 Confidence Weighted 20

10

20

30

40

50

60

70

80

90

100

Classification Accuracy

Techniques

Accu

racy

%

Page 30: Chris Dibben University of Edinburgh

Creation of a fully-linked vital events database for the whole Scotland back to 1855

19741855 Present

Vital Events (24 million births, deaths and marriages)Digital Images + Index

Vital Events Database

Vital Events Database

Fully-linked Vital Events Database

Page 31: Chris Dibben University of Edinburgh

Large scale family reconstruction studies and Pedigrees

Page 32: Chris Dibben University of Edinburgh

Gottfredsson, Magnús, et al. "Lessons from the past: familial aggregation analysis of fatal pandemic influenza (Spanish flu) in Iceland in 1918."Proceedings of the National Academy of Sciences 105.4 (2008): 1303-1308.

Page 33: Chris Dibben University of Edinburgh
Page 34: Chris Dibben University of Edinburgh

Acknowledgments• The Digitising Scotland project is funded by ESRC;• The support from National Records of Scotland is

also gratefully acknowledged.