Linking historical ship records to a newspaper archive

Preview:

DESCRIPTION

Talk at Histoinformatics, 10 November 2014, Barcelona

Citation preview

Linking historical ship records to a newspaper archive

Andrea Bravo BaladoVictor de Boer, Guus Schreiber

VU University Amsterdam

2

Context: dutchshipsandsailors.nl/

3

Dutch Ships and Sailors (DSS) datasets

4

Results published as Linked Data

5

Data visualizations

6

This study

• Increasing number of historical databases are being digitized

• Finding matching occurrences of the same object in different datasets is both relevant (for historical research) and non-trivial– “Instance mapping”

• This paper: case study of linking ship instances in two maritime datasets

7

Focus on methodology

• This study is not about developing new techniques

• This study is about methodology:– What combination of existing techniques gets the

“best” result?– What the “best” result is depends on context (i.e.,

goal of the historical research)• This is a case study, so be wary of

generalization

8

Data

• Muster rolls (Northern Dutch Maritime Museum)– Period: 1803-1937– 77,043 records of 34,552 sea men – 17,098 mentions of 4,935 ships

• Newspaper archive (Dutch National Library)– Period: 1618-1995– 7K newspapers, 9M pages (coverage: 10%) – Text generated via OCR

9

Timeline newspapers in the archive

10

Example muster roll record (in Dutch)

11

Example newspaper article (in Dutch)

12

Approach

• Generate candidate set of links• Apply two types of filters to the candidate set– Domain-specific filtering• Using domain heuristics about ship identification

– Text classification of newspaper articles• Determine whether the article is about a ship

• Combine filters

13

Baseline generation

• Find all ship instances in the muster rolls• Query newspaper archive for first 100 hits

with this name– API: http://www.delpher.nl/

• Result set is expected to have high recall but low precision

14

Evaluation

• No gold standard• Manual assessment of all links is infeasible• Sampling method for evaluating candidates– 50 candidates per technique– 3 assessors (domain expert plus two authors)– Inter-observer agreement: Cohen’s kappa = 0.65

• Recall: approximation, based on the estimated number of correct links (using the baseline)

15

Domain-specific filtering

• Heuristic 1: co-occurrence of name of ship captain– Common practice in historical maritime

documentation• Heuristic 2: date of newspaper article is within

ship lifetime (as indicated by muster roll)– Average life span of ship is 30 years

16

Text classification

• Task: decide whether a newspaper article is about a ship

• Two techniques used– Naive Bayes and Support Vector Machine (SVM)

with Sequential Minimal Optimisation (SMO)– WEKA implementation– Training set: 200 samples (121 positive, 79

negative)

17

Configuration

• Filter 1a: captain name• Filter 1b: time restriction• Filter 2: combine filters 1a + 1b• Filter 2 + text classification

18

Results

19

Analysis

• Captain’s name turns out to be a strong heuristic

• Time restriction much less useful• When combined, precision becomes very high,

at the cost of (approximate) recall• Text classification has high precision (no false

positives)• Text classification combined with heuristic

filtering has negative effect

20

Discussion

• Interestingly, the historian preferred very high precision at the cost of recall

• Consequently, 16K links published as Linked Data (precision 0.96; approximate recall 0.13)

• Links are to departure/arrival listing, but also to shipwrecks and sales

• In case of good heuristics the contribution of generic techniques is at best minimal

• Absence of gold standard is realistic

21

Limitations

• Evaluation– 50 samples – Choice of assessors– Approximation of recall

• Data– OCR quality of newspaper articles– Digitized newspaper archive covers only 10%

22

Acknowledgements

• Jurjen Leinenga, domain expert• CLARIN-NL

http://www.clarin.nl • BiographyNet, Netherlands eScience Center

http://esciencecenter.nl

• Online appendix with details of results at http://dx.doi.org/10.6084/m9.figshare.1189228

23

QUESTION TIME

Recommended