Information Extraction on Noisy Texts for Historical Research

1. Information Extraction on Noisy Texts for Historical ResearchMike BryantKepa Joseba RodriquezTobias BlankeReto Speck19th July 2012http://www.ehri-project.eu

2. Why EHRI?Fragmentation and dispersal of archival sources Geographical scope of Holocaust Attempts to destroy the evidence Migration of Holocaust survivors Multiplicity documentation projects after the war 3. The Adler case 4. The Adler Case5 - Kings College 2 - ITS International Tracing Service 4 NIOD1 - Jewish MuseumPrague 3YADVASHEMCONNECTING COLLECTIONS 5. Connecting CollectionsCollection-level metadataEnhance existing servicesDevelop new services Build a virtual observatory Build a virtual research A digital infrastructure to environment unlock sources Problem-driven User-driven 6. Integrate multiple layers of MetadataArchival(Finding aids, thesaurus) Machine Generated (extracted entities) User GeneratedMetadata(annotations) 7. Services for partner archives OCR Provide a general-purpose OCR service tailored to the needs of historical material Allow attaching scanned paper finding aids to bare-bones collection descriptions and automatically storing/indexing OCR output Named Entity Extraction Integrate NEE services to bootstrap the process of tagging collection descriptions Integrate NEE with the EHRI thesaurus, to filter and validate NEE output Build candidate search indexes, with crowd-sourced validation 8. Workflow Tools the Ocropodium Project1. Workflow development 2. Batch Process3. Transcript correction 9. NEE Experiment Corpus data Wiener Library: Holocaustsurvivor testimonies 17 pages ~93% OCR word accuracy Kings College London:H.M.S. Kelly Newsletters 33 pages ~92.5% OCR word accuracy 10. NEE Experiment - Tools Extracted entities Find all information about Personprisoners arriving in Therezin from Locationthe Netherlands in 1944 Organisation ToolsFind all documentation from Hans Alchemy API Gunther Adler on SS guards in OpenCalaisTherezin Apache OpenNLP Stanford NER Manually annotated source data Tokenized and POS tagged using TreeTagger Imported into MMAX2 for manual entity tagging 11. NEE Experiment - ResultsLow performance of the tools in corrected and raw textRawCorrectedPR F1P R F1 Alchemy0.61 0.38 0.470.63 0.38 0.48 OpenCalais 0.75 0.29 0.410.69 0.30 0.42 OpenNLP0.42 0.12 0.190.53 0.13 0.21 Stanford 0.57 0.52 0.540.60 0.61 0.60 12. LOC extraction most accurate, ORG leastWL F1-Score KCL F1-Score 13. NEE Experiment Personal names Person names: commonly written in non-standard forms Person and location names are used for other kind ofentities, e.g. warships Warships frequently annotated as PER 14. NEE Experiment - OrganisationsPerformance of type ORG extraction is very low Names of organizations appear in non-standard forms Jargon and abbreviations abound, particularly in Kelly newsletters Many organizations no longer exist SS and other relevant Nazi organizations have not be detected Spelling errors and typos in the original files: OpenCalais used general knowledge to resolve this problem Use of general knowledge my be problematic. Klan, Walter Ku Klux Klan 15. Relative performance Stanford NER best performance across both datasets Most effective on PER and LOC types Alchemy API best results on ORG type Biggest difference between raw OCR and manually corrected text Not massively ahead of OpenCalais/Stanford Apache OpenNLP worst performance on our data But: most open of the tools and theoretically trainable 16. Conclusions Manual correction of OCR output does not significantlyimprove the performance (on our material) Raw output is enough to obtain provisional candidates for N-gram indexing Best results likely to come from combinations of tools Specific workflows for specific material, no silver bullet Focus in near team: Identify most significant patterns of error Implement pre-processing pipeline using simple heuristics and pattern matching tools Focus in longer term: Integrate EHRI thesaurus and other forms of knowledge to validate and correct the output of NE extraction tools 17. ThanksAny questions?Publications: Tobias Blanke, Mike Bryant, Mark Hedges: Ocropodium: open source OCRfor small-scale historical archives. Journal of Information Science, Vol. 38,No. 1. Tobias Blanke, Michael Bryant, Mark Hedges: Open source OCR forScientific Workflows in History. Journal of Documentation, Forthcoming.

Entertainment & Humor

Information Extraction on Noisy Texts for Historical Research