Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

An approach to unsupervised historical text normalisation

Petar MitankinSofia University

Stefan GerdjikovSofia University

Stoyan MihovBulgarian Academy

of SciencesIICT

DATeCH 2014, Maye 19 - 20, Madrid, Spain

An approach to unsupervised historical text normalisation

Petar MitankinSofia University

Stefan GerdjikovSofia University

Stoyan MihovBulgarian Academy

of SciencesIICT

DATeCH 2014, Maye 19 - 20, Madrid, Spain

Contents

● Supervised Text Normalisation– CULTURA

– REBELS Translation Model

– Functional Automata

● Unsupervised Text Normalisation– Unsupervised REBELS

– Experimental Results

– Future Improvements

Co-funded under the 7th Framework Programme of the European Commission

● Maye - 34 occurrences in the 1641 Depositions, 8022 documents, 17th century Early Modern English

● CULTURA: CULTivating Understanding and Research through Adaptivity

● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI

Co-funded under the 7th Framework Programme of the European Commission

● Maye - 34 occurrences in the 1641 Depositions, 8022 documents, 17th century Early Modern English

● CULTURA: CULTivating Understanding and Research through Adaptivity

● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI

Supervised Text Normalisation

● Manually created ground truth– 500 documents from the 1641 Depositions

– All words: 205 291

– Normalised words: 51 133

● Statistical Machine Translation from historical language to modern language combines:– Translation model

– Language model

● Manually created ground truth– 500 documents from the 1641 Depositions

– All words: 205 291

– Normalised words: 51 133

● Statistical Machine Translation from historical language to modern language combines:– Translation model

– Language model

REgularities Based Embedding of Language Structures

sheeREBELSTranslationModel

he / -1.89se / -1.69she / -9.75shea / -10.04

Automatic Extraction of Historical Spelling Variations

Training ofThe REBELS Translation Model

● Training pairs from the ground truth:

(shee, she), (maye, may), (she, she),

(tyme, time), (saith, says), (have, have),

(tho:, thomas), ...

● Deterministic structure of all historical/modern subwords

● Each word has several hierarchical decompositions in the DAWG:

Hierarchical decomposition of each

historical word

Hierarchical decomposition of each

modern word

● For each training pair (knowth, knows) we find a mapping between the decompositions:

● We collect statistics about

historical subword -> modern subword

● We collect statistics about

historical subword -> modern subword

REgularities Based Embedding of Language Structures

sheeREBELSTranslationModel

he / -1.89se / -1.69she / -9.75shea / -10.04

REBELS generates normalisation candidates for

unseen historical words

REBELS

knowth

REBELS

shee knowth me

relevance score (he knuth my) =

REBELS TM (he knuth my) * C_tm +

Statistical Language Model (he knuth my)*C_lm

Combination of REBELS with Statistical Bigram Language Model

● Bigram Statistical Model– Smoothing: Absolute Discounting, Backing-off

– Gutengberg English language corpus

Functional Automata

L(C_tm, C_lm) is represented with Functional Automata

Automatic Construction of Functional Automaton For The

Partial Derivative w.r.t. x

L(C_tm, C_lm) is optimised with the Conjugate Gradient method

REBELSTranslationModel

SearchModule Based on Functional Automata

GroundTruth

TrainingModuleBased on Functional Automata

Historical

text Normalised

Unsupervised Text Normalisation

Unsupervised Generation of Training Pairs(knoweth, knows)

Historical

text Normalised

Unsupervised Generation of the Training Pairs

● We use similarity search to generate training pairs:– For each historical word H:

● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance

1 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 3 from H and generate (H,M).

● If more than 6 modern words were generated for H, then

do not use the corresponding pairs for training.

● If too many (> 5) modern words were generated for H, then do not use the corresponding pairs for training.

Normalisation of the 1641 Depositions. Experimental results

Method

Generation of REBELS Training

Spelling Probabilities

Language Model Accuracy BLEU

1 ---- ---- ---- 75.59 50.31

2 Unsupervised NO YES 67.84 45.52

3 Unsupervised YES NO 79.18 56.55

4 Unsupervised YES YES 81.79 61.88

5 Unsupervised Supervised Trained Supervised Trained 84.82 68.78

6 Supervised Supervised Trained Supervised Trained 93.96 87.30

Future Improvement

Unsupervised Generation of Training Pairs(knoweth, knows)with probabilities

Historical

text Normalised

MAPTrainingModule

Thank You!

Comments / Questions?

ACKNOWLEDGEMENTS

The reported research work is supported bythe project CULTURA, grant 269973, funded by the FP7Programme andthe project AComIn, grant 316087, funded by the FP7 Programme.

Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Technology

Introduction to database-Normalisation

Datech2014 - Session 5 - Wittgenstein’s Nachlass: WiTTFind and Wittgenstein Advanced Search Tools (WAST)

Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

normalisation algorithms

A Lexical Normalisation for Social Media Text · approach to normalisation focusing on context insensitive lexical variants. It combines the detection and normalisation of lexical

Datech2014 - Session 5 - Estimating and Rating Quality of Optical Character Recognised Text

DATA NORMALISATION Pamela Quick. Data Normalisation 2 Objectives Data normalisation aims to derive record structures which avoid anomalies in u Insertion

Databases – Attributes & Entities Normalisation

NoTube: Loudness Normalisation

Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

Databases: Normalisation

Catalogue Normalisation Maj2012

Using meteorological normalisation to detect interventions ...eprints.whiterose.ac.uk/...normalisation...markup.pdf · 33 Meteorological normalisation is one technique which can be

Chapter 7 Normalisation

Introduction to normalisation - Welcome to … · Introduction to normalisation Normalisation • Normalisation = a formal process for deciding which attributes should be ... (3NF)

Datech2014-Session1-Document Representation Refinement for Precise Region Description

Normalisation and anomalies

CEN EN13606 Normalisation Framework

Lecture 6 : Normalisation

Normalisation & ER Diagrams