Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Kristín Bjarnadóttir, Jón Friðrik Daðason & Ludger Zeevaert
The Árni Magnússon Institute for Icelandic Studies University of Iceland
Icelandic: Modern Tools and Old Texts
Nordic CLARIN Network Workshop on Historical Resources and Tools
Gothenburg 6.10.2015
Topic
Ludger: The challenge: Njáls saga and NLP tools Jón Friðrik: Applying modern NLP tools to old texts Kristín: Why normalize to Modern Icelandic?
Linguistic resources
Co-operation
• Co-operation between the Manuscript Department and the Department for Lexicography
• Aim: Adapt linguistic tools developed for Modern Icelandic to Old Icelandic corpora
The Gullskinna-Project
• Funded by The Icelandic Research Fund (Rannís, styrknúmer 152342-051)
• Start: August 2015 (3 years) • Main questions: (1) a stemma of the Gullskinna-manuscripts (2) an analysis of the treatment of a number of
morphological and syntactic variables exhibited by different scribes
The Gullskinna-Project
• Funded by The Icelandic Research Fund (Rannís, styrknúmer 152342-051)
• Start: August 2015 (3 years) • Main questions: (1) a stemma of the Gullskinna-manuscripts (2) an analysis of the treatment of a number of
morphological and syntactic variables exhibited by different scribes
Linguistic Variables with Differences Between Manuscripts of Njáls saga
• Historical present tense • Order of noun and modifier • Pronominal reference (reflexive or not) • Agreement (past participle/supine) • Indirect speech constructions
Linguistic Variables with Differences Between Manuscripts of Njáls saga
• Historical present tense • Order of noun and modifier • Pronominal reference (reflexive or not) • Agreement (past participle/supine) • Indirect speech constructions
One Example: Indirect Speech
• Direct speech • Conjunctional clause • Accusative with infinitive (a.c.i.) • Mixed constructions
One Short Example
Hrappur segist vera íslenskur Hrappur-SBJ say-3.SG.PRS.MID be-INF Icelandic ‘Hrappur says himself to be Icelandic’ Hrappur sagði að hann
væri íslenskur Hrappur-SBJ said that he
be-3.SG.PST.SBJV Icelandic ‘Hrappur said that he was Icelandic’
Our Approach
• Tag a version of the text in Modern Icelandic spelling
Our Approach
• Tag a version of the text in Modern Icelandic spelling
• Transform the tagged text into Menota-XML (Medieval Nordic Text Archive, menota.org) + syntactic tagging (TEI)
Our Approach
• Tag a version of the text in Modern Icelandic spelling
• Transform the tagged text to Menota-XML (Medieval Nordic Text Archive, menota.org) + syntactic tagging (TEI)
• Correct the tagging
Our Approach
• Tag a version of the text in Modern Icelandic spelling
• Transform the tagged text to Menota-XML (Medieval Nordic Text Archive, menota.org) + syntactic tagging (TEI)
• Correct the tagging • Add the text from the manuscript
Our Approach • Tag a version of the text in Modern Icelandic
spelling • Transform the tagged text to Menota-XML
(Medieval Nordic Text Archive, menota.org) + syntactic tagging (TEI)
• Correct the tagging • Add the text from the manuscript • Analyse and compare linguistic variables in
different manuscripts
Conjunctional Clause/ A.c.i.
Hrappur kvaðst vera íslenskur (AM 162 B theta fol. “Þetabrotið”, ca 1325)
Hrappur sagði hann væri utan af Íslandi (AM 133 fol. “Kálfalækjarbók”, ca 1325)
Hrappur segist vera íslenskur (AM 137 fol. “Vigfúsarbók”, ca 1650)
Hrappur kvaðst vera íslenskur maður (AM 163 i fol. “Saurbæjarbók”, 1668)
Hrappur sagði að hann væri íslenskur (AM 135 fol., ca 1690)
Chronological Development or not?
Þeta Gullskinna
Vigfúsarbók Saurbæjarbók
Kálfalækjarbók
AM 135 fol.
*x4 *x1
Icelandic Taggers • IceNLP
• Offers a hybrid rule-based/HMM tagger • Trained on the Icelandic Frequency Dictionary
• 590,000 tokens • Tagging accuracy of 92.7%
• IceStagger • An adaption of Stagger, an averaged
perceptron tagger • Tagging accuracy of 93.8%
• An improvement over Stagger’s accuracy of 91.0%
Tagging Old Icelandic • Normalized Old Icelandic
• IceNLP: 86.6% accuracy (vs. 92.7%) • Stagger: 84.9% accuracy (vs. 91.0%)
• Extending the training data • Add 95,000 tokens from the Saga-Gold
corpus (13th-14th century texts) to the training set
• Improved tagging accuracy • IceNLP: 90.6% accuracy • Stagger: 91.8% accuracy
Lemmatizing Old Icelandic • Lemmald
• A rule-based lemmatizer included with IceNLP • Trained on the IFD
• Approx. 74,000 distinct word form/tag/lemma combinations
• Nefnir • A new rule-based lemmatizer • Trained on the Database of Modern Icelandic
Inflection • Approx. 6,000,000 distinct word form/tag/
lemma combinations
A Sample Evaluation • Approx. 1,700 tokens from Njála
• Without the extended training set
Task IceNLP IceStagger
Tagging 88.3% 83.5% Lemmatization (Lemmald) 92.3% 92.3% Lemmatization (Nefnir) 94.2% 93.2%
Automating the Process
Automating the Process
Goals • Make the tools as easy to use as possible
• Tools can be added, replaced and updated as necessary
• The less technical proficiency that is required of users, the better
• Introduce a simple and fast process • Offers a much better starting point than
manually annotating documents from scratch • Fully processed documents can be shared
back to us in a common format • Documents shared back can be used to
enlarge the training set
Normalization & Resources
• Existing NLP tools are made for Modern Icelandic
• For NLP use, all pre 20th C texts have to be normalized, as spelling is not standardized before then
• Linguistic resources (lexicons) are needed for each period
Layers of Text: Normalization
Layers of Text
1. Photograph 2. Facsimile 3. Diplomatic: Modern character set 4. Normalized to standardized Old Norse 5. Normalized to Modern Icelandic
Our aim is to normalize a diplomatic version to Modern Icelandic in order to be able to use the PoS Taggers and other NLP tools. (And to make the texts readable for everyone.)
Continuity
• The cohesion of Icelandic word forms is extensive,
apart from spelling variants • Relatively slight changes in morphology
• The inflectional system is intact • Minor changes in individual inflectional classes • Drift of individual words between inflectional
classes
• Word formation: Unchanged rules …
Result: Very high rate of recognisable word forms between periods with no clear cut-off point in time.
Why Modern Icelandic?
Linguistic resources for the modern language are available • Database of Modern Icelandic Inflection (BÍN): 380,000
paradigms, 5.8 million inflectional forms • The Tagged Icelandic Corpus (MÍM): 25 million running words • Íslenskur orðasjóður (Leipzig Wortschatz): 545 million running
words from websites, 21st C. Producing new tools for each period of the language would be expensive and not always feasible because of data scarcity. Pre-standardized spelling is highly idiosyncratic, up to the 20th century. There is no fundamental difference in normalizing a very idiosyncratic 19th century text and an ‘easy’ 15th century one.
Historical Resources
• Written Language Archive (WLA, post 1540): over 700,000
headwords (normalized to Modern Icelandic), 1.3 million citations in original spelling.
• Ordbog over de norrøne prosasprog (ONP): ca. 65,000 headwords (normalized to Old Norse)
• Individual texts and indices … such as Andrea De Leew Van Weenen: Lemmatized Index to
the Icelandic Homily Book (2004) … and all available texts, such as Ludger’s Njála project, etc. …
.
Skrambi, the Normalizer
• Skrambi is a spellchecker
• Uses a statistical model for character substitution • Can adapt itself to the characteristics of individual
documents • Is lexicon-based
• Skrambi is used for • Modern language spell-checking • OCR correction • Normalization of older texts
• Skrambi’s versatility depends on the lexicons used. By using a 19th C lexicon, 19th C OCR texts can be corrected and normalized, etc.
• ONP (pre 1540): dryckiom, drvckio, drvkkio, dryckiar-, drykkior, dryckio, dryccior, dryckiu, dryckiona, drykkivr, drykkju, drykkiu, ðryckiu, dryckiv …
• WLA (post 1540): dryckia, dryckiu, dryckju, drykkja, drykkju …
• ONP: WLA: • [ck|k|kk|cc] > kk ck > kk • i > j , v > u i > j • [u|o|v]$* > u • v > y • ^ð > d Cf. BÍN à
Spelling in the ONP / WLA
The Database of Modern Icelandic Inflection
Section: D
Conclusion The challenge is to reduce unknown words to improve
the accuracy of the tools • by linking and enlarging lexicons • by using our compound analyser
(It produces binary constituent trees) • by developing a tool to normalize word boundaries
The more resources (i.e., texts) we get, the better the results of the tools will be.
The Árni Magnússon Institute
for Icelandic Studies
Thank you for your attention
Ludger Zeevaert, Jón Friðrik Daðason, Kristín Bjarnadóttir [email protected], jfd1 @hi.is, [email protected]
Oct. 6th 2015 Gotenburg