IMACT Final Conference - Language Parallel Sessions - Erjavec

Preview:

Citation preview

Tomaž ErjavecDepartment of Knowledge Technologies

Jožef Stefan Institute

Ljubljana

Resources for historical Slovene

IMPACT Conference 2011

October 24-25, 2011, London

2

• Pre-story: AHLib (2004–08)(Deutsch-slowenische/kroatische Übersetzung 1848–1918)• Corpus / DL of ger→slv books• AAS: transcription correction and markup (TEI P4)• JSI: automatic annotation and editing environment

• Story: EU IP IMPACT (ext. 2010–2011)• Better OCR for historical texts• NUK: GTD transcriptions (PAGE/Aletheia)• JSI: (semi)manual lexicon construction

• Co-story: Google award (2011)• Developing language models for historical Slovene• ZRC SAZU: transcriptions of old texts (TEI P5)• JSI: annotating a corpus of old Slovene

Background

Tomaž Erjavec: Slovene language resources

3

Methodology• Develop 3 resources:

• transcribed texts• hand-annotated corpus• lexicon of historical words

• Develop annotation tool, ToTrTaLe• How to tag and lemmatise historical Slovene?

Little chance of developing training data comparable to that for contemporary Slovene

• Basic idea: • modernise words then use models for modern Slovene• transcription is via fixed lexicon + transcription patterns• patterns implemented via LMU Vaam• mostly OK for XIX and XVIII century language

Tomaž Erjavec: Slovene language resources

Corpus

Annotators

ToTrTaLe

HistoricallexiconTexts

Contemporarymodels

4

Issues• Tokenisation - words were split differently in historical

language :• žnjo → z njo• po noči → ponoči

• Variability:• archaic forms:

ljubezen ← lubesen, ljubesen, lubeſn, ljubezin, ljubesin

• inflection:ljubezen ← ljubezni, ljubeznijo

• both:ljubezen ← ljubezni, ljubesni, lubesen, ljubesen, lubesni, lubeſn, ljubeznijo, ljubezin, lubeſne, lubeſni, lubesne, ljubesnijo, ljubesin

• Extinct words:• zajhen / cajhen / znamenje

Tomaž Erjavec: Slovene language resources

5

Transcribed historical texts• AHLib corpus/DL:

90 books, 10,000 pages, 2M words (> 1850)• NUK GTD:

5,000 pages, 1M words • Google Books:

30 books, 10,000 pages, 2M words (in progress)• WikiSource (Lj Uni):

200 books, 5M words (in progress)

~ 10M words

• most texts have associated facsimiles• can be made freely available

Tomaž Erjavec: Slovene language resources

6

Initial Lexicon• Development of initial lexicon (2010), using the data and tools at

hand• AHLib collection (70 books > 1850)• Transcription rules + FidaPLUS lexicon of contemporary slv• LMU LeXtractor editing tool• produced 3,000 entries (word-forms)

Tomaž Erjavec: Slovene language resources

7

Reference corpusgoo300k• Page sampled• Each word annotated with:

• Contemporary equivalent• Modern lemma• Part-of-speech tag

• First with ToTrTaLe• Then manually correct

• INL Cobalt Lexicon Tool• A team of annotators• Also correcting errors in transcription• Manual, cookbook, FAQ, mailing list, meetings…

• TEI P5 – bibliography, links to facsimiles & DL

Tomaž Erjavec: Slovene language resources

Period Units Pages Tokens

1584 1 8 60001695 1 27 10000

1751-1800 8 155 27000

1801-1850 12 206 740001851-1875 36 380 1260001876-1900 23 224 51000

∑ 81 1000 296000

8

INL Cobalt lexicon building tool

Tomaž Erjavec: Slovene language resources

9

TEI corpusdump

Tomaž Erjavec: Slovene language resources

10

Final lexicon

Composition:• Initial LeXtractor lexicon (3k entries)• Lexicon dump from goo300k• Additional lexicon from full

text collection

Format:• TEI P5• lemma oriented• grammatical properties, glosses, historical spelling, (corpus)

examples

Tomaž Erjavec: Slovene language resources

goo300k All Historical

Lex. entries 56346 22849

Word-forms 53853 19627

Normalised 46996 15402

Modernised 37334 11396

Lemmas 19569 8605

11

Results• Language resources for historical Slovene:

• Text Collection hs5M: • facsimile + transcription, DL (+ automatic annotation)

• Annotated Corpus goo300k: • page-sampled , hand-annotated

• Structured Lexicon imp20k: • grammar + glosses + forms + attestations

• TEI P5, CC BY

• ToTrTaLe + resources for HS: • tokenisation & transcription patterns

• Services: CUWI, (moderniser+archaiser)• all still work in progress, available mid-2012

Tomaž Erjavec: Slovene language resources

12

Further work• Better IR for Digital Libraries: NUK• Dictionary of historical Slovene: ZRC• Beyond words: changes in syntax• MT paradigm• tweets & Croatian

Tomaž Erjavec: Slovene language resources

Recommended