IMACT Final Conference - Language Parallel Sessions - Erjavec

Tomaž ErjavecDepartment of Knowledge Technologies

Jožef Stefan Institute

Ljubljana

Resources for historical Slovene

IMPACT Conference 2011

October 24-25, 2011, London

• Pre-story: AHLib (2004–08)(Deutsch-slowenische/kroatische Übersetzung 1848–1918)• Corpus / DL of ger→slv books• AAS: transcription correction and markup (TEI P4)• JSI: automatic annotation and editing environment

• Story: EU IP IMPACT (ext. 2010–2011)• Better OCR for historical texts• NUK: GTD transcriptions (PAGE/Aletheia)• JSI: (semi)manual lexicon construction

• Co-story: Google award (2011)• Developing language models for historical Slovene• ZRC SAZU: transcriptions of old texts (TEI P5)• JSI: annotating a corpus of old Slovene

Background

Tomaž Erjavec: Slovene language resources

Methodology• Develop 3 resources:

• transcribed texts• hand-annotated corpus• lexicon of historical words

• Develop annotation tool, ToTrTaLe• How to tag and lemmatise historical Slovene?

Little chance of developing training data comparable to that for contemporary Slovene

• Basic idea: • modernise words then use models for modern Slovene• transcription is via fixed lexicon + transcription patterns• patterns implemented via LMU Vaam• mostly OK for XIX and XVIII century language

Corpus

Annotators

ToTrTaLe

HistoricallexiconTexts

Contemporarymodels

Issues• Tokenisation - words were split differently in historical

language :• žnjo → z njo• po noči → ponoči

• Variability:• archaic forms:

ljubezen ← lubesen, ljubesen, lubeſn, ljubezin, ljubesin

• inflection:ljubezen ← ljubezni, ljubeznijo

• both:ljubezen ← ljubezni, ljubesni, lubesen, ljubesen, lubesni, lubeſn, ljubeznijo, ljubezin, lubeſne, lubeſni, lubesne, ljubesnijo, ljubesin

• Extinct words:• zajhen / cajhen / znamenje

Transcribed historical texts• AHLib corpus/DL:

90 books, 10,000 pages, 2M words (> 1850)• NUK GTD:

5,000 pages, 1M words • Google Books:

30 books, 10,000 pages, 2M words (in progress)• WikiSource (Lj Uni):

200 books, 5M words (in progress)

~ 10M words

• most texts have associated facsimiles• can be made freely available

Initial Lexicon• Development of initial lexicon (2010), using the data and tools at

hand• AHLib collection (70 books > 1850)• Transcription rules + FidaPLUS lexicon of contemporary slv• LMU LeXtractor editing tool• produced 3,000 entries (word-forms)

Reference corpusgoo300k• Page sampled• Each word annotated with:

• Contemporary equivalent• Modern lemma• Part-of-speech tag

• First with ToTrTaLe• Then manually correct

• INL Cobalt Lexicon Tool• A team of annotators• Also correcting errors in transcription• Manual, cookbook, FAQ, mailing list, meetings…

• TEI P5 – bibliography, links to facsimiles & DL

Period Units Pages Tokens

1584 1 8 60001695 1 27 10000

1751-1800 8 155 27000

1801-1850 12 206 740001851-1875 36 380 1260001876-1900 23 224 51000

∑ 81 1000 296000

INL Cobalt lexicon building tool

TEI corpusdump

Final lexicon

Composition:• Initial LeXtractor lexicon (3k entries)• Lexicon dump from goo300k• Additional lexicon from full

text collection

Format:• TEI P5• lemma oriented• grammatical properties, glosses, historical spelling, (corpus)

examples

goo300k All Historical

Lex. entries 56346 22849

Word-forms 53853 19627

Normalised 46996 15402

Modernised 37334 11396

Lemmas 19569 8605

Results• Language resources for historical Slovene:

• Text Collection hs5M: • facsimile + transcription, DL (+ automatic annotation)

• Annotated Corpus goo300k: • page-sampled , hand-annotated

• Structured Lexicon imp20k: • grammar + glosses + forms + attestations

• TEI P5, CC BY

• ToTrTaLe + resources for HS: • tokenisation & transcription patterns

• Services: CUWI, (moderniser+archaiser)• all still work in progress, available mid-2012

Further work• Better IR for Digital Libraries: NUK• Dictionary of historical Slovene: ZRC• Beyond words: changes in syntax• MT paradigm• tweets & Croatian

IMACT Final Conference - Language Parallel Sessions - Erjavec

Education

Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Ljubljana , Slovenia

snap.waterfordcoco.iesnap.waterfordcoco.ie/collections/enewspapers/dungarvan_observer/... · Ki Imact hornas; - ind At!ter" hearing Mr. Kynes *tatement. given hereunöer. ... tamiües

Alex Simovski - WorkSafe · PDF fileAlex Simovski . Senior Occupational Hygienist . ... Jet air noise - ... Reduce drop height / angle imact

슬라이드 1 · disaster management, human health and security. Plenary sessions & Keynote lectures, Thematic sessions. Regional sessions, Poster sessions, Exhibition. Excursion

FDI’s Imact on Domestic Firms: spillover through backward linkage Javorcik (AER, 2004)

Jack Erjavec Auto Tech Book

Centre for Science and Environment Media Briefing on Mining Imact on Environment

Language Technologies “New Media and eScience” MSc Programme Jožef Stefan International Postgraduate School Winter/Spring Semester, 2007/08 Tomaž Erjavec

Encoding language corpora: current trends and future directions Tomaž Erjavec Department of Knowledge Technologies Department of Knowledge Technologies

Blackthorne Resort 2019 Entertainment and Rate Schedule€¦ · Moonshine Junkies 13 TBD 14 Movie Night 15 Sessions 16 Sessions 17 Sessions 18 Sessions 19 Sessions 20 Sessions 21

Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets 11.1.2008

Language Technologies Tomaž Erjavec “New Media and eScience” MSc Programme Jožef Stefan International Postgraduate School Winter/Spring Semester, 2005/06

Automotive Technology - Nelson...Automotive Technology: A Systems Approach Third Canadian Edition Jack Erjavec, Martin Restoule, Stephen Leroux, Rob Thompson 9780176531522 Advancing

Sessions By Program Technical Program Short Paper Session Poster Sessions Tutorials Special Sessions

Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

Presenting with Imact - Part 1

Are you for Cockta or Coca Cola? (Tomaž Erjavec, Oculus)

The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

Tomaž Erjavec

Totale Multilingual Tokenisation, Tagging and Lemmatisation Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Ljubljana, Slovenia JRC