14
Digital encoding of text Tomaž Erjavec

Digital encoding of text

  • Upload
    cain

  • View
    44

  • Download
    2

Embed Size (px)

DESCRIPTION

Digital encoding of text. Tomaž Erjavec. Scholarly digital editions of Slovenian literature http://nl.ijs.si/e-zrc/. Content provider: Institute of Slovenian L iterature – S cientific research centre of the Slovenian A cademy of S ciences and A rts , Ljubljana - PowerPoint PPT Presentation

Citation preview

Page 1: Digital encoding of text

Digital encoding of text

Tomaž Erjavec

Page 2: Digital encoding of text

Scholarly digital editions of Slovenian literature http://nl.ijs.si/e-zrc/

Content provider: Institute of Slovenian Literature – Scientific research centre of the Slovenian Academy of Sciences and Arts, Ljubljana

Technology provider:Department of Knowledge TechnologiesJožef Stefan Institute, Ljubljana

Page 3: Digital encoding of text

Freising Manuscripts (FM):

• Three religious texts: – FM I: a confession form – FM II: a homily on

penitence and remission – FM III: a confession form

• Provenance: Upper Carinthia or Freising(Austria, Germany)

• Place of use: Carinthian estates of the Freising diocese

• Written after 27 May, 972; not after 1023

Page 4: Digital encoding of text

The history of the Freising Manuscripts

• Discovered by B. J. Docen in 1806 in the Munich State Library

• Many printed editions since then• First diplomatic transcription 1827 by

P. Köppen & A. H. Vostokov, Sanktpeterburg

Critical edition by Slovenian Academy of Sciences 1992, 1993, 2004

Page 5: Digital encoding of text

The printed edition 2004 – our source,

containing: • Diplomatic transcription with apparatus,

comparing 9 older DT• Critical transcription with apparatus,

comparing 13 older CT • Phonetic transcription in IPA, with apparatus• Translations into Latin and 3 modern languages • Dictionary of all words in the CT, with PT, the 4

translations + Old Church Slavonic, and examples (concordances)

• Bibliography, with 600+ items• Introductions

Page 6: Digital encoding of text

The goal of e-edition: to gather the 200-years history of FM editions

• Annotated text of all major transcriptions so far:the history of understanding

• Alignment of all 16 transcriptions and translations:understanding through comparison

• Sound recording added to phonetic transcription:understanding through experiencing

• Addition of translations: Polish, Italian understanding for non-Slovenian speakers

• Integration of materialsunderstanding for all

Page 7: Digital encoding of text

Production of the e-edition

• Electronic original: a local editor format or re-keyed Word files

• Conversion: dedicated Perl and XSLT filters• Target format: the Text Encoding Initiative

Guidelines P4• View format: XSLT transform into HTML• Rapid prototyping and a cyclical process of

refinement

Page 8: Digital encoding of text

Challenging issues

• Complex characters, e.g. (ZRCola font: http://zrcola.zrc-sazu.si/)

• Adding speech into the e-edition(manual segmentation, errors in the originals, inserting phrase & sentence boundaries into parallel views)

• Dictionary conversion(idiosyncratic format, complex structure, difficult cross-references)

Page 9: Digital encoding of text

Examples:The TEI encoded phonetic transcription

Page 10: Digital encoding of text

BS Dictionary

Page 11: Digital encoding of text

BS Bibliography

Page 12: Digital encoding of text

BS basic parallel view

Page 13: Digital encoding of text

Further work in finishing the BS eEdition

• TEI header (Slovene + English, also HTML view)

• Better treatment of PUA characters(documented in header, fallback)

• Resolving outstanding content issues

• Better overall structure and linking

Page 14: Digital encoding of text

Further work:general goals

• Incorporating language technologies into the eEditions (concordancing, lemmatisation, part-of-speech tagging)

• Adaptable Web interface for viewing (select what and how to see: corrections, emendations, notes, facsimile)

• Accessing and connecting the e-library as a whole (cataloguing, searching)