31
Haaf/Geyken: The Lifecycle of the DTA Base Format The Lifecycle of the DTA Base Format (DTABf) Susanne Haaf, Alexander Geyken Deutsches Textarchiv, BBAW Berlin

The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

The Lifecycle of the DTA Base Format (DTABf)

Susanne Haaf, Alexander Geyken

Deutsches Textarchiv, BBAW – Berlin

Page 2: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

The Context: DTA

Deutsches Textarchiv (German Text Archive)

Funded by Partner of

Page 3: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

Context of the DTA Base Format

• DTABf: TEI/P5 format for text annotation within the corpora of the Deutsches Textarchiv (DTA)

• DTA corpora

– Core: Digitization of first editions of printed German historical texts (17th-19th century) from diverse disciplines

– Extensions (DTAE): texts, smaller corpora and large text corpora from currently ~15 other projects

• Purpose: Create a text corpus, which reflects the development of the historical New High German language

Page 4: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

DTA core corpus

DTA numbers (October 2013)

1 014 works freely available on www.deutschestextarchiv.de

345 431 digitized pages

80 094 247 tokens

563 091 033 characters (Unicode)

562 further works within DTAQ (www.deutschestextarchiv.de/dtaq; accessible after registration)

Page 5: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

(Historical) German texts on the web

Page 6: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

Types of Resources

1. DTA-“born“: pre-structured and transcribed according to the DTA guidelines e.g. AEDit, Notes of A. v. Humboldt, Neue Rheinische Zeitung

2. Other TEI formats: e.g. Blumenbach online, sandrart.net , HAB Wolfenbüttel, Dingler‘s Polytechnisches Journal

3. Other (non TEI) formats: e.g. Wikisource, gutenberg.org, Gutzkow edition, A. v. Humboldt‘s Kosmos

4. OCR-born: large collections e.g. Digi20, GEI-Digital, Die Grenzboten (200,000 pages)

Primary focus: texts with high quality metadata, without text normalization

Page 7: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

Problem Statement

Page 8: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

• Different digitization formats

• Different, equivalent tagging solutions

• Different depth of text structuring

• Tagging of phenomena new to the DTABf

Problem Statement

Goal: Homogeneity in text structuring • obtain homogeneous web presentation • allow for consistent queries throughout the entire DTA corpora

Page 9: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Example: Combination of text and element nodes Corpus query for verb „laufen“ (run) within <stage> element

Page 10: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

The DTABf: History & Structure

Page 11: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

Analytics

© Ed Brambley, http://commons.wikimedia.org/wiki/File:-_Flickr_-_edbrambley.jpg; CC BY-SA 2.0

DTABf: Re-inventing the wheel?

Page 12: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

DTABf: Re-inventing the wheel?

Goal for DTABf: high coverage of phenomena in written historical text while being as precise as possible; keep entirely TEI/P5 conformant

• TEI Lite:

– Too flexible with respect to attributes and values

– No levels for the depth of annotation

• TEI Tite

– Not a proper subset of TEI/P5

– Problem of coverage

• …

Page 13: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

TEI Formats

• Geyken/Haaf/Wiegand: The DTA ‘base format’: A TEI-Subset for the Compilation of Interoperable Corpora. Wien, 2012.

Relation DTABf ~ other TEI formats

• Here: Life Cycle of the DTABf, since it is being continuously adapted to new necessities

Page 14: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

Genesis of the DTABf

• Phase 1 (2007 – 2010) – Format for encountered structural phenomena in the DTA core corpus

– Basis: large time span (1600-1900+), large variety of text genres

• Phase 2 (2010 – 2013) – Thorough consistency checks: revision and comprehensive

documentation of the DTABf based on experienced phenomena in phase 1

• DTABf today – Continuous adaption to new phenomena

– Guiding ideas:

• Same tagging for semantically similar phenomena

• Being as concise as possible

• Documentation of decisions/Explanation of tagging solutions

Page 15: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

DTABf today

• Tagset:

– 80 <text> elements

– 25 <teiHeader> elements

– Proper subset of the TEI tagset

– Restricted selection of attributes and closed set of values (wherever possible and appropriate)

• Components:

– ODD (www.deutschestextarchiv.de/basisformat.odd)

– RelaxNG (www.deutschestextarchiv.de/basisformat.rng)

– Documentation/Guidelines for text annotation (www.deutschestextarchiv.de/doku/basisformat)

– Guidelines for text transcription (www.deutschestextarchiv.de/doku/richtlinien)

Page 16: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

DTABf components: ODD

Page 17: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

DTABf components: Documentation

http://www.deutschestextarchiv.de/doku/basisformat

Page 18: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

http://www.deutschestextarchiv.de/doku/basisformat_table

Page 19: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

The Life Cycle of the DTABf

Page 20: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

Phenomena within the scope of the DTABf

Most new texts can be converted to an equivalent DTABf solution in a straightforward way

– Independent of format, e.g. Wikisource syntax

– TEI texts, where different encoding was chosen

<unclear reason="problem">

[Fehlender Text (engl.: missing text)]

</unclear>

→ <gap reason="illegible"/>

(Source: Dingler’s Polytechnisches Journal)

<ornament type="line_long"/>

→ <milestone unit="section" rendition="#hr"/>

<hi rendition="#center"> → ~#c

(Source: Blumenbach online)

Page 21: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

New phenomena within the scope of DTABf

www.deutschestextarchiv.de/dtaq/book/view/bacmeister_predigt_1614?p=18

Page 22: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

New phenomena requiring changes to the DTABf

Example: Annotation depth

Figure references & descriptions

<figure xml:id="tx000027_005a" rendition="#center">

<graphic url="images/tx000027_005a"/>

<figDesc>

illustration on titlepage

</figDesc>

</figure>

<figure facs="images/tx000027_005a">

<note type="editorial">

illustration on titlepage

</note>

</figure>

(Source: Blumenbach online)

Page 23: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

New phenomena requiring changes to the DTABf

Example: New document types

New division types for funeral sermons (AEDit project)

<elementSpec ident="div" module="textstructure"

mode="change">

<attList>

<attDef ident="type" mode="change">

<valList type="closed" mode="replace">

<valItem ident="fsBibleVerse"/>

<valItem ident="fsSermon"/>

<valItem ident="fsOration"/>

</valList>

</attDef>

</attList>

</elementSpec>

(cf. DTABf ODD: www.deutschestextarchiv.de/basisformat.odd)

Page 24: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

New Phenomena requiring

adjustments of the DTABf

documentation

Example:

List items?

Para-graphs?

http://www.deutschestextarchiv.de/dtaq/book/view/kaempfer_japan01_1777?p=157; ~?p=158

Page 25: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

New TEI/P5 releases requiring changes to the DTABf

Changes to the TEI guidelines, which are not backward compatible

<biblScope type="pages">

volume number of the publication

within a series

</biblScope>

<biblScope unit="pages">

volume number of the publication

within a series

</biblScope>

(Changes within TEI/P5, release 2.3.0; cf. http://www.tei-c.org/Vault/P5/current/doc/tei-p5-doc/readme-2.3.0.html)

Page 26: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

Specification of annotation depth

DTABf tagging solutions for a great range of phenomena encoding levels to ensure interoperability

• Encoding Level 1: required <pb>, <list>, <lg>, <note>, …

• Encoding Level 2: recommended <choice>, <fw>, <lb>, …

• Encoding Level 3: optional <foreign>, <persName>, …

• Encoding Level 4: proscribed <ab>, <div1>, <g>, …

(Cf. TEI P5 Guidelines, Ch. 15.5: Recommendations for the Encoding of Large Corpora http://www.tei-c.org/release/doc/tei-p5-doc/en/html/CC.html#CCREC)

Page 27: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

Life Cycle of the DTABf: Tools and Services for Users

Page 28: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

DTABf: Software

• Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation Project 1 of CLARIN-D WG 1)

• Proofreading and web-editing: Web-based distributed proofreading and editing environment (DTAQ; currently (Oct. 2013) 285 registered users, cf. Poster on TEI-MM 2013)

• Metadata: Web form for metadata ingestion

• Object data (text): Serialization routines (TokWrap) for use in NLP contexts

Page 29: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

Dissemination of DTABf

• Best Practice model of CLARIN-D (for historical written text)

(www.clarin-d.de/de/sprachressourcen/benutzerhandbuch.html)

• Regular (national) hands-on workshops (next: Oct 25th in Berlin, BBAW)

• Presentation at CLARIN-D Summer School (2014)

Page 30: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

DTABf and CLARIN

• Integration DTA corpora into CLARIN-D service center

– Persistent Identifiers

– OAI/PMH

• Conversion routines: DTABf to CMDI & TCF (enabling usage within the NLP tool chains of CLARIN-D WebLicht)

• Endpoint for Federated Content Search in CLARIN

• Long-term preservation via CLARIN-D service center

Page 31: The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation

Haaf/Geyken: The Lifecycle of the DTA Base Format

Thank you!

Visit the DTABf documentation: www.deutschestextarchiv.de/doku/basisformat

Read more: www.deutschestextarchiv.de/doku/publikationen

Participate: www.deutschestextarchiv.de/dtaq

www.deutschestextarchiv.de/dtae

Contact us: [email protected] @textarchiv