The Lifecycle of the DTA Base Format (DTABf) · 2013-11-09 · • Editing: DTA-Framework for the...

Preview:

Citation preview

Haaf/Geyken: The Lifecycle of the DTA Base Format

The Lifecycle of the DTA Base Format (DTABf)

Susanne Haaf, Alexander Geyken

Deutsches Textarchiv, BBAW – Berlin

Haaf/Geyken: The Lifecycle of the DTA Base Format

The Context: DTA

Deutsches Textarchiv (German Text Archive)

Funded by Partner of

Haaf/Geyken: The Lifecycle of the DTA Base Format

Context of the DTA Base Format

• DTABf: TEI/P5 format for text annotation within the corpora of the Deutsches Textarchiv (DTA)

• DTA corpora

– Core: Digitization of first editions of printed German historical texts (17th-19th century) from diverse disciplines

– Extensions (DTAE): texts, smaller corpora and large text corpora from currently ~15 other projects

• Purpose: Create a text corpus, which reflects the development of the historical New High German language

Haaf/Geyken: The Lifecycle of the DTA Base Format

DTA core corpus

DTA numbers (October 2013)

1 014 works freely available on www.deutschestextarchiv.de

345 431 digitized pages

80 094 247 tokens

563 091 033 characters (Unicode)

562 further works within DTAQ (www.deutschestextarchiv.de/dtaq; accessible after registration)

Haaf/Geyken: The Lifecycle of the DTA Base Format

(Historical) German texts on the web

Haaf/Geyken: The Lifecycle of the DTA Base Format

Types of Resources

1. DTA-“born“: pre-structured and transcribed according to the DTA guidelines e.g. AEDit, Notes of A. v. Humboldt, Neue Rheinische Zeitung

2. Other TEI formats: e.g. Blumenbach online, sandrart.net , HAB Wolfenbüttel, Dingler‘s Polytechnisches Journal

3. Other (non TEI) formats: e.g. Wikisource, gutenberg.org, Gutzkow edition, A. v. Humboldt‘s Kosmos

4. OCR-born: large collections e.g. Digi20, GEI-Digital, Die Grenzboten (200,000 pages)

Primary focus: texts with high quality metadata, without text normalization

Haaf/Geyken: The Lifecycle of the DTA Base Format

Problem Statement

Haaf/Geyken: The Lifecycle of the DTA Base Format

• Different digitization formats

• Different, equivalent tagging solutions

• Different depth of text structuring

• Tagging of phenomena new to the DTABf

Problem Statement

Goal: Homogeneity in text structuring • obtain homogeneous web presentation • allow for consistent queries throughout the entire DTA corpora

Example: Combination of text and element nodes Corpus query for verb „laufen“ (run) within <stage> element

Haaf/Geyken: The Lifecycle of the DTA Base Format

The DTABf: History & Structure

Haaf/Geyken: The Lifecycle of the DTA Base Format

Analytics

© Ed Brambley, http://commons.wikimedia.org/wiki/File:-_Flickr_-_edbrambley.jpg; CC BY-SA 2.0

DTABf: Re-inventing the wheel?

Haaf/Geyken: The Lifecycle of the DTA Base Format

DTABf: Re-inventing the wheel?

Goal for DTABf: high coverage of phenomena in written historical text while being as precise as possible; keep entirely TEI/P5 conformant

• TEI Lite:

– Too flexible with respect to attributes and values

– No levels for the depth of annotation

• TEI Tite

– Not a proper subset of TEI/P5

– Problem of coverage

• …

Haaf/Geyken: The Lifecycle of the DTA Base Format

TEI Formats

• Geyken/Haaf/Wiegand: The DTA ‘base format’: A TEI-Subset for the Compilation of Interoperable Corpora. Wien, 2012.

Relation DTABf ~ other TEI formats

• Here: Life Cycle of the DTABf, since it is being continuously adapted to new necessities

Haaf/Geyken: The Lifecycle of the DTA Base Format

Genesis of the DTABf

• Phase 1 (2007 – 2010) – Format for encountered structural phenomena in the DTA core corpus

– Basis: large time span (1600-1900+), large variety of text genres

• Phase 2 (2010 – 2013) – Thorough consistency checks: revision and comprehensive

documentation of the DTABf based on experienced phenomena in phase 1

• DTABf today – Continuous adaption to new phenomena

– Guiding ideas:

• Same tagging for semantically similar phenomena

• Being as concise as possible

• Documentation of decisions/Explanation of tagging solutions

Haaf/Geyken: The Lifecycle of the DTA Base Format

DTABf today

• Tagset:

– 80 <text> elements

– 25 <teiHeader> elements

– Proper subset of the TEI tagset

– Restricted selection of attributes and closed set of values (wherever possible and appropriate)

• Components:

– ODD (www.deutschestextarchiv.de/basisformat.odd)

– RelaxNG (www.deutschestextarchiv.de/basisformat.rng)

– Documentation/Guidelines for text annotation (www.deutschestextarchiv.de/doku/basisformat)

– Guidelines for text transcription (www.deutschestextarchiv.de/doku/richtlinien)

Haaf/Geyken: The Lifecycle of the DTA Base Format

DTABf components: ODD

Haaf/Geyken: The Lifecycle of the DTA Base Format

DTABf components: Documentation

http://www.deutschestextarchiv.de/doku/basisformat

Haaf/Geyken: The Lifecycle of the DTA Base Format

http://www.deutschestextarchiv.de/doku/basisformat_table

Haaf/Geyken: The Lifecycle of the DTA Base Format

The Life Cycle of the DTABf

Haaf/Geyken: The Lifecycle of the DTA Base Format

Phenomena within the scope of the DTABf

Most new texts can be converted to an equivalent DTABf solution in a straightforward way

– Independent of format, e.g. Wikisource syntax

– TEI texts, where different encoding was chosen

<unclear reason="problem">

[Fehlender Text (engl.: missing text)]

</unclear>

→ <gap reason="illegible"/>

(Source: Dingler’s Polytechnisches Journal)

<ornament type="line_long"/>

→ <milestone unit="section" rendition="#hr"/>

<hi rendition="#center"> → ~#c

(Source: Blumenbach online)

Haaf/Geyken: The Lifecycle of the DTA Base Format

New phenomena within the scope of DTABf

www.deutschestextarchiv.de/dtaq/book/view/bacmeister_predigt_1614?p=18

Haaf/Geyken: The Lifecycle of the DTA Base Format

New phenomena requiring changes to the DTABf

Example: Annotation depth

Figure references & descriptions

<figure xml:id="tx000027_005a" rendition="#center">

<graphic url="images/tx000027_005a"/>

<figDesc>

illustration on titlepage

</figDesc>

</figure>

<figure facs="images/tx000027_005a">

<note type="editorial">

illustration on titlepage

</note>

</figure>

(Source: Blumenbach online)

Haaf/Geyken: The Lifecycle of the DTA Base Format

New phenomena requiring changes to the DTABf

Example: New document types

New division types for funeral sermons (AEDit project)

<elementSpec ident="div" module="textstructure"

mode="change">

<attList>

<attDef ident="type" mode="change">

<valList type="closed" mode="replace">

<valItem ident="fsBibleVerse"/>

<valItem ident="fsSermon"/>

<valItem ident="fsOration"/>

</valList>

</attDef>

</attList>

</elementSpec>

(cf. DTABf ODD: www.deutschestextarchiv.de/basisformat.odd)

New Phenomena requiring

adjustments of the DTABf

documentation

Example:

List items?

Para-graphs?

http://www.deutschestextarchiv.de/dtaq/book/view/kaempfer_japan01_1777?p=157; ~?p=158

Haaf/Geyken: The Lifecycle of the DTA Base Format

New TEI/P5 releases requiring changes to the DTABf

Changes to the TEI guidelines, which are not backward compatible

<biblScope type="pages">

volume number of the publication

within a series

</biblScope>

<biblScope unit="pages">

volume number of the publication

within a series

</biblScope>

(Changes within TEI/P5, release 2.3.0; cf. http://www.tei-c.org/Vault/P5/current/doc/tei-p5-doc/readme-2.3.0.html)

Haaf/Geyken: The Lifecycle of the DTA Base Format

Specification of annotation depth

DTABf tagging solutions for a great range of phenomena encoding levels to ensure interoperability

• Encoding Level 1: required <pb>, <list>, <lg>, <note>, …

• Encoding Level 2: recommended <choice>, <fw>, <lb>, …

• Encoding Level 3: optional <foreign>, <persName>, …

• Encoding Level 4: proscribed <ab>, <div1>, <g>, …

(Cf. TEI P5 Guidelines, Ch. 15.5: Recommendations for the Encoding of Large Corpora http://www.tei-c.org/release/doc/tei-p5-doc/en/html/CC.html#CCREC)

Haaf/Geyken: The Lifecycle of the DTA Base Format

Life Cycle of the DTABf: Tools and Services for Users

Haaf/Geyken: The Lifecycle of the DTA Base Format

DTABf: Software

• Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation Project 1 of CLARIN-D WG 1)

• Proofreading and web-editing: Web-based distributed proofreading and editing environment (DTAQ; currently (Oct. 2013) 285 registered users, cf. Poster on TEI-MM 2013)

• Metadata: Web form for metadata ingestion

• Object data (text): Serialization routines (TokWrap) for use in NLP contexts

Haaf/Geyken: The Lifecycle of the DTA Base Format

Dissemination of DTABf

• Best Practice model of CLARIN-D (for historical written text)

(www.clarin-d.de/de/sprachressourcen/benutzerhandbuch.html)

• Regular (national) hands-on workshops (next: Oct 25th in Berlin, BBAW)

• Presentation at CLARIN-D Summer School (2014)

Haaf/Geyken: The Lifecycle of the DTA Base Format

DTABf and CLARIN

• Integration DTA corpora into CLARIN-D service center

– Persistent Identifiers

– OAI/PMH

• Conversion routines: DTABf to CMDI & TCF (enabling usage within the NLP tool chains of CLARIN-D WebLicht)

• Endpoint for Federated Content Search in CLARIN

• Long-term preservation via CLARIN-D service center

Haaf/Geyken: The Lifecycle of the DTA Base Format

Thank you!

Visit the DTABf documentation: www.deutschestextarchiv.de/doku/basisformat

Read more: www.deutschestextarchiv.de/doku/publikationen

Participate: www.deutschestextarchiv.de/dtaq

www.deutschestextarchiv.de/dtae

Contact us: dta@bbaw.de @textarchiv

Recommended