Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Haaf/Geyken: The Lifecycle of the DTA Base Format
The Lifecycle of the DTA Base Format (DTABf)
Susanne Haaf, Alexander Geyken
Deutsches Textarchiv, BBAW – Berlin
Haaf/Geyken: The Lifecycle of the DTA Base Format
The Context: DTA
Deutsches Textarchiv (German Text Archive)
Funded by Partner of
Haaf/Geyken: The Lifecycle of the DTA Base Format
Context of the DTA Base Format
• DTABf: TEI/P5 format for text annotation within the corpora of the Deutsches Textarchiv (DTA)
• DTA corpora
– Core: Digitization of first editions of printed German historical texts (17th-19th century) from diverse disciplines
– Extensions (DTAE): texts, smaller corpora and large text corpora from currently ~15 other projects
• Purpose: Create a text corpus, which reflects the development of the historical New High German language
Haaf/Geyken: The Lifecycle of the DTA Base Format
DTA core corpus
DTA numbers (October 2013)
1 014 works freely available on www.deutschestextarchiv.de
345 431 digitized pages
80 094 247 tokens
563 091 033 characters (Unicode)
562 further works within DTAQ (www.deutschestextarchiv.de/dtaq; accessible after registration)
Haaf/Geyken: The Lifecycle of the DTA Base Format
(Historical) German texts on the web
Haaf/Geyken: The Lifecycle of the DTA Base Format
Types of Resources
1. DTA-“born“: pre-structured and transcribed according to the DTA guidelines e.g. AEDit, Notes of A. v. Humboldt, Neue Rheinische Zeitung
2. Other TEI formats: e.g. Blumenbach online, sandrart.net , HAB Wolfenbüttel, Dingler‘s Polytechnisches Journal
3. Other (non TEI) formats: e.g. Wikisource, gutenberg.org, Gutzkow edition, A. v. Humboldt‘s Kosmos
4. OCR-born: large collections e.g. Digi20, GEI-Digital, Die Grenzboten (200,000 pages)
Primary focus: texts with high quality metadata, without text normalization
Haaf/Geyken: The Lifecycle of the DTA Base Format
Problem Statement
Haaf/Geyken: The Lifecycle of the DTA Base Format
• Different digitization formats
• Different, equivalent tagging solutions
• Different depth of text structuring
• Tagging of phenomena new to the DTABf
Problem Statement
Goal: Homogeneity in text structuring • obtain homogeneous web presentation • allow for consistent queries throughout the entire DTA corpora
Example: Combination of text and element nodes Corpus query for verb „laufen“ (run) within <stage> element
Haaf/Geyken: The Lifecycle of the DTA Base Format
The DTABf: History & Structure
Haaf/Geyken: The Lifecycle of the DTA Base Format
Analytics
…
© Ed Brambley, http://commons.wikimedia.org/wiki/File:-_Flickr_-_edbrambley.jpg; CC BY-SA 2.0
DTABf: Re-inventing the wheel?
Haaf/Geyken: The Lifecycle of the DTA Base Format
DTABf: Re-inventing the wheel?
Goal for DTABf: high coverage of phenomena in written historical text while being as precise as possible; keep entirely TEI/P5 conformant
• TEI Lite:
– Too flexible with respect to attributes and values
– No levels for the depth of annotation
• TEI Tite
– Not a proper subset of TEI/P5
– Problem of coverage
• …
Haaf/Geyken: The Lifecycle of the DTA Base Format
TEI Formats
• Geyken/Haaf/Wiegand: The DTA ‘base format’: A TEI-Subset for the Compilation of Interoperable Corpora. Wien, 2012.
Relation DTABf ~ other TEI formats
• Here: Life Cycle of the DTABf, since it is being continuously adapted to new necessities
Haaf/Geyken: The Lifecycle of the DTA Base Format
Genesis of the DTABf
• Phase 1 (2007 – 2010) – Format for encountered structural phenomena in the DTA core corpus
– Basis: large time span (1600-1900+), large variety of text genres
• Phase 2 (2010 – 2013) – Thorough consistency checks: revision and comprehensive
documentation of the DTABf based on experienced phenomena in phase 1
• DTABf today – Continuous adaption to new phenomena
– Guiding ideas:
• Same tagging for semantically similar phenomena
• Being as concise as possible
• Documentation of decisions/Explanation of tagging solutions
Haaf/Geyken: The Lifecycle of the DTA Base Format
DTABf today
• Tagset:
– 80 <text> elements
– 25 <teiHeader> elements
– Proper subset of the TEI tagset
– Restricted selection of attributes and closed set of values (wherever possible and appropriate)
• Components:
– ODD (www.deutschestextarchiv.de/basisformat.odd)
– RelaxNG (www.deutschestextarchiv.de/basisformat.rng)
– Documentation/Guidelines for text annotation (www.deutschestextarchiv.de/doku/basisformat)
– Guidelines for text transcription (www.deutschestextarchiv.de/doku/richtlinien)
Haaf/Geyken: The Lifecycle of the DTA Base Format
DTABf components: ODD
Haaf/Geyken: The Lifecycle of the DTA Base Format
DTABf components: Documentation
http://www.deutschestextarchiv.de/doku/basisformat
Haaf/Geyken: The Lifecycle of the DTA Base Format
http://www.deutschestextarchiv.de/doku/basisformat_table
Haaf/Geyken: The Lifecycle of the DTA Base Format
The Life Cycle of the DTABf
Haaf/Geyken: The Lifecycle of the DTA Base Format
Phenomena within the scope of the DTABf
Most new texts can be converted to an equivalent DTABf solution in a straightforward way
– Independent of format, e.g. Wikisource syntax
– TEI texts, where different encoding was chosen
<unclear reason="problem">
[Fehlender Text (engl.: missing text)]
</unclear>
→ <gap reason="illegible"/>
(Source: Dingler’s Polytechnisches Journal)
<ornament type="line_long"/>
→ <milestone unit="section" rendition="#hr"/>
<hi rendition="#center"> → ~#c
(Source: Blumenbach online)
Haaf/Geyken: The Lifecycle of the DTA Base Format
New phenomena within the scope of DTABf
www.deutschestextarchiv.de/dtaq/book/view/bacmeister_predigt_1614?p=18
Haaf/Geyken: The Lifecycle of the DTA Base Format
New phenomena requiring changes to the DTABf
Example: Annotation depth
Figure references & descriptions
<figure xml:id="tx000027_005a" rendition="#center">
<graphic url="images/tx000027_005a"/>
<figDesc>
illustration on titlepage
</figDesc>
</figure>
→
<figure facs="images/tx000027_005a">
<note type="editorial">
illustration on titlepage
</note>
</figure>
(Source: Blumenbach online)
Haaf/Geyken: The Lifecycle of the DTA Base Format
New phenomena requiring changes to the DTABf
Example: New document types
New division types for funeral sermons (AEDit project)
<elementSpec ident="div" module="textstructure"
mode="change">
<attList>
<attDef ident="type" mode="change">
<valList type="closed" mode="replace">
…
<valItem ident="fsBibleVerse"/>
<valItem ident="fsSermon"/>
<valItem ident="fsOration"/>
</valList>
</attDef>
</attList>
</elementSpec>
(cf. DTABf ODD: www.deutschestextarchiv.de/basisformat.odd)
New Phenomena requiring
adjustments of the DTABf
documentation
Example:
List items?
Para-graphs?
http://www.deutschestextarchiv.de/dtaq/book/view/kaempfer_japan01_1777?p=157; ~?p=158
Haaf/Geyken: The Lifecycle of the DTA Base Format
New TEI/P5 releases requiring changes to the DTABf
Changes to the TEI guidelines, which are not backward compatible
<biblScope type="pages">
volume number of the publication
within a series
</biblScope>
→
<biblScope unit="pages">
volume number of the publication
within a series
</biblScope>
(Changes within TEI/P5, release 2.3.0; cf. http://www.tei-c.org/Vault/P5/current/doc/tei-p5-doc/readme-2.3.0.html)
Haaf/Geyken: The Lifecycle of the DTA Base Format
Specification of annotation depth
DTABf tagging solutions for a great range of phenomena encoding levels to ensure interoperability
• Encoding Level 1: required <pb>, <list>, <lg>, <note>, …
• Encoding Level 2: recommended <choice>, <fw>, <lb>, …
• Encoding Level 3: optional <foreign>, <persName>, …
• Encoding Level 4: proscribed <ab>, <div1>, <g>, …
(Cf. TEI P5 Guidelines, Ch. 15.5: Recommendations for the Encoding of Large Corpora http://www.tei-c.org/release/doc/tei-p5-doc/en/html/CC.html#CCREC)
Haaf/Geyken: The Lifecycle of the DTA Base Format
Life Cycle of the DTABf: Tools and Services for Users
Haaf/Geyken: The Lifecycle of the DTA Base Format
DTABf: Software
• Editing: DTA-Framework for the oXygen Author mode (DTAoX) for DTABf conformant text annotation (used in Curation Project 1 of CLARIN-D WG 1)
• Proofreading and web-editing: Web-based distributed proofreading and editing environment (DTAQ; currently (Oct. 2013) 285 registered users, cf. Poster on TEI-MM 2013)
• Metadata: Web form for metadata ingestion
• Object data (text): Serialization routines (TokWrap) for use in NLP contexts
Haaf/Geyken: The Lifecycle of the DTA Base Format
Dissemination of DTABf
• Best Practice model of CLARIN-D (for historical written text)
(www.clarin-d.de/de/sprachressourcen/benutzerhandbuch.html)
• Regular (national) hands-on workshops (next: Oct 25th in Berlin, BBAW)
• Presentation at CLARIN-D Summer School (2014)
Haaf/Geyken: The Lifecycle of the DTA Base Format
DTABf and CLARIN
• Integration DTA corpora into CLARIN-D service center
– Persistent Identifiers
– OAI/PMH
• Conversion routines: DTABf to CMDI & TCF (enabling usage within the NLP tool chains of CLARIN-D WebLicht)
• Endpoint for Federated Content Search in CLARIN
• Long-term preservation via CLARIN-D service center
Haaf/Geyken: The Lifecycle of the DTA Base Format
Thank you!
Visit the DTABf documentation: www.deutschestextarchiv.de/doku/basisformat
Read more: www.deutschestextarchiv.de/doku/publikationen
Participate: www.deutschestextarchiv.de/dtaq
www.deutschestextarchiv.de/dtae
Contact us: [email protected] @textarchiv