Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
S. Haaf: Text Type Classification for the Historical DTA Corpus
Text Type Classification for the Historical DTA Corpus
Susanne Haaf Deutsches Textarchiv, BBAW – Berlin
NeDiMAH-CLARIN-Workshop Exploring Historical Sources with Language Technology:
Results and Perspectives
S. Haaf: Text Type Classification for the Historical DTA Corpus
About the Project
• Deutsches Textarchiv/ German Text Archive (DTA)
• Funding:
• Partner:
• Duration: 2007-2014/15
• Goal: – Provide the basis for a reference corpus for the development
of the New High German language (17th to 19th century)
S. Haaf: Text Type Classification for the Historical DTA Corpus
About the Project
• Ca. 1,500 texts of different disciplines and text types
• Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization)
• DTA 'Base Format'
• Guidelines for the transcription closely to the source
• Structural XML-annotation according to TEI/P5
• Guidelines for metadata entry
• Web-based quality assurance
• DTA-Extensions
• Integration of historical text data from other project contexts
• Curation and Collection of diverse text resources
S. Haaf: Text Type Classification for the Historical DTA Corpus
The DTA Bibliography
• Selection of works for the DTA core corpus: fixed bibliography
• Bibliography was created with the help of BBAW members, i.e. experts for the (history of) different (scientific) disciplines
• Requirements for the Selection
– reflect the diversity of text types …
– … at different points in time
– represent works which were
• Important for the scientific field
• Or: Widely recognised (i.e. of huge public influence)
• Or even: Not very influential
Genuinely lexicographic approach
• Phase 3: New selection of another 200 works Filling gaps … – … considering time
– … considering text type
S. Haaf: Text Type Classification for the Historical DTA Corpus
Text Type Classification for the DTA
• Created in a data-driven way, i.e.:
New book in the DTA corpus
Is there an existing category that fits?
Yes?
Assign the fitting existing category!
No?
Create new category!
• Based on the classification of the DWDS (Digital Dictionary of the German Language) which was continually extended
S. Haaf: Text Type Classification for the Historical DTA Corpus
Text Type Classification for the DTA
3 main (super-)categories:
2 levels: super- & sub-categories
S. Haaf: Text Type Classification for the Historical DTA Corpus
Text Type Classification for the DTA
• Fiction: – Drama, Lyrics, Prose
– Biography, Epistolary Novel, Travel Literature, Novels, Children's Books, …
• Functional Literature: – Handbooks (Good Behaviour/Etiquette, Pedagogy, Gardening, …)
– Travel Books, Cookbooks, Newspapers, Devotional Literature …
• Scientific Texts: – Science: Biology, Geography, Medicine, Chemistry, …
– Humanities: Literature, Linguistics, History, Musical Studies, …
– Social Sciences and Economics
– …
S. Haaf: Text Type Classification for the Historical DTA Corpus
Text Type Classification for the DTA
S. Haaf: Text Type Classification for the Historical DTA Corpus
Text Type Classification
What for?
http://www.deutschestextarchiv.de
1. Access based on Text Types
S. Haaf: Text Type Classification for the Historical DTA Corpus
http://www.deutschestextarchiv.de/list/browse?genre=Gebrauchsliteratur
1. Access based on Text Types
2. Queries based on
Text Types
Travel destinations mentioned in
functional literature?
S. Haaf: Text Type Classification for the Historical DTA Corpus
3. Analyses based on Text Types
Fiction Functional Literature
Science
S. Haaf: Text Type Classification for the Historical DTA Corpus
3. Analyses based on Text Types
Kid's Toy (Germanet) within Fictional Literature
Query: Kinderspielzeug|gn-sub #has[textClassDWDS, /Gebrauchsliteratur/]
S. Haaf: Text Type Classification for the Historical DTA Corpus
Problem statement
• Text classification created in a data-driven way: It only shows what we have …
… but it gives no clues about what we do not have (i.e. text types important for a certain time which are not represented by the DTA corpus)
Hence it is difficult to evaluate the representativity of the DTA corpus in this respect
• The DTA text classification is not mapped to existing classifications of significance
• There are only two layers leading to ambiguities e.g. Funeral Sermons:
Functional Literature::Theology ?
Functional Literature::FuneralSermon?
Functional Literature::SpecialOccasion? …
S. Haaf: Text Type Classification for the Historical DTA Corpus
Solution: Switch to an existing classification?
Example AAD: Classification of the Working Group on Old Prints by the huge German libraries
• Certain text types we need are not represented (e.g. gardening), others we don't need (e.g. maps)
• Other text types are incoherently modeled
• In some cases it is too detailed for us
• In other cases it is not detailed enough
• Sometimes no descriptions at all or descriptions which are not extensive enough
• Text types belong to different description levels (text type vs. knowledge area …)
S. Haaf: Text Type Classification for the Historical DTA Corpus
AAD: Incoherences
BS: Use synonymon OB: Supercategory UB: Subcategory
#OfficialPrintedPublication (2)
SubC: Law
#law
#CollectionOfLaws
SupC: OfficialPrintedPublication
#OfficialPrintedPublication (1)
S. Haaf: Text Type Classification for the Historical DTA Corpus
Solution: Switch to an existing classification?
Example AAD: Classification of the Working Group on Old Prints by the huge German libraries
• Certain text types we need are not represented (e.g. gardening), others we don't need (e.g. maps)
• Other text types are incoherently modeled
• In some cases it is too detailed for us
• In other cases it is not detailed enough
• Sometimes no descriptions at all or descriptions which are not extensive enough
• Text types belong to different description levels (text type vs. knowledge area …)
S. Haaf: Text Type Classification for the Historical DTA Corpus
AAD: Different Description Levels
#Children's Book
#Church Song
#Catechism
Type of intended usage
Text type
Type of text presentation
#Rhetorics Knowledge area
BS: Use synonymon OB: Supercategory UB: Subcategory
S. Haaf: Text Type Classification for the Historical DTA Corpus
Solution: Revised DTA text type classification
• Redesign and extend the DTA text type classification based on different existing classfications
• Mapping from the one to the other
– DTA text types can semi-automatically be transfered to the new classification
– (Digitized) works of text types still missing in the corpus can be found from library catalogues
– Sources:
AAD (http://aad.gbv.de/empfehlung/aad_gattung.pdf)
Wikisource (http://de.wikisource.org/wiki/Wikisource:Systematik)
DWDS
DTA
…
S. Haaf: Text Type Classification for the Historical DTA Corpus
Solution: Revised DTA text type classification
• Small set of Supercategories
– Non-fiction
• Scientific Literature
• Functional Literature
– Fiction
• Detailed (but still manageable) set of subcategories
• Hierarchies are allowed but kept shallow
• Descriptions/Documentation
S. Haaf: Text Type Classification for the Historical DTA Corpus
Revised DTA text type classification
Classification of text types (i.e. of the subcategories)
• Präsentationsform (i.e. Type of text presentation)
Flyer, Funeral Print (Funeralschrift), Book of Prayer, Cookbook, Catalogue
• Sitz im Leben (i.e. Life context which texts are embedded in)
Devotional Literature, Texts for/from women, Occasional texts
• Textsorte (Text type)
Poem, Novel, Scientific Paper
• Wissensbereich (i.e. Knowledge area covered by the text)
Theology, Chemistry, Math, Linguistics
S. Haaf: Text Type Classification for the Historical DTA Corpus
Term description (via eXist)
<term type="textType" source="#aad" id="autobiography">
<name>Autobiography</name>
<desc type="main">
<p>Life memories; Description of historical events
by personal witnesses</p>
<bibl>AAD</bibl>
</desc>
<desc type="alternative-1">[…]</desc>
<subordinates/>
<superordinates>
<term id="#biography"/>
</superordinates>
<mapping>
<term source="#dwds">Autobiography</term>
</mapping>
[features, notes, …]
</term>
S. Haaf: Text Type Classification for the Historical DTA Corpus
Term description (via eXist)
<term type="textType" source="#dta" id="flyer">
<name>Flyer</name>
<desc type="main">
<p>Easily produced broschure, produced for the purpose of
agitation, information, or documentation</p>
<bibl>Cf. AAD</bibl>
</desc>
<desc type="alternative-1">[…]</desc>
[…]
<mapping>
<term source="#aad">Flyer</term>
<term source="#aad">Broadsheet</term>
</mapping>
[features, notes, …]
</term>
S. Haaf: Text Type Classification for the Historical DTA Corpus
Thank you!
Contact:
Project Deutsches Textarchiv:
www.deutschestextarchiv.de
www.deutschestextarchiv.de/doku/basisformat
www.deutschestextarchiv.de/dtaq
www.deutschestextarchiv.de/dtae
Literature:
www.deutschestextarchiv.de/doku/publikationen