25
S. Haaf: Text Type Classification for the Historical DTA Corpus Text Type Classification for the Historical DTA Corpus Susanne Haaf Deutsches Textarchiv, BBAW Berlin NeDiMAH-CLARIN-Workshop Exploring Historical Sources with Language Technology: Results and Perspectives

Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

Text Type Classification for the Historical DTA Corpus

Susanne Haaf Deutsches Textarchiv, BBAW – Berlin

NeDiMAH-CLARIN-Workshop Exploring Historical Sources with Language Technology:

Results and Perspectives

Page 2: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

About the Project

• Deutsches Textarchiv/ German Text Archive (DTA)

• Funding:

• Partner:

• Duration: 2007-2014/15

• Goal: – Provide the basis for a reference corpus for the development

of the New High German language (17th to 19th century)

Page 3: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

About the Project

• Ca. 1,500 texts of different disciplines and text types

• Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization)

• DTA 'Base Format'

• Guidelines for the transcription closely to the source

• Structural XML-annotation according to TEI/P5

• Guidelines for metadata entry

• Web-based quality assurance

• DTA-Extensions

• Integration of historical text data from other project contexts

• Curation and Collection of diverse text resources

Page 4: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

The DTA Bibliography

• Selection of works for the DTA core corpus: fixed bibliography

• Bibliography was created with the help of BBAW members, i.e. experts for the (history of) different (scientific) disciplines

• Requirements for the Selection

– reflect the diversity of text types …

– … at different points in time

– represent works which were

• Important for the scientific field

• Or: Widely recognised (i.e. of huge public influence)

• Or even: Not very influential

Genuinely lexicographic approach

• Phase 3: New selection of another 200 works Filling gaps … – … considering time

– … considering text type

Page 5: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

Text Type Classification for the DTA

• Created in a data-driven way, i.e.:

New book in the DTA corpus

Is there an existing category that fits?

Yes?

Assign the fitting existing category!

No?

Create new category!

• Based on the classification of the DWDS (Digital Dictionary of the German Language) which was continually extended

Page 6: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

Text Type Classification for the DTA

3 main (super-)categories:

2 levels: super- & sub-categories

Page 7: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

Text Type Classification for the DTA

• Fiction: – Drama, Lyrics, Prose

– Biography, Epistolary Novel, Travel Literature, Novels, Children's Books, …

• Functional Literature: – Handbooks (Good Behaviour/Etiquette, Pedagogy, Gardening, …)

– Travel Books, Cookbooks, Newspapers, Devotional Literature …

• Scientific Texts: – Science: Biology, Geography, Medicine, Chemistry, …

– Humanities: Literature, Linguistics, History, Musical Studies, …

– Social Sciences and Economics

– …

Page 8: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

Text Type Classification for the DTA

Page 9: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

Text Type Classification

What for?

Page 10: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

http://www.deutschestextarchiv.de

1. Access based on Text Types

Page 11: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

http://www.deutschestextarchiv.de/list/browse?genre=Gebrauchsliteratur

1. Access based on Text Types

Page 12: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

2. Queries based on

Text Types

Travel destinations mentioned in

functional literature?

Page 13: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

3. Analyses based on Text Types

Fiction Functional Literature

Science

Page 14: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

3. Analyses based on Text Types

Kid's Toy (Germanet) within Fictional Literature

Query: Kinderspielzeug|gn-sub #has[textClassDWDS, /Gebrauchsliteratur/]

Page 15: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

Problem statement

• Text classification created in a data-driven way: It only shows what we have …

… but it gives no clues about what we do not have (i.e. text types important for a certain time which are not represented by the DTA corpus)

Hence it is difficult to evaluate the representativity of the DTA corpus in this respect

• The DTA text classification is not mapped to existing classifications of significance

• There are only two layers leading to ambiguities e.g. Funeral Sermons:

Functional Literature::Theology ?

Functional Literature::FuneralSermon?

Functional Literature::SpecialOccasion? …

Page 16: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

Solution: Switch to an existing classification?

Example AAD: Classification of the Working Group on Old Prints by the huge German libraries

• Certain text types we need are not represented (e.g. gardening), others we don't need (e.g. maps)

• Other text types are incoherently modeled

• In some cases it is too detailed for us

• In other cases it is not detailed enough

• Sometimes no descriptions at all or descriptions which are not extensive enough

• Text types belong to different description levels (text type vs. knowledge area …)

Page 17: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

AAD: Incoherences

BS: Use synonymon OB: Supercategory UB: Subcategory

#OfficialPrintedPublication (2)

SubC: Law

#law

#CollectionOfLaws

SupC: OfficialPrintedPublication

#OfficialPrintedPublication (1)

Page 18: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

Solution: Switch to an existing classification?

Example AAD: Classification of the Working Group on Old Prints by the huge German libraries

• Certain text types we need are not represented (e.g. gardening), others we don't need (e.g. maps)

• Other text types are incoherently modeled

• In some cases it is too detailed for us

• In other cases it is not detailed enough

• Sometimes no descriptions at all or descriptions which are not extensive enough

• Text types belong to different description levels (text type vs. knowledge area …)

Page 19: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

AAD: Different Description Levels

#Children's Book

#Church Song

#Catechism

Type of intended usage

Text type

Type of text presentation

#Rhetorics Knowledge area

BS: Use synonymon OB: Supercategory UB: Subcategory

Page 20: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

Solution: Revised DTA text type classification

• Redesign and extend the DTA text type classification based on different existing classfications

• Mapping from the one to the other

– DTA text types can semi-automatically be transfered to the new classification

– (Digitized) works of text types still missing in the corpus can be found from library catalogues

– Sources:

AAD (http://aad.gbv.de/empfehlung/aad_gattung.pdf)

Wikisource (http://de.wikisource.org/wiki/Wikisource:Systematik)

DWDS

DTA

Page 21: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

Solution: Revised DTA text type classification

• Small set of Supercategories

– Non-fiction

• Scientific Literature

• Functional Literature

– Fiction

• Detailed (but still manageable) set of subcategories

• Hierarchies are allowed but kept shallow

• Descriptions/Documentation

Page 22: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

Revised DTA text type classification

Classification of text types (i.e. of the subcategories)

• Präsentationsform (i.e. Type of text presentation)

Flyer, Funeral Print (Funeralschrift), Book of Prayer, Cookbook, Catalogue

• Sitz im Leben (i.e. Life context which texts are embedded in)

Devotional Literature, Texts for/from women, Occasional texts

• Textsorte (Text type)

Poem, Novel, Scientific Paper

• Wissensbereich (i.e. Knowledge area covered by the text)

Theology, Chemistry, Math, Linguistics

Page 23: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

Term description (via eXist)

<term type="textType" source="#aad" id="autobiography">

<name>Autobiography</name>

<desc type="main">

<p>Life memories; Description of historical events

by personal witnesses</p>

<bibl>AAD</bibl>

</desc>

<desc type="alternative-1">[…]</desc>

<subordinates/>

<superordinates>

<term id="#biography"/>

</superordinates>

<mapping>

<term source="#dwds">Autobiography</term>

</mapping>

[features, notes, …]

</term>

Page 24: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

Term description (via eXist)

<term type="textType" source="#dta" id="flyer">

<name>Flyer</name>

<desc type="main">

<p>Easily produced broschure, produced for the purpose of

agitation, information, or documentation</p>

<bibl>Cf. AAD</bibl>

</desc>

<desc type="alternative-1">[…]</desc>

[…]

<mapping>

<term source="#aad">Flyer</term>

<term source="#aad">Broadsheet</term>

</mapping>

[features, notes, …]

</term>

Page 25: Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format

S. Haaf: Text Type Classification for the Historical DTA Corpus

Thank you!

Contact:

[email protected]

Project Deutsches Textarchiv:

www.deutschestextarchiv.de

www.deutschestextarchiv.de/doku/basisformat

www.deutschestextarchiv.de/dtaq

www.deutschestextarchiv.de/dtae

Literature:

www.deutschestextarchiv.de/doku/publikationen