16
http://www.text- technology.de Text Technological M od e llin g of In fo rm a tion - 1 - Information Modelling of Language and Text: XML-based, multi-level, semantic-oriented. -Some methods (only)- www.text-technology.de

Http:// - 1 - Information Modelling of Language and Text: XML-based, multi-level, semantic-oriented. - Some methods (only)-

Embed Size (px)

Citation preview

Page 1: Http:// - 1 - Information Modelling of Language and Text: XML-based, multi-level, semantic-oriented. - Some methods (only)-

http://www.text-technology.de

Text TechnologicalModel ling of Information

- 1 -

Information Modelling of Language and Text:

XML-based, multi-level, semantic-oriented.-Some methods (only)-

www.text-technology.de

Page 2: Http:// - 1 - Information Modelling of Language and Text: XML-based, multi-level, semantic-oriented. - Some methods (only)-

http://www.text-technology.de

Text TechnologicalModel ling of Information

- 2 -

Research Group „Texttechnological Information Modelling“

University of Bielefeld: D. Gibbon MODELEX

D. Metzing SEKIMO

associated: J.-T. Milde Multimodal Corpora TASX

University of Dortmund: A. Storrer HYTEX

University of Giessen: H. Lobin SEMDOC

University of Tübingen: U. Mönnich COMOD

The TASX-Annotator: http://tasxforce.lili.uni-bielefeld.de/

Page 3: Http:// - 1 - Information Modelling of Language and Text: XML-based, multi-level, semantic-oriented. - Some methods (only)-

http://www.text-technology.de

Text TechnologicalModel ling of Information

- 3 -

Methodological issues: Multidimensionality of linguistic data requires:

(1) multiple tiers of annotation (xml-based)

(2) connections between multiple tiers (specific methods)

(3) multi-annotation of identical raw data (multiple trees)

(4) specific relations between multi-level annotations

Page 4: Http:// - 1 - Information Modelling of Language and Text: XML-based, multi-level, semantic-oriented. - Some methods (only)-

http://www.text-technology.de

Text TechnologicalModel ling of Information

- 4 -

Methodological issues: Multidimensionality of linguistic data requires:

(5) a distinction between one or more conceptual levels (semantic markup) and one or more annotation layers (syntactic markup) as well as mappings between both

(6) ways to make use of and to generate different annotation sets (annotation + data) given more uniform conceptual representations (accessibility of corpora (search, hypothesis testing, comparative or typological analysis))

Page 5: Http:// - 1 - Information Modelling of Language and Text: XML-based, multi-level, semantic-oriented. - Some methods (only)-

http://www.text-technology.de

Text TechnologicalModel ling of Information

- 5 -

Semdoc: Annotation

<segment id="s24" parent="g6" newtopic="illustration_bck" litref="s" footnoteref="s33a">From the now infamous McDonald's coffee spill case to litigation against Ford and Firestone for injuries caused by tire tread separation to tobacco litigation, high stakes civil cases have become familiar staples of our media diet (see e.g., Are lawyers burning America, 1995; Budiansky, 1995; Church, 1986; Langley, 1986; Stossel, 1996).5 </segment>

<sect1> <para> ... From the now infamous McDonald's coffee spill case to litigation against Ford and Firestone for injuries caused by tire tread separation to tobacco litigation, high stakes civil cases have become familiar staples of our media diet (see e.g., Are lawyers burning America, 1995; Budiansky, 1995; Church, 1986; Langley, 1986; Stossel, 1996). <footnoteref linkend="i5">5</footnoteref> </para></sect1>

<segment id="i17" parent="i56" relname="span">From the now infamous McDonald&apos;s coffee spill case to litigation against Ford and Firestone for injuries caused by tire tread separation to tobacco litigation, high stakes civil cases have become familiar staples of our media diet</segment><segment id="i18" parent="i17" relname="evidence“> (see e.g., Are lawyers burning America, 1995; Budiansky, 1995; Church, 1986; Langley, 1986; Stossel, 1996).5</segment>

structural

thematic

rhetorical

Page 6: Http:// - 1 - Information Modelling of Language and Text: XML-based, multi-level, semantic-oriented. - Some methods (only)-

http://www.text-technology.de

Text TechnologicalModel ling of Information

- 6 -

Sekimo: Multiple annotations of Japanese dialogue corpora

Annotation categories are based upon widely used tag-sets like IPADIC (Chasen)

The results of corpus analysis can be used to

- compare the tag-sets empirically

- augment tag-sets with conceptual information,

- reuse existing corpora which are based upon the same tat-sets

Page 7: Http:// - 1 - Information Modelling of Language and Text: XML-based, multi-level, semantic-oriented. - Some methods (only)-

http://www.text-technology.de

Text TechnologicalModel ling of Information

- 7 -

Sekimo: Sample Annotation

Page 8: Http:// - 1 - Information Modelling of Language and Text: XML-based, multi-level, semantic-oriented. - Some methods (only)-

http://www.text-technology.de

Text TechnologicalModel ling of Information

- 8 -

Ich heiße Meier

Example: Modeling of congruency in Japanese and German

watashi ha murano to moushimasuLexical-pragmaticcongruency

Morpho-syntacticcongruency

General

Ja-Germ-1

Ja-Germ-2

Ja-Germ-3

Ja-1

verb has marker

sentence has subject

subject has marker

two annotation units havemarker

verb and utterancehave marker

Conceptual difference of congruency reflects in different configurations ofannotations, related via secondary information structuring:

Page 9: Http:// - 1 - Information Modelling of Language and Text: XML-based, multi-level, semantic-oriented. - Some methods (only)-

http://www.text-technology.de

Text TechnologicalModel ling of Information

- 9 -

Visualisation as SVG graphic

Page 10: Http:// - 1 - Information Modelling of Language and Text: XML-based, multi-level, semantic-oriented. - Some methods (only)-

http://www.text-technology.de

Text TechnologicalModel ling of Information

- 10 -

Transformation

<NOUN>

watashi

</NOUN>

Sekimo: Example for mapping annotations <->concepts

<noun>watashi</noun>

<word pos=„noun“>watashi</word>

<word><feature>pos</feature>

<value>noun</value>

watashi</word>

noun

word[@pos=„noun“]

word[feature=„pos“ & value=„noun“]

NOUN

WORD

NOUN KOPULA

Concepts

Mapping

Annotations

Page 11: Http:// - 1 - Information Modelling of Language and Text: XML-based, multi-level, semantic-oriented. - Some methods (only)-

http://www.text-technology.de

Text TechnologicalModel ling of Information

- 11 -

ModeLex: Temporal Calculus (Allen) for multimodal annotations

● Relations between annotation layers● Can be applied to

-Text: Order is given by character sequence-Signal: Order is given by timestamps

Page 12: Http:// - 1 - Information Modelling of Language and Text: XML-based, multi-level, semantic-oriented. - Some methods (only)-

http://www.text-technology.de

Text TechnologicalModel ling of Information

- 12 -

Lexicon Model: Subclassification of annotation units, based upontemporal relations

Corpus Classification hierarchy

Properties

class

subclass

subsubclass

properties of class

properties of

subclass

properties of

subsubclass

Page 13: Http:// - 1 - Information Modelling of Language and Text: XML-based, multi-level, semantic-oriented. - Some methods (only)-

http://www.text-technology.de

Text TechnologicalModel ling of Information

- 13 -

HyTex Corpus: Documents about the domain of „text technology“ (syntactic level)

Domain-specific knowledge (semantic level)

TermNet: representation of knowledge about terms and concepts of the domain (in the style of WordNet )

HyTex: Multi-level approach

Linguistic annotation:POS-TaggingLemmatizationChunk-Parsing

Textgrammatical annotation:Definitions and technical termsTopical and rhetorical structures

TermNet: Representation of semantic relations between technical terms of the domain

User model(static or dynamic)

Adaptive generation of hypertext views on coherence criteria

User models (pragmatic level)

fixed user profiles or dynamic generation of links according to the history of previous usage

Page 14: Http:// - 1 - Information Modelling of Language and Text: XML-based, multi-level, semantic-oriented. - Some methods (only)-

http://www.text-technology.de

Text TechnologicalModel ling of Information

- 14 -

Hytex:Research Cooperation and Contacts

DFG-Forschergruppe 437:text technological modelling

of information

Text-grammatical

foundations for the

(semi)automated

text-to-hypertext

conversion (HyTex)

DEREKO: Corpus Technology at the University of Tübingen

Chunk Parser for the syntactic annotation of the HyTex corpus

WordNet Project, Princeton UniversityGermaNet Project, University of Tübingen

Exchange of entities and relations for the TermNet model

TEMIS: Text Mining Solutions Heidelberg/Paris

Annotation schema for anaphoric and co-reference relations in German texts. Usage of the Text Mining-Tool Knowledge Extractor for the annotation of definitions

Intelligent Views: Knowledge Management, Darmstadt

Usage of the tool „K-Infinity“ supporting the convenient construction and maintenance of the TermNet

Page 15: Http:// - 1 - Information Modelling of Language and Text: XML-based, multi-level, semantic-oriented. - Some methods (only)-

http://www.text-technology.de

Text TechnologicalModel ling of Information

- 15 -

Sekimo: Project Context

SFB Mehrsprachigkeit Hamburg: Jadex

Japanese and German expert discourse in  mono- und multilingual constellations

secondary information

structuring and

comparative

discourse analysis

DFG-Forschergruppe 437

Texttechnologische Informationsmodellieru

ng

NITE: Natural Interactivity Tools Engineering

University of Southern Denmark Universitat Autònoma de Barcelona

DFKI SaarbrückenHCRC Edinburgh

IMS StuttgartILC Pisa

Page 16: Http:// - 1 - Information Modelling of Language and Text: XML-based, multi-level, semantic-oriented. - Some methods (only)-

http://www.text-technology.de

Text TechnologicalModel ling of Information

- 16 -

Research Group „Texttechnological Information Modelling“

January 2004

International Conference Center for Interdisciplinary Research

„Modeling Linguistic Information Resources“

University of Bielefeld

(1) Semantics of Generic Document Structures and Discourse Parsing

(2) Modelling Textual, Lexical and World Knowledge as a Basis for Hypertext Linking

(3) Multiple Annotation of Language Data

(4) Multimodal Lexical Information for Language Documentation