19
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014

UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014

Embed Size (px)

Citation preview

UAM CorpusTool: An Overview

Debopam DasDiscourse Research Group

Department of LinguisticsSimon Fraser University

Feb 5, 2014

Outline

UAM CorpusTool (O’Donnell, 2008) Tool description A short tutorial

Annotating signals of coherence relations by UAM CorpusTool

Feb 5, 2014 Discourse Research Group 2

UAM CorpusTool

Created by Mick O’Donnell in 2008 Replaces prior software Systemic Coder which

allowed coding of single documents at a single layer Available at http://www.wagsoft.com/CorpusTool/ Runs on Windows and Mac OS “… primarily aimed at the linguist or computational

linguist who does not program, and would rather spend their time annotating text than learning how to use the system.” (O’Donnell, 2008: 13)

Feb 5, 2014 Discourse Research Group 3

UAM CorpusTool

Annotate documents text type, writer characteristics, register, etc.

Annotate segments Tagging sections of a text by function (abstract, introduction, body, conclusion) Tagging sentences (active/passive; simple/ complex) or clauses

(relative/imperative/non-finite) Semantic or pragmatic annotation (synonymy/antonymy; speech acts) Tagging POS (noun, verbs, adjective)

Automatic grammar analysis (English only) using Stanford parser

Rhetorical structure annotation

Feb 5, 2014 Discourse Research Group 4

Annotation in UAM CorpusTool Main Steps

Start a new project Add (an) annotation layer(s)

You can use some pre-built annotation schemes or design your own

Add file Import .txt files and Incorporate them

Annotate

Feb 5, 2014 Discourse Research Group 5

Annotation in UAM CorpusTool Main Window Screenshot

Feb 5, 2014 Discourse Research Group 6

Annotation in UAM CorpusTool Annotation Scheme Screenshots

Feb 5, 2014 Discourse Research Group 7

Annotation in UAM CorpusTool Document Coding Screenshot

Feb 5, 2014 Discourse Research Group 8

Annotation in UAM CorpusTool Segment Coding Screenshot

Feb 5, 2014 Discourse Research Group 9

Other Components

Search Autocode Statistics Explore Options Help

Feb 5, 2014 Discourse Research Group 10

Annotating Signals of Coherence Relations Goal

Annotate signals of coherence relations Signals of coherence relations

E.g., John is tall, but Mary is short. One straightforward signal: the discourse marker

‘but’ Also, there are two more signals

Antonyms (tall ~ short) Parallel syntactic constructions (subj – copula – adj)

Feb 5, 2014 Discourse Research Group 11

Annotating Signals of Coherence Relations

Annotate the RST Discourse Treebank (Carlson et al., 2002)

Contains 385 documents from The Wall Street Journal articles

Texts in those articles are annotated already for rhetorical (coherence) relations

Approx. 22,000 discourse units and 17,000 relations in total

Feb 5, 2014 Discourse Research Group 12

Annotating Signals of Coherence Relations

Requirements from an annotation tool Importability

Relevant data to be imported into the tool

Annotation Scheme Support for three-level hierarchical taxonomy

Customizability Easy access to the annotation scheme for editing

Multiple Annotations Two or more tags for a single element

Convertibility XML output

Simplicity No advanced computational knowledge Graphical interface

Feb 5, 2014 Discourse Research Group 13

Signalling Annotation by UAM CorpusTool

Problem with Importing data UAM CorpusTool supports RST annotation and can

directly import RST files However, it cannot provide layered annotation on top of

the RST-level structure Solution to the problem

Convert RST base files from LISP to text format Import the converted files This retains discourse structures and all relational

information

Feb 5, 2014 Discourse Research Group 14

Signalling Annotation by UAM CorpusTool

How did we do the rest?

Feb 5, 2014 Discourse Research Group 15

Signalling Annotation by UAM CorpusTool

Annotation Scheme Screenshot

Feb 5, 2014 Discourse Research Group 16

Signalling Annotation by UAM CorpusTool

Annotation Window Screenshot

Feb 5, 2014 Discourse Research Group 17

References

Carlson, L., Marcu, D., & Okurowski, M. E. (2002). RST Discourse Treebank, LDC2002T07 [Corpus]. Philadelphia, PA: Linguistic Data Consortium.

O'Donnell, M. (2008). The UAM CorpusTool: Software for corpus annotation and exploration. Paper presented at the XXVI Congreso de AESLA, Almeria, Spain.

Feb 5, 2014 Discourse Research Group 18

Thank You!

Feb 5, 2014 Discourse Research Group 19