Upload
peter-colin-cox
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
UAM CorpusTool: An Overview
Debopam DasDiscourse Research Group
Department of LinguisticsSimon Fraser University
Feb 5, 2014
Outline
UAM CorpusTool (O’Donnell, 2008) Tool description A short tutorial
Annotating signals of coherence relations by UAM CorpusTool
Feb 5, 2014 Discourse Research Group 2
UAM CorpusTool
Created by Mick O’Donnell in 2008 Replaces prior software Systemic Coder which
allowed coding of single documents at a single layer Available at http://www.wagsoft.com/CorpusTool/ Runs on Windows and Mac OS “… primarily aimed at the linguist or computational
linguist who does not program, and would rather spend their time annotating text than learning how to use the system.” (O’Donnell, 2008: 13)
Feb 5, 2014 Discourse Research Group 3
UAM CorpusTool
Annotate documents text type, writer characteristics, register, etc.
Annotate segments Tagging sections of a text by function (abstract, introduction, body, conclusion) Tagging sentences (active/passive; simple/ complex) or clauses
(relative/imperative/non-finite) Semantic or pragmatic annotation (synonymy/antonymy; speech acts) Tagging POS (noun, verbs, adjective)
Automatic grammar analysis (English only) using Stanford parser
Rhetorical structure annotation
Feb 5, 2014 Discourse Research Group 4
Annotation in UAM CorpusTool Main Steps
Start a new project Add (an) annotation layer(s)
You can use some pre-built annotation schemes or design your own
Add file Import .txt files and Incorporate them
Annotate
Feb 5, 2014 Discourse Research Group 5
Other Components
Search Autocode Statistics Explore Options Help
Feb 5, 2014 Discourse Research Group 10
Annotating Signals of Coherence Relations Goal
Annotate signals of coherence relations Signals of coherence relations
E.g., John is tall, but Mary is short. One straightforward signal: the discourse marker
‘but’ Also, there are two more signals
Antonyms (tall ~ short) Parallel syntactic constructions (subj – copula – adj)
Feb 5, 2014 Discourse Research Group 11
Annotating Signals of Coherence Relations
Annotate the RST Discourse Treebank (Carlson et al., 2002)
Contains 385 documents from The Wall Street Journal articles
Texts in those articles are annotated already for rhetorical (coherence) relations
Approx. 22,000 discourse units and 17,000 relations in total
Feb 5, 2014 Discourse Research Group 12
Annotating Signals of Coherence Relations
Requirements from an annotation tool Importability
Relevant data to be imported into the tool
Annotation Scheme Support for three-level hierarchical taxonomy
Customizability Easy access to the annotation scheme for editing
Multiple Annotations Two or more tags for a single element
Convertibility XML output
Simplicity No advanced computational knowledge Graphical interface
Feb 5, 2014 Discourse Research Group 13
Signalling Annotation by UAM CorpusTool
Problem with Importing data UAM CorpusTool supports RST annotation and can
directly import RST files However, it cannot provide layered annotation on top of
the RST-level structure Solution to the problem
Convert RST base files from LISP to text format Import the converted files This retains discourse structures and all relational
information
Feb 5, 2014 Discourse Research Group 14
Signalling Annotation by UAM CorpusTool
How did we do the rest?
Feb 5, 2014 Discourse Research Group 15
Signalling Annotation by UAM CorpusTool
Annotation Scheme Screenshot
Feb 5, 2014 Discourse Research Group 16
Signalling Annotation by UAM CorpusTool
Annotation Window Screenshot
Feb 5, 2014 Discourse Research Group 17
References
Carlson, L., Marcu, D., & Okurowski, M. E. (2002). RST Discourse Treebank, LDC2002T07 [Corpus]. Philadelphia, PA: Linguistic Data Consortium.
O'Donnell, M. (2008). The UAM CorpusTool: Software for corpus annotation and exploration. Paper presented at the XXVI Congreso de AESLA, Almeria, Spain.
Feb 5, 2014 Discourse Research Group 18