33
Attribution and the PDTB Silvia Pareti The University of Edinburgh School of Informatics

Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

Attribution and the PDTB

Silvia Pareti

The University of Edinburgh School of Informatics

Page 2: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

Outline

• Introduction

• Attribution in the PDTB

• Annotation schema extension

• Resources development

• Preliminary achievements

• Future directions

Page 3: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

Introduction - Attribution

Page 4: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

(wsj 0961)

Page 5: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

PDTB - Attribution Annotation

Mr. Nemeth said in parliament that Czechoslovakia and Hungary would suffer environmental damage if the twin dams were built as planned. (wsj_0037)

____Explicit____

9163..9165

#### Text ####

if

#### Features ####

Ot, Comm, Null, Null

9067..9096

#### Text ####

Mr. Nemeth said in parliament

##############

if, Contingency.Condition.Unreal present

____Arg1____

9097..9162

#### Text ####

that Czechoslovakia and Hungary would suffer environmental damage

#### Features ####

Inh, Null, Null, Null

____Arg2____

9166..9201

#### Text ####

the twin dams were built as planned

#### Features ####

Inh, Null, Null, Null

(Prasad et al., 2008)

Page 6: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

Other corpora with attribution

• MPQA Opinion Corpus (Wiebe et al., 2002)

– 692 articles

– intra-sentential annotation

• RST Discourse Treebank (Carlson&Marcu, 2001)

– 385 articles

– intra-sentential, only explicit sources, verb cues or according to

• GraphBank (Wolf&Gibson, 2005)

– 135 articles

– only attributions not overlapping with other discourse relations

• Other smaller or low-coverage projects

– Sidney Morning Herald Corpus (O’Keefe et al., submitted)

– Corpus TCC and RHETALHO (Pardo et al., 2004)

Page 7: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

PDTB - Advantages

Large corpus

less frequent structures and strategies are better observed, e.g. :

Groused Robert Antolini, head of over-the-counter trading at Donaldson, Lufkin & Jenrette: "It's making it tough for traders to make money”. (wsj_1142)

For some at the SEC, an agency that covets its independence, Mr. Breeden may be too much of a Washington insider. (wsj_0955)

Page 8: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

PDTB - Advantages

The range of attributions covered is not pre-defined

• Attributions are not limited to the sentence level

• A wide range of attributions are annotated:

– direct, indirect and mixed

– having named or not named, explicit as well as implicit sources (e.g. it is believed…)

– having verb and non-verb cues (e.g. idea, for)

• Includes some relevant features

Page 9: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

PDTB - Extensions

• Finer grained annotation of the attribution span: source, cue, circumstantial information

• Completing content spans of some direct or mixed attributions

Page 10: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

PDTB - Extensions

• Finer grained annotation of the attribution span: source, cue, circumstantial information

• Completing content spans of some direct or mixed attributions

"It's just sort of a one-upsmanship thing with some people," added Larry Shapiro. "They like to talk about having the new Red Rock Terrace one of Diamond Creek's Cabernets or the Dunn 1985 Cabernet, or the Petrus.

Producers have seen this market opening up and they're now creating wines that appeal to these people."

(wsj 0071)

Page 11: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

• Annotation of attributions not overlapping with discourse relations

• Annotation of nested attributions

PDTB - Extensions

Page 12: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

• Annotation of attributions not overlapping with discourse relations

• Annotation of nested attributions

["The Caterpillar people aren't too happy when they see their equipment used like that,"]

[shrugs] [Mr. George].

["They figure it's not a very good advert.“] (wsj 1121)

PDTB - Extensions

[They] [figure] [it's not a very good advert]

Page 13: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

Annotation Schema

source

cue

SUPPLEMENT

content

[Mr. Nemeth said IN PARLIAMENT] that Czechoslovakia and Hungary would suffer environmental damage if the twin dams were built as planned. (wsj_0037)

[PDTB attribution span]

PDTB discourse connective /Arg1/Arg2 text spans

Page 14: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

attribution type

source type

• assertion (e.g. say, mention)

• belief (e.g. think, doubt)

• fact (e.g. remember, know)

• eventuality (e.g. allow)

PDTB Attribution Features

• writer (if explicit, e.g. I think...)

• other (e.g. Mr. Brown, a witness)

• arbitrary (e.g. one, people)

• mixed (e.g. My assessment and everyone's assessment is…(wsj_2012))

Page 15: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

factuality (determinacy)

scopal change (scopal polarity)

• factual

• non-factual

PDTB Attribution Features

• none

• scopal change

Se c’è, cioè, una maggioranza in Parlamento in grado di affrontare seriamente una fase di riforme anche elettorali, Ø penso che la legislatura possa utilmente proseguire. (re075)

If there is a majority at the Parliament able to seriously face a phase of reforms, also electoral, (I) think that the legislature could usefully continue.

Page 16: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

source attitude

authorial stance

•neutral (e.g. say, add)

•positive (e.g. welcome, beam)

•critical (e.g. lament, fume)

•tentative (e.g. believe, suggest)

•other (e.g. joke)

New Attribution Features

•committed (e.g. admit, know)

•not-committed (e.g. lie, claim)

•neutral (e.g. say, suggest)

Page 17: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

New Attribution Features

Mr. Nemeth said IN PARLIAMENT that Czechoslovakia and Hungary would suffer environmental damage if the twin dams were built as planned. (wsj_0037)

Attribution type: assertion

Source type: other

Factuality: factual

Scopal change: none

Source attitude: neutral

Authorial stance: neutral

Page 18: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

New Attribution Features

Mr. Nemeth said IN PARLIAMENT that Czechoslovakia and Hungary would suffer environmental damage if the twin dams were built as planned. (wsj_0037)

Attribution type: assertion

Source type: other

Factuality: factual

Scopal change: none

Source attitude: neutral

Authorial stance: neutral

Confronted, Mrs. Yeargin admitted she had given the questions and answers two days before the examination to two low-ability geography classes.(wsj 0044) Authorial stance: committed

"I think that this magazine is not only called Garbage, but it is practicing journalistic garbage," fumes a spokesman for Campbell Soup.(wsj 0062) Source attitude: negative

Page 19: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

Inter-Annotator Agreement

• 2 annotators

• 14 articles (PDTB)

• annotation manual

• training on an article

• MMAX2 annotation tool (Müller&Strube,2006)

• complete annotation schema

Data:

• 491 attributions

(22% are nested)

(Pareti, 2012 submitted)

Page 20: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

Results - Existence of Attribution

0.87 agr proportion of commonly annotated relations with respect to the annotations identified overall by Annotator A and Annotator B

NOTE: writer attributions were annotated only if explicit

Span selection tasks (agr metric):

Cue Source Content Supplement 0.97 0.94 0.95 0.37

Page 21: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

Results- Features

PERCENT AGREEMENT COHEN'S KAPPA

TYPE 83.42(317) 0.63

SOURCE 95(361) 0.71

SCOPAL CHANGE 98.68(375) 0.60

AUTHORIAL STANCE 94.47(359) 0.20

SOURCE ATTITUDE 82.36(313) 0.48

FACTUALITY 97.63(371) 0.73

Page 22: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

Italian Attribution Corpus-ItAC

• 50 articles (37,000 tokens) from Italian newspaper corpora (e.g. La Repubblica)

• 460 attribution relations

• Freely available from: http://homepages.inf.ed.ac.uk/s1052974/resources.php

(Pareti and Prodanof, 2010)

Page 23: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

PDTB Attribution Corpus

Stand-off annotation of attribution based on the PDTB:

• Comprises all attribution relations annotated in the PDTB (reconstructed from the current annotation)

• The annotation is further extended according to the revised annotation schema

(Pareti, 2012)

9868 attributions

Page 24: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

PDTB Attribution Corpus Annotation of the attribution span: source cue SUPPLEMENT

80% automatically, then manually revised, using 48 matching rules, e.g.: (NP-SBJ)(VP) one person said (PP-LOC)(NP)(VB) IN DALLAS, LTV said (NP-SBJ)(VBP)(JJ) I am sure

20 % had rarer syntax and was manually annotated, e.g.:

Judge Curry ordered the refunds to begin Feb. 1 and said (wsj 0015)

Page 25: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

PDTB Attribution Corpus

Further annotation of the content span:

– adding punctuation (direct quotation marks)

– completing content spans that had only been partially annotated

– annotating the quote status of the attribution based on the position of quote span QS and content span CS:

• direct QS = CS

• indirect CS outside or contained in QS

• mixed CS overlaps QS or QS contained in CS

Page 26: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

PDTB Attribution Corpus

ATTRIBUTION ID: wsj_0003.pdtb_05 SOURCE SPAN: Darrell Phillips, vice president of human resources for Hollingsworth & Vose CUE SPAN: said CONTENT SPAN: “There’s no question that some of those workers and managers contracted asbestos–related diseases,” “But you have to recognize that these events took place 35 years ago. It has no bearing on our work force today.” SUPPLEMENT SPAN: None FEATURES: Ot, Comm, Null, Null QUOTE STATUS: Direct

Page 27: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

Use of PDTB Attribution Corpus

Independent analysis of attribution:

• cue composition

– several cues other than verbs (prepositions, nouns, adverbs)

– wide range of attributional verbs (266 types in the corpus)

• source composition

– NEs only about 50% of the sources

• attribution structures

Page 28: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

Use of PDTB Attribution Corpus

Testing a system for the identification of direct quotes and their speaker in the literature and news domains. University of Sydney and Sydney Morning Herald

(O’Keefe et al. 2012, submitted).

• rule-based and machine-learning based approaches have been tested on 3 corpora.

• Approaches results show that direct quotes differ by domain and style

Page 29: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

Future

• Development of an attribution extraction system using the data to train a classifier

• Semi-automatic extension of the annotation to comprise all attributions in the corpus

• Annotation of the level of nesting of each attribution

• Release of the corpus for development/testing and shared tasks usages

Page 30: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

Conclusion

• Advantages of attribution in the PDTB

• Development of a finer-grained annotation schema and its inter-annotator agreement results

• Application of the schema to a small corpus of Italian

• Collection and further annotation of attribution in the PDTB

• Importance of this resource for the analysis of attribution and its ‘long tail’ and for testing and developing attribution extraction systems

Page 31: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The
Page 32: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

Bibliography

Carlson, L. and Marcu, D. Discourse tagging reference manual. Technical report ISITR- 545. Technical report, ISI, University of Southern California, September 2001.

Müller, C. and Strube, M., Multi-Level Annotation of Linguistic Data with MMAX2. In: Sabine Braun, Kurt Kohn, Joybrato Mukherjee (Eds.): Corpus Technology and Language Pedagogy. New Resources, New Tools, New Methods. Frankfurt: Peter Lang, pp. 197-214. (English Corpus Linguistics, Vol.3 ), 2006.

O’Keefe, T., Pareti, S., Curran, J., Koprinska, I. and Honnibal, M., A sequence labelling approach to quote attribution. Manuscript submitted for publication, 2012.

Pardo, T., das Graças Volpe Nunes, M. and Rino, L.. Dizer: An automatic discourse analyzer for Brazilian Portuguese. In Ana Bazzan and Sofiane Labidi, editors, Advances in Artificial Intelligence – SBIA 2004, volume 3171 of Lecture Notes in Computer Science, pages 224–234. Springer Berlin / Heidelberg, 2004.

Pareti, S. and Prodanof, I. Annotating attribution relations: Towards an Italian discourse treebank. In Proceedings of LREC10, 2010.

Page 33: Attribution and the PDTB - Penn Engineeringpdtb2012/assets/... · •Annotation of attributions not overlapping with discourse relations •Annotation of nested attributions ["The

Pareti,S. A database of attribution relations. In Proceedings of LREC12, Istanbul, 23-25 May 2012 (to appear).

Pareti, S., Theory and practise of annotating attributions. Manuscript submitted for publication, 2012.

Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A. and Webber, B. The Penn Discourse Treebank 2.0. In Proceedings of LREC08, 2008.

Wiebe, J. Instructions for annotating opinions in newspaper articles. Technical report, University of Pittsburgh, 2002.

Wolf, F. and Gibson, E. Representing discourse coherence: A corpus-based study. Comput. Linguist., 31:249288, June 2005.