57
1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

Embed Size (px)

Citation preview

Page 1: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

1

Identifying Subjective Language

Janyce WiebeUniversity of Pittsburgh

Page 2: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

2

Overview

General area: acquire knowledge of evaluative and speculative language and use it in NLP applications

Primarily corpus-based work

Today: results of exploratory studies

Page 3: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

3

Collaborators

Rebecca Bruce, Vasileios Hatzivassiloglou, Joseph PhillipsMatthew Bell, Melanie Martin,Theresa Wilson

Page 4: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

4

Subjectivity Tagging

Recognizing opinions and evaluations (Subjective sentences) as opposed to material objectively presented as true (Objective sentences)

Banfield 1985, Fludernik 1993, Wiebe 1994, Stein & Wright 1995

Page 5: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

5

Examples

At several different levels, it’s a fascinating tale. subjective

Bell Industries Inc. increased its quarterly to 10 cents from 7 cents a share. objective

Page 6: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

6

Subjectivity

“Complained”“You Idiot!”

“Terrible product”

“Speculated”“Maybe”

“Enthused”“Wonderful!”

“Great product”

Page 7: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

7

Examples

Strong addressee-oriented negative evaluation Recognizing flames (Spertus 1997) Personal e-mail filters (Kaufer 2000)

I had in mind your facts, buddy, not hers.

Nice touch. “Alleges” whenever facts posted are not in your persona of what is “real.”

Page 8: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

8

Examples

Opinionated, editorial language IR, text categorization (Kessler et al. 1997) Do the writers purport to be objective?

Look, this is a man who has great numbers.

We stand in awe of the Woodstock generation’sability to be unceasingly fascinated by the subjectof itself.

Page 9: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

9

Examples

Belief and speech reports Information extraction, summarization,

intellectual attribution (Teufel & Moens 2000)

Northwest Airlines settled the remaining lawsuits,a federal judge said.

“The cost of health care is eroding our standard ofliving and sapping industrial strength”, complainsWalter Maher.

Page 10: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

10

Other Applications

Review mining (Terveen et al. 1997)

Clustering documents by ideology (Sack 1995)

Style in machine translation and generation (Hovy 1987)

Page 11: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

11

Potential Subjective Elements

"The cost of health care is eroding standards of living and sapping industrial strength,” complains Walter Maher.

Sap: potential subjective element

Subjective element

Page 12: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

12

Subjectivity

Multiple types, sources, and targets

We stand in awe of the Woodstock generation’s ability to be unceasingly fascinated by the subject of itself.

Somehow grown-ups believed that wisdom adhered to youth.

Page 13: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

13

Outline

Data and annotationSentence-level classification Individual wordsCollocationsCombinations

Page 14: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

14

Annotations

Three levels: expression level sentence level document level

Manually tagged + existing annotations

Page 15: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

15

Expression Level Annotations

[Perhaps you’ll forgive me] for reposting his response

They promised [e+ 2 yet] more for [e+ 3 really good] [e? 1 stuff]

Page 16: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

16

Expression Level Annotations

Difficult for manual and automatic tagging: detailed no predetermined classification unit

To date: used for training and bootstrapping

Probably the most natural level

Page 17: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

17

Document Level Annotations

Manual: flames in Newsgroups

Existing: opinion pieces in the WSJ: editorials, letters to the editor, arts & leisure reviews

* to ***** reviews

+ More directly related to applications, but …

Page 18: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

18

Document Level Annotations

Opinion pieces contain objective sentences and Non-opinion pieces contain subjective sentences

Editorials contain facts supporting the argument

News reports present reactions (van Dijk 1988) “Critics claim …” “Supporters argue …”

Reviews contain information about the product

Page 19: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

19

Document Level Annotations

opinion pieces subj 74% obj 26%

In a WSJ data set:

non-opinion pieces subj 43% obj 57%

Page 20: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

20

Data in this Talk

Sentence level 1000 WSJ sentences 3 judges reached good agreement after rounds Used for training and evaluation

Expression level 1000 WSJ sentences (2J) 462 newsgroup messages (2J) + 15413 words (1J) Single round; results promising Used to generate features, and not for evaluation

Page 21: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

21

Data in this Talk

Document level: Existing opinion-piece annotations used to generate features

Manually refined classifications used for evaluation Identified editorials not marked as such Only clear instances labeled To date: 1 judge

Distinct from the other data3 editions, each more than 150K words

Page 22: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

22

Sentence Level AnnotationsA sentence is labeled subjective if any significantexpression of subjectivity appears

“The cost of health care is eroding our standard of living and sapping industrial strength,’’ complains Walter Maher.

“What an idiot,’’ the idiot presumably complained.

Page 23: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

23

Sentence Classification

Binary Features: pronoun, adjective, number, modal ¬ “will “,

adverb ¬ “not”, new paragraph

Lexical feature: good for subj; good for obj; good for neither

Probabilistic classifier

10-fold cross validation; 51% baseline72% average accuracy across folds 82% average accuracy on sentences rated certain

Page 24: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

24

Identifying PSEs

There are few high precision, high frequencypotential subjective elements

Page 25: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

25

Identifying Individual PSEs

Classifications correlated with adjectivesGood subsets Dynamic adjectives (Quirk et al. 1985)

Positive, negative polarity; gradability automatically identified in corpora (Hatzivassiloglou & McKeown 1997)

Results from distributional similarity

Page 26: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

26

Distributional Similarity

Word similarity based on distributional pattern of words

Much work in NLP (see Lee 99, Lee and Pereira 99)

Purposes: Improve estimates of unseen eventsThesaurus and dictionary construction from corpora

Page 27: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

27

Lin’s Distributional Similarity

Lin 1998

I have a brown dogR1

R3

R2

R4

Word R W I R1 havehave R2 dogbrown R3 dog . . .

Page 28: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

28

Lin’s Distributional Similarity

R W R W R WR W R W R W R W R W

Word1 Word2

Pairs statistically correlated with Word1

Sum over RWint: I(Word1,RWint) + I(Word2,RWint) /Sum over RWw1: I(Word1,RWw1) + Sum over RWw2: I(Word2,RWw2)

Page 29: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

29

Bizarre

strange similar scary unusual fascinatinginteresting curious tragic different contradictory peculiar silly sad absurdpoignant crazy funny comic compellingodd

Page 30: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

30

Bizarre

strange similar scary unusual fascinatinginteresting curious tragic different contradictory peculiar silly sad absurdpoignant crazy funny comic compellingodd

Page 31: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

31

Bizarre

strange similar scary unusual fascinatinginteresting curious tragic different contradictory peculiar silly sad absurdpoignant crazy funny comic compellingodd

Page 32: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

32

Filtering

SeedWords

Words+Clusters

Filtered Set

Word + cluster removedif precision on training set< threshold

Page 33: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

33

Parameters

SeedWords

Words+Clusters

Cluster size

Threshold

Page 34: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

34

Seeds from Annotations

1000 WSJ sentences with sentence level and expression level annotations

They promised [e+ 2 yet] more for [e+ 3 really good] [e? 1 stuff].

"It's [e? 3 really] [e- 3 bizarre]," says Albert Lerman, creative director at the Wells agency.

Page 35: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

35

Experiments

910

110 1/10 used for training, 9/10 for testing

Parameters: Cluster-size fixed at 20 Filtering threshold: precision of baseline adjective feature on the training data

+7.5% ave 10-fold cross validation

[More improvements with other adj features]

Page 36: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

36

Opinion Pieces

3 WSJ data sets, over 150K words each

Skewed distribution: 13-17% words in opinions

Baseline for comparison: # words in opinions / total # words

For measuring precision: Prec(S) = # instances of S in opinions /

total # instances of S

Page 37: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

37

Parameters

SeedWords

Words+Clusters

Cluster size

Threshold

1-70%

2-40

Page 38: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

38

Results

Varies with parameter settings, but there are smoothregions of the space

Here: training/validation/testing

Page 39: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

39

Low Frequency Words

Single instance in a corpus ~ low frequency

Analysis of expression level annotations: there are many more single-instance words in subjective elements than outside them

Page 40: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

40

Unique Words

Replace all words that appear once in the test datawith “UNIQUE”

+5-10% points

Page 41: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

41

Collocations

here we go again get out of here what a well and good rocket science for the last time just as well … !

Start with the observation that low precision wordsoften compose higher precision collocations

Page 42: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

42

Collocations

Identify n-gram PSEs as sequences whose precisionis higher than the maximum precision of its constituents

W1,W2 is a PSE if prec(W1,W2) > max (prec(W1),prec(W2))

W1,W2,W3 is a PSE if prec(W1,W2,W3) > max(prec(W1,W2),prec(W3)) or prec(W1,W2,W3) > max(prec(W1),prec(W2,W3))

Page 43: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

43

CollocationsModerate improvements: +3-10% points

But with all unique words mapped to “UNIQUE”:+13-24% points

Page 44: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

44

Example Collocations with Unique

highly||adverb UNIQUE||adj

highly unsatisfactory

highly unorthodox

highly talented

highly conjectural

highly erotic

Page 45: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

45

Example Collocations with Unique

UNIQUE||verb out||IN farm out chuck out ruling out crowd out flesh out blot out spoken out luck out

Page 46: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

46

Collocations

UNIQUE||adj to||TO UNIQUE||verb impervious to reason strange to celebrate wise to temper

UNIQUE||noun of||IN its||pronoun sum of its usurpation of its proprietor of its

they||pronoun are||verb UNIQUE||noun they are fools they are noncontenders

Page 47: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

47

Opinion Results: Summary

Best Worst baseline 17% baseline 13% +prec/freq +prec/freq

Adjs +21/373 +09/2137 Verbs +16/721 +07/31932-grams +10/569 +04/5253-grams +07/156 +03/1481-U-grams +10/6065 +06/60452-U-grams +24/294 +14/2883-U-grams +27/138 +13/144

Disparate features have consistent performanceN Collocation sets largely distinct

Page 48: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

48

Does it add up?

Good preliminary results classifying opinion piecesusing density and feature count features.

Page 49: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

49

Future Work

Mutual bootstrapping (Riloff & Jones 1999)

Co-training (Collins & Singer 1999) to learn both PSEs and contextual featuresIntegration into a probabilistic modelText classification and review mining

Page 50: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

50

References

Banfield, A. (1982). Unspeakable Sentences. Routledge and Kegan Paul.Collins, M. & Singer, Y. (1999). Unsupervised models for named entity classification. EMNLP-VLC-99.van Dijk, T.A. (1988). News as Discourse. Lawrence Erlbaum.Fludernik, M. (1983). The Fictions of Language and the Languages of Fiction. Routledge.Hovy, E. (1987). Generating Natural Language Under Pragmatic Constraints. PhD dissertation.Kaufer, D. (2000). Flaming. www.eudora.comKessler, B., Nunberg, G., Schutze H. (1997). Automatic Detection of Genre. ACL-EACL-97.Riloff, E. & Jones R. (1999). Learning Dictionaries for Information Extraction by Multi-level Boot-strapping. AAAI-99

Page 51: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

51

References

Stein, D. & Wright, S. (1995). Subjectivity and Subjectivisation. Cambridge.Terveen, W., Hill, W., Amento, B. ,McDonald D. & Creter, J. (1997). Building Task-Specific Interfaces to High Volume Conversational Data. CHI-97.Teufel S., & Moens M. (2000). What’s Yours and What’s Mine: Determining Intellectual Attribution in Scientific Texts. EMNLP-VLC-00.Wiebe, J. (2000). Learning Subjective Adjectives from Corpora. AAAI-00.Wiebe, J. (1994). Tracking Point of View in Narrative. Computational Linguistics (20) 2.Wiebe, J. , Bruce, R., & O’Hara T. (1999). Development and Use of a Gold Standard Data Set for Subjectivity Classifications. ACL-99.

Page 52: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

52

References

Hatzivassiloglou V. & McKeown K. (1997). Predicting the Semantic Orientation of Adjectives. ACL-EACL-97.Hatzovassiloglou V. & Wiebe J. (2000). Effects of Adjective Orientation and Gradability on Sentence Subjectivity. COLING-00.Lee, L. (1999). Measures of Distributional Similarity. ACL-99.Lee, L. & Pereira F. (1999). ACL-99.Lin, D. (1998). Automatic Retrieval and Clustering of Similar Words. COLING-ACL-98.Quirk, R, Greenbaum, S., Leech, G., & Svartvik, J. (1985). A Comprehensive Grammar of the English Language. Longman.Sack, W. (1995). Representing and Recognizing Point of View. AAAI Fall Symposium on Knowledge Navigation and Retrieval.

Page 53: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

53

Sentence Annotations

Ave pair-wise Kappa scores: all data: .69 certain data: .88 (60% of the corpus)

Case study of analyzing and improving intercoderreliability:

if there is symmetric disagreement resulting from biasassessed by fitting probability models (Bishop et al. 1975, CoCo)

•bias: marginal homogeneity •symmetric disagreement: quasi-symmetry

use the latent class model to correct disagreements

Page 54: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

54

Test for Bias: Marginal Homogeneity

Worse the fit,greater the bias

C1

C2

C4

C1

C3

C2 C3 C4

4+ = X4

3+ = X3

2+ = X2

1+ = X1

X1+1 =

X2+2 =

X3+3 =

X4+4 =

ii pp for all i

Page 55: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

55

Test for Symmetric Disagreement: Quasi-Symmetry

C1

C2

C4

C1

C3

C2 C3 C4

*

*

***

***

* *

**Tests relationshipsamong the off-diagonal counts

Better the fit,higher the correlation

Page 56: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

56

(Potential) Subjective Elements

Same word, different types “Great majority” objective “Great!“ positive evaluative “Just great.” negative evaluative

Page 57: 1 Identifying Subjective Language Janyce Wiebe University of Pittsburgh

57

Review Mining

From: Hoodoo>[email protected]>Newsgroups: rec.gardensSubject: Re: Garden software

I bought a copy of Garden Encyclopedia from Sierra.Well worth the time and money.