24
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam CIKM 2008 1

AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Embed Size (px)

Citation preview

Page 1: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL

Ben He

Craig Macdonald

Iadh Ounis

University of Glasgow

Jiyin HeUniversity of Amsterdam

CIKM 2008

1

Page 2: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Introduction

Finding opinionated blog posts is still an open problem.

A popular solution is to utilize the external resources and manual efforts in identifying subjective features.

The authors proposed a dictionary-based statistical approach, which automatically derives evidence for subjectivity from the blog collection itself, without requiring any manual effort.

2

Page 3: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

TREC Opinion Finding Task (1/2) Text REtrieval Conference. Goal: To identify sentiment at the

document-level. The dataset are composed of:

Feed documents: XML format, usually a short summary of the blog post.

Permalink documents: HTML format, the complete blog post and its comments.

Homepage documents: HTML format, main entry to the blog.

3

Page 4: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

TREC Opinion Finding Task (2/2) Sample query format:

<top><num> 863<title> netflix<desc>

Identify documents that show customer opinions of Netflix.

<narr>A relevant document will indicate subscriber satisfaction with Netflix. Opinions about the Netflix DVD allocation system, promptness or delay in mailings are relevant.Indications of having been or intent to become a Netflix subscriber that do not state an opinion are not relevant.

</top>

4

Page 5: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Statistical Dictionary-based Approach

5

Page 6: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Dictionary Generation

The Skewed Query Model Rank all terms in the collection by term

frequencies in descending order. The terms, whose rankings are in the range

(S·#terms, U·#terms)are selected in the dictionary. #terms : the number of unique terms in the

collection S,U : model parameters. S=0.00007 and

U=0.001 in this paper.

6

Page 7: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Dictionary Generation

Ex:#terms=200,000#terms x 0.00007=14#terms x 0.001=200Only those terms ranked 14 to 200 will be

preserved

The dictionary is not necessary opinionated.

7

Page 8: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Term Weighting (1/2)

KL divergence method

D(Rel): Collection of relevant documents. D(opRel): Collection of opinionated and relevant documents. c(D(opRel))= #tokens in the opinionated documents. c(D(Rel))= #tokens in the relevant documents. tfx=term frequency of the term t in the opinionated

documents. tfrel=term frequency of the term t in the relevant

documents.

8

Page 9: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Term Weighting (2/2)

Bose-Einstein statistics method Measures how informative a term is in the

set D(opRel) against D(Rel).

= : the frequency of the term t in the D(Rel). : the number of documents in D(Rel). : the frequency of the term t in the

D(opRel).

9

Page 10: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Generating the Opinion Score Take the X top weighted terms from the

opinion dictionary. X will be tuned in the training step.

Submit them to the retrieval system as a query Qopn.

Score(d,Qopn): the opinion score of document d.

Score(d,Q): the initial ranking score.

10

Page 11: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Score Combination

Linear combination:

Log combination:

a, k will be tuned in the training step.

11

Page 12: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Experiment Settings (1/3)

TREC06: 50 topics for training. TREC07: 50 topics for testing. Only the “title” field is used (1.74

words/topic). Baseline 1: Apply InLB model, a variation of

the BM25 ranking function. Retrieve as many relevant documents as possible.

12

Page 13: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Experiment Settings (2/3)

Baseline 2: favor documents where the query terms appear in close proximity.

Q2: The set of all query term pairs in query Q. N: #Docs in the collection. T: #Tokens in the collection. pfn: The normalized frequency of the tuple p.

13

Page 14: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Experiment Settings (3/3)

Manually collecting an external dictionary from OpinionFinder and several other resources.

Contains approximately 12,000 English words, mostly adjectives, adverbs and nouns.

14

Page 15: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Experiment: Term Weighting (1/2) Hypothesis: the most opinionated terms

for one query set are also good indicators of opinion for other queries.

Sampling:

For each sample set, calculate the weight of each terms.

Training Set(50 Topics)

Set1

Set2

Set10…

Each with 25 Topics

Overlap :65%

maximum

15

Page 16: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Experiment: Term Weighting (2/2) Compute the cosine similarity between

the weights of the top 100 weighted terms from each two samples

16

Page 17: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Experiment: Validation (1/3)

Tuning the parameters X, a and k mentioned before.

Maximize X by maximizing the mean MAP of the 10 samples.

17

Training Set(50 Topics)

Set1for

assigning term weight

Set1’for

validation

Page 18: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Experiment: Validation (2/3)18

Page 19: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Experiment: Validation (3/3)

Fix X=100, tuning a and k. a within [0, 1] , step=0.05 k within (0, 1000] , step=50

19

Page 20: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Experiment: Evaluation (1/3)

20

Page 21: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Experiment: Evaluation (2/3)

21

Page 22: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Experiment: Evaluation (3/3) Comparison with the OpinionFinder

All being equal, replace the opinion score Score(d,Qopn) with

22

Page 23: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Conclusion

An effective and practical approach to retrieving opinionated blog posts without manual effort.

Opinion scores are computed during indexing Computational cost is negligible.

The automatically generated internal dictionary performs as good as the external dictionary.

Diferrent random samples from the collection reach a high consensus on the opinionated terms if the Bose-Einstein statistics given by the geometric distribution are applied.

23

Page 24: AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam

Thank you for listening!24