29
LOGO Comments-Oriented Blog Summarization by Sentence Extraction Meishan Hu, Aixin Sun, Ee-Peng Lim (ACM CIKM’07) Advisor Dr. Koh Jia-Ling Speaker Tu Yi-Lang Date 2008.04.17

Comments-Oriented Blog Summarization by Sentence Extraction

  • Upload
    olwen

  • View
    23

  • Download
    3

Embed Size (px)

DESCRIPTION

Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.04.17. Comments-Oriented Blog Summarization by Sentence Extraction. Meishan Hu, Aixin Sun, Ee-Peng Lim (ACM CIKM’07). Introduction. Entries of blogs, also known as blog posts, often contain comments from blog readers. - PowerPoint PPT Presentation

Citation preview

Page 1: Comments-Oriented Blog Summarization by Sentence Extraction

LOGO

Comments-Oriented Blog Summarization by

Sentence ExtractionMeishan Hu, Aixin Sun, Ee-Peng Lim (ACM CIKM’07)

Advisor : Dr. Koh Jia-LingSpeaker : Tu Yi-Lang

Date : 2008.04.17

Page 2: Comments-Oriented Blog Summarization by Sentence Extraction

2

Introduction

Entries of blogs, also known as blog posts, often contain comments from blog readers.

A recent study on blog conversation showed that readers treat comments associated with a post as an inherent part of the post.

It tries to find out whether the reading of comments would change a reader’s understanding about the post.

Page 3: Comments-Oriented Blog Summarization by Sentence Extraction

3

Introduction

It conducted a user study on summarizing blog posts by labeling representative sentences in those posts.

Significant differences between the sentences labeled before and after reading comments were observed.

Page 4: Comments-Oriented Blog Summarization by Sentence Extraction

4

Introduction

In this research, the task is to summarize a blog post by extracting representative sentences from the post using information hidden in its comments.

The extracted sentences represent the topics presented in the post that are captured by its readers.

Page 5: Comments-Oriented Blog Summarization by Sentence Extraction

5

Introduction

Given a blog post and its comments, the solution consists of three modules : sentence detection word representativeness measure sentence selection

Page 6: Comments-Oriented Blog Summarization by Sentence Extraction

6

Problem Definition

Definition 1: Given a blog post P consisting of a set of sen

tences P = {s1, s2, . . . , sn} and the set of comments C = {c1,c2, . . . , cm} associated with P, the task of comments-oriented blog summarization is to extract a subset of sentences from P, denoted by Sr (Sr P), that best represents ⊂the discussion in C.

Page 7: Comments-Oriented Blog Summarization by Sentence Extraction

7

Problem Definition

One straightforward approach is to compute a representativeness score for each sentence si, denoted by Rep(si).

As a sentence consists of a set of words, si = {w1,w2, . . . ,wm}, one can derive Rep(si) using representativeness scores of all words contained in si.

Page 8: Comments-Oriented Blog Summarization by Sentence Extraction

8

Problem Definition

Word representativeness denoted by Rep(wk), can be measured by counting the number of occurrences of a word in comments.

Such as the following three schemes : Binary. Comment Frequency (CF). Term Frequency (TF).

Page 9: Comments-Oriented Blog Summarization by Sentence Extraction

9

Problem Definition

All three measures are simple statistics on comment content, binary captures minimum information; CF and TF capture slightly more, and the measures suffer from spam comments.

So it tries to find a measure that could capture more information from comments (besides content) and is less sensitive to spam.

Page 10: Comments-Oriented Blog Summarization by Sentence Extraction

10

ReQuT Model

Here states three common observations on how comments may link to each other, and they provide some guidelines on measuring word representativeness. Observation 1 :

• A reader often mentions another reader’s name to indicate that the current comment is a reply to previous comment(s) posted by the mentioned reader.

• A reader may mention multiple readers in one comment.

Page 11: Comments-Oriented Blog Summarization by Sentence Extraction

11

ReQuT Model

Observation 2 :• A comment may contain quoted sentences from on

e or more comments to reply these comments or continue the discussion.

Observation 3 :• Discussion in comments often branches into sever

al topics and a set of comments are linked together by sharing the same topic.

Page 12: Comments-Oriented Blog Summarization by Sentence Extraction

12

Reader, Quotation and Topic Measures

ReQuT means “Reader”, “Quotation” and “Topic”.

Based on the three observations, it can find that a word is representative if it is written by authoritative readers, appears in widely quoted comments, and represents hotly discussed topics.

Page 13: Comments-Oriented Blog Summarization by Sentence Extraction

13

Reader, Quotation and Topic Measures

Reader measure : A directed reader graph GR :=(VR, ER), each

node ra V∈ R is a reader, and an edge eR(rb, ra) E∈ R exists if rb mentions ra in one of rb’s comments.

WR(rb, ra), is the ratio between the number of times rb mention ra against all times rb mention other readers (including ra).

Page 14: Comments-Oriented Blog Summarization by Sentence Extraction

14

.

• |R| : the total number of readers of the blog.• d : damping factor.

.

• tf(wk,ci) : the term frequency of word wk in

comment ci.• ci ← ra : ci is authored by reader ra.

Reader, Quotation and Topic Measures

Page 15: Comments-Oriented Blog Summarization by Sentence Extraction

15

Reader, Quotation and Topic Measures

Quotation measure : A directed acyclic quotation graph GQ := (VQ,

EQ), each node ci V∈ Q is a comment, and an edge (cj, ci) E∈ Q indicates cj quoted sentences from ci.

WQ(cj, ci), is 1 over the number of comments that cj ever quoted.

Page 16: Comments-Oriented Blog Summarization by Sentence Extraction

16

Reader, Quotation and Topic Measures

.

• |C| : the number of comments associated with the

given post.

.

• wk c∈ i : word wk appears in comment ci.

Page 17: Comments-Oriented Blog Summarization by Sentence Extraction

17

Reader, Quotation and Topic Measures

Topic measure : A hotly discussed topic has a large number of

comments all close to the topic cluster centroid. It tries to compute the importance of a topic clu

ster.

Page 18: Comments-Oriented Blog Summarization by Sentence Extraction

18

Reader, Quotation and Topic Measures

.

• |Ci| : the length of comment ci in number of words.• C : the set of comments.• sim(ci, tu) : the cosine similarity between commen

t

ci and the centroid of topic cluster tu.

.

• ci t∈ u : comment ci is clustered into topic cluster tu.

Page 19: Comments-Oriented Blog Summarization by Sentence Extraction

19

Word Representativeness Score

The representativeness score of a word Rep(wk) is the combination of reader, quotation and topic measures in ReQuT model.

Page 20: Comments-Oriented Blog Summarization by Sentence Extraction

20

Word Representativeness Score

. α, β and γ are the coefficients. 0 ≤ α, β, γ ≤ 1.0 and α + β + γ = 1.0.

As both readers and bloggers have no control on authority measure and very minimum control on quotation and topic measure, so the ReQuT is less sensitive to spam comments.

Page 21: Comments-Oriented Blog Summarization by Sentence Extraction

21

Sentence Selection

Two sentence selection methods : Density-based selection (DBS).

• DBS was proposed to rank and select sentences in question answering.

Summation-based selection (SBS).• SBS proposed in this paper, gives a higher represe

ntativeness score to a sentence if it contains more representative words.

Page 22: Comments-Oriented Blog Summarization by Sentence Extraction

22

Sentence Selection

Density-based selection (DBS) : Here adopted DBS in the problem by treating

words appearing in comments as keywords and the rest non-keywords.

.

• K : the total number of keywords contained in si.• Score(wj) : the score of keyword wj.• distance(wj, wj+1) : the number of non-keywords b

etween two adjacent keywords wj and wj+1 in si.

Page 23: Comments-Oriented Blog Summarization by Sentence Extraction

23

Sentence Selection

Summation-based selection (SBS) : SBS does not favor long sentences by conside

ring the number of words in a sentence.

.

• |si| : the length of sentence si in number of words.• τ (τ > 0) : a parameter to flexibly control the contrib

ution of a word’s representativeness score.

Page 24: Comments-Oriented Blog Summarization by Sentence Extraction

24

User Study and Experiments

Here collected data from two famous blogs, i.e., Cosmic Variance and IEBlog, both having relatively large readership and being widely commented.

Page 25: Comments-Oriented Blog Summarization by Sentence Extraction

25

User Study

With 10 posts randomly picked up from each of the two blogs and 3 human summarizers recruited from Computer Engineering students.

The hypothesis is that one’s understanding about a blog post does not change after he or she read the comments associated with the post.

Page 26: Comments-Oriented Blog Summarization by Sentence Extraction

26

User Study

The user study was conducted in two phrases : provided 3 summarizers the 20 blog posts, and

asked the summarizers to select approximately 30% of sentences from each post as its summary.

phrase 1 : without comments (RefSet-1). phrase 2 : provided the nearly 1000 comment

s

associated with the 20 posts (RefSet-2).

Page 27: Comments-Oriented Blog Summarization by Sentence Extraction

27

User Study

For each human summarizer, it computed the level of self-agreement.

Self-agreement level is defined by the percentage of sentences labeled in both reference sets against sentences in RefSet-1 by the same summarizer.

Page 28: Comments-Oriented Blog Summarization by Sentence Extraction

28

Experimental Results

As the sentences are labeled after reading comments, RefSet-2 was used to evaluate the two sentence selection methods with four word representativeness measures.

Here adopted R-Precision and NDCG as performance metrics.

Page 29: Comments-Oriented Blog Summarization by Sentence Extraction

29

Experimental Results

In the experiment : in SBS : τ=0.2 in combining reader, quotation and topic mea

sure : α=β=γ=0.33