25
Identifying and Ranking Topic Clusters in the Blogosphere Muhammad Atif Qureshi Korea Advanced Institute of Science and Technology Arjumand Younus Korea Advanced Institute of Science and Technology Muhammad Saeed University of Karachi Nasir Touheed Institute of Business Administration

Identifying and ranking topic clusters in the blogosphere

Embed Size (px)

DESCRIPTION

Slides presented in COLING 2010 workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources.

Citation preview

Page 1: Identifying and ranking topic clusters in the blogosphere

Identifying and Ranking Topic Clusters in the Blogosphere

Muhammad Atif QureshiKorea Advanced Institute of Science and Technology

Arjumand YounusKorea Advanced Institute of Science and Technology

Muhammad SaeedUniversity of Karachi

Nasir TouheedInstitute of Business Administration

Page 2: Identifying and ranking topic clusters in the blogosphere

Outline

Introduction

Approach

Experiments and Results

Conclusions

1

2

3

4

1COLING 2010 CCSR WORKSHOP

Page 3: Identifying and ranking topic clusters in the blogosphere

Web 1.0 to Web 2.0

Paradigm shift From a read-only Web to a read-write Web Increased user participation User generated content

Wikis (Wikipedia, Wiktionary) Social networking sites (Facebook, Myspace, Twitter) Digital media sharing websites (YouTube, Flickr) Blogs (Blogspot, Wordpress)

2COLING 2010 CCSR WORKSHOP

Introduction Approach Experiments and Results Conclusions

Page 4: Identifying and ranking topic clusters in the blogosphere

The Blogosphere

Blogs empower people to voice their opinions and share their ideas.

Bloggers also have the option to link to other blogs – social network of bloggers sharing interests in same topics.

How can we identify these topic clusters? Who is most influential blogger in a given cluster?

3COLING 2010 CCSR WORKSHOP

Introduction Approach Experiments and Results Conclusions

Page 5: Identifying and ranking topic clusters in the blogosphere

The Blogosphere

Blogs empower people to voice their opinions and share their ideas.

Bloggers also have the option to link to other blogs – social network of bloggers sharing interests in same topics.

How can we identify these topic clusters? Who is most influential blogger in a given cluster?

4COLING 2010 CCSR WORKSHOP

Introduction Approach Experiments and Results Conclusions

Page 6: Identifying and ranking topic clusters in the blogosphere

The Blogosphere

Blogs empower people to voice their opinions and share their ideas.

Bloggers also have the option to link to other blogs – social network of bloggers sharing interests in same topics.

How can we identify these topic clusters? Who is the most influential blogger in a given cluster?

5COLING 2010 CCSR WORKSHOP

Introduction Approach Experiments and Results Conclusions

Page 7: Identifying and ranking topic clusters in the blogosphere

Problem Definition

Given the blogosphere with blogs containing diverse information on a broad range of topics: Find the cluster of blogs to read that have interest in

some particular topic. Which blog holds the greatest influence for the

particular topic?

6COLING 2010 CCSR WORKSHOP

Introduction Approach Experiments and Results Conclusions

Page 8: Identifying and ranking topic clusters in the blogosphere

Problem Definition

Given the blogosphere with blogs containing diverse information on a broad range of topics: Find the cluster of blogs to read that have interest in

some particular topic. Which blog holds the greatest influence for the

particular topic?

7COLING 2010 CCSR WORKSHOP

Introduction Approach Experiments and Results Conclusions

Page 9: Identifying and ranking topic clusters in the blogosphere

Link Based Methods for the Blogosphere

Link based methods don’t work well for the blogosphere Weakly linked nature of blog pages Blog posts need some time to get in-links Bloggers try to exploit the link based methods by

assuming role of spammers

8COLING 2010 CCSR WORKSHOP

Introduction Approach Experiments and Results Conclusions

Page 10: Identifying and ranking topic clusters in the blogosphere

Outline

Introduction

Approach

Experiments and Results

Conclusions

1

2

3

4

9COLING 2010 CCSR WORKSHOP

Page 11: Identifying and ranking topic clusters in the blogosphere

Blog Communities vs. Topic Clusters

Blog community Discovered by following blog threads’ discussions

Topic clusters Role of blogs as conversational medium diminished Bloggers having interest in a specific topic form

socially linked network with other bloggers writing about same topic

10COLING 2010 CCSR WORKSHOP

Introduction Approach Experiments and Results Conclusions

Page 12: Identifying and ranking topic clusters in the blogosphere

Blog Dimensions

Blog considered along three dimensions: Part of speech Occurrence Blog post no.

11COLING 2010 CCSR WORKSHOP

Introduction Approach Experiments and Results Conclusions

Page 13: Identifying and ranking topic clusters in the blogosphere

Topic Discussion Isolation Rank

Metric used to discover the topic clusters Based on set of given topic words and some linguistic rules

We define the TDIR score of a blog as follows:

nnoun, nadjective and nadverb is respectively the number of times a noun, adjective or adverb for a specific topic are found in all the blog posts

wn, wadj and wadv are respective weights assigned to the noun, adjective and adverb for a specific topic

posts total of Number

wnwnwnTDIR

advadverbadjadjectivennoun )()()(1

12COLING 2010 CCSR WORKSHOP

Introduction Approach Experiments and Results Conclusions

Page 14: Identifying and ranking topic clusters in the blogosphere

Topic Discussion Rank

Metric used to rank the blogs within a topic cluster Based on hyperlinked social network of blogs and blog post

contents

We define the TDR score of a blog as follows:

Matching_Outlinks represent blogs that are part of topic cluster

o : (o,b) – outlinks from blog b

damp is the damping factor

otherwise damp; x TDIR inksTotal_OutlutlinksMatching_O TDIR

blog from outlinks zero if TDIR;b TDR

boo ),(:

][

13COLING 2010 CCSR WORKSHOP

Introduction Approach Experiments and Results Conclusions

Page 15: Identifying and ranking topic clusters in the blogosphere

Role of Damping Factor

Assume TDIR of blog A is 2 and TDIR of blog B is 1

TDR without damping factor A: 2 + (1/1 x 1) = 3 B: 1 + (1/1 x 2) = 3

TDR with damping factor A: 2 + (1/1 x 1 x 0.9) = 2.9 B: 1 + (1/1 x 2 x 0.9) = 2.8

14COLING 2010 CCSR WORKSHOP

Introduction Approach Experiments and Results Conclusions

Page 16: Identifying and ranking topic clusters in the blogosphere

Outline

Introduction

Approach

Experiments and Results

Conclusions

1

2

3

4

15COLING 2010 CCSR WORKSHOP

Page 17: Identifying and ranking topic clusters in the blogosphere

Experimental Setup

Experimental data Real blog data collected during crawling of blogspot

domain 102 blog sites comprising of 50,471 blog posts

Experimental topics “compute”, “democracy”, “secularism”,

“bioinformatics”, “Haiti”, “Obama”

16COLING 2010 CCSR WORKSHOP

Introduction Approach Experiments and Results Conclusions

Page 18: Identifying and ranking topic clusters in the blogosphere

Experimental Measures

Precision

Recall

Ca represents topic cluster set found using our algorithmCt represents true topic cluster set

Ca

CaCt

Ct

CaCt

17COLING 2010 CCSR WORKSHOP

Introduction Approach Experiments and Results Conclusions

Page 19: Identifying and ranking topic clusters in the blogosphere

Experimental Results - Precision

Average precision found to be 0.87

18COLING 2010 CCSR WORKSHOP

Introduction Approach Experiments and Results Conclusions

Page 20: Identifying and ranking topic clusters in the blogosphere

Experimental Results - Recall

Average recall found to be 0.971

19COLING 2010 CCSR WORKSHOP

Introduction Approach Experiments and Results Conclusions

Page 21: Identifying and ranking topic clusters in the blogosphere

Outline

Introduction

Approach

Experiments and Results

Conclusions

1

2

3

4

20COLING 2010 CCSR WORKSHOP

Page 22: Identifying and ranking topic clusters in the blogosphere

Conclusions

This work presents the concept of “topic clusters” to solve the blog categorization problem for the Information Retrieval domain.

The proposed method takes into account both blog posts’ content and link structure.

Natural language processing techniques incorporated into the method ensure high coverage.

The method was evaluated using a real word dataset of the blogspot domain.

21COLING 2010 CCSR WORKSHOP

Introduction Approach Experiments and Results Conclusions

Page 23: Identifying and ranking topic clusters in the blogosphere

Thanks

Muhammad Atif Qureshi

[email protected]

22COLING 2010 CCSR WORKSHOP

Page 24: Identifying and ranking topic clusters in the blogosphere

Appendix

23COLING 2010 CCSR WORKSHOP

Page 25: Identifying and ranking topic clusters in the blogosphere

Additional Experiments

Experiment on topic “Obama” repeated with additional term “Democrats” Precision increased from 0.907 to 0.95 Ranks of some blogs higher than ranks obtained

previously

Two more experiments on fine-grained topics Healthcare bill: Precision was found to be 0.857 and

recall obtained was 1; additional term “obamacare” was used

Avatar: Precision was found to be 0.47 and recall obtained was 1; additional terms had no effect

24COLING 2010 CCSR WORKSHOP

Introduction Approach Experiments and Results Conclusions