42
Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Automatic Blog Monitoring and Summarization

Ka Cheung “Richard” Sia

PhD Prospectus

Page 2: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

With/without organized access

Page 3: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Inaccessible?

% of Feeds Vs # of Subscribers

0%

20%

40%

60%

80%

100%

1+ 20+ 50+ 1000+ 5000+

# of Subscribers

% o

f F

ee

ds

By AskJeeves

Page 4: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Introduction

Organized access to blogs Full coverage Reflect changes quickly Filtered and organized presentation

Intended Contributions Efficient techniques to harvest blogs Algorithms to monitor frequently changing data sources Algorithms to reconstruct implicit networks and compose

topic summaries

Page 5: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Modules

Monitoring Collection (future work) Topic detection and tracking (future work) Conclusion

Page 6: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Monitoring

Preliminary results

Page 7: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Framework

A central server monitors data source changes and provides succinct summaries to users

Page 8: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Overview

New challenges Content change more rapidly with recurring pattern More time-sensitive requirements

Modeling of posting update Definition of delay Strategies for allocation and scheduling

Page 9: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Characteristics

Homogeneous Poisson modelλ(t) = λ at any t

Periodic inhomogeneous Poisson model λ(t) = λ(t-nT), n=1,2,…

Page 10: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Definition of metrics

Delay of a data sourcesum of elapsed time for every post

Delay experienced by the aggregator

iji ttD )(

k

iitDOD

1

)()(

n

iii ODwAD

1

)()(

Page 11: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Definition of metrics

τj – retrieval timeλ(t) – posting rate

Expected delay Homogeneous Poisson model

Inhomogeneous Poisson model

2

)()(

21

jjOD

j

j

dtttOD j

1

))(()(

Page 12: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Problem formulation

Minimization of expected delay experienced by the aggregator under constraint of limited resources.

Schedule τj’s such that

is minimized.

n

iii ODwAD

1

)()(

Page 13: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Approach

Resource allocation How often to contact data sources? O1 is more active than O2, how much more often should we

contact O1 than O2?

Retrieval scheduling When to contact a data source? 3 retrievals are allocated for O1, when should these 3

retrievals be located?

Page 14: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Resource allocation

Consider n data source O1, …, On

λi – posting rate of Oi

wi – weight of Oi

N – total number of retrievals per day mi – number of retrievals per day allocated to Oi

Optimal allocationiii wm

Page 15: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Retrieval scheduling

m retrieval(s) per day are allocated to a data source O, how should we schedule these m retrievals?

m=1 m>1

Page 16: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Single retrieval per period

λ(t) = 1, t [0,1], λ(t)=0, t [1,2] Periodicity T=2

τ = 0.5, expected delay = 0.75 τ = 1, expected delay = 0.5 τ = 2, expected delay = 1.5

Page 17: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Single retrieval per period

For a data source with posting rate λ(t) and period T, the expected delay when retrieved at time τ is given by:

0)(

and )(1

)(

optimalityfor Criteria

0 dt

ddtt

T

T

T

dttTtdtttD

))(())(()(

0

Page 18: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Multiple retrievals per period

m retrievals per period are allocated, when scheduled at time τ1, …, τm, the expected delay is given by:

11

11

1

))(()(

T

dtttOD

m

m

ii

i

i

j

j

dttjjj

1

)())((

optimalityfor Criteria

1

Page 19: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Example

6 retrievals for λ(t)=2+2sin(2πt)

j

j

dttjjj

1

)())((

optimalityfor Criteria

1

Page 20: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Experiment

Data – 10k RSS feeds over Oct – Dec 2004

Page 21: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Performance

CGM03 – optimize for “age” Ours – both resource allocation and retrieval scheduling

Page 22: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Size of estimation window

Resource constraint: 4 retrievals per day per feeds on average 2 weeks is an appropriate choice

Page 23: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Predictability of posting rate

90% of the RSS feeds post consistently

Page 24: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Summaries and extensions

Resource allocation is more aggressive Retrieval scheduling optimizes within individual data

source

Include user access pattern Variable retrieval cost

Page 25: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Collection

Future work

Page 26: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Collection

Blog hosting website Central repository

~5.3M URLs from weblogs.comlimited and contaminated

CrawlingRetrieve maximum number of blog while reducing number of irrelevant pages downloaded

Domain Count Category

spaces.msn.com 839,663 Blog

blogspot.com 362,957 Blog

wretch.cc 116,161 Blog

search-net101.com 89,750 Spam/ads

abalty.com 86,329 Spam/ads

search-now854.com 80,109 Spam/ads

bigebiz.org 79,059 Spam/ads

Page 27: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Collection

Blogs are inter-connected (blogrolls) Selectively following links, discovering hubs for blogs

blog blog

[1] Chakrabarti et.al. “Focused Crawling: A New Approach to Topic-specific Web Resource Discovery”, The International WWW conference 1999

Page 28: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Relinquishment of blogs

Detection of abandoned blog to save resource

[2] D.R. Cox “Regression models and life-tables (with discussion)”Journal of the Royal Statistical Society, B(34), 1972[3] Gina Venolia “A Matter of Life or Death: Modeling Blog Mortality”Technical report, Microsoft Research

Page 29: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Topic detection and tracking

Future work

Page 30: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Overview

Characteristics Document stream Traces of information propagation among blogs

Challenges Modeling growth and death of a topic Ranking of blog articles Malicious content

Page 31: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Influence network in blogs

Information are “diffused” among blogs

Indicator of popularity Social relationship among

bloggers

Page 32: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Influence network in blogs

Four major patterns of propagation

Reconstruction of implicit network Ranking (source authority) Advertising campaign

Page 33: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Data characteristics

~ 97 - 98 % daily content are new

Page 34: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Data characteristics

Same content last for ~8 days

Page 35: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Topics

Topics with different lifespan Bursty Mid-range Sustaining

Evolving of topic

[4] J. Kleinberg, “Bursty and Hierarchical Structure in Streams”in SIGKDD 2002[5] J. Kleinberg, “Temploral Dynamics of On-Line Information Streams”Data Stream Management: Processing High-Speed Data Stream, Springer 2005

Page 36: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Document similarity

Sparse and diverse ~400 articles clustered into 21 clusters out of 10,000

daily articles (by DBSCAN)

Page 37: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Framework

Document stream approach Filtering Aggregation

Page 38: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Problems

Selecting a representative subset of documents from a topic cluster Coverage Distinctiveness among subset

Ranking of documents Time Source authority

Page 39: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Conclusion

1. Efficient collection of blogs and modeling the relinquishment

2. Monitoring and retrieval scheduling of rapidly changing data sources

3. Composing topic summary1. Reconstruction of an implicit influence network2. Representative document selection problem

Page 40: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

End

Questions?

Page 41: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

More examples

Page 42: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

Major posting patterns

K – means clustering