32
Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14 Summarized and Presented by Gihyun Gong, IDS Lab.

Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Embed Size (px)

DESCRIPTION

Copyright  2008 by CEBT What is the Web Forum? 3

Citation preview

Page 1: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Exploring Traversal Strategy forWeb Forum Crawling

Yida Wang, Jiang-Ming Yang, Wei Lai, Rui CaiMicrosoft Research Asia, Beijing

SIGIR 08

2009. 1. 14Summarized and Presented by Gihyun Gong, IDS Lab.

Page 2: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Outline Motivation Web Forum Structure Our System

Crawling Process Traversal Strategy

– Skeleton link identification– Page-flipping link detection

Evaluation

2

Page 3: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

What is the Web Forum?

3

Page 4: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

What is the Web Forum? Forum structures

User Groups– Posters– Moderators– Admin

Post Thread

4

Page 5: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Characteristics Structure

‘Thread’ base (not page base) Many related/shortcut links Pre-defined templates Links with permission control

Contents User-Created Content Frequently changes Weight and quality of ‘Reply’ Suitable for discuss about a topic

5

Page 6: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Why we crawl the Web Forum? Web forum is a huge resource of human knowledge

Over 20% search results are from web forums Could get related outer-link(site) Topic based structure

6

Topic

Topic

PostsPostsPosts

….

Topic

PostsPostsPosts

….

Topic

PostsPostsPosts

….

Topic

PostsPostsPosts

….

Web Forum

Page 7: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

The Limitation of Generic Crawlers In general crawling, each page is treated independently,

andeach link is treated indiscriminately Lead to more than 50% useless pages Ignore the relationships between pages from a same thread

Forum crawling needs a site-level perspective and a carefulselection of links

7

Page 8: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Related Work and limitation Basic strategy

Breadth-first– Hard to select a proper crawling depth for each site

Shallow/Deep crawling strategy– Cannot ensure to access all valuable content / May result in too many dupli-

cated and invalid pages Enhanced strategy

On-line page importance computation– Calculate the partial PageRank to estimate the importance of a page– Not suitable for forum sites

Deep web crawling– Forums are a kind of deep Web, but this crawler focused on how to prepare ap-

propriate queries to probe Focused Crawling

– Semantic topics in forums are too diverse to be simply characterized with a list of terms

8

Page 9: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Outline Motivation Web Forum Structure Our System

Crawling Process Traversal Strategy

– Skeleton link identification– Page-flipping link detection

Evaluation

9

Page 10: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Web Forum Structure - Sitemap Sitemap is a directed graph

Set of vertices and links

10

Page 11: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Web Forum Structure – Skeleton Link Most important links supporting the structure

11

Page 12: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Web Forum Structure – Page Flipping A kind of loop-back links in the sitemap

12

Page 13: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Site-Level Perspective Understand the organization structure Find an optimal Traversal strategy

List-of-Board

List-of-Thread

Browse-by-Tag

Search Result

Post-of-Thread

Login Portal

Entry

Digest

The site-level perspective of "forums.asp.net"

13

Page 14: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Outline Motivation Web Forum Structure Our System

Crawling Process Traversal Strategy

– Skeleton link identification– Page-flipping link detection

Evaluation

14

Page 15: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Random Sampling

Sitemap Construc-

tion

Traversal Strategy Exploring

Crawl-ing

Process

15

Page 16: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

• Adopted a combined strategy of breadth-first and depth-first using a double-ended queue

• Try to cover as many as possible unseen URL Pat-

terns

Random Sampling

Sitemap Construc-

tion

Traversal Strategy Exploring

Crawl-ing

Random Sampling

16

Page 17: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Random Sampling Randomly sample some pages from a given site

Try to push as many as possible unseen URLs to queue,randomly pop a URL from the front or the end of the queue To cover as many as possible unseen URL patterns

1,000 pages will be enough

17

Page 18: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

• Utilized the repetitive re-gions to characterize the

contentlayout of each page

• Represent links with their location and URL patterns

Random Sampling

Sitemap Construc-

tion

Traversal Strategy Exploring

Crawl-ing

Sitemap Construction

18

Page 19: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Random Sampling

Sitemap Construc-

tion

Traversal Strategy Exploring

Crawl-ing

• Skeleton Link Identifica-tion

• Page-Flipping Link Detec-tion

Traversal Strategy Exploring

19

Page 20: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Skeleton Links Skeleton links are the most important links supporting the structure of a fo-

rum site Crawlers crawl as many as possible unique pages in a given forum site by

following skeleton links Skeleton links point to all valuable pages without introducing redundant and

valueless

How to Identify? Aim at all unique pages without duplicates An optimal set of skeleton links leads to most unique pages and few duplicates Search skeleton links for each valuable vertex

– Level by level: Inspired by user browsing behavior– Find an optimal combination of links

20

Page 21: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

How to Identify Skeleton Links Coverage

Informativeness

Page 22: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Page-Flipping Links Crawlers can completely download a long discussion thread

divided into several pages by following page-flipping links Page-flipping links are a kind of loop-back links in the sitemap.

However, not all loop-back links are page-flipping ones How to detect?

For page-flipping links, if there is a path from page A to B,there must be a path follow the same type of links from B to A

Page-flipping links have larger connectivity score

22

Page A

Hyperlink

Page B

Hyperlink

Page 23: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT23

An illustration of the characteristics of page-flipping links

Connectivity = 722 / 890 = 0.81

Connectivity = 108 / 1153 = 0.09

Page 24: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Random Sampling

Sitemap Construc-

tion

Traversal Strategy Exploring

Crawl-ing

• Mapping a new page to an

existing layout vertex

• Follow the traversal strategy

for out-links

24

Page 25: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Crawling From the given entry page

Map a new page to an existing layout vertex

Follow the explored traversal strategy for out-links from thatpage

25

Page 26: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Outline Motivation Web Forum Structure Our System

Crawling Process Traversal Strategy

– Skeleton link identification– Page-flipping link detection

Evaluation

26

Page 27: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Experimental Setup Contract experiments in eight forums from diverse cate-

gories Mirror pages: Crawled by a real commerce crawler Structure-driven: Crawled by structure-driven crawler pro-

posed in SIGIR’06 Our method: Crawled by crawler using our traversal strategy

27

Page 28: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Evaluation Criteria

28

Coverage

Informativeness

Page 29: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Effectiveness and Efficiency Effectiveness

29

Page 30: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Effectiveness and Efficiency Efficiency

30

Page 31: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Evaluation of Page-Flipping Detec-tion

31

Page 32: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14

Copyright 2008 by CEBT

Conclusions A complete solution to automatically explore an appro-

priate traversal strategy to a given target forum site is proposed Skeleton link identification Page-flipping link detection

More future work directions Incremental crawling Forum page segmentation

32