Upload
allan-weaver
View
214
Download
0
Embed Size (px)
DESCRIPTION
Copyright 2008 by CEBT What is the Web Forum? 3
Citation preview
Exploring Traversal Strategy forWeb Forum Crawling
Yida Wang, Jiang-Ming Yang, Wei Lai, Rui CaiMicrosoft Research Asia, Beijing
SIGIR 08
2009. 1. 14Summarized and Presented by Gihyun Gong, IDS Lab.
Copyright 2008 by CEBT
Outline Motivation Web Forum Structure Our System
Crawling Process Traversal Strategy
– Skeleton link identification– Page-flipping link detection
Evaluation
2
Copyright 2008 by CEBT
What is the Web Forum?
3
Copyright 2008 by CEBT
What is the Web Forum? Forum structures
User Groups– Posters– Moderators– Admin
Post Thread
4
Copyright 2008 by CEBT
Characteristics Structure
‘Thread’ base (not page base) Many related/shortcut links Pre-defined templates Links with permission control
Contents User-Created Content Frequently changes Weight and quality of ‘Reply’ Suitable for discuss about a topic
5
Copyright 2008 by CEBT
Why we crawl the Web Forum? Web forum is a huge resource of human knowledge
Over 20% search results are from web forums Could get related outer-link(site) Topic based structure
6
Topic
Topic
PostsPostsPosts
….
Topic
PostsPostsPosts
….
Topic
PostsPostsPosts
….
Topic
PostsPostsPosts
….
Web Forum
Copyright 2008 by CEBT
The Limitation of Generic Crawlers In general crawling, each page is treated independently,
andeach link is treated indiscriminately Lead to more than 50% useless pages Ignore the relationships between pages from a same thread
Forum crawling needs a site-level perspective and a carefulselection of links
7
Copyright 2008 by CEBT
Related Work and limitation Basic strategy
Breadth-first– Hard to select a proper crawling depth for each site
Shallow/Deep crawling strategy– Cannot ensure to access all valuable content / May result in too many dupli-
cated and invalid pages Enhanced strategy
On-line page importance computation– Calculate the partial PageRank to estimate the importance of a page– Not suitable for forum sites
Deep web crawling– Forums are a kind of deep Web, but this crawler focused on how to prepare ap-
propriate queries to probe Focused Crawling
– Semantic topics in forums are too diverse to be simply characterized with a list of terms
8
Copyright 2008 by CEBT
Outline Motivation Web Forum Structure Our System
Crawling Process Traversal Strategy
– Skeleton link identification– Page-flipping link detection
Evaluation
9
Copyright 2008 by CEBT
Web Forum Structure - Sitemap Sitemap is a directed graph
Set of vertices and links
10
Copyright 2008 by CEBT
Web Forum Structure – Skeleton Link Most important links supporting the structure
11
Copyright 2008 by CEBT
Web Forum Structure – Page Flipping A kind of loop-back links in the sitemap
12
Copyright 2008 by CEBT
Site-Level Perspective Understand the organization structure Find an optimal Traversal strategy
List-of-Board
List-of-Thread
Browse-by-Tag
Search Result
Post-of-Thread
Login Portal
Entry
Digest
The site-level perspective of "forums.asp.net"
13
Copyright 2008 by CEBT
Outline Motivation Web Forum Structure Our System
Crawling Process Traversal Strategy
– Skeleton link identification– Page-flipping link detection
Evaluation
14
Copyright 2008 by CEBT
Random Sampling
Sitemap Construc-
tion
Traversal Strategy Exploring
Crawl-ing
Process
15
Copyright 2008 by CEBT
• Adopted a combined strategy of breadth-first and depth-first using a double-ended queue
• Try to cover as many as possible unseen URL Pat-
terns
Random Sampling
Sitemap Construc-
tion
Traversal Strategy Exploring
Crawl-ing
Random Sampling
16
Copyright 2008 by CEBT
Random Sampling Randomly sample some pages from a given site
Try to push as many as possible unseen URLs to queue,randomly pop a URL from the front or the end of the queue To cover as many as possible unseen URL patterns
1,000 pages will be enough
17
Copyright 2008 by CEBT
• Utilized the repetitive re-gions to characterize the
contentlayout of each page
• Represent links with their location and URL patterns
Random Sampling
Sitemap Construc-
tion
Traversal Strategy Exploring
Crawl-ing
Sitemap Construction
18
Copyright 2008 by CEBT
Random Sampling
Sitemap Construc-
tion
Traversal Strategy Exploring
Crawl-ing
• Skeleton Link Identifica-tion
• Page-Flipping Link Detec-tion
Traversal Strategy Exploring
19
Copyright 2008 by CEBT
Skeleton Links Skeleton links are the most important links supporting the structure of a fo-
rum site Crawlers crawl as many as possible unique pages in a given forum site by
following skeleton links Skeleton links point to all valuable pages without introducing redundant and
valueless
How to Identify? Aim at all unique pages without duplicates An optimal set of skeleton links leads to most unique pages and few duplicates Search skeleton links for each valuable vertex
– Level by level: Inspired by user browsing behavior– Find an optimal combination of links
20
Copyright 2008 by CEBT
How to Identify Skeleton Links Coverage
Informativeness
Copyright 2008 by CEBT
Page-Flipping Links Crawlers can completely download a long discussion thread
divided into several pages by following page-flipping links Page-flipping links are a kind of loop-back links in the sitemap.
However, not all loop-back links are page-flipping ones How to detect?
For page-flipping links, if there is a path from page A to B,there must be a path follow the same type of links from B to A
Page-flipping links have larger connectivity score
22
Page A
Hyperlink
Page B
Hyperlink
Copyright 2008 by CEBT23
An illustration of the characteristics of page-flipping links
Connectivity = 722 / 890 = 0.81
Connectivity = 108 / 1153 = 0.09
Copyright 2008 by CEBT
Random Sampling
Sitemap Construc-
tion
Traversal Strategy Exploring
Crawl-ing
• Mapping a new page to an
existing layout vertex
• Follow the traversal strategy
for out-links
24
Copyright 2008 by CEBT
Crawling From the given entry page
Map a new page to an existing layout vertex
Follow the explored traversal strategy for out-links from thatpage
25
Copyright 2008 by CEBT
Outline Motivation Web Forum Structure Our System
Crawling Process Traversal Strategy
– Skeleton link identification– Page-flipping link detection
Evaluation
26
Copyright 2008 by CEBT
Experimental Setup Contract experiments in eight forums from diverse cate-
gories Mirror pages: Crawled by a real commerce crawler Structure-driven: Crawled by structure-driven crawler pro-
posed in SIGIR’06 Our method: Crawled by crawler using our traversal strategy
27
Copyright 2008 by CEBT
Evaluation Criteria
28
Coverage
Informativeness
Copyright 2008 by CEBT
Effectiveness and Efficiency Effectiveness
29
Copyright 2008 by CEBT
Effectiveness and Efficiency Efficiency
30
Copyright 2008 by CEBT
Evaluation of Page-Flipping Detec-tion
31
Copyright 2008 by CEBT
Conclusions A complete solution to automatically explore an appro-
priate traversal strategy to a given target forum site is proposed Skeleton link identification Page-flipping link detection
More future work directions Incremental crawling Forum page segmentation
32