Upload
grace-beach
View
218
Download
3
Tags:
Embed Size (px)
Citation preview
iRobot: An Intelligent Crawler for Web Forums
Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang
Microsoft Research, Asia
April 10, 2023
Outline
• Motivation & Challenge• iRobot – Our Solution
– System Overview– Module Details
• Evaluation
2
Outline
• Motivation & Challenge• iRobot – Our Solution
– System Overview– Module Details
• Evaluation
3
Why Web Forum is Important
• Forum is a huge resource of human knowledge– Popular all over the world– Contain any conceivable topics and issues
• Forum data can benefit many applications– Improve quality of search result– Various data mining on forum data
• Collecting forum data– Is the basis of all forum related research– Is not a trivial task
4
Why Forum Crawling is Difficult
• Duplicate Pages– Forum is with complex in-site structure– Many shortcuts for browsing
• Invalid Pages– Most forums are with access control– Some pages can only be visited after registration
• Page-flipping– Long thread is shown in multiple pages– Deep navigation levels
5
The Limitation of Generic Crawlers
• In general crawling, each page is treated independently– Fixed crawling depth– Cannot avoid duplicates before downloading– Fetch lots of invalid pages, such as login prompt– Ignore the relationships between pages from a
same thread
• Forum crawling needs a site-level perspective!
6
Statistics on Some Forums
• Around 50% crawled pages are useless• Waste of both bandwidth and storage
7
Outline
• Motivation & Challenge• Our Solution – iRobot
– System Overview– Module Details
• Evaluation
8
What is Site-Level Perspective?
• Understand the organization structure• Find our an optimal crawling strategy
9
List-of-Board
List-of-Thread
Browse-by-Tag
Search Result
Post-of-Thread
Login Portal
Entry
Digest
The site-level perspective of "forums.asp.net"
iRobot: An Intelligent Forum Crawler
Crawler
General Web Crawling
Sitemap Construction
Traversal Path Selection
Forum Crawling
Segmentation & Archiving
Raw Pages Meta
Restart
10
Outline
• Motivation & Challenge• Our Solution – iRobot
– System Overview– Module Details
• How many kinds of pages? • How do these pages link with each other?• Which pages are valuable?• Which links should be followed?
• Evaluation
11
Sitemap Construction
Sitemap Construction
Traversal Path Selection
Traversal Path Selection
Page Clustering• Forum pages are based on database & template• Layout is robust to describe template
– Repetitive regions are everywhere on forum pages– Layout can be characterized by repetitive regions
(b) (d)(a) (c)
12
Page Clustering
13
14
List-of-Board
List-of-Thread
Browse-by-Tag
Search Result
Post-of-Thread
Login Portal
Digest
Link Analysis• URL Pattern can distinguish links, but not
reliable on all the sites• Location can also distinguish links
15
1. Login
4. Thread List
5. Thread
A Link = URL Pattern + Location
16
List-of-Board
List-of-Thread
Browse-by-Tag
Search Result
Post-of-Thread
Login Portal
Entry
Digest
Informativeness Evaluation
• Which kind of pages (nodes) are valuable?• Some heuristic criteria
– A larger node is more like to be valuable– Page with large size are more like to be valuable– A diverse node is more like to be valuable
• Based on content de-dup
17
18
List-of-Board
List-of-Thread
Browse-by-Tag
Search Result
Post-of-Thread
Login Portal
Entry
Digest
Traversal Path Selection
• Clean sitemap– Remove valueless nodes– Remove duplicate nodes– Remove links to valueless / duplicate nodes
• Find an optimal path– Construct a spanning tree– Use depth as cost
• User browsing behaviors– Identify page-flipping links
• Number, Pre/Next19
20
List-of-Board
List-of-Thread
Browse-by-Tag
Search Result
Post-of-Thread
Login Portal
Entry
Digest
Outline
• Motivation & Challenge• iRobot – Our Solution
– System Overview– Module Details
• Evaluation
21
Evaluation Criteria
• Duplicate ratio
• Invalid ratio
• Coverage ratio
0%
10%
20%
30%
40%
50%
60%
70%
Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina
Mirrored Pages iRobot
0%
5%
10%
15%
20%
25%
Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina
Mirrored Pages iRobot
0%10%20%30%40%50%60%70%80%90%
100%
Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina
Coverage ratio
22
Effectiveness and Efficiency• Effectiveness
• Efficiency0
1000
2000
3000
4000
5000
6000
Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina
(a) A Generic Crawler Invalididate Duplicate Valuable
0
1000
2000
3000
4000
5000
6000
Biketo Asp Baidu Douban CQZG Tripadvisor Gentoo
(b) iRobot Invalididate Duplicate Valuable
0
2500
5000
7500
10000
12500
15000
17500
20000
Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina
(a) A Generic Crawler Invalididate
Duplicate
Valuable
0
2500
5000
7500
10000
12500
15000
17500
20000
Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina
(b) iRobot Invalididate
Duplicate
Valuable
23
Performance vs. Sampled Page#
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
10 20 50 100 500 1000Number of Sampled Pages
Coverage ratio
Duplicate ratio
Invalid ratio
24
Preserved Discussion Threads
Forums MirroredCrawled by
iRobotCorrectly
RecoveredBiketo 1584 1313 1293
Asp 600 536 536
Baidu − − −
Douban 62 60 37
CQZG 1393 1384 1311
Tripadvisor 326 272 272
Hoopchina 2935 2829 2593
25
Conclusions
• An intelligent forum crawler based on site-level structure analysis– Identify page templates / valuable pages / link
analysis / traversal path selection• Some modules can still be improved
– More automated & mature algorithms in SIGIR’08• More future work directions
– Queue management– Refresh strategies
26
Thanks!
27