27
iRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia

iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

iRobot: An Intelligent Crawler for Web Forums

Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang

Microsoft Research, Asia

Page 2: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

Outline

• Motivation & Challenge

• iRobot – Our Solution

– System Overview

– Module Details

• Evaluation

2

Page 3: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

Outline

• Motivation & Challenge

• iRobot – Our Solution

– System Overview

– Module Details

• Evaluation

3

Page 4: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

Why Web Forum is Important

• Forum is a huge resource of human knowledge– Popular all over the world

– Contain any conceivable topics and issues

• Forum data can benefit many applications– Improve quality of search result

– Various data mining on forum data

• Collecting forum data– Is the basis of all forum related research

– Is not a trivial task

4

Page 5: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

Why Forum Crawling is Difficult

• Duplicate Pages– Forum is with complex in-site structure

– Many shortcuts for browsing

• Invalid Pages– Most forums are with access control

– Some pages can only be visited after registration

• Page-flipping– Long thread is shown in multiple pages

– Deep navigation levels

5

Page 6: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

The Limitation of Generic Crawlers

• In general crawling, each page is treated independently

– Fixed crawling depth

– Cannot avoid duplicates before downloading

– Fetch lots of invalid pages, such as login prompt

– Ignore the relationships between pages from a same thread

• Forum crawling needs a site-level perspective!

6

Page 7: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

Statistics on Some Forums

• Around 50% crawled pages are useless

• Waste of both bandwidth and storage

7

Page 8: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

Outline

• Motivation & Challenge

• Our Solution – iRobot

– System Overview

– Module Details

• Evaluation

8

Page 9: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

What is Site-Level Perspective?

• Understand the organization structure

• Find our an optimal crawling strategy

9

List-of-Board

List-of-Thread

Browse-by-Tag

Search Result

Post-of-Thread

Login Portal

Entry

Digest

The site-level perspective of "forums.asp.net"

Page 10: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

iRobot: An Intelligent Forum Crawler

Crawler

General Web Crawling

Sitemap Construction

Traversal Path Selection

Forum Crawling

Segmentation

& Archiving

Raw Pages Meta

Restart

10

Page 11: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

Outline

• Motivation & Challenge

• Our Solution – iRobot– System Overview

– Module Details• How many kinds of pages?

• How do these pages link with each other?

• Which pages are valuable?

• Which links should be followed?

• Evaluation

11

Sitemap Construction

Traversal Path Selection

Page 12: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

Page Clustering

• Forum pages are based on database & template

• Layout is robust to describe template

– Repetitive regions are everywhere on forum pages

– Layout can be characterized by repetitive regions

(b) (d)(a) (c)

12

Page 13: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

Page Clustering

13

Page 14: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

14

List-of-Board

List-of-Thread

Browse-by-Tag

Search Result

Post-of-Thread

Login Portal

Digest

Page 15: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

Link Analysis

• URL Pattern can distinguish links, but not reliable on all the sites

• Location can also distinguish links

15

1. Login

4. Thread List

5. Thread

A Link = URL Pattern + Location

Page 16: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

16

List-of-Board

List-of-Thread

Browse-by-Tag

Search Result

Post-of-Thread

Login Portal

Entry

Digest

Page 17: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

Informativeness Evaluation

• Which kind of pages (nodes) are valuable?

• Some heuristic criteria

– A larger node is more like to be valuable

– Page with large size are more like to be valuable

– A diverse node is more like to be valuable

• Based on content de-dup

17

Page 18: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

18

List-of-Board

List-of-Thread

Browse-by-Tag

Search Result

Post-of-Thread

Login Portal

Entry

Digest

Page 19: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

Traversal Path Selection

• Clean sitemap– Remove valueless nodes

– Remove duplicate nodes

– Remove links to valueless / duplicate nodes

• Find an optimal path– Construct a spanning tree

– Use depth as cost• User browsing behaviors

– Identify page-flipping links• Number, Pre/Next

19

Page 20: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

20

List-of-Board

List-of-Thread

Browse-by-Tag

Search Result

Post-of-Thread

Login Portal

Entry

Digest

Page 21: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

Outline

• Motivation & Challenge

• iRobot – Our Solution

– System Overview

– Module Details

• Evaluation

21

Page 22: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

Evaluation Criteria

• Duplicate ratio

• Invalid ratio

• Coverage ratio

0%

10%

20%

30%

40%

50%

60%

70%

Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina

Mirrored Pages iRobot

0%

5%

10%

15%

20%

25%

Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina

Mirrored Pages iRobot

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina

Coverage ratio

22

Page 23: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

Effectiveness and Efficiency

• Effectiveness

• Efficiency

0

1000

2000

3000

4000

5000

6000

Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina

(a) A Generic Crawler Invalididate Duplicate Valuable

0

1000

2000

3000

4000

5000

6000

Biketo Asp Baidu Douban CQZG Tripadvisor Gentoo

(b) iRobot Invalididate Duplicate Valuable

0

2500

5000

7500

10000

12500

15000

17500

20000

Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina

(a) A Generic Crawler Invalididate

Duplicate

Valuable

0

2500

5000

7500

10000

12500

15000

17500

20000

Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina

(b) iRobot Invalididate

Duplicate

Valuable

23

Page 24: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

Performance vs. Sampled Page#

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

10 20 50 100 500 1000Number of Sampled Pages

Coverage ratio

Duplicate ratio

Invalid ratio

24

Page 25: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

Preserved Discussion Threads

Forums MirroredCrawled by

iRobot

Correctly

Recovered

Biketo 1584 1313 1293

Asp 600 536 536

Baidu − − −

Douban 62 60 37

CQZG 1393 1384 1311

Tripadvisor 326 272 272

Hoopchina 2935 2829 2593

25

94.5%

87.6%

Page 26: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

Conclusions

• An intelligent forum crawler based on site-level structure analysis– Identify page templates / valuable pages / link

analysis / traversal path selection

• Some modules can still be improved– More automated & mature algorithms in SIGIR’08

• More future work directions– Queue management

– Refresh strategies

26

Page 27: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid

Thanks!

27