Geographically Focused Collaborative Crawling Hyun Chul Lee University of Toronto &...

Geographically Focused Collaborative Crawling

Hyun Chul LeeUniversity of Toronto

&Genieknows.com

Joint work with Weizheng Gao (Genieknows.com)Yingbo Miao (Genieknows.com)

Outline

Introduction/Motivation Crawling Strategies Evaluation Criteria Experiments Conclusion

Evolution of Search Engines

Due to a large number of web users with different search needs, the current search engines are not sufficient to satisfy all needs

Search Engines are being evolved! Some possible evolution paths:

Personalization of search engines Examples: MyAsk, Google personalized search, My Yaho

o Search, etc. Localization of search engines

Examples: Google Local, Yahoo Local, Citysearch,etc. Specialization of search engines

Examples: Kosmix, IMDB, Scirus, Citeseer, etc Others (Web 2.0, multimedia, blog, etc)

Search Engine Localization

Yahoo estimates that 20-25% of all search queries have a local component either stated explicitly (e.g. Home Depot Boston; Washington acupuncturist) or implicitly (e.g. flowers, doctors).

Source: SearchengineWatch Aug 3, 2004

Cl assi fi cat i on of searchqueri es

Non- Local

Local Search Engine

Objective: Allow the user to perform the search according to his/her keyword input as well as the geographical location of his/her interest.

Location can be City, State, Country.

E.g. Find restaurants in Los Angeles, CA, USA Specific Address

E.g. Find starbucks near 100 millam street houston, tx

Point of Interest E.g. Find restaurants near LAX.

Web Based Local Search Engine

The precise definition of what Local Search Engine is not possible as there are numerous Internet YellowPages (IYP) that claim to be local search engines.

Certainly, a true Local Search Engine should be Web based.

Crawling of deep web data (geographically-relevant) is also desirable.

Motivation for geographically focused crawling

1st step toward building a local search engine is to collect/crawl geographically sensitive pages.

There are two possible approaches

1. Target those pages which are geographically-sensitive during crawling

1. General Crawling2. Filter out pages that

are not geographically-sensitive.

We study this problem aspart of building

Genieknows local search engine

Outline

Problem Description

Geographically Sensitive Crawling: Given a set of targeted locations (e.g. list of cities), collect as many pages as possible that are geographically-relevant to the given locations.

For simplicity, for our experiment, we assume that the targeted locations are given in forms of city-state pairs.

Basic Assumption(Example)

Pages about Boston

Pages about Houston

Non-relevant pages

Basic Strategy (Single Crawling Node Case)

Exploit features that might potentially lead to the desired geographically-sensitive pages

Guide the behavior of crawling using such features.

Note that similar ideas are used for topical focused crawling.

GivenPage

Extracted URL 2(www.restaurant.com

Extracted URL 1(www.restaurant-boston.com

Crawl this URL

Target Location: Boston

Extension to the multiple crawling nodes case

C1 C2 C3 C4 C5G1 G2 G3 C1 C2

Geographically-Sensitive CrawlingNodes

General CrawlingNodes

Boston Chicago Houston

Crawling Nodes

URL 1 (about Chicago)

C2C1G3G2G1

Extension to the multiple crawling nodes case (cont.)

URL 2 (Not geographically-sensitive)

C2C1G3G2G1

Extension to the multiple crawling nodes case (cont.)

Extracted URL(Houston)

Crawling Strategies(URL Based)

Does the considered URL contain the targeted city-

pair A?

Assign the corresponding URL to the

crawling node responsible for the city-pair A

Crawling Strategies(Extended Anchor Text)

Extended Anchor Text refers to the set of prefix and suffix tokens to the link text.

When multiple findings of city-state pairs occur, then choose the closest one to the link text

Does the considered Extended Anchor Text

contain the targeted city-pair A?

Assign the corresponding URL to the

crawling node responsible for the city-pair A

Crawling Strategies(Full Content Based)

Consider the number of times that the city-state pairs is found as part of the content

Consider the number of times that only city-name is found as part of the content

1. Compute the probability that a page is

about a city-state pair using the full content

2. Assign all extracted URLs to the crawling node responsible for the most probable city-state pair

Crawling Strategies(Classification Based)

Shown to be useful for the topical collaborative crawling strategy (Chung et al., 2002)

Naïve-Bayes classifier was used for simplicity.

Training data were obtained from DMOZ.

1. Classifier determines whether a page is

relevant to the given city-state pair

2. Assign all extracted URLs to the crawling node responsible for the most probable city-state pair

Crawling Strategies(IP-Address Based)

For the IP-address mapping tool, hostip.info (API) was employed.

1. Associate the IP-address of the web host

service with the corresponding city-state

2. Assign all extracted URLs to the crawling node

responsible for the obtained city-state pair

Normalization and Disambiguation of city names

From previous researches (Amitay et al. 2004, Ding et al. 2000) Aliasing: Problem of different names for the same city.

United States Postal Service (USPS) was used for this purpose.

Ambiguity: City names with different meanings. For the full content based strategy, solve it through the

analysis of other city-state pairs found within the page For the rest of strategy, simply assume that it is city nam

e with the largest population size.

Outline

Evaluation Model

Standard Metrics (Cho et al. 2002) Overlap: N-I/N where N refers to the total number

of downloaded pages by the overall crawler and I denotes the number of unique downloaded pages by the overall crawler

Diversity: S/N where S denotes the number of unique domain names of downloaded pages by the overall crawler and N denotes the total number of downloaded pages by the overall crawler.

Communication Overlap: Exchanged URLs per downloaded page.

Evaluation Models (Cont.)

Geographically Sensitive Metrics Use extracted geo-entities (address

information) to evaluate. Geo-Coverage: Number of pages with at least

one geo-entity. Geo-Focus: Number of retrieved pages that

contain at least one geographic entity to the assigned city-state pairs of the crawling node.

Geo-Centrality: How central are the retrieved nodes relative to those pages that contain geographic entities.

Outline

Introduction/Motivation Crawling Strategies Evaluation Criteria Experiment Conclusion

Experiment

Objective: Crawl pages pertinent to the top 100 US cities

Crawling Nodes: 5 Geographically Sensitive Crawling nodes and 1 General node

2 Servers (3.2 GHz dual P4s, 1 GB RAM) Around 10 million pages for each crawling

strategy Standard Hash-Based Crawling was also

considered for the purpose of comparison

Result(Geo-Coverage)

0. 00%1. 00%2. 00%3. 00%4. 00%5. 00%6. 00%7. 00%8. 00%9. 00%

10. 00%

Di ff erent Crawl i ng St rategi es

URL- Hash Based

URL Based

Extended Anchor TextBasedFul l Content Based

Cl assi fi cat i on Based

I P- Address Based

Result(Geo-Focus)

0. 00%

20. 00%

40. 00%

60. 00%

80. 00%

100. 00%

Di ff erent Crawl i ng St rategi es

URL Based

Extended AnchorText BasedFul l ContentBasedCl assi fi cat i onBasedI P- Address Based

Result (Communication-Overhead)

1Di ff erent Crawl i ng Strategi es

URL-Hash Based

URL Based

Extended Anchor TextBasedFul l Content Based

Cl assi fi cati on Based

I P-Address Based

Outline

Introduction/Motivation Crawling Strategies Evaluation Criteria Experiment Geographic Locality

Geographic Locality

Question: How close (graph based distance) are those geographically-sensitive pages?

: the probability that that a pair of linked pages, chosen uniformly at random, is pertinent to the same city-state pair under the considered collaborative strategy.

: the probability that that a pair of un-linked pages, chosen uniformly at random, is pertinent to the same city-state pair under the considered collaborative strategy.

Results

CrawlingStrategy

URL Based 0.41559 0.02582

Classification Based

0.044495 0.008923

Full Content Based

0.26325 0.01157

Conclusion

We showed the feasibility of geographically focused collaborative crawling approach to target those pages which are geographically-sensitive.

We proposed several evaluation criteria for geographically focused collaborative crawling.

Extended anchor text and URL are valuable features to be exploited for this particular type of crawling.

It would be interesting to look at other more sophiscated features.

There are lot of problems related to local search including ranking, indexing, retrieval, crawling

Geographically Focused Collaborative Crawling Hyun Chul Lee University of Toronto &...

Documents

Melbourne VLab CoE - On the COMS Data Product Timeliness€¦ · 4th AOMSUC, Melbourne, Australia, Oct 9 ~11, 2013 KARI 1 Han-dol KIM, Do-chul Yang, Chul-min Park, Eun-joo Kwon, Seok-bae

Online Structured Meta-learning...Online Structured Meta-learning Huaxiu Yao 1, Yingbo Zhou2, Mehrdad Mahdavi Zhenhui Li1, Richard Socher2, Caiming Xiong2 1Pennsylvania State University,

Edited by Thorsten Beck and Yung Chul Park Fostering

Case-Based Recommendation Presented by Chul-Hwan Lee Barry Smyth

HAN, Byung-Chul. en El Enjambre (2014)

Diabetic foot ulcer: A diabetologist's perspectiveicdm2015.diabetes.or.kr/file/RG2-1 Jong Chul Won.pdf · Diabetic foot ulcer: A diabetologist's perspective Jong Chul Won, M.D., Ph.D

Spring 2007Introduction to OS1 IT 3423: Operating System Concepts and Administration Instructor: Wayne (Weizheng) Zhou wzhou2@spsu.edu

CHUL LEE, CORE Lab. E.E. 1 Web Server QoS Management by Adaptive Content Delivery September 26 2000 Chul Lee Tarek F. Abdelzaher and Nina Bhatti Quality

Kwa Sey, PhD, MPH Yingbo Ma, MS Nannie Song, MPH

Sung-Chul Shin Chair Professor, KAIST Director, International Affairs Division, KAST Status of K-12 Science Education in Korea Ⓒ Prof. Sung – Chul Shin

Yung Chul Park: East Asian Liberalization, Bubbles, and the … · 2016-07-31 · Yung Chul Park East Asian ... two-way trade between members of the Association of South- east Asian

Bibliographyof Dr.Chul Par - NASA...Biography of Dr. Chul Park Chul Park recently retired from the NASA Ames Research Center, where he worked as a Staff Scientist in the Reacting Flow

Bong Chul Choi - Postmodernism and Postmodern Judgment

Trainingportal #hms2013 hvordan lykkes globalt - mintra trainingportal - chul aamodt

Beautiful Gwang-Duk-San Mountain & Cho-Kyung-Chul Astronomical Observatory

Lee Chul Young,Korean Grammar Textbook,Indexed

La agonía de Eros_Byung-Chul Han.pdf

STRING Dong-Chul Kim BioMeCIS CSE @ UTA 10/7/2015 1

ZnO nanorod electronic devices - Homepage - CMU · PDF fileGYU-CHUL YI gcyi@ POSTECH Dept. of Materials Science and Engineering 1 ZnO nanorod electronic devices April 3, 2006 Gyu-Chul

A Brief Introduction to Convolutional Networksinwogu/teaching/Coursepage573...A Brief Introduction to Convolutional Networks Yingbo Zhou Intro to Conv Nets Neural science inspirations