4
MelT 2010 Language Specific Crawling based on Web Pages Features Masomeh Azimzadeh, Alireza Yari, Mohammad Javad Kargar Iran Telecommunication Research Center {azim_ma, [email protected]} Abstract Since Word Wide Web contains large set of data in different languages, retrieving language specific information creates a new challenge in information retrieval called language specific crawling. In this paper, a new approach is purposed for language specific crawling in which a combination of some selected content and context features of web documents have been applied. This approach has been implemented for Persian language and evaluated in Iranian web domain. The evaluation results show how this approach can improve the performance of crawling from speed and coverage points of view. 1. Introduction Focused crawling is an automated mechanism to efficiently find pages relevant to a topic on the web. Focused crawlers [1,3,4,6,7,9,10,11,14,19] are proposed to traverse and retrieve only a part of the web that is relevant to a particular topic, starting from a set of pages usually referred to as the seed set. It makes efficient usage of network bandwidth and storage capacity because the crawler follows the relevant paths and stops crawling in irrelevant paths. Focused crawling could be adapted for crawling language specific resources which is named as language specific crawling. The idea of language specific crawling is based on language locality in the web. The language locality expresses this fact that the pages of one language are linked by other web pages of the same language. Since there are not comprehensive research in the context of focused crawling based on the Persian language, in this paper the different approaches of focused crawling has been studied and a new approach for language specific crawling has been proposed. The study of language specific crawling approaches show that some of them are performed based on web page's features such as meta tag [15] or content [18]. In this research, a combination of these two features is used in 978-1-4244-7003-7/10/$26.00 ©2010 IEEE 17 a way to achieve better performance. The experiment results on Persian language show that the purposed approach can improve the retrieval of Persian documents from speed and coverage points of view. After this in section 2, the related works have been reviewed and the proposed language specific crawling approach has been explained in section 3. Then in section 4, the implementation has been presented and the test results have been shown in section 5. Finally, a short summary of paper comes next. 2. Related Works Many works in the field of focused crawling have been done and many approaches have been proposed. In this section, we will have a general review on some of the important works such as learnable or feedback- based crawling, inheritance-base crawling, and crawling with different information resources. It must be considered that some of the mentioned works use a combination of these features. Feedback-based crawling methods utilize an extracted knowledge from previous crawling cycles [8]. These methods are also called the learnable crawling methods. The inheritance-based crawling methods [10, 14] estimate the relevance of neighbor pages before they are actually fetched and analyzed. A crawler can use different information resources. For example, it can use keywords, query [6, 7] or complementary knowledge such as thesaurus or ontology [4, 11]. Language specific crawling is a special kind of focused crawling which is concentrated on language rather than concept. Only few works were completed in the context of language specific crawling [15, 17, 18]. These approaches are based on language locality in the web and gather pages which belong to a specific language or have the same linguistic identities. Language specific web crawling [17] is one of these works which creates the web archives for countries with linguistic identities. This algorithm uses the linguistic meta tag information and n-gram[20] approach for finding the relevant pages. In a similar

[IEEE 2010 International Conference on Multimedia Computing and Information Technology (MCIT 2010) - Sharjah, United Arab Emirates (2010.03.2-2010.03.4)] 2010 International Conference

Embed Size (px)

Citation preview

Page 1: [IEEE 2010 International Conference on Multimedia Computing and Information Technology (MCIT 2010) - Sharjah, United Arab Emirates (2010.03.2-2010.03.4)] 2010 International Conference

MelT 2010

Language Specific Crawling based on Web Pages Features

Masomeh Azimzadeh, Alireza Yari, Mohammad Javad KargarIran Telecommunication Research Center

{azim_ma, a-yari@itrc. ac.ir}

Abstract

Since Word Wide Web contains large set ofdata indifferent languages, retrieving language specificinformation creates a new challenge in informationretrieval called language specific crawling. In thispaper, a new approach is purposed for languagespecific crawling in which a combination of someselected content and context features of webdocuments have been applied. This approach has beenimplemented for Persian language and evaluated inIranian web domain. The evaluation results show howthis approach can improve the performance ofcrawling from speed and coverage points ofview.

1. IntroductionFocused crawling is an automated mechanism to

efficiently find pages relevant to a topic on the web.Focused crawlers [1,3,4,6,7,9,10,11,14,19] areproposed to traverse and retrieve only a part of the webthat is relevant to a particular topic, starting from a setof pages usually referred to as the seed set. It makesefficient usage of network bandwidth and storagecapacity because the crawler follows the relevant pathsand stops crawling in irrelevant paths. Focusedcrawling could be adapted for crawling languagespecific resources which is named as language specificcrawling. The idea of language specific crawling isbased on language locality in the web. The languagelocality expresses this fact that the pages of onelanguage are linked by other web pages of the samelanguage.

Since there are not comprehensive research in thecontext of focused crawling based on the Persianlanguage, in this paper the different approaches offocused crawling has been studied and a new approachfor language specific crawling has been proposed. Thestudy of language specific crawling approaches showthat some of them are performed based on web page'sfeatures such as meta tag [15] or content [18]. In thisresearch, a combination of these two features is used in

978-1-4244-7003-7/10/$26.00 ©201 0 IEEE 17

a way to achieve better performance. The experimentresults on Persian language show that the purposedapproach can improve the retrieval of Persiandocuments from speed and coverage points of view.

After this in section 2, the related works have beenreviewed and the proposed language specific crawlingapproach has been explained in section 3. Then insection 4, the implementation has been presented andthe test results have been shown in section 5. Finally, ashort summary of paper comes next.

2. Related WorksMany works in the field of focused crawling have

been done and many approaches have been proposed.In this section, we will have a general review on someof the important works such as learnable or feedback­based crawling, inheritance-base crawling, andcrawling with different information resources. It mustbe considered that some of the mentioned works use acombination of these features. Feedback-basedcrawling methods utilize an extracted knowledge fromprevious crawling cycles [8]. These methods are alsocalled the learnable crawling methods. Theinheritance-based crawling methods [10, 14] estimatethe relevance of neighbor pages before they areactually fetched and analyzed. A crawler can usedifferent information resources. For example, it canuse keywords, query [6, 7] or complementaryknowledge such as thesaurus or ontology [4, 11].

Language specific crawling is a special kind offocused crawling which is concentrated on languagerather than concept. Only few works were completedin the context of language specific crawling [15, 17,18]. These approaches are based on language localityin the web and gather pages which belong to a specificlanguage or have the same linguistic identities.Language specific web crawling [17] is one of theseworks which creates the web archives for countrieswith linguistic identities. This algorithm uses thelinguistic meta tag information and n-gram[20]approach for finding the relevant pages. In a similar

Page 2: [IEEE 2010 International Conference on Multimedia Computing and Information Technology (MCIT 2010) - Sharjah, United Arab Emirates (2010.03.2-2010.03.4)] 2010 International Conference

work [15], the language of web pages is determinedfrom character encoding schema. The main idea of[18] is to crawl a large domain on the internet in orderto make language specific corpora from the availableresource in the web.

3. Proposed Language Specific CrawlingIn this section, the requirements of a language

specific crawler is presented. A language specificcrawler as a focused crawler must cover two set ofrequirements. Firstly it must have a suitable method foridentification of related pages or pages belong to thelanguage. Secondly, it should have intelligent policy infollowing the relevant URLs.

3.1. Language Identification methodsTo have a dynamic language specific crawler, it is

important to select some proper features and applythem, such that, it can preserve the crawling speed andimprove its performance. In the proposed method, acombination of the following meta tag and stop wordshave been applied in a way to improve the crawlingperformance. Since meta tag feature can be simplyapplied for detecting the language of web pages, it isthe first priority in our approach. In such cases whenmeta tag feature does not exist in a web document, theother combined features will be effective. The contentof web page is the second feature that is suggested forlanguage detection of web documents. For thispurpose, the contents of web documents are comparedwith a list of stop words. The occurrence of word listin that document indicates the language of document.

3.2. The Policy of crawlingDue to the proposed language specific crawling is

not a subject related approach, using the logicalstructure of web will be a suitable mechanism toidentify the language of web pages especially becauseof language locality of web pages in web graph. In ourwork, based on language locality assumption, thelanguage of child page or fetched URLs is consideredsame as the language of parent page. To rank Persianweb pages, a weighting mechanism is used to crawl thetarget web pages. For this purpose, the fetched URLsare weighted based on the language of their parentpage. The weighting formula is presented below:

URL Weight PAGERANK_WEIGHT* pagerank +WLSCORE_WEIGHT * pagerank + HITS_HUB_WEIGHT* hubrank +HITS_AUTHORITY_WEIGHT * authrank +SITERANK_WEIGHT * siterank + QUEUESIZE_WEIGHT *queuesize + DEPTH_WEIGHT* depth_score +STATIC_WEIGHT* static_score + RANDOM_WEIGHT *random_score + PERSIAN_WEIGHT * ispersian;

18

In above formula, different parameters such aspagerank, depth and random and their associatedweights are considered. The Persian weight is used inthis formula to rank Persian pages. The associatedparameter is ispersian which indicates the Persian andnon Persian pages.

However, the URLs are ranked based on thelanguage of their parent pages, the language ofassociated page is checked and determined based onlanguage identification methods. Therefore thecrawling policy of proposed approach is based onusing the information of only one level upper node ofweb graph, this policy protects the crawler from errorpropagation in lower level.

4. ImplementationDesigning the crawler from scratch is a complex

and time consuming process. To implementcomponents of language specific crawler, an opensource crawler called WIRE [21] which is under thelicense of GPL has been used. This crawler has fourmodules which are called seeder, manager, harvesterand gatherer. The manager module generates a list ofURLs that should be downloaded by the harvester ineach cycle. Downloaded pages are sent to gatherermodule for parsing them and find new URLs for nextcycles. The seeder module searches new URLs thathave not been seen before and also checks for URLsthat should not be crawled because of the robots.txtexclusion protocol. To implement the languagespecific crawler, we add the language detectioncomponent to the gatherer where the content of webpages is parsed and the policy component with aweighting factor sets in the manager module. In theimplementation of a proper crawling policy, aweighting mechanism has been developed in themanager module. For every URL in this module, ascore based on different parameters such as crawlingdepth and ranking parameters have been assigned. Toprioritize downloaded URLs, a parent score feature hasbeen added to metadata of web page and this featuregives value in language detection process. If the valueof this field is equal to the specified language, the URLwill be prioritized for the fetching process.

5. Experiment and Analysis

5.1. Initial ConditionsFor the reason that the behaviour of crawler is

strongly affected by the list of initial URLs or seed set,two categories of seed set have been used from DMOZdirectory. DMOZ is a reference directory of differentcountry web sites which is updated manually. The firstcategory of seeds is selected from some Persian

Page 3: [IEEE 2010 International Conference on Multimedia Computing and Information Technology (MCIT 2010) - Sharjah, United Arab Emirates (2010.03.2-2010.03.4)] 2010 International Conference

pages has exhibited that the crawler with suitablePersian weight has better performance from coveragepoint of view. To compare the performance of case oneand case two, the average of coverage rate is calculatedfor these two cases and presented in figure 3.

directories and second category of seeds is selectedfrom directories of different country domains .

5.3. EvaluationTo measure the performance of the language

specific crawler, two metrics including coverage andspeed have been considered . The coverage is evaluatedby the number of relative to the total parsed pages. Thespeed is measured by the number of pages percrawling cycles. The figure 1 represents theexperiment case one with some Persian directory fromDMOZ. In the first case, the crawler has run for 6cycles and each experiment has been taken 1:30 hours.

0 .:

0 .6 +------------;f'~="'----

0.9

0.8

0 .7

0.6

~ 0.5

8 0.4

0 .3

0.2

0.1

...-:

./~

//

/ --/ »>.s-:

___Case l

-+-Case 2

0.5 +----------+-. ""~ U J +----------:,...--1----

8~ 0.3

0.' +------,~---f / - - - -

0 .1 +--_ ;;....;:o -'---..~----,.,'----------

cycl e n umber

_w~i9ht O

_ w .. 19ht2~O

_ w .. ight 5('0

Cycle Number

Figure 3. A comparisoncaseof the experiment casesoneand two.

As figure 3 demonstrates , when seeds are selectedfrom Persian domain, the speed and coverage improvesignificantly . In the last case, we are going todemonstrate the result of feature selection approacheson the rate of Persian web crawler.

Figure 1.Thecoverage rate in different crawling cycles infirstcase

Such as figure 1 show, the higher the weight oflanguage detection priority, the more related pageshave been covered by crawler.

0 .3 .,-- - - - - - - - - - - - - -

0 .25

A0 .2

~ t.~

0 .15-+- we ight 0

f _ we i i ht 2SO

0 .1-.-we ight 500

/ _ w ei ght l OOO

0.05

J1 2 3 4 5 6 7 8 9 10 11 12 13 1 4 15 16

cycle num ber

Figure2. Thecoverage rate in different crawling cycles insecond case

The maximum value for language class weightingdepends on the configuration of crawler . In this case,the optimum value was 500, afterwards noimprovement has been observed in the performance ofcrawler . Therefore in this paper, the experiments wereshown until the value 500. This issue is presented infigure 2 which represents the experiment case two withsome Persian and non Persian directory from DMOZ.In this case, the Persian crawler has been run for 16cycles and each test has been taken 2:30 hours . Thebehavior of crawler with different weights for Persian

19

16000

i14000

i12000

:J 10000e

i 8000_Met~ tag..

~ 6000 ~Contentl

J 4000 ---+- Met. t'B & Content

]000

1 3 ~ I 9 11 13 l~ 11 19 11 13 2~

Cyel . Nu m be r

Figure 4. Therateof gathered Persian pages in differentcrawling cycles

Figure 4 is demonstrating how the combination ofcontent and meta tag features can affect the rate ofPersian web page gathering . As figure 4 represents, themeta tag curve has slower growing rate because manyPersian web pages does not contain such tag. By usingthe content feature, the result becomes better because aportion of web pages without meta tag of language arealso covered . According to figure 4, by combiningthese two features, the rate of gathered Persian pageshas been extended significantly . This is because whenonly the content feature is used some Persian webpages with coding except Windows-1256 and UTF-8are not processed by this algorithm.

Figure 5 demonstrates the effect of feature selectionon coverage rate. As it is observed in Figure 5, thecrawler also gathers the related web pages with morespeed because in different time slices it crawls and

Page 4: [IEEE 2010 International Conference on Multimedia Computing and Information Technology (MCIT 2010) - Sharjah, United Arab Emirates (2010.03.2-2010.03.4)] 2010 International Conference

gathers more fraction of related web pages than metatag- based method and content- based method.

0.6 ,--------------

0.5 h .....- ......- ............."....._-

E 0.4

~ v ~~ 0.3 .......- Metatag~

<3 0.2 _ Content

-.- Metatagand Content

0.1

o -'ll-- - - - - - - - - - - -1 3 5 7 9 11 13 15 17 19 21 23 25 27

Cycle number

Figure 5. The coverage rate in different crawling cycles

As illustrated by the above figures, the nature ofweb crawling prevents the linear growth of thecoverage rate. On the other hand, the number offetched pages in each cycle depends on theconnectivity forms of Persian web pages or Persianweb graph. The performance of crawler is also affectedby different crawling parameters, specifically by thelinguistic parameter.

6. ConclusionIn this work, a language specific crawling approach

based on web features was presented . This approachcombines the content and context features to get fasterand better performance. The results of our experimentsdemonstrate how the coverage of crawler can beincreased with regard to the total number of pages. Theexperiments also showed how the selection of initialURL address (seed) is important and can affect theresult of performance evaluation . However, byincreasing the Persian language detection weight up tospecified value, the evaluation indexes show betterresults. Due to improvement in gathering of Persianweb pages by using the language specific crawler, itcan be used in different applications such as search andinformation retrieval systems.

References[1] G. Pant, P. Srinivasan, and F. Menczer, Crawling the

Web, In Web Dynamics : Adapting to Change in Content ,Size, Topology and Usc, Edited by M. Levene and A.Poulovassilis, Springer Verlag, pp. 153-178, 2004.

[2] M.P.S.Bhatia, D. Gupta, Discussion on Web Crawlers ofSearch Engine, Proceedings of 2nd National Conferenceon Challenges & Opportunities in InformationTechnology (COIT-2008), Mandi Gobindgarh , March29, pp. 227-230, 2008.

[3] G. Almpanidis, C. Kotropoulos, I. Pitas, Combining textand link analysis for focused crawling - An applicationfor vertical search engines, Information System 32(6),pp. 886-908, 2007.

20

[4] A. Badia, T. Muezzinoglu, O. Nas-raoui , Focusedcrawling: experiences in a real world project, InProceedings of the 151nternational Conference on WorldWide Web, Edinburgh , pp. 1043-1044,2006.

[5] S. Chakrabarti, M. van den Berg, B. Dom, Focusedcrawling: a new approach to topic-specific Web resourcediscovery , In Proceedings of the 8th International WorldWide Web Conference, Torento , 1999.

[6] F. Menczer , G. Pant, P. Srinivasan and M. Ruiz,Evaluating Topic -Driven Web Crawlers, In Proceedingsof the 24th Annual International ACM/SIGIRConference, New Orleans, USA, 2001.

[7] J. Cho, H. Garcia-Molina, L. Page, Efficient CrawlingThrough URL Ordering , In Proceedings of the 7thInternational World-Wide Web Conference, 1998

[8] N. Angkawattanawit, A. Rungsawang, LearnableCrawling: An Efficient Approach to Topic-specific WebResource Discovery, In Proceedings of the 2ndInternational Symposium on Communications andInformation Technology (ISCIT) , 2002.

[9] P. De Bra, G.-J. Houben, Y. Kornatzky , R. Post,Information retrieval in distributed hypertexts, InProceedings of RIAO'94, Intelligent Multimedia,Information Retrieval Systems and Management, NewYork, NY, 1994.

[10]M. Hersovici , M. Jacovi , Y.S. Maarek, D. Pelleg, M.Shtalheim, and S. Ur, The Shark-Search algorithm - anapplication :tailored Web site mapping, In: 7th World­Wide, Web Conference, Brisbane, Australia , online ,1998.

[II ]M. Ehrig, A. Maedche, Ontology-Focused Crawling ofweb documents, In Proceedings of the ACM Symposiumon Applied Computing, 2003.

[12]K.Yang, Combining text- and link-based retrievalmethods for Web JR, In The Ninth Text REtrieval Conf(TREC 9), 2001.

[13]S. Raghavan, H. Garcia-Molina, Crawling the hiddenweb, In Proceedings ofVLDB '01 , pp. 129-138,2001.

[14]L. Page, S. Brin, R. Motwani, and T.Winograd. Thepagerank citation ranking : Bringing order to the web,1998.

[15]K. Somboonviwat, M. Kitsuregawa, and T. Tamura.Simulation Study ofLanguage Specific Web Crawling,icde, 21st International Conference on Data Engineering(ICDE'05), p. 1254,2005.

[16]B. Novak, A Survey ofFocused Web CrawlingAlgorithms, SIKDD 2004 Multi-Conference IS 2004, pp.12-15,2004.

[17]K. Somboonviwat, T. Tamura , and M. Kitsuregawa,Finding thai web pages inforeign web spaces , In ICDEWorkshops , p. 135,2006.

[18]G. Botha and E. Barnards, Two approaches to gatheringtext corporafrom the World Wide Web, In Proceedingsof the 16th Annual Symposium of the PatternRecognition Association of South Africa , Langebaan ,South-Africa, p. 194, November, 2005.

[19]C. Castillo, Effective Web crawling, Ph.D. Thesis,University of Chile, Department of Computer Science,2004.

[20]W.B. Cavnar, 1. M. Trenkle , N-gram-based textcategorization, In Symposium on Document Analysisand Information Retrieval, Las Vegas, pp.161-175 , 1994.

[21]http://www.cwr.cl/projects /WIRE/. Oct. 2006