Technical University of Crete Department of Electronic and ...€¦ · Technical University of...

Preview:

Citation preview

Technical University of Crete

Department of Electronic and Computer

Engineering

DESIGN AND EVALUATION OF TOPIC

DRIVEN FOCUSED CRAWLERS

FOR THE WORLD WIDE WEB

By

BATSAKIS SOTIRIOS

A Thesis submit ted in par t ia l fu l f i l lment

of the requi rements for the degree of

Master of Computer Engineer ing

Chania , November 2007

ii

Design and evaluation of topic driven

focused crawlers for the World Wide Web

Batsakis Sotirios

Abst ract

Fo c us e d c r aw l e r s a r e p r o g r am s de s i gne d t o b r ow s e t h e

W eb an d d ow nl o ad p a ge s o n a s p e c i f i c t o p i c . Th e y a r e us e d

f o r a ns w e r i n g us e r q u e r i e s o r f o r bu i l d i n g d i g i t a l l i b r a r i e s

o n a t o p i c s p ec i f i ed b y t h e us e r . T he y a r e d i s t i n gu i s h ed in to

c l as s i c , s e m an t i c a n d l e a r n i n g f o cus e d c r a wl e r s . C l as s i c

f o c us e d c r a wl e r s e s t im a t e t h e r e l ev anc e o f W eb p a ge s wi th

t h e t o p i c b y c o m pu t i n g th e s imi l a r i t y o f W eb p a ge s w i t h a

u s e r p ro v id e d l i s t o f k e yw o r d s t h a t d e sc r ib e t he t op i c o f

i n t e r es t . S em an t i c C r aw l e r s a r e a v a r i a t i o n o f c l a s s i c

f o c us e d c r a wl e r s t h a t u s e c on c ep tua l r e l a t i o ns b e t we e n

t e rm s ( e . g . r e t r i eve d f ro m an on t o l og y) f o r e s t im a t i n g t h e

r e l ev a n c e o f t h e W e b p a ge w i t h t h e t op i c . Le a r n i n g c r a wle r s

e m plo y a t r a in in g p r o ce s s t h a t gu i de t he c r a wl e r t o wa r ds

p a ge s r e l a t ed t o t he t o p i c .

T h i s wo rk a dd r es s i s s u es r e l a t e d t o t h e d e s i gn an d

i mpl e me n t a t i o n o f c l a s s i c , s em an t i c a n d l e a r n i n g fo cu s ed

c r a w le r s . S e ve r a l v a r i a n t s o f c l a s s i c f o cu se d c ra wl e r s

r e l yi n g u p on we b p a ge c on t e n t an d l i nk an c ho r t ex t f o r

e s t im a t in g t h e r e l ev a n c e o f w eb p a ges t o a g i v en t op i c a r e

ex a min e d a nd imp le m e n t ed . A no v e l ty o f t h i s w o rk i s t he

i n t ro du c t io n o f a ne w c a t e go r y o f s e ma n t i c c r a wl e r s m ak i n g

u s e o f W or d Ne t a s t h e un d er l yi n g o n to lo g y f o r o b t a in i n g

t e rm s c on c ep tu a l l y r e l a t e d ( bu t n o t n e c es s a r i l y

l ex i co gr a p h i c a l l y s i mi l a r ) w i th t h e t op i c . Le a r n in g c r a wl e r s

b a s ed on Hid d en M a r ko v Mo d e l ( HM M ) f o r l e a r n i n g n o t

iii

o n l y t h e co n t en t o f r e l ev an t p a ge s bu t a l s o p a t hs l e ad in g to

r e l ev a n t p a ge s fo l l o w in g a c e r t a i n num b er o f r ou t in g h o ps

a r e ex a min e d as w e l l . An a d d i t i ona l c on t r ib u t i on o f t h i s

w o r k i s t h e i n t r od u c t i on o f a ne w c a t e go r y o f h yb r id

c r a w le r s c omb in in g th e s t r e n gt h o f bo th c l a s s i c an d l e a r n in g

f o c us e d c r aw l e r s .

T h e c r a wl e r s r e f e r r e d t o a bo ve a r e a l l i mp l e m en t e d

a n d a c om p ar a t iv e a n a l ys i s o f t h e i r p e r f o r m an c e i s

p r e s en t e d . A l l c r aw l e r s ac h i e v e t h e i r m ax imu m p er f o rma n c e

w h e n a com bi n a t i on o f w eb p a ge an d a n c ho r t ex t i s u s ed f o r

a s s i gn i n g d ow nl oad p r i o r i t i e s t o w e b p a ge s . S e m an t i c

s imi l a r i t y m e t ho ds c om bi n ed wi th a ge n e r a l pu r po se

o n t o l o g y s o u r c e su c h a s W o r dN et do n ’ t a c t u a l l y i m p ro v e

p e r f o r ma n c e , ex ce p t t h e im p l em en t a t i on t h a t r e s t r i c t s

s e ma n t i c s im i l a r i t y t o s yn o n ym t e rm s . H yb r i d c r a wle r s

i mp ro v ed t h e p e r f o r m an c e o f s t a t e o f t h e a r t HM M c r a wle r s

y i e l d in g v e r y p r o mi s in g r e su l t s .

iv

C on t en ts

C hap t e r 1 . I n t r odu c t i on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1 . 1 B a c k gr o u n d .............................................................................................................. 2

1 . 2 P r e s e n t w o r k ........................................................................................................... 6

1 . 3 C o n t r i b u t i o n o f t h e c u r r e n t t h e s i s ............................................................... 8

1 . 4 T h e s i s o u t l i n e ......................................................................................................... 9

C hap t e r 2 . R e la t ed W o rk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 0

2.1 Introduction ............................................................................................................... 10

2 . 2 N o n F o c u s e d C r a w l e r s ..................................................................................... 11

2 . 3 C l a s s i c F o c u s e d C r a w l e r s ............................................................................... 12

2 . 4 S e ma n t i c C r a w l e r s ............................................................................................. 16

2 . 5 L e a r n i n g C r a w l e r s .............................................................................................. 19

2 . 6 S u mma r y ................................................................................................................. 24

C hap t e r 3 . C raw l er D es ign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 6

3.1 Introduction ............................................................................................................... 26

3 . 2 C l a s s i c C r a w l e r s ................................................................................................. 29

3 . 2 . 2 B e s t F i r s t C r a w l e r w i t h a n c h o r t e x t s i mi l a r i t y ........................... 31

3 . 2 . 3 B e s t F i r s t C r a w l e r w i t h p a g e c o n t e n t a n d a n c h o r t e x t . ........... 31

3 . 3 S e ma n t i c C r a w l e r s ............................................................................................. 32

3 . 3 . 1 E h r i g C r a w l e r ............................................................................................... 34

3 . 3 . 2 S S R M C r a w l e r .............................................................................................. 34

3 . 2 . 3 S e ma n t i c C r a w l e r w i t h s y n o n y m s e t e x p a n s i o n .......................... 35

3 . 4 L e a r n i n g C r a w l e r s .............................................................................................. 35

3 . 4 . 1 H i d d e n M a r ko v M o d e l C r a w l e r ........................................................... 37

3 . 4 . 2 H y b r i d C r a w l e r s .......................................................................................... 39

3 . 5 S u mma r y ................................................................................................................. 41

C hap t e r 4 . E xp e r ime n t a l R esu l t s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3

4.1 Introduction ............................................................................................................... 43

4 . 2 P e r f o r ma n c e me a s u r e s ...................................................................................... 44

4 . 3 E x p e r i me n t s e t u p ................................................................................................ 45

4 . 4 C l a s s i c F o c u s e d C r a w l e r s ............................................................................... 47

4 . 5 S e ma n t i c C r a w l e r s ............................................................................................. 48

4 . 6 L e a r n i n g C r a w l e r s .............................................................................................. 50

4 . 7 D i s c u s s i o n .............................................................................................................. 53

v

C hap t e r 5 . Con c lus ion s and f u tu r e wo r k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4

R ef e r en c es . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 6

CHAPTER 1. INTRODUCTION

1

Chapter 1. Introduction

T h e W o r l d W id e W eb i s a hu ge i n f o rm a t io n s ou r c e w i t h

b i l l i o ns o f w e b p age s o n e ve r y c o n c e i v ab l e su b j e c t . G en e r a l

p u rp os e s e a r ch en g i n es s u ch as G oo g le [ 5 ] , Y a ho o [ 7 ] , M SN

[ 8 ] a nd As k [ 9 ] ha v e a pp e a r ed in o r d e r t o a s s i s t u s e r s i n

f i nd i n g in f o rm at i on o n t h e W eb . The s e s e a r c h en g i n es a r e

v e r y c o m pl i c a t e d an d s i z a b l e s ys t e ms [ 1 , 2 ] , bu t t h e y d o n ’ t

a c h i e v e a fu l l c ove r a ge o f t h e W e b . G o o g l e a ch i e v es u p to

7 6 % a nd Y ah oo up t o 6 9% co v e r a ge , wh i l e o t h e r s ea r c h

e n g i n es i n dex a n ev e n sm al l e r p e r c e n t a ge o f t h e e n t i r e W eb

[ 3 ] . In f o r m at io n se a r c h es on t h e W e b i s s u ed t h ro u gh W eb

s e a r ch en g i n es a r e n o t p r op a ga t ed o v er t h e W e b in r ea l t i me .

In s t e a d th e y i n d ex , a n a l yz e a n d c a t e go r i z e W e b i n f o rm at io n

a c c um ul a t e d l oc a l ly i n d a t a r e po s i t o r i e s a nd t h i s i n f o r m at ion

i s t h en u s ed f o r ans w e r i n g us e r q ue r i e s . Th e ge n e r a l p u rp os e

s e a r ch e n gi n e ap pro a c h e f f e c t i v e l y a d d r e s s es t h e n e e d o f t h e

e n d us e r t o f i n d spe c i f i c i n f o r m at i on in r e a l t im e .

C r a wl e r s ( a l s o k no w n as R ob o t s o r S p id e r s [ 20 ] ) a r e

t oo l s fo r a s s emb l i n g lo c a l l y i n f o rm at io n f r om t h e W eb .

Fo c us e d c ra wl e r s i n p a r t i cu l a r , h ave b e e n i n t ro du c ed f o r

s a t i s f yi n g th e n e ed o f i nd iv i du a l s ( e . g . d om ai n ex p e r t s ) o r

o r ga n iz a t io ns t o c re a t e a nd m ai n t a i n l o c a l l y d i g i t a l l i b ra r i e s

o n a s ub j e c t o r f o r a n sw e r i n g c omp l i ca t e d qu e r i e s ( f o r wh i ch

a w e b s e a rc h en g in e wo u l d yi e l d l im i t ed o r no s a t i s fa c t o r y

r e s u l t s ) . T yp i c a l r e q u i r em e n t s o f su ch a pp l i c a t i on us e r s a r e

t h e n e ed fo r h i gh q u a l i t y u p - to - d a t e r e su l t s , w h i l e

m in i miz i n g th e amo u n t o f r e s o ur c e s d e d i c a t e d t o t h e s ea r c h

t a sk . Foc us e d c r awl e r s d ow nl o ad a s m a n y p a ge s r e l ev an t t o

t h e s ub j e c t a s t he y c a n , w h i l e k ee p in g th e am ou n t o f

i r r e l ev a n t d a t a dow n lo ad e d to a mi n i mum [ 3 0] . Bes id es t h e

c r e a t i on o f s p e c i a l i z e d d i g i t a l l i b r a r i e s , a pp l i c a t i ons o f

CHAPTER 1. INTRODUCTION

2

f o c us e d c r aw l e r s a l so i nc lu d e gu id i ng i n t e l l i ge n t a gen t s o n

t h e W eb fo r l o c a t in g s pe c i a l i z ed in f o rm at i on ( e . g . f l i gh t

s c h ed u l es a nd t i c ke t p r i c es f o r a vo ya ge p l a nn in g a ge n t ) . As

t h e imp o r t an c e and th e s i z e o f t h e W eb g r o ws s o do es t h e

i mp or t an c e o f Fo cus e d Cr a wl e r s .

1 .1 Background

C r a wl e r s a r e g iv e n a s t a r t i n g s e t o f w e b p a ge s ( s e ed pa ge s )

i n t h e i r i np u t , ex t r a c t o u t go i n g l i n ks a pp e a r in g in t h e s e ed

p a ge s a n d de t e r mine w h a t l i n ks t o v i s i t n ex t b as e d on c e r t a i n

c r i t e r i a . In t h e f o l l o wi n g , w e b p a ges po in t e d t o b y t h e s e

l i n ks a r e do w nlo a de d , a nd th os e s a t i s f yi n g c e r t a i n s e l ec t i o n

c r i t e r i a a r e s to r ed i n a l o c a l r ep os i to r y. C r a wl e r s c on t i nu e

v i s i t i n g W e b p a ges u n t i l a k n ow n numb e r o f p a ge s h a v e b e e n

d o wn lo ad e d o r un t i l l o c a l r e so u rc e s ( su c h a s s to r a ge ) a r e

ex h au s t ed .

T h e Cr a wl e r s u s ed b y ge n e r a l p u rp os e s e a r ch e n g ine s

r e t r i ev e W eb p a ge s m as s iv e l y r e ga r d l es s t h e i r t o p i c . M eth o ds

f o r im p l em e n t i n g su c h Cr a wl e r s i n c lud e :

a ) B r ea dt h F i rs t C r aw l e rs : T he o u t go in g l i nks f r om t he

g i v e n se t o f pa ge s a r e ex t ra c t ed a nd in s e r t ed i n a F i r s t

In F i r s t Ou t ( F IFO ) q ue u e , an d th e i r co r r es po nd in g w eb

p a ge s a r e do w nlo ad e d . T h e p r o c es s c o n t in ue s s im i l a r l y

w i t h t h e n e w p a ges .

b ) Pa g e i mp or t an c e C r aw l e rs : T he y a s s i gn h i gh e r v i s i t

p r io r i t y t o w e b p a ge s ( i . e . t o t he i r c o r r es po nd in g

U R Ls ) l i nk e d to f r om m o r e im po r t a n t p a ge s . P a ge

i mp or t an c e es t im a t i on c r i t e r i a fo r a s s i gn i n g p r i o r i t i e s

t o ex t r a c t e d UR Ls i n c lu d e Ba c k l i n k co u n t ( i . e . num b e r

o f we b p a ge s c on t a i n i n g l i nk s t o a g ive n p a ge ) [ 2 2 ] a nd

P a ge R an k ( t h e imp o r t an c e e s t i m a t i on m et ho d u s ed in

t h e Go o g l e s e a r ch e n g i n e ) [ 6 ] .

CHAPTER 1. INTRODUCTION

3

A l t ho u gh s im pl e , B r e a d th F i r s t C r a w l e r s a ch i e v e go od

p e r f o r ma n c e (m ea s u r ed as t h e a v e r a ge qu a l i t y o f

d o wn lo ad e d p a ge s u s i n g P a ge Ra nk c r i t e r io n ) [ 19 ] , a nd a r e

e f f e c t i v e fo r im p l em e n t i n g no n - f o cu s ed C r a wl e r s . Th e

m aj o r d i s a dv a n t a ge o f Br e a d th F i r s t C r a wl e r s ( a n d o f t h e

o th e r n on t op i c d r iv e n C ra wl e r s ) i s t h a t t h e y u s e o n l y t h e

l i n k s t r uc tu r e o f t h e w e b an d no t w e b pa ge c o n t en t i n

a s s i gn i n g v i s i t p r io r i t i e s t o UR Ls ; c ons e qu e n t l y t h e y f a i l t o

f o c us o n p a ge s o n a t o p i c . Be c au s e p a ge s o n a s p ec i f i c

t op i c a r e a m in o r f r a c t i on o f t h e ov e ra l l W e b , c r a wl i n g o n

t h a t t o p i c u s i n g n o n fo c us ed c r a wl e r s w i l l r e su l t i n to

d o wn lo ad in g a l a r ge n um b er o f i r r e l ev a n t p a ge s , t h us

q u i c k l y e x ha us t ing t h e a v a i l a b l e r e s ou r c es . T h e re fo r e

b u i ld i n g a sp e c i a l i z e d d i g i t a l l i b r a ry c a l l s fo r fo c use d

c r a w le r s .

Fo c us e d c r a wl e r s w o r k b y c o m bi n i n g b o t h t h e co n t en t o f

t h e r e t r i e v ed W eb p a ge s an d th e l i nk s t r u c tu r e o f t h e W eb

f o r a s s i gn in g h i ghe r v i s i t i n g p r io r i t y t o pa ge s r e l e v an t t o

t h e t o p i c . T h e y a r e d i s t i n gu i s h ed in to t h e fo l l o wi n g

c a t e go r i es :

a ) C l ass i c Fo c us ed C r aw l e rs [ 26 ] t ake a s i np u t a u s e r

q u e r y t h a t d es c r i be s t h e t o p i c a nd a s e t o f s t a r t i n g

U R Ls ( s e ed s ) . The c r a wl in g s t a r t s f r om th e us e r

p r ov id e d s ee d URLs . T h e c r aw l e r s a s s i gn a p r i o r i t y

v a lu e t o v i s i t ed p age s a c c o r d in g t o t h e i r r e l ev an c e t o

t h e t o p i c . T h e w e b p a ge s a r e o r de r e d b y r e l e v a n c e a nd

t h e c r aw l e r s p ro c ee d b y v i s i t i n g t h e m os t r e l ev a n t w e b

p a ge s f i r s t . T h e mo s t co mmo n c r i t e r io n fo r r e l e v an c e

e s t im a t io n b e tw e e n a r e t r i e v ed p a ge a n d a u s e r qu e r y

i s d e f i n ed as t h e s imi l a r i t y b e t w e e n t h e t ex t o f t h e

v i s i t ed p a ge wi th t h e qu e r y ( t op i c ) . T yp i c a l l y t h i s i s

c o mp ut ed us in g a t ex t s im i l a r i t y m o d e l su c h as t h e

Bo o le a n o r t h e Ve c t o r Sp a c e Mo d e l [ 12 ] . Foc us e d

CHAPTER 1. INTRODUCTION

4

c r a w le r s u s in g V e c t o r Sp a c e M ode l f o r r e l e v an c e

e s t im a t io n ( Bes t F i r s t C r a wl e r s [ 25 ] ) a r e t h e m os t

e f f e c t i v e c l a s s i c foc u s ed c r aw l in g m et ho d s o f a r [ 26 ] .

Ex i s t i n g wo r k on c l a s s i c fo cu s ed c raw l e r s i s p r e s en t e d

i n s e c t i o n 2 .3 . O u r p r op os e d v a r i a n t s a nd

i mp l e me n t a t i o ns o f c l a s s i c fo c use d c r a wl e r s a re

d i s c uss e d in s e c t i on 3 . 2 .

b ) S e man t i c C raw l e rs a r e a v a r i a t i o n o f c l a s s i c fo cu s ed

c r a w le r s . P a ge v i s i t p r io r i t y i s a s s i gne d t o p a ge s u s in g

t h e i r c on t e n t a nd b y a p p l yi n g s e m a n t i c c r i t e r i a f o r

c o mp ut i n g p a ge - t o - t op i c r e l e v an c e . A p a ge a n d th e

q u e r y c a n b e r e l e v a n t i f t h e y s h a r e c o n c ep t u a l l y

s imi l a r ( bu t no t ne c e s sa r i l y l e x i c a l l y s i m i l a r ) t e rms .

C on c ep tu a l r e l a t i on s b e t w e en t e rm s a r e d e f in e d us in g

a n un d er l yi n g t op i c sp e c i f i c o r ge n e r a l p u r po s e

o n t o l o g y. T h us , s em a n t i c c r a wl e r s d i f f e r w i th c l as s i c

f o c us e d c r a wl e r s i n t h e w a y c o n t en t r e l ev a n c e i s

c o mp ut ed . T o t he b e s t o f ou r k now l ed ge s em a n t i c

c r a w le r s ha v en ’ t be e n c om p ar e d wi th s t a t e - o f - th e - a r t

c l a s s i c fo cu s ed c ra w l e r s s u ch as t h os e r e fe r r ed t o

a b ov e , no r h a v e t h e y b e e n c omb ine d wi t h mo d e rn

s e ma n t i c s i mi l a r i t y m e t h o ds ( as t ho s e p r e s en t e d i n

[ 1 1 ] ) so a s t o a c h i e v e t h e i r fu l l p o t en t i a l . T h e p r es e n t

w o r k ad d r es s e s a l l t h es e i s su es ( s e c t i on 3 . 3 ) .

c ) L e ar n in g C r aw le rs [ 33 ] ap p l y a t r a in in g p ro c e s s fo r

a s s i gn i n g v i s i t p r i o r i t i e s t o W e b p a ge s a n d f o r gu i d in g

t h e c r a wl i n g p ro ce s s . Th e y a r e c h ar a c t e r i z ed b y t h e

w a y r e l e v an t w eb pa ge s o r p a t hs t h r ough w e b l i nk s f o r

r e a c h i n g r e l ev an t p a ge s a r e l e a r n ed b y t h e c r a w le r

( t yp i c a l l y b y m a c h i n e l e a rn i n g o r o the r p r o ce s s e s ) so

t h a t t h e c r a wl e r c an d i s t i n gu i sh b e t we e n r e l e v an t an d

n o n r e l e v an t p a ges . Bu i ld i n g up on t h i s i d e a , a n um be r

CHAPTER 1. INTRODUCTION

5

o f a p pr o a ch e s fo r l e a rn in g r e l ev a n t t o t h e t op i c W eb

p a ge s h a ve ap p e ar ed i n t h e l i t e r a t u re an d in c l ud e :

1 . A p p ro a ch e s b as e d o n m a ch i n e l ea r n i n g : T he

c r a w le r i s s up p l i ed wi t h a t r a i n in g s e t c ons i s t i n g

o f r e l ev a n t a nd n on r e l ev a n t W e b p age s w h i ch i s

u s ed t o t r a i n t h e l e a r n i n g C r a wl e r [ 33 , 34 ] . Du r i n g

c r a w l in g h i gh e r v i s i t p r i o r i t y i s a s s ign e d t o w eb

p a ge s c l as s i f i ed as r e l ev a n t t o t h e t op i c .

2 . A p p ro a ch e s t h a t t a k e n o t o n l y t h e p a ge c on t en t

a n d t h e c o r re sp on d i n g c l a s s i f i c a t i o n o f w eb p a ge s

a s r e l e va n t o r no n r e l ev a n t t o t he t op i c i n t o

a c c o un t , b u t a l s o t h e l i n k s t ru c t u r e o f t h e W eb an d

t h e p ro ba b i l i t y t ha t a g i ve n p a ge (w h ic h c a n b e

n o n re l ev a n t t o t he t op i c ) w i l l l e ad t o a r e l e va n t

p a ge w i t h in t h e min im um n um b er o f s t ep s ( ho ps ) .

M e th od s b a se d i n C on te x t G ra ph s [ 31 ] a nd H id d en

M a r ko v Mo de l s (HM M ) [ 16 ] a r e ex am pl es o f t h i s

c a t e go r y o f m e th o ds . S e c t io n 2 .5 c on t a in a

d e t a i l e d d es c r i p t i on o f t h e se me th od s a n d S ec t i on

3 . 4 t h e e nh an c e me n t s p ro pos e d in t h i s w o r k .

3 . H yb r i d m et ho ds t h a t co mbi n e l e a rn i n g c r a wl e r s

w i t h i d e as o f c l a s s i c f oc us e d c r a wl e r s [ 3 5 ] . O u r

w o r k fo c us e s on hyb r i d c r aw l e r s a nd p ro po s es an

a p p ro a ch t h a t comb in e s t h e s t r e n gt hs o f c l a s s i c

f o c us e d c r aw le r s ( v a r i a t i on s o f Be s t F i r s t

C r a wl e r s ) wi t h Hidd e n M a r ko v M od e l s f o r l e a rn in g

n o t o n l y h o w to d i s t i n gu i s h b e t w ee n r e l ev a n t a nd

n o n r e l e v an t W e b p a ge s b as ed o n c on t e n t , b u t a l s o

o n l e a rn i n g ho w t o gu id e t h e s e a rc h fo r s u ch

r e l ev a n t W eb p a ges t h r ou gh a s e qu e nc e o f ro u t in g

h o ps b e t w e en W e b p a ge s ( s om et im es t h r ou gh n on

r e l ev a n t p a ge s ) . T h i s m e t ho d i s d e s c r ib e d in

s e c t i o n 3 .4 a nd t he ex p er im en t a l r e su l t s ob t a in e d

CHAPTER 1. INTRODUCTION

6

( S e c t i on 4 . 6 ) i n d i c a t e t h a t i t i s a ve r y e f f e c t i ve

c r a w l in g m eth od .

Fig. 1: Crawler Classification

1 .2 Present w ork

T hi s w o rk d e a l s wi th t he d es i gn an d e v a l u a t i on o f fo cu s ed

c r a w le r s . S t a t e o f t h e a r t a pp r o a che s f o r bu i ld i n g to p i c

d r iv en f o cu s ed c ra w l e r s a r e co ns ide r e d in c l ud in g c l as s i c ,

s e ma n t i c a nd l e a rn in g c r a wl e r s . S ev e r a l v a r i an t s o f t h es e

a p p ro a ch e s a r e a l so p r op os e d a nd e v a l u a t ed . Th e em ph as i s o f

t h i s w or k i s on hyb r i d c r a wl e r s com bi n in g t ex t a nd l i n k

i n fo rm at io n fo r r e ac h in g f a s t e r mo r e p r omi s i n g p a ge s on t h e

t op i c o f i n t e re s t .

T h e f i r s t c r a wl e r im p l em e n t ed i s t h e Br e ad t h F i r s t

C raw l er . T h i s i s a c l a s s i c n on to p i c -o r i e n t e d c r aw le r wh i ch

i s u s ed a s a r e fe r e n c e i n a l l com p ar i s on s wi t h fo c us e d

c r a w le r s . S ev e r a l va r i an t s o f t h e B es t F i r s t Cr awl er [ 2 5 ] a r e

a l so im p l em e n t ed a n d e v a l ua t ed . Be s t F i r s t C r a wl e r w or ks b y

e s t im a t in g th e r e l ev a n c e o f t he r e t r i ev e d p a ge w i th t h e u s e r

q u e r y ( b o th r ep r es e n t ed us i n g t e rm v e c t o r s ) u s in g Ve c t o r

S p ac e Mo d e l (VSM ) [ 1 2 ] ; t h en i t v i s i t s t h e l i n ks ex t ra c t ed

f r om t h e m os t r e l e v a n t p a ge . A UR L c a n be r ep r ese n t ed

e i t h e r b y t h e t e r m v e c to r o f t h e W eb p a ge i t wa s ex t ra c t ed

f r om , o r b y t h e t e rm v e c to r o f i t s c o r r e s po nd i n g a n ch o r t ex t

( t he t ex t t h a t a pp ea r s o n th e l i n k po in t i n g t o t h a t UR L) . Al l

s o l u t i on s ( us in g p a ge c on t en t , a n c ho r t ex t o r t h e i r

Crawlers

Non topic oriented crawlers Focused crawlers

Classic focused crawlers Semantic crawlers Learning crawlers

CHAPTER 1. INTRODUCTION

7

c o mbi n a t i on ) a re im p l em e n t ed an d e va l u a t ed . Th e s e me th o ds

a r e de s c r ib e d in s ec t i on 3 .2 .

T he s e co nd c a t e go r y o f m e th o ds i n c lu d es S em a n t i c

C r a wl e r s t ha t e s t im a t e t h e c on c e p tua l ( s em a n t i c ) r e l ev a n c e

o f a W eb p a ge w i t h t h e qu e r y. T h e m e th od b y E h r ig e t . a l

[ 1 3 ] c om bin e s f o cu s ed c r aw l e r s an d s e ma n t i c r e l a t i o ns f r om

a n o n t o lo g y ( i n [ 1 3 ] t op i c s p e c i f i c on to lo g i es w e r e us e d ) , f o r

a s s i gn i n g v i s i t p r io r i t i e s t o p a ge s . In o u r im p l em e n t a t i o n o f

s e ma n t i c c r aw l e r s , t e rm v e c to r s a r e e nh a n c ed wi t h s yn o n ym s

a n d s em a n t i c a l l y s imi l a r t e rm s f rom Wo r dN e t [ 4 ] ( t hu s

m a k in g o u r im p le m e n t a t i on t h e f i r s t ge ne r a l pu rp os e

s e ma n t i c c r a wl e r im p l em e n t a t i on ) . To p i c r e l e va n c e c an t h en

b e c omp ut ed b y V S M [ 1 2] , t h e S e m an t i c S i mi l a r i t y R e t r i e v a l

M od e l (SSR M ) [ 1 4 ] o r b y M i h a l c e a e t . a l . [ 15 ] .

O u r p r op os ed app r o a ch t o Le a r n in g C r aw l e r s i s

i n f lu e n c ed b y w o r k o n H M M C raw l er s [ 16 , 1 8 ] fo r l e a r n in g

p a th s l e ad in g to r e l e v an t p a ge s i n add i t i on t o t h e c on t en t o f

t h e d es i r e d w e b p age s . Th e u s e r o f a n H MM C r a wl e r p rov id e s

a t r a in in g s e t o f p age s ( bo th r e l ev a n t a n d n on r e l ev a n t t o t he

t op i c o f i n t e r es t ) . T h es e p a ge s a re c l us t e r ed a c co r d i ng t o

t h e i r co n t en t . T r ans i t i on p ro b ab i l i t i e s b e tw e e n t he r e su l t i n g

c l us t e r s r ep r es e n t in g r e l e v an t o r no n r e l e v an t p a ge s ( l ea d in g

t o r e l ev an t o n es ) a r e c om put e d an d a r e us e d to e s t im a t e

( g i v en th e c l us t e r a W eb p a ge i s a s s i gn e d) , t h e p ro b ab i l i t y

t h a t i t w i l l l e ad t o r e l ev a n t p a ge s . T h e h i gh e r t h i s p ro b ab i l i t y

i s t h e h i gh e r t he v i s i t p r io r i t y g i v e n t o t h e p a ge ’ s ex t r ac t ed

l i n ks wi l l b e . K -Me a n s [ 4 7 ] an d X -Me a n s [ 1 7 ] c an b e a pp l i ed

f o r t h e c lu s t e r in g o f W eb pa ge s . K-m e a ns c l us t e r i n g i s a n

a l go r i t hm to c l as s i f y o r t o g r ou p ob je c t s b as e d o n

a t t r i b u t es / f e a t u r es i n to K g r ou ps ( K i s pos i t i v e i n t e ge r

p r e d ef in e d num b e r ) . T h e g r o up i n g i s do n e b y m i n im iz in g t h e

s um o f sq u ar e s o f d i s t a n c es be t w e en da t a a nd t h e

c o r r es po nd in g c l us t e r c e n t ro id . X -M ea n s i s an ex t e ns i on o f

CHAPTER 1. INTRODUCTION

8

K - m e an s wi t h d yn a mi c es t im a t io n o f t h e n um b er o f c l us t e r s

d e p en d en t on th e d a t a . In t h i s wo rk f o c us ed c r a wl e r s b a s ed

o n bo th c lu s t e r in g a p p r o ac h es a r e i mp le m en t e d a n d

e v a lu a t e d .

Ba s e d on t h e HMM C ra wl e r , Hy br id Cr awl e r s t h a t

c o mbi n e c l as s i c f oc u s ed c ra wl e r s fo r a s s i gn in g p r i o r i t i e s t o

U R Ls b a s ed o n t op i c r e l ev a n c e , a nd l e a r n i n g c r a wl e r s f o r

l e a rn in g a c c es s p a t hs fo r r e a c h in g re l e va n t p a ge s ( po ss ib l y

t h ro u gh no n r e l ev a n t o ne s ) a r e p r op os e d . T wo hyb r i d

c r a w le r s com bi n i n g H M M w i t h p a ge o r b o t h p a ge a n d anc h o r

t ex t s a r e imp l em en t e d a nd ev a lu a t e d . O u r p r op os ed a pp ro a c h

t o h yb r i d c ra wl e r s i s p re s en t e d i n s e c t i on 3 .4 .

T h e c r aw le r s r e f e r r e d t o ab ov e ( an d t h e i r v a r i a t i o ns )

a r e a l l im p l em en t ed a n d t he i r p e r fo rma n c e i s c omp a r e d ba s e d

o n r es u l t s o b t a i ned f r om t h e w e b u s i n g s ev e r a l d i f f e r e n t

t op i c s a nd s e ed ( s t a r t i n g ) p a ge s . S e c t i on 4 p r es e n t s a

c o mp a r a t i v e s tu d y o f t h e p e r f o rm an c e o f a l l c r a wl e r v a r i a n t s

b y c a t e go r y a l on g w i th a c r i t i c a l a n a l ys i s o f t h e i r

p e r f o r ma n c e .

1 .3 Contr ibut ion o f the current thes i s

T h e c on t r ib u t i on s o f t h i s w or k a r e su mm a r i z e d be lo w:

a ) T hi s t h es i s p r es e n t s a c r i t i c a l e v a l ua t io n o f s t a t e o f t h e

a r t a pp ro a c h es t o W eb C r aw l in g , i n c lu d in g C l as s i c ,

S em a n t i c a nd Le a r n i n g Fo c us ed Cr a w l e r s . T o o u r

k n ow le d ge a s im i l a r e v a l u a t i on h as n ’ t a pp e a r ed in t h e

l i t e ra tu r e b e fo r e .

b ) P ro po s es s ev e r a l v a r i an t s t o ex i s t i n g c r a w l in g

m et ho do lo g i es b a se d o n r e c en t s em a n t i c r e l e v an c e

e s t im a t io n m eth ods a nd com p a r e t he i r p e r fo rm a n ce

w i t h c l a s s i c fo c us ed c r aw l in g m eth od s .

c ) P ro po s es a no v e l hyb r i d a p pr o a ch t o l e a r n i n g c r a wl i n g

c o mbi n i n g c l as s i c f o cu s ed c r a wl e r s fo r a s s i gn i n g

p r io r i t i e s t o UR Ls w i th i d ea s f ro m l e a r n i n g c ra wl e r s

CHAPTER 1. INTRODUCTION

9

f o r l e a rn i n g pa th s f o r r e a c h i n g w e b pa ge s r e l e v an t t o

t h e t op i c .

1 .4 Thes i s out l ine

T h e w or k i n t h i s t he s i s i s o r ga n iz ed as f o l l o ws : R e l a t e d w o r k

o n fo c us ed c r a wl i ng i s p r e s en t e d i n Se c t i on 2 . I t i s o r ga n iz ed

i n s ix su bs e c t io ns ; t h e f i r s t i s t h e i n t r od uc t i on , t he s e co n d

s ub s e c t i on ( 2 . 2 ) p re s e n t s no n t op i c d r i v en c r aw l e r s , t h e t h i rd

s ub s e c t i on (2 .3 ) c l a s s i c im p l em e n t a t i o ns o f fo c us e d c r aw l e r s ,

t h e fo u r th su bs e c t io n (2 .4 ) t h e p r e l im in a r y r e l a t e d wo rk o n

s e ma n t i c c r aw l e r s , t h e f i f t h s ubs e c t io n ( 2 . 5 ) p res e n t s

p r e v io us w o rk on Le a r n i n g Cr a wl e r s a nd th e s ix th i s a

s um ma r i z a t i o n o f t h e ab ov e .

I s s u e s r e l a t e d t o t h e d es i gn an d im ple m e n t a t i on o f W eb

c r a w le r s i s p re s en t e d i n s ec t i on 3 . S u bs e c t i on 3 .1 i s a n

i n t ro du c t io n to t h e t o p i c , s ub s e c t i on 3 . 2 p r ov id es a d e t a i l e d

d e s c r ip t i o n o f c l a s s i c c r aw l e r s i mp lem e n t ed i n t h i s wo rk a n d

s ub s e c t i on 3 . 3 d ea l s w i th i s su es r e l a t e d t o t h e d es i gn o f

s e ma n t i c c r aw l e r s . In s u bs e c t io n 3 . 4 p a r t i c u l a r em p ha s i s i s

g i v e n to l ea r n i n g c r a w le r s an d to t he s ub s eq ue n t d es i gn o f

h yb r i d c r a wl e r s .

S e c t i on 4 p ro v id es a d es c r i p t i on o f t h e ex p er im en t a l

r e s u l t s . S ub s e c t i on 4 .1 p r es en t t h e p u rp os e o f t h e

ex p e r im e n t s , i n t h e s e c on d pa r t ( s u bs e c t i on 4 .2 ) t h e

p e r f o r ma n c e m e a su r e s u s ed to e v a lu a t e t he c r a wl e r s a r e

d e s c r ib e d . Th e ex pe r im e n t a l s e tu p i s d i s c uss e d i n s ub s ec t i on

4 . 3 . Ex p e r i m en t a l r e s u l t s on C l as s i c C r a wl e r s a re p re s en t ed

i n su bs e c t i on 4 .4 f o l l o w e d b y r e s u l t s o b t a i n ed b y s e ma n t i c

a n d l e a r n in g c r a w l e r s i n su bs ec t i on s 4 . 5 an d 4 . 6

r e s p ec t iv e l y. S u bse c t i on 4 .7 p r e s en t s a c r i t i c a l a n a l ys i s o f

t h e p e r fo rm a n ce o f v a r i ou s c r a wl e r s m e t ho ds c on s i de re d in

t h i s wo r k . F i n a l l y c o n c l us i on an d i s su e s f o r f u r th e r r e se a r c h

a r e d i s c us se d in S ec t i on 5 .

CHAPTER 2. RELATED WORK

10

Chapter 2. Related Work

2.1 Introduction

R el a t e d w o rk o n c r a w le r s i n c l ud e s c o n t r i bu t io ns r e gar d in g

b o t h c l a s s i c (n on - to p i c o r i en t ed ) a nd fo c us e d ( t o p i c -

o r i e n t e d ) c r a wl e r s . Ex i s t i n g wo r k o n t h e d es i gn a nd

i mpl e me n t a t i o n o f n on fo c us e d c r aw l e r s an d o f f o cu s ed

( c l as s i c , s e m an t i c a n d l e a r n in g ) c ra w l e r s p r op os ed in t h e

l i t e ra tu r e i s p r es e n t e d i n t h i s c h ap te r .

C l a s s i c no n f o cu s ed Cr a wl e r s ( e . g . c ra w l e r s u s ed b y w e b

s e a r ch e n g in e s f o r a s s em bl i n g w e b p a ge s t o l o c a l

r e p os i to r i e s ) do wnl o ad W eb p a ge s m a ss i v e l y r e ga r d l e s s o f

c o n t e n t i n o r d e r t o c r e a t e v as t p a ge r e po s i t o r i e s . Fo cu s ed

c r a w le r s o n th e o th e r ha nd a r e mo r e s e l e c t i v e , do w nloa d in g

o n l y p a ge s r e l a t e d t o a kn ow n (u s e r p r ov id e d) t o p i c . I s s u es

r e l a t e d t o t h e d es i gn a nd im pl em e n t a t i o n o f c l a s s i c a s w e l l a s

o f fo c us e d c r a wl e r s a r e d i sc us s ed i n t h e fo l l o wi n g a n d

i n c lu d e :

a ) S e ar ch s t ra t eg y : Th e c r a wl e r ca n b r ow s e t h e w e b i n a

b r e a d th f i r s t o r d e r o r s e l e c t l i nk s t o f o l l o w u s in g

i mp or t an c e es t ima t i on c r i t e r i a . Fo c us e d c r aw l e r s

a s s i gn v i s i t i n g p r io r i t i e s t o pa ge s ac c o r d in g t o t h e

r e l ev a n c e o f t h e page w i t h a t o p i c sp ec i f i e d b y a u s e r .

b ) R ef r e sh in g po l i cy : D u e t o t h e d yn a mi c n a tu r e o f t h e

W eb , p a ge s m us t b e r ev i s i t e d i n o rd e r t o ke e p p a ge

r e p os i to r i e s up - t o - d a t e . T h e op t ima l p a ge r e f r es h

p o l i c y t h a t a c h i eve s k e e p i n g p a ge re p os i to r i e s up - to -

d a t e wi th ou t un n ec e s s a r y d o w nl o ad in g o f no n o u t -

d a t ed p a ge s i s a v e r y i m p o r t an t i s s u e i n c r a wl e r d e s i gn

[ 2 1 ] . A l so , s a t i s f yi n g t h e co nf l i c t i n g d e m an ds fo r h i gh

d o wn lo ad in g r a t e w i t ho u t p u t t i n g ex ce s s i ve l o ad to t h e

CHAPTER 2. RELATED WORK

11

v i s i t ed W e b s i t e s i s a m a jo r c on c e rn w h e n de s i gn i n g a

C r a wl e r fo r a s e a r ch en g i n e .

c ) S yn ch ron iz a t i on : C r a wl e r s u s ed b y c o mm e r c i a l s e a r ch

e n g i n es us e m ul t i p l e p a ra l l e l p ro c es se s t h a t m as s iv e l y

r e t r i ev e W e b p a ge s , r e ga rd l es s o f t he i r t op i c . Th e se

p r o c es s e s mu s t be s yn c h r on iz e d i n o r d e r t o av o id

d u p l i c a t e d pa ge dow n lo ad in g [ 20 ] .

2 .2 Non Focused Craw lers

T yp i c a l l y n o n f o cus e d c ra wl e r s a r e u s e d b y ge n e r a l p u rp os e

s e a r ch e n gi n es fo r a s s e mbl in g lo c a l l y W e b i n fo rm at i on .

M e th od s fo r im p l em e n t i n g s u ch C ra wl e r s i n c lu d e :

a ) B r ea dt h F i r s t C raw l e r : A ft e r d ow nlo a d i n g t h e i n i t i a l

p a ge s ( c a l l ed s e ed p a ge s ) t h e ou t go in g l i nks ex t r a c t e d

f r om th e s e p a ge s a r e p u t i n a F IFO q u e u e . Th e l i n ks

ex t r a c t e d f i r s t po i n t t o p a ge s t h a t a r e g i v e n t h e h i gh es t

p r io r i t y f o r d o w nl oa d in g a n d f u r th e r c r a w l in g . B r e a d t h

f i r s t c r a wl i n g i s o n e o f t h e m os t c omm on l y u s e d

c r a w l in g a p p ro a ch e s fo r a s s emb l in g l o c a l l y W e b

c o n t e n t f o r u se b y W eb s e a r ch e n g in es . Go o g l e Bo t [ 5 ] ,

S l u rp [ 7 ] , M SN Bot [ 8 ] a nd T eo ma [ 9 ] a re ex am pl es o f

c r a w le r im p l em e n ta t i on s us e d b y c o mm e r c i a l s e a r ch

e n g i n es . P a ge r e f r e s h p o l i c y, s yn c h r on iz a t io n , a n d

o p t im al do wn lo a d in g r a t e a r e im po r t an t i s s ue s h e r e [ 20 ,

2 1 ] . Te c hn ic a l i s s u es s u ch as t he su pp o r t ed f i l e

f o rm a t s , f i l e s i z e l im i t a t i on s an d t h e v i s i t i n g p o l i c y a r e

a l so o f g r e a t i mpo r t a n c e . B r e a d th f i r s t c ra wl e r s a r e

c a p a b l e o f c r a wl ing a l a r ge pa r t o f t h e W eb [ 3 ] . T h en

t h e d ow nl o ad ed pa ge s a r e a na l yz e d ( e . g . b y c o n t e n t ,

t yp e ) i n d ex e d an d s ub se qu e n t l y s t o r ed in d a t a

r e p os i to r i e s co mp os e d o f t ho us a nd s o f c om pu t e r s a nd

T e r a b yt e s o f d a t a [ 1 , 2 ] . T h i s ap pr oa c h r e qu i r es h u ge

CHAPTER 2. RELATED WORK

12

r e s ou r c es wh i c h a re a v a i l a b l e o n l y t o l a r ge com p an ie s

o r o r ga n iz a t io ns s uc h as G oo g l e o r Y ah o o .

C r a wl e r s s u ch a s M e r c a t o r [ 45 ] an d La r b i n [ 10 ]

a r e ex am pl es o f B r e a d th F i r s t C r aw l e r s wh ic h a r e

f r e e l y a v a i l ab l e t o p ro g r a mm e rs fo r t e s t i n g a nd s ys t e m

d e v e l opm e n t . W h en l imi t e d r es ou r c es a r e a v a i l a b l e t h e y

c a n c r aw l a sm al l p a r t o f t h e i nd ex ed w e b an d re t r i ev e

w e b c on te n t f o r f u r th e r p r o c es s in g . B r e a d t h f i r s t

c r a w le r s yi e l d h i gh q u a l i t y p a ge s [ 19 ] b u t a r e n ’ t t o p i c

o r i e n t e d .

b ) Pa g e i mp o rt an c e C r aw l e rs : T h e y a s s i gn h i ghe r v i s i t

p r io r i t y t o U R Ls r e t r i ev e d f r om mo r e im po r t a n t

p a ge s . T yp i c a l l y , p a ge i mp or t an ce f o r a s s i gn in g

p r io r i t i e s t o ex t r ac t e d U R Ls i s comp ut e d b y B a c k l i n k

c o un t (w h er e h i gh er p r i o r i t y i s g i v en t o p a ge s po i n t e d

t o b y m a n y o t h e r W eb p a ge s ) a nd Pag e Ra n k [ 6 ] . O th e r

c r i t e r i a s uc h a s t h e p o s i t i o n o f t h e p age w i t h in t h e W eb

s i t e h i e r a r ch y ( e . g . l o w d e p t h , a s i nd i c a t ed b y f e w e r - o r

n o ne - s l a s h es i n t o t he p a ge U RL, l e a d to h i gh er

p r io r i t y) , o r t h e nu mb e r o f o u t go i n g l i n ks o f t h a t p a ge

( O ut l in k c ou n t ) can b e u s ed as w e l l . Ch o e t . a l [ 22 ]

p r ov id es a su r v e y o n th i s t yp e o f C r a wl e r s . P a ge

i mp or t an c e c r i t e r i a h a v e b e en sh own to i mp ro v e t he

q u a l i t y o f d ow nl oad e d p a ge s [ 2 2 ] .

2 .3 Class i c Focused Craw lers

C r a wl e r s u s ed b y s e a r c h en g i ne s ( s uch a s t h os e r e f e r r e d t o i n

s e c t i o n 2 . 2 ) a r e d es i gn ed to m ax imiz e t h e t o t a l n um b er a n d

p r ob a b l y t h e q u a l i t y o f d o wn lo a de d w e b p a ge s . To p ic

o r i e n t e d o r Fo c use d C r aw le r s t a k e a s i n pu t a u s e r qu e r y

( C l a s s i c Fo cu se d c r a w le r s ) , o r ex am pl e p a ge s p r ov ide d b y

CHAPTER 2. RELATED WORK

13

t h e u se r a s a t r a in in g s e t ( Le a r n i n g Cr a w l e r s ) a nd f o cu s t h e

c r a w l in g p r oc e s s o n p a ge s r e l ev a n t t o t h e t op i c . Foc u s ed

c r a w le r s k ee p th e o v e r a l l n umb e r o f d o wn lo ad e d W e b pa ge s

t o a m in i mum wh i l e m ax imiz in g t h e p e r c en t a ge o f r e l e v a n t

p a ge s .

T h e p e r fo rm a nc e o f a f oc us e d c r aw l e r d ep e nds o n t h e

s e l e c t i o n o f go o d s t a r t i n g p a ge s ( s e e d p a ge s ) . Go od s e e d

p a ge s c an b e e i t he r w eb pa ge s r e l ev a n t t o qu e r y t o p i c o r

p a ge s f r om wh ic h r e l ev a n t p a ges c a n b e a c c e s s e d t h r ough a

s m al l num b e r o r r ou t i n g h op s . Fo r ex am pl e , i f t h e t op i c i s on

s c i e n t i f i c pu b l i c a t i on s , a go od s ee d p a ge c a n be t h e

p u b l i c a t i on s p a ge o f a n au th o r , l a b o r de p a r tm e n t o r

a l t e r n a t i v e l y t h e w e b p a ge o f t h e au th o r , l ab o r d e pa r tm en t

r e s p ec t iv e l y ( a l t hou gh th e l a s t m a y c o n t a i n n o p ub l i c a t i o ns

a t a l l , i t i s kn ow n t o l e a d t o p a ge s con t a in in g p u b l i c a t i on s ) .

S e ed p a ge s s ho u ld a l s o b e im po r t a n t a s w e l l ( wh e r e

i mp or t an c e i s d e f in e d u s i n g l i n k ana l ys i s m e t ho ds suc h a s

H IT S [ 46 ] a nd Page R a n k [ 6 ] ) . T h e r a t i o na l e b e h i nd t h i s

r e q u i r em en t i s t ha t imp o r t an t W e b p a ge s ( wh e n u s ed a s

s t a r t i n g p a ge s –s ee d s – f o r c r a wl i ng ) m a y gu i d e c raw l i n g

p r o c es s t o o th e r i mp o r t an t W eb p a ge s f a s t e s t , t hu s im p rov in g

t h e qu a l i t y o f t h e r e s u l t s . T h e se e d pa ge s a r e o f t en s e l ec t e d

b y s u b m i t t i n g t h e qu e r y t h a t d es c r i b es t h e t o p i c o f i n t e r es t t o

a s e a r c h e n gi n e a nd b y u s i n g t he t op se a r c h e n g in e r e su l t s .

E a r l y a p p r o a c he s o n Fo cus e d C r a wl i n g in c l ud e am on g

o th e r s t h e F i s h -S ea r c h a l go r i t hm [ 2 3 ] . Th e b a s i c i d e a o f t h e

a l go r i t hm i s t h a t w h e n s ev e r a l p a ges a r e c a nd i d a t es fo r l i nk

b r o ws i n g an d dow n lo ad in g , p r i o r i t y i s g i v en to pa ge s

r e l ev a n t t o t h e t op i c ( a p a ge i s l ab e l ed a s r e l ev an t i f i t

c o n t a i ns t h e qu e r y t ex t ) . E v e r y c a n d id a t e p a ge i s a s s i gne d a

Bo o le a n v a l u e d e r iv e d b y a s imp l e l ex i co gr a ph i c r u l e ( an d i t

i s d ow nl o ad ed b y a s ep a r a t e a pp l i ca t i on t h r e ad ) . T hre a d s

c o r r es po nd in g to r e l e va n t p a ge s c r e a t e n e w th r e ad s f o r t h e i r

CHAPTER 2. RELATED WORK

14

o u t go in g l i nk s , wh i l e t h re a ds c o r r es po nd in g t o i r r e l ev a n t

p a ge s a re s to pp e d . T h i s wo r k ex a mi ne d t h e s ep a r a t e u se o f

t h e a nc ho r t ex t i n a s s i gn i n g p r io r i t i e s t o UR Ls .

T h e m ai n d i s ad v an t a ge o f t h e F i sh -Se a r c h a l go r i t hm i s

t h a t p r io r i t i e s t a ke Bo o le a n v a lu es ; t h e re f o re a l l r e l e v a n t

p a ge s a r e a s s i gn ed th e s am e p r io r i t y . T h e S ha r k -S ea r c h

a l go r i t hm [ 2 4 ] i s a d i r e c t su c c e s so r o f F i s h - Se a r c h , w h e r e

V SM [ 1 2] i s u s ed f o r a s s i gn in g no n Bo o le a n p r io r i t y v a l u es

t o c an d i da t e p a ge s . T h i s im pr ov e d th e r e su l t s o f c r a wl in g

[ 2 4 ] . Th e V e c t o r S p ac e Mo d e l be c am e th e ba s i s o f c l a s s i c

f o c us e d c r aw l e r s ev e r s i n c e .

A c c or d i n g t o VSM , d o cum e n t s a r e r e p r es e n t ed a s t e rm

v e c to r s an d t he we i gh t ��� o f a t e rm j i n do c um e n t i i s

c o mp ut ed as :

� ��� = ���� ∗ ���

���� = ��� ����

, ��� = ��� ���

� �1�

W h e r e ���� i s t h e t e rm f r e qu e n c y o f t e r m j i n do cu m en t i , ���

i s t h e i nv e r s e doc u m en t f r e qu en c y o f t e rm j , ��� i s t h e

f r e q ue n c y o f a pp e ar a n c e o f t e rm j i n to d o cu m en t i , ���� i s

t h e m ax im um f r eq ue n c y o f a l l t e r ms in to d oc um e n t i , � i s t he

t o t a l num b er o f doc u m en t s a nd �� i s t h e n um b er o f do c um en t s

c o n t a i n i n g t e rm j .

R e c en t a pp ro a c he s t o fo c us ed c r a w l i n g i nc lud e

In f o S p id e r s a nd Be s t - F i r s t C r aw le r [ 2 5 ] . In f o S p id e r s u s e

N e u r a l Ne tw o rk s , w h i l e Be s t F i r s t C r a wl e r s u s e t ex t

s imi l a r i t y b y V S M f o r a s s i gn i n g p r i o r i t y v a l u e s t o c a nd i d a t e

p a ge s .

G i ve n a qu e r y a n d a W e b p a ge , t he p r io r i t y o f t h e W eb

p a ge i s c om put e d b y Be s t F i r s t C r a w l e r s a s t h e cos in e

CHAPTER 2. RELATED WORK

15

s imi l a r i t y b e t w e e n t h e i r d o cum e n t v e c t o r s wh e r e ���, ��� a r e

t e rm w e i gh t s o f t h e q ue r y a n d th e we b p a ge r es p e c t i v e l y:

��� �������������, � !���" = ∑ $�� ∗ $��%�&'(∑ $��)�&%

�&' (∑ $��)�&%�&'

�2��

W h e r e + i s t he t o t a l num b e r o f t e rm s i n t o q u er y a n d pa ge

c o n t e n t .

In f a c t , t h e Be s t F i r s t C ra wl e r i s a s im p l i f i e d v e r s i on o f

t h e Sh a rk -S e a r ch c r a w le r : I t d o e sn ’ t c omb in e l i n k a nc h o r

t ex t a nd p r ev i ou s v i s i t ed p a ge s s c o r es i n to t h e p a ge p r io r i t y

f u n c t i on , a s S h ar k -S e a r ch do es . A l s o , Be s t F i r s t C r aw l e r s u s e

o n l y t e r m f r eq u en c y ( t f ) v e c to r s f o r c omp ut in g t op i c

r e l ev a n c e . Th e use o f i n ve r s e d ocu m en t f r eq u en c y ( i d f )

v a lu es ( as su gge s t e d b y V S M ) i n t he c as e o f fo cu s ed

c r a w l in g i s p ro b l em a t i c s i nc e t h i s mi gh t r e q u i r e

r e c a l cu l a t i o n o f a l l t e rm v ec to r s a t e v e r y c r a w l i n g s t ep . In

a d d i t i o n , i d f v a lu es a r e h i gh l y i n a c c ur a t e a t t h e e a r l y s t a ge s

o f c r aw l i n g b e c au s e o f t h e sm al l n um b er o f r e t r i e v e d

d o cu m en t s . Bes t F i r s t C r a wl e r s h a ve b ee n s how n t o

o u t p e r f o rm In f o S p i d er s , a n d S h a rk -S e a r ch a nd a l s o o th e r

n o n- f o cus e d c ra wl in g a p p ro a c he s s u ch a s Br e ad t h F i r s t , a n d

P a ge Ra n k [ 26 ] . Bes t f i r s t c r a wl i n g i s c on s i d e r e d t o b e t h e

m os t e s t a b l i s he d a p p ro a ch t o f o cuse d c ra wl in g du e t o i t s

s im p l i c i t y a n d e f f i c i e n c y. T h e N - Bes t F i r s t C r aw le r i s a

ge n e r a l i z ed v e r s i on o f Be s t F i r s t C r a wl e r : a t e a c h s t ep ,

i n s t e a d o f ch oo s i ng o n e W e b pa ge f o r l i nk ex t r a c t i o n a n d

d o wn lo ad in g o f p age s po in t e d t o b y t h e s e l i n ks , N p a ge s w i th

h i gh es t p r i o r i t y a r e c ho s en [ 2 7 ] .

A l on g t h e s am e l i n es , a n a pp r o ach r e fe r r ed t o a s

“ i n t e l l i ge n t c r aw l in g” [ 2 8 ] su gge s t s c o mbi n i n g p a ge c on t en t ,

U R L s t r i n g an d s t a t i s t i c s a bo u t r e l ev an t / i r r e l e va n t p a ge s a n d

s ib l i n g p a ge s f o r a s s i gn in g p r i o r i t i e s t o c a nd id a t e UR Ls .

T h es e s t a t i s t i c s a re u p da t ed a nd c om bi ne d du r i n g c r aw l i n g

CHAPTER 2. RELATED WORK

16

f o r gu i d in g th e s e l e c t i on o f t h e n ex t l i n ks t o fo l l o w yi e l d i n g

a h i gh l y e f f e c t i ve c r a w l i n g a l go r i t hm th a t l e a rn s t o c r a w l

w i t ho u t d i r e c t u s e r t r a in in g .

2 .4 Semant ic Craw lers

S em a n t i c C r aw l e r s a r e imp l em en te d b y c o m bi n in g a n

o n t o l o g y w i t h s e m a n t i c s im i l a r i t y m e a s u r e s [ 14 ] f o r

d e t ec t i n g t o p i c r e l e v a n c e b e tw e e n re t r i e v ed W e b p a ge s a n d

u s e r qu e r i es . S e ma n t i c s im i l a r i t y p l a ys a n i m p or t an t r o l e

h e r e : i t c a n b e us ed to d e t e c t t o p i c r e l e va n c e b y a s s o c i a t i n g

t e rm s in a q u er y a n d t he W e b p a ge us in g th e o n t o l o g y, a n d

b y a s s i gn in g a d e g r e e o f r e l ev a n ce t o e a c h su ch t e rm

a s so c i a t i o n .

E h r i g e t . a l [ 13 ] p ro p os es u s e o f t op i c o r i en t ed o n to lo g i es

f o r f i n d in g p a ge s r e l ev a n t on t he t op i c o f i n t e r e s t . Ev e r y

t e rm in a W eb p a ge i s ex am in e d a nd co n t r i bu t es pos i t i v e ly t o

a s s i gn i n g a p r io r i t y s c o r e i f i t i s a q u e r y t e r m o r i f i t i s

s e ma n t i c a l l y r e l a t ed t o t h e u s e r q ue ry t e r m s . T h e f o l l ow i n g

v a r i a t i o ns fo r e v a lu a t in g s em an t i c r e l a t i o ns o f p a ge t e r m s

w i t h qu e r y t e rm s we r e us ed :

a ) I f a t e rm i s d i r e c t l y c o n n e c t e d ( d i s t an c e 1 ) t o a qu e r y

t e rm , t h e n i t i s c o n s i de r e d r e l eva n t (d i s t a n c e i s

d e f i n ed a s t h e l eng t h o f t he sh o r t es t p a th c on n e c t i n g

t w o t e rms r ep r e sen t ed a s v e r t i c e s i n to t h e on t o l o g y

g r a p h wh e r e ed ge s r ep r es e n t r e l a t i on o f a d j a c en t

t e rm s) .

b ) I f a t e r m i s c lo s e t o a q u e r y t e r m ( d i s t an c e 2 o r l e s s )

u s i n g o n l y IS - A r e l a t i o ns t h e n i t i s r e l e v an t t o t h e

q u e r y t e r m.

c ) E v e r y p a ge t e r m ap p e a r in g i n t o t h e o n t o l o g y g r a p h i s

a s s i gn ed a r e l ev a nc e v a l u e d ep e nd i ng o n i t s d i s t a n ce

w i t h qu e r y t e r ms . T h e g r e a t e r t h e d i s t an c e t h e l o w er

t h e r e l e v an c e v a l u e w i l l b e . Sp e c i f i c a l l y , u s i n g a t o p i c

CHAPTER 2. RELATED WORK

17

s p e c i f i c un d er l yi n g o n to lo g y t h e s em a n t i c s im i l a r i t y

b e tw e e n t e rm s i s co mp ut ed as :

�� ,-��., �)� = �/0�12,13� �3�

W h e r e � i s a d e c r e as in g f ac to r (0 .5 i n t h i s w o rk ) an d

�5��., �)� i s t h e l e n g th o f sho r t e s t p a t h c on n ec t i n g t e rms

t 1 a nd t 2 i n to t he on to lo g y g r a p h ( 0 i f t h e t e rm s b e lo n g

t o t h e s a m e s yn o nym s e t ) . Th e lo n ger t h e d i s t a n c e o f

t h e t e rms in to t h e g r a p h th e s m al l e r t h e i r s i mi l a r i t y i s .

T h i s m e th od i s a v a r i a t i o n o f t h e s ho r t e s t p a th

s e ma n t i c s imi l a r i t y m eth od .

T h e l a s t ap p ro a ch ha s t h e b es t p e r fo rm a nc e fo r

c o mp ut i n g th e co nc e p tu a l s im i l a r i t y b e tw e e n t e rms a nd w a s

a l so u s ed in o u r w o r k fo r co mp a r i so n w i th o th e r s ema n t i c

r e l a t i on m eth od s a n d s t a t e - o f -a r t c l a s s i c fo c us ed c r aw l i n g

a p p ro a ch e s . A no t he r s t a t e o f t he a r t t e rm s im i l a r i t y m e t ho d

u s ed i n p r e s en t wo rk i s t h e Li e t . a l m e t ho d [ 42 ] :

T h e s em a n t i c s imi l a r i t y b e t w e en t w o t e rm s t 1 a nd t 2 i s

c o mp ut ed a s a fu n c t i on o f t h e l e n g t h o f t h e p a th

c o nn e c t in g t h e t e rm s i n t he u nd e r l yi n g o n t o l o g y g r a p h

a n d th e d ep th o f t e r m s i n t o t h e t ax o nom y:

�� 6���., �)� = !789:;<7:=;<:;<>:=;< �4�

W h e r e L i s t h e sh o r t es t p a t h l e n g th b e t w e en �. an d �), @

i s t h e d e p th o f t h e m os t sp e c i f i c comm on co n c ep t o f �., �)

i n t o t h e t ax on om y a n d �, A a r e c ons t an t s �� = 0,2 a n d A = 0,6

i n ou r im p l em e n t a t i on ) .

A c c o rd in g t o r esu l t s r ep or t ed i n [ 14 ] t h i s m e t ho d h av e

b e e n p ro ve n to b e f a s t a n d a c cu r a t e ( a c h i e v i n g a c c u r ac y

u p to 8 2 % c omp a r ed t o r es u l t s ob t a in ed b y h u m an s ) .

G e n e r a l pu r po se t ax on omi e s su ch a s W o rd N et c a n a l so b e

a p p l i e d f o r f oc us ed c r aw l i n g . W or dN e t i s an o n l in e l ex i ca l

CHAPTER 2. RELATED WORK

18

r e f e r e nc e s ys t e m de v e lo pe d a t P r in c e to n Un i v e r s i t y . W o rd N et

a t t e mpt s t o m od e l t h e l ex i c a l k no wl ed ge o f a n a t iv e s pe a k e r

o f En g l i sh . W o rd Ne t c an a l so b e s e en a s on t o l o g y f o r n a t u ra l

l a n gu a ge t e rm s . I t c o n t a i ns a ro un d 1 00 , 00 0 t e r ms , o r ga n iz ed

i n t o t ax on omi c h i e r a r c h i es . No un s , v e r bs , ad j e c t i v e s a n d

a d v e rb s a r e g r o up ed i n t o s yn o n ym s e t s ( s yn s e t s ) . Th e s yn s e t s

a r e a l s o o r ga n iz e d in t o s e ns es ( i . e . co r r es po nd ing t o

d i f f e r en t m e an in gs o f t h e s am e t e r m o r c o n ce p t ) . T he s yn s e t s

( o r co n ce p t s ) a r e r e l a t e d t o o th e r s yn s e t s h i gh e r o r l owe r i n

t h e h i e r a r ch y d e f i ne d b y d i f f e r e n t t yp e s o f r e l a t i ons h i ps . T h e

m os t c omm on r e l a t i on sh ips a r e t h e Hyp o n ym / H yp e r n ym ( i . e . ,

I s - A r e l a t i o ns h ip s ) , a n d t h e M er on ym / H olo n ym ( i . e . , P a r t -o f

r e l a t i on sh ip s ) . T her e a r e n i n e no un a n d s ev e r a l v e rb Is - A

h i e r a r c h i es ( ad j e c t i v e s a nd ad v e rb s a re n o t o r ga n iz ed i n to Is -

A h i e r a r c h i es ) . F i gu r e 2 i l l u s t r a t es a f r a gm en t o f t h e

W o r dN e t Is - A h i e ra r c h y.

Fi g . 2 W o r dN e t H yp e r n ym / h yp o n ym s s yn s e t s r e l a t i o ns

ex a mpl e

Airplane , aeroplane, plane

Aircraft

Craft

Vehicle

Airship,… Drone,

Glider,…

Vessel, watercraft

Rocket, projectile Sled, sledge,…

Spacecraft,… Hovercraft

Airliner Amphibia

n

Jet Fighter Bomber Biplan

e

Monoplane

CHAPTER 2. RELATED WORK

19

T o th e b es t o f o u r kn ow le d ge a c o mp a r a t iv e s tu d y

b e tw e e n s e ma n t i c a n d o t h e r f o cu se d c r a wl in g a p p roa c h e s

h a sn ’ t b e en r epo r t e d i n t h e l i t e r a t u r e b e fo r e . T h e

i mp l e me n t a t i o ns i n [ 13 ] a r e co mp ar e d on l y w i t h a b a s i c

f o c us e d c r a wl e r ( a s s i gn i n g e ac h p a ge a s i mp l e b in a r y p r io r i t y

v a lu e d ep en d ed on t h e p r es e n c e o f qu e r y t e r ms ) r a t h e r t h an

w i t h t h e wi d e l y u s e d Be s t F i r s t C r a wl e r s m ak in g u s e o f V SM

f o r e s t im a t i n g t o p i c r e l e v an c e [ 2 9 ] . Th e p r op os e d w o rk d e a l s

w i t h ex a c t l y t h i s i s s u e an d p r e s en t s a c o m p ar a t iv e s tu d y

b e tw e e n c l as s i c an d s e ve r a l v a r i an t s o f s em an t i c c r aw l i n g

a p p ro a ch e s ( i n c l ud in g E hr i g e t . a l [ 1 3 ] ) .

2 .5 Learning Craw lers

E a r l y a p p r o a c h es t o d e v e lo p i n g l ea r n i n g c r a wl e r s a pp l i e d a

l e a rn in g c l as s i f i e r ( t h a t r e l i e d on we b t ax on omi e s s u ch a s

Y a h oo [ 7 ] ) an d u s ed f o r d i s t i n gu i sh i ng b e t w e en r e l e va n t a n d

n o n r e l e v an t p a ge s [ 3 0 ] . Ev e r y p a ge c on t a in in g l i n ks

c a n d i da t e f o r do wn lo a d in g i s c l a s s i f i e d a s r e l e v an t o r n o t

r e l ev a n t an d as s ign e d a p r io r i t y v a l u e a c c o r d i n g t o t h a t

c l a s s i f i c a t i o n (h i gh e r p r i o r i t y w a s a s s i gn e d t o r e l e v a n t

p a ge s ) . T h i s wo rk i s c on s id e r ed to b e o n e o f t h e f i r s t

c o n t r i bu t io ns i n t h e f i e ld o f Le a r n in g C r a wl e r s . Re s e n t

a p p ro a ch e s i n vo l v in g m a ch i n e l ea r n ing m e th ods fo r f o cu s ed

c r a w l in g i n c l ud e de c i s io n t r e es [ 3 4 ] , N e u r a l N e tw o rk s a n d

S up po r t V e c to r M ac h in es [ 3 3 ] .

Bu i l d i n g u po n s im i l a r i d e as t he c r a w le r i n [ 31 ]

i n t ro du c e d t h e co nc e p t o f Co n t ex t Gra p hs : Fo r e v e r y r e l e v a n t

p a ge a s e a r c h e n gin e ’ s b a c k l i nk s e rv i c e i s a pp l i ed t o r e t r i ev e

i t s p r e d ec e s s o r p a ge s . T h en , a c l a s s i f i e r i s bu i ld a c co r d in g t o

t h e d i s t a n c e o f pa ge s ( Le v e l ) t o t h e r e l ev a n t p a ges s e t .

D o wn lo a d p r io r i t i e s a r e e s t im a t e d u s in g t h i s c l a s s i f i e r : T h e

c l os e r a c a nd i d a t e p a ge t o a r e l e v an t o n e i s , t he g r e a t e r t h e

p r io r i t y o f t h a t p a ge wi l l b e .

CHAPTER 2. RELATED WORK

20

T a r ge t p a ge

Le v e l 1 p a ge

Le v e l 2 p a ge

F i g . 3 Co n t ex t g r a ph : P a ge s a r e c l as s i f i e d a c co r d in g to t he i r

d i s t a n c e ( Le v e l ) f ro m t a r ge t p a ge s .

A n ex t en s i on to t he C on tex t G r ap h m et ho d w as t h e Hid d en

M a r ko v M od e l ( HM M ) c r a wl e r [ 1 6 ] . T h e us e r b r ow s es t h e

W eb an d in d i c a t e s i f a do wn lo ad e d p a ge i s r e l ev a n t t o t he

t op i c o r n o t . Th e v i s i t i n g s eq u en c e i s a l so r e c o rd ed and i s

u s ed fo r t r a i n i n g th e a l go r i t h m to i de n t i f y p a t h s l e ad i ng t o

r e l ev a n t p a ge s . The d o wn lo a de d p a ge s a r e c lu s t e re d an d a

H i dd en M ar ko v M od e l [ 44 ] i s c r e a t ed : E ve r y p a ge i s

c h a r a c t e r i z ed b y t wo s t a t e s ( a ) t h e v i s i b l e s t a t e

c o r r es po nd in g to t he c l us t e r t h a t t he p a ge b e l ongs t o

a c c o rd in g to i t s c o n t en t , a nd (b ) t h e h id de n s t a t e

c o r r es po nd in g t o t h e d i s t a n ce o f t h e p a ge f r o m a r e l ev a n t

p a ge ( 0 i f t h e pa ge i s a t a r ge t / r e l e va n t pa ge ) . Dur i n g

c r a w l in g e v e r y p a ge i s a s s i gn e d a v a lu e e qu a l t o t h e

p r ob a b i l i t y t h a t g i v en th e c l us t e r t h e p a ge b e l on gs t o ,

c r a w l in g wi l l l e ad t o a t a r ge t p age , t h i s p ro b ab i l i t y i s

c o mp ut ed u s i n g th e H id de n M a rk ov Mo d e l .

CHAPTER 2. RELATED WORK

21

S p ec i f i c a l l y A l l pa ge s a r e r ep r es e n t e d b y t h e i r

t e rm v e c t o r s a c c o rd in g t o VS M a nd th e y a r e c l u s t e r e d . T h us

e v e r y p a ge i n to t h e t r a in in g s e t i n c h a r a c t e r i z ed b y t h e

c l us t e r i t be lo n gs t o a nd b y i t s d i s t anc e ( l e v e l ) f r om a t a r ge t

p a ge ( F i g . 4 ) .

L 3 p a ge

L 2 p a ge

L 1 p a ge

L 0 p a ge

Fi g . 4 Re p r es e n t a t i o n o f t h e H MM t r a in in g s e t u s in g

d i s t an c e f r om t a r ge t p a ge s ( Le v e l ) and c l us t e r s o f p a ge s i n

t h e t r a i n i n g s e t .

In f i gu r e 4 g r e e n p a ge s i n d i c a t e t a r ge t o r l ev e l 0 p a ges ,

ye l l o w p a ges a r e l e v e l 1 pa ge s (1 l i n k d i s t a n c e f r om t a r ge t

p a ge s ) , o r an ge p a ge s a r e l e v e l 2 (2 l i nk s aw a y f r o m t a r ge t

p a ge s ) , an d re d p age s a r e 3 o r m o re l i n ks a w a y f r o m t a r ge t

p a ge s . La b e l s on pa ge s r e p r es en t t h e c l us t e r t h e p a ge b e l on gs

t o ( e . g . C 0 , C 1 and C 2 l ab e l s c o r r e sp o nd i n g t o C lu s t e r 0 ,

C l us t e r 1 a nd C lu s t e r 2 r e sp e c t iv e l y) . N o t i c e t h a t pa ge s

w i t h i n t h e s am e C l us t e r c an b e l on g t o d i f f e re n t l e ve l s , a nd

t h a t p a ge s i n t he s am e l e v e l c an b e lo ng t o a d i f f e r en t c lu s t e r .

E v e r y p a ge i s c ha r a c t e r i z ed b y i t s l e v e l o r h i dd e n s t a t e L i

w h e r e i i s t h e l e ve l , a nd b y t h e c l us t e r C j i t b e lo n gs ( o r

v i s i b l e s t a t e ) . T h a t s e t o f p a ge s wi t h h id d en a nd v i s i b l e

C2

C2

C0

C1 C0

C1

CHAPTER 2. RELATED WORK

22

s t a t es fo r m a Hi dd e n M a rk ov Mo d e l [ 44 ] . Th e fo l l ow i n g

s um ma r i z e s t h e pa r a m et e r s an d n o t a t i o n u se d b y H M M

c r a w le r :

I . I n i t ia l p r obab i l i t y ma t r ix :

D = { F�G'�, … , F�G/181:/7.�}

W h e r e ����!� d e no t e s t h e n umb e r o f h id d en s t a t es

a n d F�G�� r e p r es e n t s t he p r ob a b i l i t y o f b e in g a t h id d en

s t a t e i a t t im e 1 . T h i s p r ob a b i l i t y i s co mp ut ed b y

a s s i gn i n g to e a c h p a ge a v a lu e e qu a l t o t h e p e r c en t a ge

o f p a ge s w i t h t h e s a me h i dd e n s t a t e i n t o t h e t r a in in g

s e t .

I I . T r ans i t ion Pr obab i l i t i e s Ma t r i x A :

J = [L��]'N�O/181:/,'N�O/181:/

W h e r e L�� r ep r es e n t s t h e p ro b ab i l i t y o f be i n g a t s t a t e L j

a t t im e t + 1 i f a t s t a t e L i a t t i me t . Th i s p ro b ab i l i t y i s

e s t im a t e d b y c o u n t i n g t h e co r r es po n d in g t r an s i t i o ns

f r om s t a t e L i t o L j on t h e us e r t r a i n i n g s e t , a nd b y

n o rm al i z in g b y t h e o v e r a l l n um b er o f t r a ns i t i on s f rom

s t a t e L i .

I I I . E mi ss i on Pr obab i l i t i e s M at r ix B :

P = [A��]'N�O/181:/,'N�OQ6R/1:S/

W h e r e A�� r e p r e s en t s t h e p r ob a b i l i t y o f b e in g a t c l u s t e r

C j g i v en s t a t e L i an d T� ��!�� i s t h e n um b er o f c lu s t e r s

o f pa ge s . P r ob a b i l i t i e s a re c omp ut e d b y c o u n t i n g t h e

n um b er o f p a ges i n to c lu s t e r C j w i th h i dd e n s t a t e L i

a n d no rm al i z in g b y t h e o v e r a l l n umb e r o f p a ge s wi th

h id d en s t a t e L i .

D u r i n g c r a wl i n g pa ge c o n te n t i s p ro c e s s e d a nd th e H M M

c r a w le r a s s i gns t h e p a ge to a c lu s t e r u s i n g K -N ea r es t

N e i gh bo rs a l go r i t hm [ 43 ] . G i v en th e p a ge c l u s t e r an d t h e

H i dd en M a rk ov M od e l p a r am e te r s (π , A a n d B m at r ix es ) t h e

p r ob a b i l i t y t h a t t he n ex t p a ge v i s i t ed w i l l b e a t a r ge t p age i s

CHAPTER 2. RELATED WORK

23

c o mp ut ed us i n g Vi t e rb i a l go r i t hm [4 0 ] . Th a t p ro ba b i l i t y

r e p r es e n t s a l s o v i s i t p r i o r i t y o f t h e l i nk . Th e V i t e r b i

a l go r i t hm co mp ut es a p r ed i c t i on o f t he s t a t e i n t h e n ex t t ime

s t ep g iv e n th e s e qu e n c e o f w e b p a ges ob s e r v ed t hu s f a r . In

o r d e r t o ca l cu l a t e t h e p r ed i c t i o n v a l ue , e a ch v i s i t ed p a ge i s

a s so c i a t e d wi t h v a lu e s a (L j , t ) , j = 0 , 1 , . . , s t a t es . V a lu e a (L j , t ) i s

t h e p ro ba b i l i t y t h a t t h e s ys t e m i s i n h id d en s t a t e L j a t t i me t ,

b a s ed on ob s e r v a t io ns m ad e t hu s f a r . G i ve n v a lu es a (L j , t -1 )

o f pa r e n t p a ge s , v a l u es a (L j , t ) a r e c om put e d us ing t h e

f o l l o wi n g r ec u rs ion :

��G� , �" = A�QU V ���G�, � − 1� ∗ ���/181:/

�&'� �5�

W h e r e a i j i s t he t r a n s i t i o n p ro b ab i l i t y o f s t a t e L i t o L j f r om

m at r ix A a nd A�QU i s t h e e mis s i on p rob a b i l i t y o f c lu s t e r c t

f r om h id de n s t a t e L j f rom m at r ix B . V a lu e s a (L j , 0 ) a t t h e

f i na l r e c u rs i on s t ep a r e t a ke n f r om in i t i a l p ro b ab i l i t y m a t r ix

π . G iv e n v a lu es a (L j , t ) t h e p r ob a b i l i t y t h a t t h e s ys t e m wi l l be

i n s t a t e L j a t t h e nex t t im e s t e p i s c omp ut e d a s fo l l o w s :

��G�, � + 1" = V ���G�, �� ∗ ���/181:/

�&'� �6�

T h e p ro b ab i l i t y o f b e e n a t s t a t e L 0 ( r e l e v an t pa ge ) i n t he n ex t

s t ep i s t he p r io r i t y a s s i gn ed t o p a ges .

C h ak r ab a r t i e t . a l [ 32 ] p r op os e d a t wo c l as s i f i e r

a p p ro a ch . T he o p en d i re c to r y ( D M O Z) [ 39 ] W e b t ax o nom y i s

u s ed t o c l as s i f y d o w nl oa d ed pa ge s as r e l e va n t o r no t , a n d

f e e d a s e c on d c l as s i f i e r w h i ch i s t r a in e d u s i n g th es e p age s .

T h e s e c on d c l as s i f i e r i s u s e d t o e v a l ua t e t h e p r ob a b i l i t y t h a t

t h e g iv e n p a ge w i l l l e a d t o a t a r ge t p age . A n ex t e ns i v e s tu d y

o f Le a r n i n g C r aw l e r s an d t h e e v a lu a t io n o f s ev e r a l

CHAPTER 2. RELATED WORK

24

c l as s i f i e r s u s e d t o a s s i gn v i s i t p r io r i t y v a l u e s t o p a ges i s

p r e s en t e d i n [ 3 3 ] . C l a s s i f i e r s b a sed o n S up po r t V ec t o r

M a c h in e s [ 38 ] (S VM ) s e em to o u tp e r f o rm Ba ye s C l a s s i f i e r s

a n d c l a s s i f i e r s b as ed on N e ur a l N e t wo rk s on t h a t t a s k .

R e se n t c on t r ib u t i on s t o t h e f i e l d o f l e a rn in g c r a wl in g

i n c lu d e H yb r i d c raw l e r s [ 3 5 ] c om bin in g i d ea s f r om l e a rn in g

a n d c l as s i c f oc us ed c r aw l e r s . In [ 3 5 ] a H yb r i d C r a wl e r i s

p r op os e d : Th e c r awl e r wo r ks b y a c t i n g a l t e r n a t i v e l y e i t he r a s

l e a rn in g c r a w l e r gu id e d b y ge n e t i c a l go r i t hm s ( fo r l e a rn in g

t h e l i nk s e qu en c e l e a d in g t o t a r ge t p age s ) o r a s b r e ad th f i r s t

c r a w le r . In o u r w o r k , w e ap p l y a h yb r i d m eth od f o r

i mp ro v i n g t h e p e r f o rm a n c e o f l e a rn in g c r a w le r s . Ho w ev e r ,

i n s t e a d o f a l t e rna t i n g c r a wl e r s b e t w e en t wo mo d es o f

o p e r a t i on ( Le a r n ing o r Br e a d th f i r s t c r a w l e r ) w e c omb ine t h e

p a ge p r io r i t y f u n c t i on s c omp ut e d b y a H id de n M a rk ov Mo d e l

C r a wl e r an d t h a t o f t h e Bes t F i r s t C ra wl e r i n o rd e r t o

e v a lu a t e t h e o v e ra l l p r i o r i t y v a lu e o f a W e b p a ge .

2 .6 Summary

R el a t e d w o rk o n fo c us e d c r a wl e r s i nc l ud es c l a s s i c , s ema n t i c

a n d l e a r n in g a p pr o a c he s . T h e Bes t F i r s t C r a wl e r a n d

v a r i a t i o ns o f t h i s m e t ho d ( e . g . N- Be s t F i r s t C r a wl e r ) fo r m a

c o mmo n an d e f f e c t i v e ap pr o a ch fo r f o c us e d c r a wl i n g [ 2 6 ] .

S em a n t i c c ra wl e r s p r e s en t e d i n [ 1 3 ] a r e no t w e l l s t ud i e d a n d

a c om p ar i s on w i t h s t a t e o f t h e a r t c l a s s i c fo c us ed c r aw l e r s

s u ch a s Be s t - F i r s t h a s n ’ t a pp e a r ed in t h e l i t e ra tu r e b e f o r e .

Le a r n i n g c r a w l e r s f o rm a d i s t i n c t i ve c a t e go r y o f f o c u s ed

c r a w le r s ba s ed o n a t r a in in g s e t p ro v i d ed b y t h e u s e r f o r

t op i c d es c r i p t i on . Le a r n in g c r a w le r s ba s ed on S VM

c l as s i f i e r s f o r a s s ign i n g p a ge v i s i t i n g p r io r i t i e s a c h i e v e go o d

p e r f o r ma n c e [ 33 ] , w h i l e me th od s t h a t l e a r n p a t hs l e ad i ng t o

r e l ev a n t t o t h e t op i c p a ges su ch as Co n t ex t G r ap h me t ho d

[ 3 1 ] an d Hi dd en Ma r k ov Mo d e l Cr a wle r s [ 16 ,1 8 ] a r e o f g r e a t

CHAPTER 2. RELATED WORK

25

i mp or t an c e . A l s o t h e ne w l y p r o p os ed h yb r i d m et ho ds [ 3 5 ] a r e

v e r y p r om is i n g ap pr o a c h t o f o cu se d c ra w l i n g .

CHAPTER 3. CRAWLER DESIGN

26

Chapter 3. Crawler Design

3.1 Introduction

Is s u e s r e l a t e d t o d es i gn a nd im ple m e n t a t i on s o f f oc u s ed

c r a w le r s a re d i s cus s ed n ex t . G i v en a n a pp l i c a t i on ( ge n e r a l

p u rp os e w e b s ea r ch e n g i n e o r t op i c s p e c i f i c d i g i t a l l i b r a r y)

t h e a pp r op r i a t e t yp e o f w eb c r aw le r h a s t o b e d e t e r mi n ed

f i r s t . Fo r t h e f i r s t a p p l i c a t i o n t yp e , a b r e a d th f i r s t c r aw le r i s

a r e a s on ab l e s o lu t io n . Fo c us ed c r a wl e r s ( c l a s s i c , s em a n t i c o r

l e a rn in g c r a wl e r s ) a r e b es t su i t ed fo r t h e l a t e r ap p l i ca t i on

t yp e .

B r e a d th F i r s t C r a wl e r

Fo c us e d Cr a wl e r s

G r e e n c i r c l e s d eno t e r e l ev a n t t o t he t o p i c p a ges a n d a r c s

l i n ks b e t we e n W eb p a ge s . A r ro w s d e no t e v i s i t s e que n c e

u s i n g d i f f e re n t c ra w l e r s . Fo c us e d C r a wl e r s a s s i gn h i gh e r

v i s i t p r i o r i t i e s t o l i n ks co n t a i n ed i n r e l ev a n t t o t h e t o p i c

p a ge s .

Fi g . 5 C r a wl e r O p er a t i on

CHAPTER 3. CRAWLER DESIGN

27

Fi g . 5 d em on s t r a t e s t h e s e a rc h s t age s o f a c r a wl e r . W eb

p a ge s a r e d en o t e d b y c i r c l es ( g r e e n c i r c l e s co r r es po nd to

p a ge s r e l ev a n t t o t h e t op i c a t h a nd ) wh i l e l i n ks d en o t e

o u t go in g l i n ks f rom a p a ge . T he c r a wl e r r e t r i e v es p a ge s f r om

t h e we b s t a r t i n g wi th a s e ed p a ge sho w n a t t h e ro o t o f t h e

t r e e . A s d i s cu ss ed i n t h e i n t r od u c t io n , t h e ou t go in g l i nk s

( U R Ls ) o f ea c h v i s i t e d p a ge a r e p l a c e d i n a q ue u e f r om

w h ic h th e w e b p a ge to v i s i t nex t i s s e l e c t e d i n so m e o rd e r .

T h e c r a wl e r ge t s t h e UR L, d o wnl o ad t h e p a ge an d p l a c e s

U R Ls e x t r ac t ed f rom th e do w nlo a d ed p a ge i n t h e q u eu e . T h i s

p r o c es s i s r ep e a t ed u n t i l t h e c r aw l e r d e c i d es t o s to p ( e . g .

d i s k s pa c e ex h au s t e d , t im e l ap s ed o r t he us e r i s s a t i s f i ed

w i t h t h e r es u l t s ) . Fo c us e d c r a wl e r s i n t ro du c e a n umb er o f

c r i t e r i a ( e . g . p a ge imp o r t an c e , r e l e v a n c e t o t o p i c ) f o r

a s s i gn i n g p r i o r i t i e s t o w eb p a ge s i n t h e qu e ue an d f o r

s e l e c t i n g w h i c h pa ge t o v i s i t n ex t . F i g . 6 i l l u s t r a t es t h e

o p e r a t i on s t a ge s o f a c ra wl e r :

N o

Y e s

N o

Fi g . 6 O v er v i e w o f C ra wl e r o p er a t i o n

User input

Page downloading

Content processing

Priority assignment

Crawling termination

criteria satisfied?

Output: Web pages satisfying user needs

CHAPTER 3. CRAWLER DESIGN

28

a ) I npu t : C r a wl e r s t a k e a s i np u t a num b er o f s t a r t i n g

( s e e d ) U R Ls a n d ( i n t he c a s e o f f o cu se d c r a wl e r s ) t h e

t op i c d es c r i p t i o n . T h i s d e s c r ip t i o n ca n b e a l i s t o f

k e yw o r d s f o r c l a s s i c a n d s em a n t i c f o cu s ed c r a wl e r s o r

a t r a in i n g s e t fo r l ea r n i n g c r aw le r s .

b ) Pa g e dow nl oa d ing : Pa ge s f r om q u eu e a r e d o wnl o ad e d

i n s om e o rd e r . Fo c us e d c r aw l e r s m a y d e c i d e t o

ex c lu de p a ges no t s a t i s f yi n g t h e t op i c c r i t e r i a f ro m

f u r t h e r i nv es t i ga t i o n . P a ge s a r e s t o r e d lo c a l l y a t a

p a ge r ep os i to r y f o r f u r th e r p ro c e s s i n g .

c ) C on t en t p ro c es s in g : T he p a ge c on t en t i s l ex i c a l l y

a n a l yz e d a n d r ed uce d in to t e rm v ec to r s ( a l l t e rm s a r e

r e d u ce d t o t h e i r m o rp ho l o g i ca l ro o t s b y a p p l yi n g

P or t e r ’ s s t em min g a l go r i t hm [ 4 8 ] a nd s t op wo r ds a re

r e mo v ed ) . Ea c h t e rm in a v e c t o r i s r ep r e s en t e d b y i t s

t e rm f r eq u en c y- i nv e r s e f r e qu en c y v e c to r ( t f - i d f )

a c c o rd in g t o VSM . T h e ou t go in g l i nks o f t h e p a ge a r e

a l so ex t r a c t e d an d p l a ce d in t he p r io r i t y q u e u e .

d ) Pr i o r i t y as s i gnme n t : Ex t r a c t ed U R Ls f ro m

d o wn lo ad e d p a ge s a r e p l a c ed in a p r i o r i t y q u e u e wh e re

p r io r i t i e s a r e d e t e rm in ed b as ed o n th e t yp e o f c r a wl e r

a n d us e r p r e f e re n ce s . T he y r a n ge f r om s imp l e c r i t e r i a

s u ch a s p a ge imp or t an c e o r r e l e v an ce t o q ue r y t o p i c

( c om pu t ed b y m a t c h in g t h e q u er y w i th p a ge o r an c ho r

t ex t ) t o mo r e i nv o l ve d c r i t e r i a ( e . g . c r i t e r i a

d e t e r min e d b y a l e a r n in g p r o c es s ) .

e ) E xpan s i on : UR Ls a r e s e l e c t e d f o r f u r t h e r ex p a ns i on

a n d s t ep s ( b ) - ( e ) a r e r e p e a t e d un t i l s om e c r i t e r i a

( e . g . t h e d es i r ed n umb e r o f p age s h av e b e e n

d o wn lo ad e d ) a r e s a t i s f i e d o r s ys t em r e so u r ce s a r e

ex h au s t ed .

A l l C r a wl e r s f o l l ow t he a bo v e d e s i gn . B r e a d t h F i r s t C raw l e r

r e q u i r es o n l y s e e d p a ge s a s i n pu t . Be s t - F i r s t an d S em an t i c

CHAPTER 3. CRAWLER DESIGN

29

C r a wl e r s t a k e t he s e e d p a ge s an d a u se r q u er y a s i n pu t wh i l e

Le a r n i n g C r a wl e r s a c c e p t a t r a i n i n g s e t o f U R Ls o f p a ge s

i n s t e a d o f a q u er y. C r a wl e r s a l s o d i f f e r i n t h e w a y p r i o r i t i e s

a r e a s s i gn ed t o ex t r a c t ed U R Ls . T h i s i s t h e m os t c ru c i a l p a r t

i n t h e im p l em en t a t i o n o f f o cus e d c r a wl e r s .

A l l C r a wl e r s i n t h i s w o rk a r e i mp l e me n t ed in J a v a [ 36 ]

u s i n g E c l ip s e [ 37 ] . T h e do w nlo a d ed pa ge s m us t b e o f

t ex t / h tm l f o rm at a n d t h e i r co n t en t s i z e mu s t no t ex c e e d

1 0 0K B. R es t r i c t i ons a r e a l so i mp os e d o n co nn e c t io n t i me o u t

a n d d o wnl o ad i n g t i m es fo r p e r fo r m an c e r e as on s . T h os e

r e s t r i c t i o ns ap p l y t o a l l imp l em en te d c r a w le r s . T h e c r a wl in g

p r o c es s i s r e pe a t ed u n t i l t h e p r ed e f in e d num b e r o f pa ge s i s

r e t r i ev e d ( F i g . 6 ) . In o r e x p e r i m en t s t h i s num b e r i s s e t eq u a l

t o 10 00 we b p a ge s .

3 .2 Class i c Craw lers

T h e B r ea d th F i r s t C ra wl er f o rm s t h e b a se l in e f o r

i mp l e me n t in g Be s t F i r s t , Se m an t i c and Le a r n i n g C r aw le r s . I t

i s a s imp le p ro g r a m th a t ge t s o n e o r m o re s e ed p a ge s as i np u t

a n d fo l l o ws t h e l i nk s i n a b r e a d th f i r s t w a y u n t i l t he d es i r ed

n um b er o f W e b p age s i s d o wnl o ad e d . F i g . 7 i l l u s t r a t es t he

i n t e r fa c e o f t h e Br e a d th F i r s t c r aw l e r i mp l e me n t e d . I t

a c c e p t s o ne o r mo re s e ed p a ges a s i np u t . D o wnl o ad e d pa ge s

a r e sh ow n b e lo w .

CHAPTER 3. CRAWLER DESIGN

30

Fi g . 7 Sc r e en sh o t o f B r e a d th F i r s t C r aw l e r

3 .2 .1 B es t F i r s t Cr aw l e r w i t h p ag e co nt en t c r i t er ia

T h e s e c on d c l as s i c Cr a wl e r ( and th e f i r s t f o cu s ed )

i mp l e me n t e d i s t he B e s t F i r s t C ra wl e r us ing p ag e c on t en t

f o r p r io r i t i z i n g c a n d i d a t e UR Ls . W h e n a W e b p age i s

d o wn lo ad e d i t s c on t e n t i s l ex i c a l l y a n a l yz e d an d r ep r e sen t ed

b y t e r m v e c to r s . Ea c h t e rm in su c h ve c t o r i s r e p r es en t ed b y

i t s t f - i d f w e i gh t a cc o r d in g t o VSM [ 12] . P r i o r i t y a s s i gned to

a l i nk e qu a l s t h e c o s in e s im i l a r i t y ( E q . 2 ) o f t h e p a ge

c o n t a i n i n g th e l i nk a nd t h e us e r qu e r y .

CHAPTER 3. CRAWLER DESIGN

31

N o t i c e t h a t u s in g i nv e r s e do c um e n t f r eq u en c y ( i d f )

w e i gh t s c an b e p ro b l em a t i c b e c aus e i d f w e i gh t s n e ed t o b e

u p da t ed a t ev e r y c r a w l in g s t e p , f o r t h i s r e a so n i d f w e igh t s

c a n b e i n a c cu r a t e a t t h e i n i t i a l s t ep s o f c r aw l in g w h en t h e

n um b er o f r e t r i e ve d p a ge s i s sm al l [ 25 ] . M os t Be s t F i r s t

C r a wl e r i mp l e me n t a t i o ns u se o n l y t e r m f r eq u en c y ( t f )

w e i gh t s . In t h i s w o r k i d f w e i gh t s a r e p r ov i d ed b y th e

In t e l l iS e a r ch w eb s e a r ch e n gi n e [ 41 ] h o l d in g i d f s t a t i s t i c s

f o r En g l i sh t e rms . A t t h e n ex t s t e p t he l i n k wi t h t h e h i gh e s t

p r io r i t y i s s e l e c t e d f o r do wn lo a d in g .

3 . 2 . 2 B e s t F i rs t C raw le r w i th an cho r t e x t s i mi l ar i ty

T h e s e co nd v a r i a t i o n o f Bes t F i r s t C ra w l e r i s t h e Be s t F i r s t

C raw l er u s in g an ch or t e x t s im i la r i t y . T h e an c ho r t ex t o f a

U R L i s t h e c l i c k ab l e t ex t t h a t a pp e ar s o n t h e l i nk i n s id e a

W eb p a ge p o i n t i n g t o t h a t UR L. In t h i s w or k w e imp l em en t ed

a v a r i an t o f t h e a b o ve Be s t F i r s t C ra w l e r wh ic h in s t e ad o f

p a ge c on te n t u s es U R Ls a n ch o r t ex t a s t h e r ep r es e n t a t i o n o f

p a ge c on t en t a nd f o r a s s i gn i n g d o wn lo a d p r i o r i t i e s . No t i c e

t h a t l i n ks f rom the s a me p a ge ma y b e a s s i gn e d d i f fe r e n t

p r io r i t y v a l u e s , a s o p pos e d t o t h e f i r s t im p l em e n t a t i on , u s i n g

p a ge t ex t c on t en t f o r a s s i gn i n g p r io r i t i e s , w he r e a l l l i nk s

i n t o t he s am e p a ge a r e g i v e n t he s am e p r io r i t y . A s w i l l b e

s ho w n in t h e r e su l t s , s e l ec t i on o f anc h o r t ex t f o r a s s i gn in g

p r io r i t y v a l u es i mp r ov e d th e ge n e r a l p e r f o r m an c e o f t he

c r a w le r , u s i n g bo t h h a r v es t r a t i o a n d av e r a ge s i mi l a r i t y

c r i t e r i a ( s e c t i on 4 .3 ) .

3 . 2 . 3 B es t F i r s t C r aw l e r w i th pa ge c on t en t an d anc h or

t e x t .

T h e t h i rd v a r i a t i on o f Be s t F i r s t C r a wl e r c om bin e s t he

p r e v io us t wo im ple m e n t a t i on s us i n g p a ge c o n t en t a nd l i nk

CHAPTER 3. CRAWLER DESIGN

32

a n c ho r t ex t r e sp e c t iv e l y. E a c h UR L i s a s s i gn e d a p r io r i t y

v a lu e d e f i ne d a s :

�F���������� = similarity�5�,�" + �� ������� ���,��2 � �7�

W h e r e 5���������� i s t h e p r io r i t y v a l u e a s s i gne d t o l i n k i ,

similarity�5�,�" i s t h e s i mi l a r i t y o f q u e r y � a nd 5� ( t h e c on t e n t

o f t h e p a ge wh e r e t h e l i n k i i s l o c a t e d ) an d similarity���,�" i s

t h e s imi l a r i t y o f anc h o r t ex t �� o f l i n k i a n d qu e r y q .

T h e id e a b eh i nd t h e Be s t F i r s t C ra w l e r wi t h pa ge

c o n t e n t o n l y i s t ha t a p a ge r e l ev a n t t o t h e t o p i c i s m o re

l i k e l y t o p o i n t t o a r e l e v an t p a ge t ha n t o a n on r e l e v an t o n e .

T h us , t he h i ghe r t h e r e l ev an c e o f t he p a ge c o n t a in i n g t h e

l i n k i s , t h e h i ghe r t h e p r ob a b i l i t y t h a t t h e l i nk wi l l po i n t t o a

r e l ev a n t p a ge i s .

T he s e co nd imp l em e n t a t i on ( Be s t F i r s t C r aw le r u s in g

a n c ho r t ex t s i mi l a r i t y) t r i e s t o o v erc o m e a d i s a dv a n t age o f

t h e Be s t F i r s t C ra w l e r wi t h p a ge c o n t e n t on l y: a l l l i n ks

w i t h i n a p a ge h ave t h e s am e p r io r i t y r e ga r d l e s s o f a nc h o r

t ex t . A n ch or t ex t m a y b e r e ga r d e d a s a su mm a r y o f t he

c o n t e n t o f t h e p a ge t h a t t he l i n k po in t s t o . T he r e fo r e i t i s

r e a s on ab l e t o u s e t h i s d es c r i p to r fo r a s s i gn i n g p r i o r i t i e s t o

p a ge s . Ho w e ve r a nc h o r t ex t i s n ’ t a lwa ys d e s c r i p t i v e o f p a ge

c o n t e n t s a nd b y i g n o r i n g t h e p a ge c on t en t u s e f u l i n fo rma t i on

m a y n o t b e us ed . S o th e t h i r d Be s t F i r s t C raw l e r

i mp l e me n t a t i o n us es b o t h p a ge a n ch o r t ex t a nd p a ge c on t en t .

3 .3 Semant ic Craw lers

Be s t F i r s t c r aw le r s e s t i m a t e t h e r e l eva n c e b e t w e en t h e p a ge

c o n t e n t o r an c ho r t ex t a nd a u s e r q u e r y. T h e r e m a y ex i s t

c o n c ep t u a l l y r e l a t ed t e rm s i n bo th t h e q u e r y a n d t h e p a ge ( o r

a n c ho r t ex t ) , i n d i ca t i n g a r e l e v an c e t o t h e t o p i c . H ow e ve r i f

t h es e t e r ms a r e n ’ t l ex i c o gr a ph i c a l l y s im i l a r t h e i r r e l ev a n c e

CHAPTER 3. CRAWLER DESIGN

33

w i l l b e i gno r ed b ec a u s e VS M c omp ute s t ex t s i mi l a r i t y a s a

f u n c t i on o f s i mi l a r i t i e s b e t w ee n i d en t i c a l t e r ms fo un d i n t h e

v e c to r s w h i ch a r e c o mp a r ed . Th i s c a n b e r eso lv e d u s i n g

o n t o l o gi e s o r t e rm t ax o nom i es . In o n to lo g i es co n c ep t ua l l y

s imi l a r t e rms a r e r e l a t e d b y v i r t u e o f IS - A l i nks . A l l t e r ms

c o n c ep t u a l l y s i mi l a r t o u se r qu e r y t e r ms a r e r e t r i e ve d f r om

t h e on t o l o g y a n d u s ed f o r e nh an c ing t h e d e s c r ip t i o n o f t h e

t op i c ( e . g . b y a d d in g s yn o n ym t e r ms t o t h e t op i c k e yw o r d s )

a n d f o r co mp ut in g th e s i mi l a r i t y b e t w e en q u er y a nd

c a n d i da t e p a ge s . Fo r t h i s , v a r iou s m et ho ds h av e b e e n

p r op os e d i n c lu d i ng a m o n g o th e r s S e m an t i c S i mi l a r i t y

R e t r i e v a l M od e l (SSR M ) [ 1 4 ] a nd M ih a l c e a e t . a l [ 1 5 ] . Th e

m os t i mp or t an t r ep r e s en t a t i v e s o f t h i s c a t e go r y o f m e th o ds

a r e im p l em en t ed w i t h i n Bes t F i r s t c r a w le r s f o rmi n g th e so

c a l l e d h e r e a f t e r S em a n t i c C r aw le r s .

In t h i s w o rk , W o rdN e t [ 4 ] t e rm t ax o no m y i s u s ed as a n

o n t o l o g y f o r r e t r i ev in g c o nc e p t u a l l y s imi l a r t e rm s . W ord N et

w a s s e l ec t ed be c au s e i t p ro v id e s a v a s t co v e r a ge o f t h e

E n g l i sh vo c a bu l a r y s o i t c an b e u s ed fo r f o cu se d c r aw l ing o n

a lm os t ev e r y t o p i c m a k i n g o u r imp l em en t a t i on t h e f i r s t

ge n e r a l pu r pos e S em a n t i c C r aw le r . T h e ge n e r a l d es i gn

r e m ai ns s imi l a r t o t h a t o f C l as s i c Focu s ed Cr a wl e r s ( F i g . 6 )

b u t t h e p r i o r i t i e s a s s i gn e d t o l i n ks a r e e v a l u a t ed us in g

m et ho ds s u ch a s SSR M [ 1 4] an d Eh r ig e t . a l [ 1 3 ] . O th e r p a r t s

o f t h e s ys t e m s u ch a s d o wnl o ad in g , l i n k an d an ch o r t ex t

ex t r a c t i o n , p r ep ro ce s s i n g a nd r ep r es en t i n g t ex t s u s in g V e c t o r

S p ac e M od e l t e rm v e c to r s , r em ai n t h e s am e .

In t h e f o l l o wi n g , c a n d id a t e l i n ks f o r d o wnl o ad i n g a r e

r e p r es e n t ed b y t h e i r a nc ho r t ex t s . Ea c h c a nd i d a t e l i n k i s

a s s i gn ed a p r i o r i t y v a l u e wh i ch i s com pu te d a s t h e s ema n t i c

s imi l a r i t y b e t w e e n th e i r an c ho r t ex t a n d th e t op i c [ 1 4 , 1 5 ] .

In t u r n , s e m an t i c t ex t s imi l a r i t y i s c o m pu te d as a f un c t io n o f

CHAPTER 3. CRAWLER DESIGN

34

t h e s em an t i c ( c on ce p tu a l ) s imi l a r i t i e s b e t w e en t h e t e rms t he y

c o n t a i n . Th i s c a n b e de f in ed i n m any d i f f e r e n t w a ys [ 1 1 ]

l e a d in g to t h e im p le m e n t a t i on o f t h r e e s em an t i c c r a wl e r s .

3 . 3 . 1 Eh r ig C r aw le r

In t h i s im p l em e n ta t i on W e b p a ge s a r e r e p re s en t e d b y t h e

a n c ho r t ex t o f t h e l i n ks p o in t in g t o t h em ( in s t e ad o f p a ge

c o n t e n t a s i n [ 1 3 ] . A n ch o r t ex t s a n d th e us e r qu e r y a r e

r e p r es e n t ed b y t e r m v e c t o r s u s i n g t f w e i gh t s [ 13 ] . P a ge

p r io r i t i e s a r e com pu t ed as :

F�������:cS�d��� = V V �� ,-��� , �e" ∗ $�� ∗ $�e e&%

e&'

�&%

�&' �8�

W h e r e + i s t h e t o t a l n um b er o f t e rms in t o an c ho r t ex t an d

q u e r y, a n d �� ,- i s t e rm s e ma n t i c s imi l a r i t y c o m pu t ed u s in g

e q u a t i on 3 . N o t e t h a t on l y t f w e igh t s a r e u s e d wi th ou t

n o rm al i z in g b y v e c t o r l e n g th ( a s i t i s r e c om m en de d f o r sh o r t

t op i c a n d pa ge d e s c r i p t i on s ) , a nd t h a t W or d N et i s u s ed

i n s t e a d o f t op i c s pe c i f i c o n to lo g i es a s i n [ 1 3 ] .

3 . 3 . 2 SS R M C r aw le r

SSR M [ 1 4] i s u se d f o r a s s i gn i n g v i s i t p r io r i t i e s t o w e b

p a ge s . Sp e c i f i c a l l y t h e p r i o r i t y o f a U R L ( r e p r e s en t e d b y i t s

a n c ho r t ex t ) i s de f in e d a s f o l l ow s :

�F�������gghi��� = ∑ ∑ /�jkl�1m,1n"olm∗opnqnrsmrqmrs(∑ oln3nrq

nrs (∑ opn3nrqnrs

� �9�

W h e r e + i s t he t o t a l n um b er o f t e rm s in to t h e a n ch or t ex t

a n d t h e qu e r y. Li e t . a l . [ 42 ] i s t h e t e rm s i mi l a r i t y m e t ho d

CHAPTER 3. CRAWLER DESIGN

35

u s ed in o u r i mp l em e n t a t i on . Th e U RL w i t h h i gh es t p r io r i t y

v a lu e i s d o wnl o ad ed f i r s t .

3 . 2 . 3 S e man t i c C raw le r w i th syn onym s e t exp ans ion

A n ob v i ous im pr ov e m en t i s t o ex pa n d t ex t v e c t o r s w i t h

s yn o n ym s e t s i n W o r d Ne t an d u se bo th a n ch o r t ex t a nd p a ge

c o n t e n t fo r co mp ut in g t ex t s i mi l a r i t y a n d a s s i gn in g

p r io r i t i e s :

�F�������/uv/:1 :w08v,��� = similarity�5�, �x " + �� ������� ��x �, �x �2 � �10�

W h e r e F���������� i s t h e p r io r i t y v a l u e a s s i gne d to l i n k i ,

�� ��������5�, �x " i s t h e co s i ne s i mi l a r i t y o f ex p an d ed q u er y �x ( u s i n g W o r dN e t s yn o n ym s e t s ) an d 5� ( t he c on t en t o f p age

w h e r e t h e l i nk i i s l o c a t e d ) a nd �� ������� ��x �, �x �i s t he c os in e

s imi l a r i t y o f ex p and e d an c ho r t ex t �x � o f l i n k i a nd ex p and e d

q u e r y �x .

3 .4 Learning Craw lers

T h e m ai n i d e a be h i nd Le a r n i n g C r a wl e r s i s t ha t t h e c r aw l e r

l e a rn s us e r p r e f e r en c e s on th e t op i c f r om a s e t o f ex am pl e

p a ge s ( t h e t r a in i n g s e t ) . T r a in in g ma y i n v o l v e l e a r n i ng t h e

p a th l e ad i n g to t h e d e s i re d c on t e n t . In m o s t c a s es t h e

t r a i n i n g s e t c on s i s t s o f r e l ev a n t an d i r r e l ev a n t pa ge s . Ev e r y

d o wn lo ad e d p a ge i s c l a s s i f i ed (b as e d on t h e r es u l t s o f

l e a rn in g ) a s r e l ev an t o r i r r e l e v an t and i s a s s i gn e d a p r io r i t y .

T h e Co n t ex t Gr ap h m eth od [ 31 ] w o r ks n o t on ly b y

c l as s i f yi n g t h e c r aw l ed p a ge s as r e l ev a n t o f no t r e l e v an t , bu t

a l so b y l e a r n in g the d i s t an c e ( i n n umb e r o f r ou t in g ho ps ) t h a t

m a y l e a d f ro m an i r r e l ev a n t p a ge t o a r e l ev a n t on e ( F ig 3 ) .

T h e no n r e l ev a n t pa ge s i n t h e t r a i n ing s e t w e r e do wn loa d e d

CHAPTER 3. CRAWLER DESIGN

36

u s i n g r e cu rs iv e l y G o o g le ’ s b a ck l in k s e rv i ce , s t a r t i n g f r om

r e l ev a n t p a ge s , i n o r de r t o com pu te t he i r d i s t an c e ( l ev e l )

f r om t h e r e l ev a n t o r t a r ge t p a ge s . Du r in g c r aw l in g , page s

s imi l a r t o t h os e c l os e r t o t a r ge t pa ge s a r e g i v en h igh e r

p r io r i t y .

T h e Hi dd e n M a rk ov M od e l C ra wl e r [ 16 , 18 ] ex t en ds t h e

p r e v io us i d e a b y c a t e gor i z in g p a ge s n o t o n l y b y t h e i r

d i s t a n c e f r om a t a r ge t p a ge bu t a l so b y u s in g t h e i r c on t en t ,

t hu s es t im a t in g a r e l a t i o n b e tw e e n page c o n t en t an d t h e p a th

l e a d in g t o r e l e van t pa ge s . In i t i a l l y , a u s e r b r o wse s a

s e qu e n ce o f p a ge s l a b e l i n g t h em as r e l e v an t o r n o t . A s pa ge s

a r e d ow nl oa d ed , t h e v i s i t i n g s eq u en c e i s r ec o rd e d an d a

c o n t ex t g r a ph i s c r e a t ed wi t ho u t t h e n e ed o f a b a ck l i nk

s e r v i c e as i n [ 31 ] .

Fi g . 8 Ou t l i ne o f l e a r n in g f o cu s ed c r aw l in g

F i gu r e 8 i l l u s t r a t e s t h e f u nc t io n a l co mp on e n t s o f t h e H M M

c r a w le r im p l em e n t ed :

I . T r ain in g co mp on en t : Th e f i r s t c omp o ne n t r e co rd s t h e

U R L’ s v i s i t e d b y t h e us e r a nd t h e pa ge v i ew s e qu e n c e .

T h en i t d ow nl oa ds p a ge s a nd c om pu tes t h e t f - i d f v e c to r s

r e p r es e n t i n g th e i r c o n t e n t . F in a l l y p a ge s a r e c lu s t e r ed

u s i n g a c l us t e r in g a l go r i t hm . In o u r im p l em e n t a t i on K-

M e a ns an d X -M e ans [ 1 7 ] w er e u s ed f o r c l us t e r in g .

I I . H M M in i t ia l i z a t i on : Th e s e co nd com po n en t t a k es t he

H M M r ep r es e n t a t i on o f u s e r t r a i n i n g se t ( a s i n f i g . 4 ) a s

User training module

Hidden Markov Model

Initialization

Crawling Component

CHAPTER 3. CRAWLER DESIGN

37

i np u t a nd c ompu t es t h e Hi dd e n M a r ko v M od e l

P a r am et e r s ( i . e . π , A a nd B m a t r ix es ) . T h i s com po n en t i s

a p p l i e d du r i n g t h e i n i t i a l i z a t i on p h as e b e fo r e c r a wl i n g .

I I I . C r aw l i ng co mp one n t : i t do wn lo a ds s e l e c t e d p a ge s ,

ex t r a c t s c on t en t an d l i nk s , p r oc e s s p a ge co n t e n t a nd

a s s i gns t h e pa ge t o a c l us t e r u s i n g K -N e ar es t N e ig hb or s

a l go r i t hm [ 4 3 ] . G iv e n t he p a ge c l us t e r a nd th e H id d en

M a r ko v M od e l p a ra m et e r s (π , A a nd B m at r ix es ) t he

p r ob a b i l i t y t h a t t he n ex t p a ge v i s i t ed w i l l b e a t a r ge t

p a ge i s co mp ut ed u s in g V i t er b i a l go r i t hm [ 4 0 ] . T h a t

p r ob a b i l i t y r e p r e s en t s a l so v i s i t p r i o r i t y o f t h e l i nk . I f

t w o c lu s t e r s yi e l d a lm os t i de n t i c a l p ro b ab i l i t i e s ( i . e . t h e

d i f f e r en c e o f p ro b ab i l i t i e s i s b e l o w a p r e de f in ed

t h re sh o l d ε ) t h en h i gh e r p r i o r i t y i s a s s i gn ed t o t h e

c l us t e r l e a d in g w i th h i gh e r p r ob a b i l i t y t o t a r ge t p a ge s i n

t w o s t e ps ( a l so c o mp ut ed b y a p p l yi n g t h e Vi t e rb i

a l go r i t hm ) .

T h r e e Le a r n in g c r aw l e r s h av e b ee n im p l em e n t ed : t he f i r s t i s

t h e H id d en Ma r kov C ra wl e r (v a r i a n t s p ro po s ed i n [ 1 6 ] a n d

[ 1 8 ] ) . Th e nex t two v a r i an t s ( H yb r i d C r a wl e r s ) a re p ro po s ed

i n t h i s t h e s i s . T he y c o m b in e t h e pa ge p r io r i t y f u n c t i on s

c o mp ut ed b y t h e H i dd e n M a r ko v M ode l Cr a wl e r a nd th a t o f

t h e Be s t F i r s t C ra w l e r i n o r d e r t o e v a l u a t e t h e ov er a l l

p r io r i t y v a l u e o f a W eb p a ge .

3 . 4 . 1 H idd en M a rk ov Mod el Cr aw l er

T w o v a r i an t s o f t h i s c r aw le r h a v e be en i mp l e me n t e d :

a ) T h e f i r s t h i dd e n M a r ko v Mo d e l i mp l em e n t a t i o n us e s

K - M e an s a l go r i t hm f o r c l us t e r i n g a s d es c r i b ed in

[ 1 6 ] . In t h i s wo r k th e d i me ns io n a l i t y r e d u c t i on s t e p i s

o mi t t ed . K wa s s e t t o 5 , a nd t h e l a s t f i f t h c l us t e r

CHAPTER 3. CRAWLER DESIGN

38

h o l ds t h e r e l ev an t p a ge s . P a ge p r io r i t i e s ( pr ior i t y h m m )

a r e co mp ut ed u s i n g Vi t erb i [ 40 ] a l go r i t hm ( F i g . 9 ) .

b ) T h e s e co nd v a r i a n t i s a lm os t i d en t i ca l t o t he p r e v io us

o n e bu t i n s t e ad o f K - Me a ns , X -M e an s [ 17 ] i s u s ed .

O t he r min o r mo d i f i c a t i o ns a r e ( a ) i d f w e i gh t s a r e no t

p r e c omp ut e d ( a s i n t h e p r e v i ou s v a r i an t ) , bu t a r e

c o mp ut ed u s i n g the t r a in i n g s e t a nd ( b ) t h e r e l ev a n t

p a ge s d on ’ t f o rm a s e p a r a t e c lu s t e r b u t t h e y m a y

b e lo n g t o t h e s a me c lu s t e r w i th no n r e l ev a n t p a ge s .

A s w i l l b e s ho wn in t h e ex p e r im e n t s t h e t w o v a r i an t s

d e mo ns t r a t e d i d e n t i c a l p e r f o rm an c e . T h e f i r s t v a r i an t

w a s us e d f o r co mp a r i so ns wi t h t h e H yb r i d C r aw l e r s

p r op os e d i n t h i s wo r k .

F i g u r e 9 s u mma r i ze s H M M C r a w l e r s p r i o r i t y a s s i g n me n t

p r o c e d u r e :

Fi g . 9 HMM C r aw l e r p r io r i t y e s t ima t i on a l go r i t hm

��G� , �" = A�QU V ���G� , � − 1� ∗ ���/181:/

�&'�

��G�, � + 1" = V ���G� , �� ∗ ���/181:/

�&'�

Input: Training set, candidate page (p).

Output: priority value priorityhmm(p) assigned to candidate page p.

1. Cluster training set using K-Means or X-Means algorithm

2. Compute π, A, B matrixes.

3. Classify candidate page p to a cluster T1 using K-Nearest Neighbor

algorithm

4. Compute hidden state probabilities for current step using Viterbi formula:

5. Compute hidden state probabilities estimation for next step using

formula :

6. Assign priority priorityhmm(p) = ��G', � + 1� to page p.

CHAPTER 3. CRAWLER DESIGN

39

3 . 4 . 2 H yb r id C r aw l e rs

T w o v a r i an t s o f h yb r id c r a wl e r s a re im p l em e n t ed :

a ) H yb rid M a rko v M o de l C r aw l e r : T h e Hi dd en M a rk ov

M od e l C r aw l e r su f f e r s f ro m a t l e a s t tw o d r aw b a ck s : ( a ) i t

d o es n ’ t a s s i gn d i f fe r e n t p r io r i t i e s t o p a ge s b e lo n g in g to

t h e s a me c lu s t e r an d ( b ) i t i s v e r y d i f f i c u l t t o r e p re s en t

t h e s e t o f W e b p a ge s n o t r e l e v an t t o t h e t op i c b y c l u s t e r s

( i t i s a v e r y h e t e r oge n e o us s e t ) .

A h yb r i d ap p ro a ch c o mbi n i n g th e t ex t s imi l a r i t y o f a

p a ge w i th t h e c e n t ro i d o f t h e c lus t e r c o n t a i n i n g the

p os i t i v e ex am pl e pa ge s ( us i n g VS M) i s p ro po s ed h e r e fo r

d e a l in g wi th t h es e t wo p r ob l e ms . Th e c e n t ro id i s

c o mp ut ed a s t h e ave r a ge v e c to r o f t he p a ge s b e l on g i n g to

t h e c lus t e r . T ex t s i mi l a r i t y b e t w e en c a n d i da t e p a ge s wi t h

t h e c en t r o id m a y d i f f e r ev en i f p a ge s b e lo n g to t h e s ame

c l us t e r t hu s d e a l in g w i t h t h e f i r s t p r ob l em m e n t i one d .

S i mi l a r i t y w i t h t he c e n t r o i d o f r e l e v a n t p a ges i s no t

a f f e c t e d b y t h e wa y n o n r e l e v an t pa ge s a r e r e p r es e n t e d

t hu s d e a l i n g w i th t h e s e c on d p r ob l em a s we l l .

T h e H yb r i d Ma r ko v M od e l Cr a wl e r d i f f e r s f ro m th e

H M M Cr a wl e r i n t he w a y p r i o r i t i e s a r e a s s i gn e d to

c a n d i da t e p a ge s . I t c om pu t es a p r i o r i t y s c o r e fo r a p a ge

u s i n g t h e Hi dd e n M a r ko v Mo d e l ( pr io r i t y h m m ) an d a l so

c o mp ut es t h e t ex t s imi l a r i t y o f p a ge c o n t e n t wi t h t h e

c e n t r o i d o f t h e c l us t e r c on t a in in g t h e r e l e va n t p a ge s f r om

t h e us e r t r a in in g se t u s i n g eq ua t i on 2 . F i n a l l y , t h e p r io r i t y

o f p a ge p i i s c om put e d as fo l l o w s :

�5�������cuyS�,�5�� = �� ��������5�,TS" + 5�������cjj�5��2 � �11�

CHAPTER 3. CRAWLER DESIGN

40

W h e r e TS i s t he c en t ro id o f r e l ev an t p a ge s i n t r a i n i n g s e t ,

similarity�5�,TS" i s t h e cos in e s im i l a r i t y o f p a ge co n t e n t 5� w i t h c en t r o i d TS o f r e l ev a n t p a ge s , 5�������cjj�5�� i s t h e

p r io r i t y a s s i gne d to pa ge l i nk i i n to p a ge 5� u s i n g Hid d en

M a r ko v M od e l an d 5�������cuyS�,�5�� i s t h e p r io r i t y a s s i gn ed

t o l i n ks i n pa ge 5� b y t h e H yb r i d C ra wl e r .

b ) H yb rid H M M C raw le r w i th p ag e c o nt en t a nd an cho r

t e x t : A n ob v i ou s ex t ens io n to m et hod ( a ) i s t o u s e bo th

a n c ho r a n d pa ge t ex t i n t h e c om pu ta t i on o f p a ge

p r io r i t i e s . Th i s l ead t o t h e f o l l o win g e q u a t i on :

�5�������cuyS�, 8vQczS ���� = 5�������cjj�5�� + similarity�5�,TS " + similarity���,TS "2

2 � �12�

W h e r e TS i s t h e c en t r o i d o f r e l ev a n t p a ge s i n t r a i n in g s e t ,

�� ��������5�,TS " i s t h e cos in e s imi l a r i t y o f pa ge

c o n t e n t 5� w i t h t h e c en t r o i d TS o f r e l ev a n t pa ge s ,

�� ����������,TS" i s t h e c os i n e s im i l a r i t y o f l i nk a n c ho r

t ex t �� w i t h t h e c e n t r o i d TS o f r e l e va n t p a ge s ,

5�������cjj�5�� i s t h e p r io r i t y v a lu e as s i gn ed to l i nk s i n t o

p a ge 5� u s i n g H idd e n M a rk ov Mo d e l an d

5�������cuyS�, 8vQczS ���� i s t h e p r io r i t y a s s i gn ed to t h e l i n k

w i t h a n ch o r t ex t �� by t h e H yb r i d HM M C r aw le r wi t h p a ge

c o n t e n t a nd an c ho r t ex t .

T h e p r io r i t y f u n c t i on o f eq u a t i on 1 2 im pr ov e s t he

p e r f o r ma n c e o f t he h yb r i d C r aw l e r . As w i l l b e sh o wn in t h e

ex p e r im e n t s wh e n a n c ho r t ex t i s u s ed th e c r a wl e r i s ev e n

m o re f o cu s ed to t he t op i c . F i gu r e 9 i l l u s t r a t es t h e op e ra t i on

o f h yb r i d c r a wl e r s :

CHAPTER 3. CRAWLER DESIGN

41

C l us t e r 2

L 3 p a ge

L 2 p a ge C l us t e r 0

c e n t ro id

L 0 p a ge

L 1 p a ge

C l us t e r 1

C a nd i d a t e p a ges

Fi g . 10 H yb r i d c r aw l e r s o p er a t i o n .

In f i gu r e 1 0 t wo p a ge s (b lu e c i r c l e s ) a r e c an d id a t e f o r

d o wn lo ad in g . Th e H M M Cr a wl e r w i l l a s s i gn h i gh e r p r io r i t y

t o c an d i da t e p a ge p 1 b e l on g in g t o c l us t e r 1 s in c e t h i s c lu s t e r

l e a ds wi th h i gh e r p r ob a b i l i t y t o t a r ge t p a ge s ( c lu s t e r 0 ) i n

t w o l i nk s t ep s ( s i nc e t he p ro b ab i l i t y o f l e a d i n g t o c l us t e r 0

i n o n e s t e p i s i de n t i c a l fo r c lus t e r s 1 an d 2 ) . In s t ea d , a

H yb r i d c r a wl e r w i l l s e l e c t f o r ex p an s i on t h e p a ge p 2

b e lo n gi n g t o c lu s t e r 2 b e c au s e o f i t s p rox i mi t y ( s i mi l a r i t y)

w i t h t h e c en t r o i d o f c l us t e r 0 ( t h e c l us t e r co n t a i n in g th e

r e l ev a n t p a ge s f rom t he t r a i n i n g s e t ) .

3 .5 Summary

Cl a ss i c c ra wl e r s i n c lu d i n g t he w e l l k no w n Br e ad th -F i r s t

c r a w le r an d v a r i a t i o ns o f t h e Be s t - F i r s t C r a wl e r p r es en t ed i n

t h i s c h a p t e r h a v e b e e n i mp l e me n t e d i n t h e c u r r e n t t he s i s .

C2

C2

C0

C1

C0

C3

C1

cr

P2

P1

CHAPTER 3. CRAWLER DESIGN

42

S em a n t i c c r aw l e r s i n c lu d i n g a v a r i a t i o n o f t h e E hr i g c r aw l e r

u s i n g W or d N et , an d t h e n ov e l S SR M an d S yn o n ym s e t

ex p an s i on c r a wl e r s h av e b e e n imp le m e n t ed an d comp a r e d

w i t h s t a t e o f t h e a r t Be s t F i r s t C r aw l e r s . F i n a l l y a s e t o f

s t a t e o f t h e a r t HMM c r a wl e r s i n c lu d in g [ 1 6 , 18 ] an d th e h e r e

p r op os e d h yb r i d c r a w l e r s a re a l s o im p l em e n t ed a nd th e i r

p e r f o r ma n c e i s e v a l u a t ed .

CHAPTER 4. EXPERIMENTAL RESULTS

43

Chapter 4. Experimental Results

4.1 Introduction

T h e f o l l o win g s e t o f ex p e r im e n t s i s d es i gn ed to :

a ) P ro v i d e a c r i t i c a l e v a l u a t i on o f t h e v a r i ou s t yp e s o f

c r a w le r s ex ami n ed i n t h i s wo rk in c lu d in g c l as s i c

( Br e a d t h - F i r s t ) , t o p i c d r i v en ( Bes t - F i r s t a n d i t s

v a r i a n t s i n c lu d in g S em a n t i c c r aw l e r s ) , Le a r n in g a n d

H yb r i d c ra wl e r s .

b ) D e mo ns t r a t e t h e s up e r i o r i t y o f t h e n ew H yb r i d c r a wl e r

p r op os e d in t h i s w o r k o ve r s t a t e o f t h e a r t HM M

l e a rn in g c r aw l e r s su c h a s [ 16 , 18 ] .

S ix d i f f e r e n t t op i cs w e r e us e d ( “ l i n ux ” , “ as thm a ” ,

“ r o bo t i c s ” , “ de n gue f e v er ” , “ j a v a p ro gr a mm in g” an d “ f i r s t

a i d ” ) a nd t h e a b i l i t y o f t h e c r a wl e r s t o d ow nl oa d p a ge s on

t h e a bo v e to p i cs w a s m e as u r e d . T h e i r p e r fo rm a n c e w a s

c o mp ut ed us in g t wo w e l l e s t a b l i sh e d m e as u r es r e f e r r e d t o a s

h a r v es t r a t i o and a v e ra ge s i mi l a r i t y . E a c h c r aw l e r

d o wn lo ad e d 1 00 0 pa ge s a n d i t s av e r a ge p e r fo r ma n c e (o v er a l l

t op i c s ) w a s c om put e d u s in g b o t h c r i t e r i a . R e l ev a n t j ud ge d

p a ge s w e r e p r ov i de d b y t h e u se r wh o m a nu a l l y i n sp ec t e d

r e s u l t s ob t a in e d b y t he Go o g l e s ea r c h e n g in e on e ac h top i c .

T h es e r es u l t s w e r e u s ed as g r ou nd t ru th a nd co mp a r ed w i t h

r e s u l t s o b t a i ne d b y t h e c r a wl e r s . T h e m o re s im i l a r ( t o g ro u nd

t r u th ) t h e r esu l t s o f a c r aw l e r a r e , t h e mo s t s u c ce s s f u l t h e

c r a w le r s i s ( t h e h i gh er t h e p r ob a b i l i t y t h a t t h e c r aw l e r

r e t r i ev e s r e su l t s s im i l a r t o t h e t o p i c ) . P a ge t o t op i c r e l eva n c e

i s c omp ut e d b y V SM i n a l l c as e s .

CHAPTER 4. EXPERIMENTAL RESULTS

44

4 .2 Per formance measures

T w o d i f f e r e n t e v a lu a t io n c r i t e r i a w e re us e d :

a ) H a rv es t r a t i o : Fo r e v e r y p a ge i t s c o s i ne s imi l a r i t y

w i t h a l l p a ge s j ud ge d a s r e l ev a n t b y t h e u s e r i s

c o mp ut ed a nd t h e m ax im um o f t h es e c o s i ne s im i l a r i t i e s

i s t ak e n . I f t h e max imu m s i mi l a r i t y i s g r e a t e r t h a n a

p r e d ef in e d th r es ho ld ( 0 . 75 i n t h i s w o rk ) t h en t h e p a ge

i s ma r k ed as r e l e va n t ( o th e r wi s e t h e p a ge i s m a rk ed as

i r r e l ev a n t ) . T he h a r v es t r a t i o i s d e f i n ed a s t h e

p e r c en t a ge o f d ow nl o ad e d p a ges wi t h s imi l a r i t y g r e a t e r

t h an t he t h re sh o ld ( i n t h i s t h es i s t h e n um be r o f

r e l ev a n t p a ge s wa s u s e d i n s t e a d o f t he f r a c t i on o f t h em

a m on g t h e t o t a l num b er o f d o wnl o ad ed pa ge s ) .

b ) A v e r ag e s i mi la r i ty . T h e m ax imu m s i mi l a r i t y o f e a c h

d o wn lo ad e d p a ge w i t h a l l p a ge s m a rk e d a s r e l e v an t i s

c o mp ut ed . T h e a ve r a ge s imi l a r i t y i s d e f in e d a s t h e

a v e r a ge v a lu e o f t h e s e s i mi l a r i t i e s fo r a l l do w nlo a d ed

p a ge s .

T h e f i r s t c r i t e r i o n i s mo r e s e l e c t i v e t h a n t he s e co nd . H ar v e s t

r a t i o c an b e ad ju s t e d ( b y u s in g h i gh er t h re sh o l d ) t o m e as u re

t h e ab i l i t y o f t h e c r a w l e r t o do wn lo ad p a ge s h i gh l y r e l e v a n t

t o t h e t o p i c . A n app l i c a t i on c a l l ed “ ev a lu a t o r ” w a s d e v e l op e d

f o r au to ma t i n g t h e e v a lu a t i on p ro c es s . I t r e c e iv es a s i n p u t

t h e p os i t i v e p a ge s s e t (5 0 r e l ev a n t p a ge s on e v e r y t o p i c i n

o u r ex p e r i m en t ) and t h e 10 00 e va lu a t e d pa ge s d o wnl o ade d b y

t h e c r a wl e r , an d co mp ut es t h e p e r f o rm a n ce o f t h e c r a wle r a t

h a nd wi t h bo th c r i t e r i a .

CHAPTER 4. EXPERIMENTAL RESULTS

45

4 . 3 E xp e r i me n t se tup

T h e f o l l o win g c r a wl e r s a re co mp a r ed :

1 ) N o n Fo cu s ed C ra wle r s :

a ) B r e a d th F i r s t C r aw l e r

2 ) C l a s s i c Fo c us e d C r a wl e r s :

b ) Bes t F i r s t C r a wl e r wi t h p a ge c o n t e n t

c ) Be s t F i r s t C r a wl e r wi t h a n ch or t ex t

d ) Bes t F i r s t C r a wl e r wi t h p a ge c o n t e n t &

an c ho r t ex t

3 ) S e m an t i c C r aw l e r s :

e ) S em an t i c C r a wl e r u s i n g E h r ig e t . a l . [ 1 3 ]

m e t ho d fo r t ex t s i mi l a r i t y e s t i ma t i on .

f ) S em a n t i c C r a wl e r u s i n g SSRM [ 1 4]

m e t ho d fo r t ex t s i mi l a r i t y e s t i ma t i on .

g ) S e m an t i c C r a wl e r wi t h S yn s e t Ex p a ns i on .

4 ) Le a r n i n g C r aw le r s :

h ) Hi dd e n Ma r k ov M od e l C r aw le r

i ) H yb r i d Hid d en M a r ko v M od e l C r aw l e r

j ) H yb r i d Hid d en M a r ko v M od e l C r aw l e r wi t h

pa ge c on ten t & an c ho r t ex t .

A l l C r a wl e r s w er e e v a lu a t e d u s i n g the f o l l o wi n g to p i cs a n d

s e e d p a ges :

query seed Linux http://dir.yahoo.com/Computers_and_Internet/Software/Operating_Systems/UNIX/Linux

Asthma http://dir.yahoo.com/Health/Diseases_and_Conditions/Asthma/

Robotics http://dir.yahoo.com/Science/Computer_Science/

Dengue Fever http://health.yahoo.com/

Java programming http://dir.yahoo.com/Computers_and_Internet/

First Aid http://dir.yahoo.com/Health/

Fi g . 11 Ex p e r i m en t s e t up

1 0 00 p a ge s w e r e do w nl oa d ed fo r ea ch c r aw l e r an d f o r e a c h

t op i c . N o t i c e t h a t i n fo u r o u t o f t h e s ix t op i cs t h e s e e d p a ge

d o es n ’ t d i r e c t l y l i nk t o t a r ge t p a ge s .

CHAPTER 4. EXPERIMENTAL RESULTS

46

T h e ex pe r im en t s i n t h i s s e c t i on a re o r ga n iz ed b y c r a w l e r

t yp e s h o win g a c om p ar i s on b e t w e en v a r i ou s i mp l e m en ta t i on s

o f t he c r a wl e r o f t h e s am e t yp e . Sp e c i f i c a l l y t h e ex p er im e n t s

a r e o r ga n iz e d a s fo l l o ws :

a ) C l ass i c Fo cu se d Cr aw l e r s E xp e r i me n t

C r a wl e r s ( a ) - (d ) we r e e va lu a t e d us ing t h e s ix t o p i cs o f

F i g . 1 1 .

b ) S e man t i c C r aw le rs E xp e r i me nt s

C r a wl e r s ( e ) - ( f ) , a n d (c ) - (d ) fo r c o mp a r i so n , w ere

e v a lu a t e d us i n g t h e 6 t o p i cs o f F i g . 1 1 .

c ) L e ar n in g C r aw l ers E xp e r i me nt

C r a wl e r s ( h ) - ( j ) w e r e ev a l u a t ed us in g f ou r t o p i cs

( “ Ro bo t i cs ” , “ D engu e Fe v e r ” , “ J av a P r o gr am min g” a n d

“ F i r s t A i d ” ) .

In t h e ex p er im en t s b e l ow e a c h me tho d i s r e p re s en t e d b y a

p lo t sh ow in g n umb e r o f r e l ev a n t p age s i n t h e Y ax i s a s a

f u n c t i on o f t o t a l nu mb e r o f p a ge s r e t r i ev e d . E a ch po in t i n a

p lo t co r r esp on ds t o h a r ve s t r a t i o o r a v e r a ge s imi l a r i t y

m e as u r ed r e sp e c t ive l y.

N o t i c e t h a t Le a r n ing C r a wl e r s h a v e d i f f e r e n t i np u t ( t h e

t r a i n i n g s e t ) t h a n th e C l a s s i c a n d S ema n t i c f oc us e d C r a wl e r s

( t ha t h av e t h e us e r q ue r y a s i n pu t ) s o d i r e c t c omp a r i so ns

b e tw e e n t h e p e r fo rm a n ce o f l e a rn in g a n d o th e r c a t e go r i es o f

c r a w le r s i n n o t r e a l l y p l au s i b l e .

CHAPTER 4. EXPERIMENTAL RESULTS

47

4 .4 C lass ic Focused Craw lers

Fi g . 12 H ar v es t r a t i o f o r c l a s s i c c r a wle r s

T h e c om p ar i s on in F i g . 1 2 i n d i c a t es t h e p oo r p e r fo rm a nc e o f

Br e a d th F i r s t C r a wl e r , a s ex p e c t ed f o r a n on f o cu se d c r aw l e r .

T h e f a c t t h a t t h e Be s t F i r s t C r aw l e r u s in g a n c ho r t ex t o n l y

o u t p e r f o rms th e c ra w l e r u s in g o n l y p a ge c o n t en t i nd i ca t es

t h e v a lu e o f a n ch o r t ex t f o r c omp ut in g p a ge t o t o p i c

r e l ev a n c e .

T h e c r a wl e r c om bi n in g p a ge a n d a n ch o r t ex t

d e mo ns t r a t e d s up e r i o r p e r fo rm a n c e . Th i s r e su l t i nd i c a t e s t h a t

W eb c on t en t r e l e va n c e i s no t com put e d b y p a ge o r a n c h o r

t ex t a l on e . In s t e ad , t h e c om bin a t ion o f p a ge c on t en t a n d

a n c ho r t ex t fo rm s a mo r e r e l i ab l e p a ge d es c r i p t i on .

0

50

100

150

200

250

300

350

50

10

01

50

20

02

50

30

03

50

40

04

50

50

05

50

60

06

50

70

07

50

80

08

50

90

09

50

10

00

rele

va

nt

pa

ge

s

crawled pages

Breadth First

Best First-page content

Best First-anchor text

Best First-content & anchor

text

CHAPTER 4. EXPERIMENTAL RESULTS

48

Fi g . 13 Av e r a ge s im i l a r i t y f o r c l a s s i c f o cu s ed c r a wl e r s

F i g . 1 3 co n f i rms t h e r es u l t s o f t h e p r e v i ou s co mp a r i s on .

O v e r a l l a b es t f i r s t c r a wl e r com bi n ing p a ge a n d a n ch or t ex t

a c h i e v es s up e r i o r p e r f o r ma n c e ov e r a l l i t s com p et i t o r s w i th

b o t h c r i t e r i a .

4 .5 Semant ic Craw lers

T h e s e c on d ex pe r im e n t m e as u r es t h e p e r f o r ma n c e o f s em a n t i c

c r a w le r s u s i n g t he s ix t op i c s o f F i g . 1 1 ( a s i n t h e p r ev io us

ex p e r im e n t ) .

0,00%

10,00%

20,00%

30,00%

40,00%

50,00%

60,00%

70,00%

50

10

01

50

20

02

50

30

03

50

40

04

50

50

05

50

60

06

50

70

07

50

80

08

50

90

09

50

10

00

av

ara

ge

sim

ila

rity

crawled pages

Breadth First

Best First-page content

Best First-anchor text

Best First-content & anchor

text

CHAPTER 4. EXPERIMENTAL RESULTS

49

Fi g . 14 : H a rv es t Ra t i o f o r S em a n t i c Cr a w l e r s .

F i g . 14 i l l u s t r a t es o n l y m a r g i n a l p e r fo r ma n c e im pr ov e me n t s

o f s em a n t i c c r aw l e r s ov e r b es t f i r s t c r aw l e r s . I t i s

c o n j e c t u re d th a t t he p oo r p e r fo rm a nce o f s em a n t i c c r aw l e r s

s ho u l d no t b e r e ga r d e d a s a f a i l u r e o f s em a n t i c c r a wl e r s bu t

r a th e r a s a f a i l u r e o f W o r dN e t t o p r ov id e t e rm s c on c e p tu a l l y

s imi l a r t o t h e t o p i c . W or d N et i s a ge n e r a l t ax o nomy f o r

E n g l i sh t e rm s an d n o t a l l l i n k ed t e r m s a r e a c t u a l l y v e r y

s imi l a r , i mp l yi n g t h a t t h e r e su l t s c a n b e im p ro v ed b y u s in g

t op i c s p e c i f i c o n to l o g ie s . S uc h to p i c s pe c i f i c on t o l o g ies on

s e v e ra l d iv e r s e t op i cs w e r e no t a v a i l a b l e t o u s fo r t h e s e

ex p e r im e n t s .

0

50

100

150

200

250

300

350

50

10

01

50

20

02

50

30

03

50

40

04

50

50

05

50

60

06

50

70

07

50

80

08

50

90

09

50

10

00

rele

va

nt

pa

ge

s

crawled pages

Semantic Crawler Ehrig

method

Semantic Crawler SSRM

method

Best First-anchor text

Best First Content &

anchor text

Semantic Crawler with

synset expantion

CHAPTER 4. EXPERIMENTAL RESULTS

50

Fi g 1 5 : A v e ra ge S im i l a r i t y f o r S em an t i c C r aw le r s

R e su l t s wi th a v er age s i mi l a r i t y a c t u a l l y c o n f i rm e d t h e

r e s u l t s o f F i g . 14 . H e r e s em an t i c c ra w l e r s i mp ro v ed aga i n

t h e r es u l t s o f b es t f i r s t c r a wl e r s b u t o n l y m a r g i n a l l y ,

i nd i c a t i n g t h a t av e r a ge s imi l a r i t y ( a s l e s s s t r i c t c r i t e r i on ) i s

m o re t o l e r a n t t o r e l ax e d in t e r p r e t a t i o ns o f c on c ep tu a l

s imi l a r i t y a s p r o v id e d b y W o r dN et a nd t e r m s i mi l a r i t y

m e as u r es ( su c h a s Li e t . a l [ 4 2 ] ) .

4 .6 Learning Craw lers

T h e r esu l t s b e lo w a r e t a k en on fou r t op i cs ( “ r ob o t i c s ” ,

“ d e n gu e f e ve r ” , “ j a v a p ro gr a mmi n g” a nd “ f i r s t a id ” ) a n d

m e as u r ed o n th e f i r s t 10 00 w e b pa ge s r e t u rn ed b y e a c h

c r a w le r on e a c h t op i c . O n l y Le a r n i n g c r a wl e r s w e r e

e v a lu a t e d i n t h i s ex p e r im e n t : T wo v ar i an t s o f HM M C r aw l e r s

w e r e t e s t ed c o r r e spo n d in g t o d i f f e re n t im p l em e n t a t i o n o f t h e

c l us t e r i n g c omp on e n t (w i t h K -M e a ns an d X -M e a ns

0,00%

10,00%

20,00%

30,00%

40,00%

50,00%

60,00%

70,00%

50

10

01

50

20

02

50

30

03

50

40

04

50

50

05

50

60

06

50

70

07

50

80

08

50

90

09

50

10

00

av

ara

ge

sim

ila

rity

crawled pages

Semantic Crawler Ehrig

method

Semantic Crawler SSRM

method

Best First-anchor text

Best First content & anchor

text

Semantic Crawler with

synset expantion

CHAPTER 4. EXPERIMENTAL RESULTS

51

r e s p ec t iv e l y) . T h e r e s u l t s i n d i c a t e t ha t K -M e an s ( us in g K = 5

a s s u gge s t e d a t [ 16 ] ) a nd X - M e an s H i dd en M a r ko v Mo d e l

C r a wl e r s h a v e id e n t i c a l p e r f o rma n c e . Bo th c r aw l e r s

d e mo ns t r a t e d po o r p e r f o r ma n c e ( F i gs . 1 6 - 17 ) an d th i s c an be

a t t r i b u t ed t o s ev er a l r e a so ns : bo th v a r i a n t s do n’ t a s s i gn

d i f f e r en t p r i o r i t i e s t o p a ge s i n t o t h e s am e c l us t e r , a n d

b e tw e e n l i n ks i n to t h e s am e p a ge . Bo t h v a r i a n t s m us t b e

p r ov id e d w i th a t r a i n i n g s e t v e r y s i m i l a r i n co n t en t a nd l i n k

s t ru c t u re t o t h e p a r t o f t h e W eb t h a t wi l l b e c ra w l ed

( s om et h in g n o t a lw a ys a c h i ev a b l e ) . Be c a u s e t h e tw o H M M

C r a wl e r s ( u s i n g X - M e an s an d K -M e a ns ) h av e id en t i c a l

p e r f o r ma n c e th e f i r s t v a r i an t ( HM M Cr a w l e r u s i n g K -M ea n s )

w a s c ho s en fo r c omp a r i so n wi th t he o th e r Le a r n in g C r a wle r s .

In F i g . 1 6 t he pe r f o rm an c e o f t h e H M M c r a wl e r i s

c o mp a r ed wi t h t h e p e r fo rm a n c e o f t he n e w H yb r i d c r aw l e r s

( u s i n g c om bin a t ion o f p a ge c o n t en t a nd a n ch o r t ex t )

p r op os e d i n t h i s w o r k . Th e f i r s t ( Hyb r i d H MM us in g p a ge

c o n t e n t ) p r i o r i t i z es l i nk s u s in g e q ua t io n 1 1 ( s im i l a r i t y o f t h e

p a ge c o n t a in i n g t he l i n ks w i t h t h e ce n t r o i d o f t h e r e l ev a n t

p a ge s i n t h e t r a i n i n g se t ) . In a d d i t i o n t o t h a t t h e s ec o nd

i mpl e me n t a t i o n ( Hyb r i d H MM C r aw l e r w i t h a nc ho r t ex t ) a l so

c o mbi n es t h e s im i l a r i t y o f t h e c e n t ro i d wi t h t h e an c ho r t ex t

o f l i n ks po in t in g to c a nd id a t e p a ges f o r p r io r i t y a s s i gnm e n t

a s s u gge s t e d b y e q u a t io n 12 .

CHAPTER 4. EXPERIMENTAL RESULTS

52

Fi g . 16 : H a rv es t Ra t i o f o r HM M & Hyb r i d C r a wl e r s

Fi g . 17 : Av e r a ge Co s i n e S imi l a r i t y f o r HM M & H yb r i d

C r a wl e r s

T he H yb r i d c ra wle r s ou t p e r f o rm t h e H i dd en M a rk ov M od e l

u s i n g bo t h c r i t e r i a . T h e u s e o f p os i t i ve ex am pl es c e n t ro id a s

0

5

10

15

20

25

30

35

40

45

50

10

01

50

20

02

50

30

03

50

40

04

50

50

05

50

60

06

50

70

07

50

80

08

50

90

09

50

10

00

rela

tiv

e p

ag

es

crawled pages

HMM Crawler

Hybrid HMM Crawler

with page content

Hybrid HMM with page

content & anchor text

0,00%

5,00%

10,00%

15,00%

20,00%

25,00%

30,00%

35,00%

40,00%

45,00%

50,00%

50

10

01

50

20

02

50

30

03

50

40

04

50

50

05

50

60

06

50

70

07

50

80

08

50

90

09

50

10

00

av

ara

ge

sim

ila

rity

crawled pages

HMM Crawler

Hybrid HMM Crawler with

page content

Hybrid HMM with page

content & anchor text

CHAPTER 4. EXPERIMENTAL RESULTS

53

a q u e r y c l e a r l y i n c r e a s es p e r fo rm a n ce b e c a us e i t ov e r c om e s

t h e p ro b l ems o f H M M c r aw l e r s . As F i g . 16 an d F ig . 17

i nd i c a t e , t h e r esu l t s o b t a i n ed b y H yb r i d c r a wl e r s a r e

p r omi s i n g a n d m a y l e a d t o f u r th e r r e sea r c h o n t h i s d i r ec t i o n .

4 .7 Di scuss ion

Cl a ss i c Fo c us e d Cr a w l e r r e su l t s s how t h a t c om bin in g p a ge

c o n t e n t an d a n ch or t ex t ( Be s t F i r s t C r aw l e r - pa ge c on t en t

a n d an c ho r t ex t ) y i e l ds t h e b es t r e su l t s . Bo th p a ge c on t e n t

a n d an ch o r t ex t fo r m a r e p r e s en t a t i ve c o n t en t d es c r i p to r f o r

w e b p a ge s . S em an t i c C r aw le r s , wh e n c om bi ne d w i t h a

ge n e r a l pu r po s e on to lo g y, p e r f o rm ed po or l y c o m p a r ed to

Be s t F i r s t c r aw l e r s . B y r e s t r i c t i n g s e m an t i c r e l a t i on s t o

s yn o n ym s e t s (S ema n t i c C r aw l e r - S yn s e t ex p a nd m e th od ) t h e

p e r f o r ma n c e wa s im p ro ve d m a r g in a l ly . S yn o n ym s , a l t ho u gh

n o t l ex i c a l l y s i mi l a r su c c e ed i n i d e n t i f yi n g p a ge s w i t h

c o n t e n t s imi l a r t o t h e t op i c , i nd i c a t i ng t h a t i t i s p os s ib ly t o

ex p e c t fu r t h e r p e r f o rm a nc e imp r ove m e n t s b y u s i n g t op i c

s p e c i f i c on t o l o gi e s r i c h i n t e r ms v e r y s imi l a r t o t h e t e rm s o f

t h e t o p i c . A t t h i s p o i n t , on to l o g i es o f t h i s t yp e a r e n o t

a v a i l ab l e t o u s . Bo t h H yb r i d Cr a w l e r s a c h i ev e b e t t e r

p e r f o r ma n c e th a n t h e H id d en M a r kov Mo d e l C r a wl e r . T h e

r e s u l t s o b t a i n ed in d i c a t e t h a t p os i t i v e ex am pl es a r e m o re

i mp or t an t t h an the n e ga t iv e o n es d u r i n g t r a in in g in an

e n v i ro nm e n t s u ch a s t h e W o r ld W i d e W eb . Us i n g o n l y

p os i t i v e ex am pl es t h e p e r fo rm a nc e o f l e a r n in g c r a wl e r s i s

ex p e c t ed t o i mp r ove .

CHAPTER 5. CONCLUSIONS AND FUTURE WORK

54

Chapter 5. Conclusions and future

work

In t h e p r es e n t t h es i s , s e v e ra l v a r i an t s o f f o c us ed c r aw l e r s

w e r e im p l em e n t ed a n d e v a l ua t ed us i n g c om mo n ev a l ua t i on

c r i t e r i a . F i r s t t h e Br e a d th F i r s t C ra wl e r a nd v a r i an t s o f t h e

Be s t F i r s t C ra wl e r u s i n g p a ge c o n t en t , a nc ho r t ex t o r b o t h

w e r e co mp a r ed . Th e n s em an t i c r e l a t i on s w e r e us e d i n t h e

i mp l e me n t a t i o n o f t h r e e S em a n t i c C r a wl e r s t h a t w e r e

c o mp a r ed wi th c l as s i c fo c us e d c r a wle r s (v a r i a t i o ns o f b e s t

f i r s t c r a wl e r ) . F i na l l y , b a s e d on t h e H id d en Ma r ko v M od e l

l e a rn in g c r aw l e r , t wo n ov e l h yb r i d c r a wl e r s c om bi n i n g

e l em e n t s f rom l e a r n in g a nd c l as s i c f o c us e d c r aw l e r s w e r e

i mp l e me n t e d a nd ev a lu a t e d .

T h e ex p e r im e n t a l r e s u l t s i nd i c a t e t h a t t h e

i mp l e me n t a t i o n o f f o c us e d c r aw l e r s i s a p r o c es s wh e r e mi no r

c h a n ge s i n t h e c r a w le r d es i gn ha v e g r e a t e f f e c t i n

p e r f o r ma n c e . T he c o mb in a t io n o f a n c ho r t ex t a nd p a ge

c o n t e n t yi e l d s g r e a t p e r f o rm an c e i mp ro v em e n t i n t he c a se o f

c l a s s i c , s em a n t i c an d l e a rn in g f o cu s ed c r a wl e r s . T h e a dd i t i on

o f s e m an t i c r e l a t i o ns d id n ’ t im p ro ve p e r fo rm a n ce wi th t h e

ex c e p t i on o f ex pa n s i on wi th s yn o n ym s w h e r e s e ma n t i c

r e l a t i on s a r e r es t r i c t e d t o s yn o n ym t e r ms . P e r f o rm anc e i s

ex p e c t ed t o im p ro ve b y u s in g a p p l i c a t i on sp e c i f i c on t o l og i e s

( r e l a t ed t o t h e t o p i c ) , i n s t e a d o f ge ne r a l pu r po s e on t o lo g i e s

s u ch as W o rd N et .

Le a r n i n g C r a wl e r s t a k e as i np u t u s e r s e l e c t e d pa ge s n o t

d e s c r ib e d b y a s im p l e q ue r y. I t i s n o t o n l y t h a t Le a r n in g

c r a w le r s r e c e i v e d i f f e re n t i n pu t t h a n t h a t o f o t h e r f o cu s ed

c r a w le r s bu t a l s o t h e y a r e i n t en de d t o p e r fo r m a v e r y

d i f f i cu l t t a sk : t h ey a t t e mp t t o l e a rn w e b c r a wl i n g pa t t e rn s

CHAPTER 5. CONCLUSIONS AND FUTURE WORK

55

l e a d in g to r e l e v an t p a ge s p os s ib l y t h r o u gh o th e r n on r e l e v a n t

p a ge s t hu s i nc r e as i n g t h e p r ob ab i l i t y o f f a i l u r e ( s in c e w e b

s t ru c t u re s c an no t a l w a ys b e m o d e l e d b y s u c h l i nk p a t t e rn s ) .

H o w ev e r t h e i d e a l oo ks p r om is i n g ov e r a l l a nd m a y l e a d to

e v e n mo r e su c c es s fu l imp l em en ta t i ons o f l ea r n i n g c r a wl e r s i n

t h e f u tu r e . Th e p r es e n t w o rk ca n be r e ga r de d a s a

c o n t r i bu t io n to w a rd s t h a t d i r e c t i o n .

A n o t he r d i re c t i o n fo r fu tu r e w o rk wo u l d b e t o do m o re

e l a bo r a t e t e s t s w i th s em an t i c c r a wl e r s , m ak in g us e o f t o p i c

s p e c i f i c o n t o lo g i es ( e . g . m e d i c a l o n to l o g i e s fo r ap p l i c a t i on s

r e l a t e d t o h e a l t h ca r e ) . T h e p os i t i ve r e su l t s ob t a in ed b y

h yb r i d c r a wl e r s i nd i c a t e t h a t t h e r e l e v a n c e o f a c an d id a t e

p a ge w i th t h e s e t o f po s i t i v e ex am ple s on l y, i s an e f f ec t i v e

w a y f o r a s s i gn i n g p r io r i t i e s t o c a nd id a t e p a ge s . Us i n g o n l y

p os i t i v e ex am pl es ( i n s t e a d o f p os i t i ve a n d n e ga t i ve ) mi gh t

i mp ro v e t h e p e r f o r m an c e o f l ea r n i ng c r a w l e r s i n t e rm s o f

s p e ed an d a c cu r a c y .

REFERENCES

56

References:

[ 1 ] “ W e b S e a r c h f o r a P l a n e t : T h e G o o g l e C l u s t e r

A r c h i t e c t u r e ” L A B a r r o s o , J D e a n , U H o l z l e - M i c r o , IE E E ,

2 0 0 3 .

[ 2 ] “ V e r y L a r ge S c a l e R e t r i e va l a n d W e b S e a r c h ” D

H a w ki n g , N C r a s w e l l , I n E . V o o r h e e s a n d D . H a r ma n ,

e d i t o r s , T R E C : E x p e r i me n t a n d E va l u a t i o n i n

I n f o r ma t i o n R e t r i e v a l . M IT P r e s s , 2 0 0 5 .

[ 3 ] “ T h e In d e x a b l e W e b i s M o r e t h a n 1 1 . 5 B i l l i o n P a g e s ” A

G u l l i , A S i gn o r i n i - I n t e r n a t i o n a l W o r l d W i d e W e b

C o n f e r e n c e , 2 0 0 5 .

[ 4 ] h t t p : / / w o r d n e t . p r i n c e t o n . e d u

[ 5 ] h t t p : / / w w w . g o o g l e . c o m

[ 6 ] “ T h e A n a t o my o f a L a r ge -S c a l e H y p e r t e x t u a l W e b S e a r c h

E n g i n e ” S B r i n , L P a g e W W W 7 / C o mp u t e r N e t w o r ks , 1 9 9 8 .

[ 7 ] h t t p : / / w w w . ya h o o . c o m.

[ 8 ] h t t p : / / w w w . ms n . c o m

[ 9 ] h t t p : / / w w w . a s k . c o m

REFERENCES

57

[ 1 0 ] h t t p : / / l a r b i n . s o u r c e f o r ge . n e t / i n d e x -e n g . h t ml

[ 1 1 ] " In f o r ma t i o n R e t r i e v a l b y S e ma n t i c S i mi l a r i t y" A n ge l o s

H l i a o u t a k i s , G i a n n i s V a r e l a s , E p i me n i d i s V o u t s a k i s ,

E u r i p i d e s G . M . P e t r a k i s , E v a n ge l o s M i l i o s , I n t e r n a t i o n a l

J o u r n a l o n S e ma n t i c W e b a n d In f o r ma t i o n S ys t e ms

( I J S W IS ) , S p e c i a l I s s u e o f M u l t i me d i a S e ma n t i c s , V o l . 3 ,

N o . 3 , J u l y / S e p t e mb e r , 2 0 0 6 , p p . 5 5 -7 3 .

[ 1 2 ] “ A V e c t o r S p a c e M o d e l f o r A u t o ma t i c In d e x i n g ” G

S a l t o n , A W o n g , C S Y a n g – C o mmu n i c a t i o n s o f t h e A C M ,

1 9 7 5 .

[ 1 3 ] “ O n t o l o g y -F o c u s e d C r a w l i n g o f D o c u me n t s a n d

R e l a t i o n a l M e t a d a t a ” A l e x a n d e r M a e d c h e , M a r c E h r i g ,

S i e g f r i e d H a n d s c h u h , R a p h a e l V o l z , a n d L j i l j a n a

S t o j a n o v i c . P r o c e e d i n gs o f t h e E l e ve n t h In t e r n a t i o n a l

W o r l d W i d e W e b C o n f e r e n c e W W W -2 0 0 2 .

[ 1 4 ] “ S e ma n t i c S i mi l a r i t y M e t h o d s i n W o r d N e t a n d t h e i r

A p p l i c a t i o n t o In f o r ma t i o n R e t r i e v a l o n t h e W e b ” V a r e l a s

G . , V o u t s a k i s E . , R a f t o p o u l o u P . , P e t r a k i s E . , M i l i o s E . I n :

7 t h A C M In t e r n a t i o n a l W o r ks h o p o n W e b In f o r ma t i o n a n d

D a t a M a n a g e me n t ( W ID M 2 0 0 5 ) , B r e me n , G e r ma n y ( 2 0 0 5 ) .

[ 1 5 ] “ M e a s u r i n g t h e S e ma n t i c S i mi l a r i t y o f T e x t s . ”

C o r l e y , C . , M i h a l c e a , R . : , P r o c e e d i n gs o f t h e A C L

W o r k s h o p o n E mp i r i c a l M o d e l i n g o f S e ma n t i c

E q u i va l e n c e a n d E n t a i l me n t . A n n A r b o r , J u n e 2 0 0 5 .

REFERENCES

58

[ 1 6 ] “ F o c u s e d C r a w l i n g b y L e a r n i n g H M M f r o m u s e r ’ s

T o p i c -S p e c i f i c B r o w s i n g . ” H . L i u , E . M i l i o s , a n d J .

J a n s s e n . I n P r o c e e d i n g s o f 2 0 0 4 IE E E / W IC / A C M

I n t e r n a t i o n a l C o n f e r e n c e o n W e b In t e l l i g e n c e , p a ge s

7 3 2 – 7 3 5 , B e i j i n g , C h i n a , S e p t e mb e r 2 0 -2 4 , 2 0 0 4 .

[ 1 7 ] “ X -me a n s : E x t e n d i n g K -me a n s w i t h E f f i c i e n t

E s t i ma t i o n o f t h e N u mb e r o f C l u s t e r s . ” D . P e l l e g a n d A .

M o o r e . I n P r o c e e d i n gs o f t h e 1 7 t h In t e r n a t i o n a l

C o n f . o n M a c h i n e L e a r n i n g , p a ge s 7 2 7 – 7 3 4 . M o r ga n

K a u f ma n n , S a n F r a n c i s c o , C A , 2 0 0 0 .

[ 1 8 ] “ U s i n g H M M t o L e a r n U s e r B r o w s i n g P a t t e r n s f o r

F o c u s e d W e b C r a w l i n g ” H L i u , J J a n s s e n , E M i l i o s - D a t a

& K n o w l e d ge E n g i n e e r i n g , 2 0 0 6 .

[ 1 9 ] “ B r e a d t h -F i r s t S e a r c h C r a w l i n g Y i e l d s H i g h -Q u a l i t y

P a ge s . ” M . N a j o r k a n d J . L . W i e n e r . I n P r o c . 1 0t h

I n t e r n a t i o n a l W o r l d W i d e W e b C o n f e r e n c e , 2 0 0 1 .

[ 2 0 ] “ C r a w l i n g t h e W e b : D i s c o v e r y a n d M a i n t e n a n c e o f a

L a r ge -S c a l e W e b D a t a . ” C h o , J . 2 0 0 1 . P h . D . t h e s i s ,

S t a n f o r d U n i v e r s i t y .

[ 2 1 ] “ S e a r c h i n g t h e W e b . ” A r v i n d A r a s u , J u n gh o o C h o ,

H e c t o r G a r c i a -M o l i n a , A n d r e a s P a e p c k e , a n d S r i r a m

R a g h a va n . T r a n s a c t i o n s o n In t e r n e t T e c h n o l o g y ,

2 0 0 1 .

[ 2 2 ] “ E f f i c i e n t C r a w l i n g T h r o u gh U R L O r d e r i n g . ” J u n gh o o

C h o , H e c t o r G a r c i a - M o l i n a , L a w r e n c e P a g e . S e ve n t h

I n t e r n a t i o n a l W e b C o n f e r e n c e ( W W W 9 8 ) . B r i s b a n e ,

A u s t r a l i a , A p r i l 1 4 -1 8 , 1 9 9 8 .

REFERENCES

59

[ 2 3 ] “ In f o r ma t i o n R e t r i e va l i n D i s t r i b u t e d H y p e r t e x t s ” P .

D e B r a , G . - J . H o u b e n , Y . K o r n a t z k y , a n d R . P o s t , i n :

P r o c e e d i n g s o f R IA O '9 4 , I n t e l l i g e n t M u l t i me d i a ,

I n f o r ma t i o n R e t r i e v a l S ys t e ms a n d M a n a ge me n t , N e w

Y o r k , N Y , 1 9 9 4 .

[ 2 4 ] “ T h e S h a r k -S e a r c h A l go r i t h m - A n A p p l i c a t i o n :

T a i l o r e d W e b S i t e M a p p i n g” H e r s o v i c i , M . , J a c o v i , M . ,

M a a r e k , Y . S . , P e l l e g , D . , S h t a l h a i m , M . a n d U r , S .

( 1 9 9 8 ) , C o mp u t e r N e t w o r k s a n d IS D N S ys t e ms , V o l . 3 0

N o . 1 -7 , p p . 3 1 7 -2 6 .

[ 2 5 ] “ E va l u a t i n g T o p i c -D r i ve n W e b C r a w l e r s ” F . M e n c ze r ,

G . P a n t , M . R u i z , P . S r i n i va s a n , , P r o c . 2 4 t h A n n u a l I n t l .

A C M S IG IR C o n f . o n R e s e a r c h a n d D e v e l o p me n t i n

I n f o r ma t i o n R e t r i e v a l , A C M P r e s s , N e w Y o r k , N Y , 2 0 0 1

[ 2 6 ] “ T o p i c a l W e b C r a w l e r s : E va l u a t i n g A d a p t i ve

A l go r i t h ms ” F M e n c ze r , G P a n t , P S r i n i v a s a n – A C M

T r a n s a c t i o n s o n In t e r n e t T e c h n o l o g y ( T O IT ) , 2 0 0 4 .

[ 2 7 ] “ A G e n e r a l E va l u a t i o n F r a me w o r k f o r T o p i c a l

C r a w l e r s ” P S r i n i v a s a n , F M e n c ze r , G P a n t –

I n f o r ma t i o n R e t r i e v a l , 2 0 0 5 – S p r i n g e r .

[ 2 8 ] “ In t e l l i g e n t C r a w l i n g o n t h e W o r l d W i d e W e b w i t h

A r b i t r a r y P r e d i c a t e s . ” C . A g g a r w a l , F . A l -G a r a w i , a n d P .

Y u . I n P r o c . 1 0 t h In t l . W o r l d W i d e W e b C o n f e r e n c e ,

p a g e s 9 6 – 1 0 5 , 2 0 0 1 .

[ 2 9 ] “ A S u r ve y o f F o c u s e d W e b C r a w l i n g A l g o r i t h ms . ”

N o va k , B . P r o c e e d i n g s o f t h e 7 t h In t e r n a t i o n a l mu l t i -

c o n f e r e n c e In f o r ma t i o n S o c i e t y IS -2 0 0 4 , L j u b l j a n a :

I n s t i t u t “ J o že f S t e f a n ” , 2 0 0 4 .

REFERENCES

60

[ 3 0 ] “ F o c u s e d C r a w l i n g : A N e w A p p r o a c h f o r T o p i c

S p e c i f i c R e s o u r c e D i s c o v e r y” S C h a kr a b a r t i , M v a n d e n

B e r g , B D o m - W W W C o n f e r e n c e , 1 9 9 9 .

[ 3 1 ] “ F o c u s e d C r a w l i n g U s i n g C o n t e x t G r a p h s . ” M .

D i l i ge n t i , F . C o e t ze e , S . L a w r e n c e , C . L . G i l e s , a n d M .

G o r i . I n P r o c . 2 6 t h In t e r n a t i o n a l C o n f e r e n c e o n V e r y

L a r ge D a t a b a s e s ( V L D B 2 0 0 0 ) , p a ge s 5 2 7 – 5 3 4 , C a i r o ,

E g y p t , 2 0 0 0 .

[ 3 2 ] “ A c c e l e r a t e d F o c u s e d C r a w l i n g t h r o u g h O n l i n e

R e l e v a n c e F e e d b a c k ” C h a kr a b a r t i , S . , P u n e r a , K . , a n d

S u b r a ma n ya m, M . , I n P r o c e e d i n g s o f t h e e l e v e n t h

i n t e r n a t i o n a l c o n f e r e n c e o n W o r l d W i d e W e b ( W W W 2 0 0 2 ) ,

2 0 0 2 , p p . 1 4 8 -1 5 9 .

[ 3 3 ] “ L e a r n i n g t o C r a w l : C o mp a r i n g C l a s s i f i c a t i o n

S c h e me s ” G P a n t , P S r i n i va s a n – A C M T r a n s a c t i o n s o n

I n f o r ma t i o n S y s t e ms ( T O IS ) , 2 0 0 5 .

[ 3 4 ] “ F o c u s e d C r a w l i n g b y E x p l o i t i n g A n c h o r T e x t U s i n g

D e c i s i o n T r e e ” L i J u n , F u r u s e K , Y a ma g u c h i K . C ,

P r o c e e d i n g s o f t h e 1 4 t h In t e r n a t i o n a l W o r l d W i d e W e b

C o n f e r e n c e . 2 0 0 5 : 1 1 9 0 -1 1 9 1 .

[ 3 5 ] “ A N o ve l H y b r i d F o c u s e d C r a w l i n g A l go r i t h m t o B u i l d

D o ma i n -S p e c i f i c C o l l e c t i o n s ” Y C h e n , P h D t h e s i s – 2 0 0 7 .

[ 3 6 ] h t t p : / / j a va . s u n . c o m/

[ 3 7 ] h t t p : / / w w w . e c l i p s e . o r g /

[ 3 8 ] “ A T u t o r i a l o n S u p p o r t V e c t o r M a c h i n e s f o r P a t t e r n

R e c o g n i t i o n ” C J C B u r g e s - D a t a M i n i n g a n d K n o w l e d ge

D i s c o v e r y , 1 9 9 8 .

[ 3 9 ] h t t p : / / w w w . d mo z . o r g /

REFERENCES

61

[ 4 0 ] “ T h e V i t e r b i A l g o r i t h m” G D F o r n e y - P r o c e e d i n gs o f

t h e IE E E , 1 9 7 3 .

[ 4 1 ] “ In t e l l i S e a r c h : I n t e l l i g e n t S e a r c h f o r Ima g e s a n d

T e x t o n t h e W e b ” E V o u t s a k i s , E G M P e t r a k i s , E M i l i o s .

3 r d In t e r n . C o n f e r e n c e o n Ima g e A n a l y s i s a n d

R e c o g n i t i o n ( IC I A R 2 0 0 6 ) , p p . 6 9 7 -7 0 8 , S e p t . 1 8 -2 0 ,

2 0 0 6 , P o v o a d e V a r z i m , P o r t u ga l .

[ 4 2 ] “ A n A p p r o a c h f o r M e a s u r i n g S e ma n t i c S i mi l a r i t y

b e t w e e n w o r d s u s i n g M u l t i p l e In f o r ma t i o n S o u r c e s ” Y L i ,

Z B a n d a r - IE E E T r a n s a c t i o n s o n K n o w l e d g e a n d

D a t a E n g i n e e r i n g , 2 0 0 3 .

[ 4 3 ] “ N e a r e s t N e i g h b o r P a t t e r n C l a s s i f i c a t i o n ” T C o v e r , P

H a r t - I n f o r ma t i o n T h e o r y , IE E E T r a n s a c t i o n s o n , 1 9 6 7 .

[ 4 4 ] “ A n In t r o d u c t i o n t o H i d d e n M a r k o v M o d e l s ” L

R a b i n e r , B J u a n g - A S S P M a ga z i n e 1 9 8 6 .

[ 4 5 ] “ M e r c a t o r : A S c a l a b l e , E x t e n s i b l e W e b C r a w l e r ” A

H e y d o n , M N a j o r k – W o r l d W i d e W e b , 1 9 9 9 – S p r i n ge r .

[ 4 6 ] “ M i n i n g t h e L i n k S t r u c t u r e o f t h e W o r l d W i d e W e b ”

S o u me n C h a k r a b a r t i , B yr o n E . D o m, D a v i d G i b s o n , J o n

K l e i n b e r g , R a v i K u ma r , P r a b h a k a r R a g h a v a n , S r i d h a r

R a j a g o p a l a n , a n d A n d r e w T o mk i n s . IE E E C o mp u t e r ,

3 2 ( 8 ) : 6 0 -6 7 , 1 9 9 9 .

[ 4 7 ] “ D a t a C l u s t e r i n g : a R e v i e w ” A K J a i n , M N M u r t y , P J

F l yn n - A C M C o mp u t i n g S u r v e ys ( C S U R ) , 1 9 9 9 .

[ 4 8 ] “ A n A l go r i t h m f o r S u f f i x S t r i p p i n g ” P o r t e r , M . F . ( 1 9 8 0 )

P r o gr a m, 1 4 ( 3 ) : 1 3 0 - 1 3 7 .

Recommended