Technical University of Crete Department of Electronic and ... Technical University of Crete Department

  • View
    0

  • Download
    0

Embed Size (px)

Text of Technical University of Crete Department of Electronic and ... Technical University of Crete...

  • Technical University of Crete

    Department of Electronic and Computer

    Engineering

    DESIGN AND EVALUATION OF TOPIC

    DRIVEN FOCUSED CRAWLERS

    FOR THE WORLD WIDE WEB

    By

    BATSAKIS SOTIRIOS

    A Thesis submit ted in par t ia l fu l f i l lment

    of the requi rements for the degree of

    Master of Computer Engineer ing

    Chania , November 2007

  • ii

    Design and evaluation of topic driven

    focused crawlers for the World Wide Web

    Batsakis Sotirios

    Abst ract

    Fo c us e d c r aw l e r s a r e p r o g r am s de s i gne d t o b r ow s e t h e

    W eb an d d ow nl o ad p a ge s o n a s p e c i f i c t o p i c . Th e y a r e us e d

    f o r a ns w e r i n g us e r q u e r i e s o r f o r bu i l d i n g d i g i t a l l i b r a r i e s

    o n a t o p i c s p ec i f i ed b y t h e us e r . T he y a r e d i s t i n gu i s h ed in to

    c l as s i c , s e m an t i c a n d l e a r n i n g f o cus e d c r a wl e r s . C l as s i c

    f o c us e d c r a wl e r s e s t im a t e t h e r e l ev anc e o f W eb p a ge s wi th

    t h e t o p i c b y c o m pu t i n g th e s imi l a r i t y o f W eb p a ge s w i t h a

    u s e r p ro v id e d l i s t o f k e yw o r d s t h a t d e sc r ib e t he t op i c o f

    i n t e r es t . S em an t i c C r aw l e r s a r e a v a r i a t i o n o f c l a s s i c

    f o c us e d c r a wl e r s t h a t u s e c on c ep tua l r e l a t i o ns b e t we e n

    t e rm s ( e . g . r e t r i eve d f ro m an on t o l og y) f o r e s t im a t i n g t h e

    r e l ev a n c e o f t h e W e b p a ge w i t h t h e t op i c . Le a r n i n g c r a wle r s

    e m plo y a t r a in in g p r o ce s s t h a t gu i de t he c r a wl e r t o wa r ds

    p a ge s r e l a t ed t o t he t o p i c .

    T h i s wo rk a dd r es s i s s u es r e l a t e d t o t h e d e s i gn an d

    i mpl e me n t a t i o n o f c l a s s i c , s em an t i c a n d l e a r n i n g fo cu s ed

    c r a w le r s . S e ve r a l v a r i a n t s o f c l a s s i c f o cu se d c ra wl e r s

    r e l yi n g u p on we b p a ge c on t e n t an d l i nk an c ho r t ex t f o r

    e s t im a t in g t h e r e l ev a n c e o f w eb p a ges t o a g i v en t op i c a r e

    ex a min e d a nd imp le m e n t ed . A no v e l ty o f t h i s w o rk i s t he

    i n t ro du c t io n o f a ne w c a t e go r y o f s e ma n t i c c r a wl e r s m ak i n g

    u s e o f W or d Ne t a s t h e un d er l yi n g o n to lo g y f o r o b t a in i n g

    t e rm s c on c ep tu a l l y r e l a t e d ( bu t n o t n e c es s a r i l y

    l ex i co gr a p h i c a l l y s i mi l a r ) w i th t h e t op i c . Le a r n in g c r a wl e r s

    b a s ed on Hid d en M a r ko v Mo d e l ( HM M ) f o r l e a r n i n g n o t

  • iii

    o n l y t h e co n t en t o f r e l ev an t p a ge s bu t a l s o p a t hs l e ad in g to

    r e l ev a n t p a ge s fo l l o w in g a c e r t a i n num b er o f r ou t in g h o ps

    a r e ex a min e d as w e l l . An a d d i t i ona l c on t r ib u t i on o f t h i s

    w o r k i s t h e i n t r od u c t i on o f a ne w c a t e go r y o f h yb r id

    c r a w le r s c omb in in g th e s t r e n gt h o f bo th c l a s s i c an d l e a r n in g

    f o c us e d c r aw l e r s .

    T h e c r a wl e r s r e f e r r e d t o a bo ve a r e a l l i mp l e m en t e d

    a n d a c om p ar a t iv e a n a l ys i s o f t h e i r p e r f o r m an c e i s

    p r e s en t e d . A l l c r aw l e r s ac h i e v e t h e i r m ax imu m p er f o rma n c e

    w h e n a com bi n a t i on o f w eb p a ge an d a n c ho r t ex t i s u s ed f o r

    a s s i gn i n g d ow nl oad p r i o r i t i e s t o w e b p a ge s . S e m an t i c

    s imi l a r i t y m e t ho ds c om bi n ed wi th a ge n e r a l pu r po se

    o n t o l o g y s o u r c e su c h a s W o r dN et do n ’ t a c t u a l l y i m p ro v e

    p e r f o r ma n c e , ex ce p t t h e im p l em en t a t i on t h a t r e s t r i c t s

    s e ma n t i c s im i l a r i t y t o s yn o n ym t e rm s . H yb r i d c r a wle r s

    i mp ro v ed t h e p e r f o r m an c e o f s t a t e o f t h e a r t HM M c r a wle r s

    y i e l d in g v e r y p r o mi s in g r e su l t s .

  • iv

    C on t en ts

    C hap t e r 1 . I n t r odu c t i on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1 . 1 B a c k gr o u n d .............................................................................................................. 2

    1 . 2 P r e s e n t w o r k ........................................................................................................... 6

    1 . 3 C o n t r i b u t i o n o f t h e c u r r e n t t h e s i s ............................................................... 8

    1 . 4 T h e s i s o u t l i n e ......................................................................................................... 9

    C hap t e r 2 . R e la t ed W o rk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 0

    2.1 Introduction ............................................................................................................... 10

    2 . 2 N o n F o c u s e d C r a w l e r s ..................................................................................... 11

    2 . 3 C l a s s i c F o c u s e d C r a w l e r s ............................................................................... 12

    2 . 4 S e ma n t i c C r a w l e r s ............................................................................................. 16

    2 . 5 L e a r n i n g C r a w l e r s .............................................................................................. 19

    2 . 6 S u mma r y ................................................................................................................. 24

    C hap t e r 3 . C raw l er D es ign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 6

    3.1 Introduction ............................................................................................................... 26

    3 . 2 C l a s s i c C r a w l e r s ................................................................................................. 29

    3 . 2 . 2 B e s t F i r s t C r a w l e r w i t h a n c h o r t e x t s i mi l a r i t y ........................... 31

    3 . 2 . 3 B e s t F i r s t C r a w l e r w i t h p a g e c o n t e n t a n d a n c h o r t e x t . ........... 31

    3 . 3 S e ma n t i c C r a w l e r s ............................................................................................. 32

    3 . 3 . 1 E h r i g C r a w l e r ............................................................................................... 34

    3 . 3 . 2 S S R M C r a w l e r .............................................................................................. 34

    3 . 2 . 3 S e ma n t i c C r a w l e r w i t h s y n o n y m s e t e x p a n s i o n .......................... 35

    3 . 4 L e a r n i n g C r a w l e r s .............................................................................................. 35

    3 . 4 . 1 H i d d e n M a r ko v M o d e l C r a w l e r ........................................................... 37

    3 . 4 . 2 H y b r i d C r a w l e r s .......................................................................................... 39

    3 . 5 S u mma r y ................................................................................................................. 41

    C hap t e r 4 . E xp e r ime n t a l R esu l t s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3

    4.1 Introduction ............................................................................................................... 43

    4 . 2 P e r f o r ma n c e me a s u r e s ...................................................................................... 44

    4 . 3 E x p e r i me n t s e t u p ................................................................................................ 45

    4 . 4 C l a s s i c F o c u s e d C r a w l e r s ............................................................................... 47

    4 . 5 S e ma n t i c C r a w l e r s ............................................................................................. 48

    4 . 6 L e a r n i n g C r a w l e r s .............................................................................................. 50

    4 . 7 D i s c u s s i o n .............................................................................................................. 53

  • v

    C hap t e r 5 . Con c lus ion s and f u tu r e wo r k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4

    R ef e r en

Recommended

View more >