27
Genetic Algorithms and Simulated Annealing for Internet Search V eljko M ilutinovi ć Departm entofElectricalEngineering SchoolofElectricalEngineering U niversity ofBelgrade PO B 35-54, 11120 Belgrade, Serbia, Y ugoslavi vm@ etf.bg.ac.yu http://galeb.etf.bg.ac.yu/~vm

Genetic Algorithms and Simulated Annealing for Internet Search

Embed Size (px)

DESCRIPTION

Genetic Algorithms and Simulated Annealing for Internet Search. GAAS. W. D. W. D. W. M. W. Genetic Algorithm for Internet Search. Issues of Importance. Possible Solutions. Simple keyword evaluation Jaccard's score Link evaluation. StC. #1. #2. SlC. SuH. - PowerPoint PPT Presentation

Citation preview

Page 1: Genetic Algorithms and Simulated Annealing  for Internet Search

Genetic Algorithms and Simulated Annealing for Internet Search

Veljko Milutinović

Department of Electrical EngineeringSchool of Electrical Engineering

University of BelgradePOB 35-54, 11120 Belgrade, Serbia, Yugoslavia

[email protected]://galeb.etf.bg.ac.yu/~vm

Page 2: Genetic Algorithms and Simulated Annealing  for Internet Search

GAAS

Support for links-based search agents (Spiders),as an alternative to index-based search (Altavista)

Genetic algorithms invented and developed in AI;may be efficient if properly applied to Internet search

Simulated annealing introduced in mathematics;applicable to Internet search, but reported as not too efficient

Industry leader in packages for EBI-related strategic planning:Comshare, Inc.

Page 3: Genetic Algorithms and Simulated Annealing  for Internet Search

Genetic Algorithmfor Internet Search

1. Select the initial WWW presentation or a set thereof2. Extract all URLs and fetch the corresponding WWW presentations3. Measure the fitness value for each newly fetched WWW presentation4. Continue with a subset of the most promising WWW presentations, while occasionally mutating the extracted URLs

[Chen+Chung+Ramsey+Yang+Ma+Yen97].

D

D

M

W

W

W

W

Page 4: Genetic Algorithms and Simulated Annealing  for Internet Search

Issues of Importance

1. Representation of genes (URL is a numerically-encoded string)2. Crossover (one parent: WWW page; the other: selection function)3. Fitness function (typically, Jaccard's score)4. Number of offsprings (limited to a subset of the "best")5. Mutation type (typically, DB-based)

[Mirković+Kraus+Milutinović97].

Page 5: Genetic Algorithms and Simulated Annealing  for Internet Search

Possible SolutionsPossible representation approaches:

1. String2. Array of strings, etc…

Possible crossover approaches:1. Link crossover (one explicit parent and one implicit parent)2. Classical (two explicit parents), etc…

Possible fitness functions:1. Jaccard's function2. Evaluation function, etc…

Possible number of offsprings:1. Limited2. Unlimited, etc…

Possible mutation types:1. DB-based2. Semantics-based, etc…

Page 6: Genetic Algorithms and Simulated Annealing  for Internet Search

http://www.altavista.com EOS

http://galeb.etf.bg.ac.yu/~vm/tutorial.html EOS

Example: String representation

representationof genomes

string array ofstrings

numerical

integerencoded

bit-stringencoded

String Array of strings Numerical

Integer encoded Bit-string encoded

Representation of Genomes

Explanation: URL is represented as a string,terminated by the End-Of-String character.

Page 7: Genetic Algorithms and Simulated Annealing  for Internet Search

Classical Parent Link

Overlapping links Link pre-evaluation

crossoveroperator

classical parent link

overlappinglinks

linkpre-evaluation

Example: Link pre-evaluation

parents

potentialoffsprings

0,1 0,3 0,9 0,93 0,2 0,8 0,9

selectedoffsprings

lanation: Fitness function values are calculated for all linksand the best ones are selected as offsprings.

Crossover Operator

Page 8: Genetic Algorithms and Simulated Annealing  for Internet Search

•Simple keyword evaluation•Jaccard's score•Link evaluation

Fitness Function

Example: Jaccard's score

Explanation: The best first search algorithm performed usingJaccard's score as the fitness evaluation function.

0.1 0.1

0.9 0.60.7

0.42.

3.0.3

0.5 0.31.

Input documents

Potential offsprings

Output documents

Iteration number

0-1 Jaccard's score

1.

fitness function

simplekeyword

evaluation

Jaccard'sscore

linkevaluation

fitness function

Simple Keyword evaluation

Jaccard's score Link

evaluation

Page 9: Genetic Algorithms and Simulated Annealing  for Internet Search

Degree of the Crossover

degree of crossover

limited unlimited

Example: Unlimited crossover

Explanation: Parents and all offsprings are rankedaccording to their fitness function values; the best onesamong them are selected.

Limited Unlimited

parents

offsprings 0,9

rankedsolutions

0,2

selectedoffsprings

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9

0,80,70,50,40,30,1

0,6

Page 10: Genetic Algorithms and Simulated Annealing  for Internet Search

Mutation Operator

mutationoperator

generational selective

DB-based semantic

unsorted

topicsorted

indexed spatiallocality temporal

locality

site-type

Example: Topic sorted, DB-based mutation

Explanation: One URL is randomly selected from the set ofURLs that cover certain topic.

educationhttp://www.santafe.eduhttp://www.etf.bg.ac.yuhttp://www.cmu.edu http://www.etf.bg.ac.yu

out: offspring

in: topic

Generational Selective DB-based Unsorted Topic sorted Indexed Semantic Spatial locality Temporal locality Site-type

Page 11: Genetic Algorithms and Simulated Annealing  for Internet Search

Generation of the Output Set

Interactive Post-generation

generation ofthe output set

interactive post-generation

Example: Interactive generation

Explanation: The best individuals from eachgeneration are selected for the output set.

1. population

2. population

0,1 0,2 0,4

0,3 0,5 0,7

the output set

3. population0,8 0,85 0,9

0,4

0,7

0,9

Page 12: Genetic Algorithms and Simulated Annealing  for Internet Search

Selected Papers About Intelligent Agents on Internet

Chen, H., Chung, Y.-M., Ramsey, M., Yank C., Ma, P.-C., Yen, J.,"Intelligent Spider for Internet Searching,"Proceedings of the HICSS-97, Maui, Hawai'i, USA, pp. 178-188.

This paper introduces a new interactive genetic search algorithm,which is better than traditional genetic search without on-line adjustments(the worst case of which is the best-search algorithm)

The number of home pages is doubled every 6 months!Consequently, searching is a challenge!!!Intelligent searching agents are called "spiders."

Major problems:(a) Information overload(b) Vocabulary differences (synonyms, different languages, ...)

Page 13: Genetic Algorithms and Simulated Annealing  for Internet Search

Main information retrieval mechanisms:(a) Keyword search (Lycos at CMU and Yahoo at Stanford)(b) Hypertext browsing (Mosaic and Netscape)

Two main approaches to Internet searching:(a) Client-based searching spider(b) Server-based on-line database indexing and searching

Client-based searching spiders:(a) TueMosaic based on the Best First Search algorithm(b) TueMosaic v2.42 based on the Fish Search algorithm(c) WebCrawler based on an Improved Fish Search algorithm

The Best First Search elements:(a) Current homepage (one or a set)(b) User specified set of keywords(c) Depth and width of search for links contained inthe current homepage

Page 14: Genetic Algorithms and Simulated Annealing  for Internet Search

The Fish Search - a modification of the Best First Search:(a) Each URL corresponds to a fish(b) After the document is retrieved, fish spawns children (URLs)(c) These URLs are "produced" only if relevant (not unconditionally)

Drawbacks of the Best First and the Fish Search:(a) Potentially relevant homepages which do not connect with the current one are inaccessible!(b) The search is exponential, with the increase of depth and width.

The Crawler Search - a modification of the Fish Search:(a) Search initiated using index(b) Links followed in an intelligent order:

Relevance of a link is evaluated using the anchor testAnchor test measures similarity between anchor text and user queryAnchor text are the words describing the link to another documentAnchor text is a small subset of the documentSearch speed versus search quality

Essence: If weak links avoided - more strong links in unit of time!(c) Used by America Online since January 1995

Page 15: Genetic Algorithms and Simulated Annealing  for Internet Search

On-line database searching and indexing:(a) Entire WWW documents are retrieved and temporarily stored on the host server(b) All relevant information is indexed on the host server(c) This creates a server-based replica of all information on the WWW(d) Index is used as a search key

Examples:(a) WWWW - World Wide Web Work(b) AilWeb(c) Harvest Information Discovery and Access System(d) University of Arizona WWW Lab(e) Lycos(f) Excite(g) Yahoo(h) Alta Vista

Architecture of an intelligent spider

(5 components):(a) Requests and control(b) Graphical user interface(c) Search engine(d) Home page fetching(e) Indexing source

Page 16: Genetic Algorithms and Simulated Annealing  for Internet Search

Requests and control (#1):(a) Users submit queries with information such as 1. Starting URL(s) 2. Keywords 3. Number of URLs expected to return 4. Category of the searching space(b) When a query is submitted the appropriate searching space is invoked in the available databse

Graphical user interface (#2):(a) A link between the submitted query and the searching engine(b) Important that users can view intermediate results

Search engine (#3):(a) Genetic algorithm(b) Simulated annealing

Page 17: Genetic Algorithms and Simulated Annealing  for Internet Search

Homepage fetching (#4):(a) Public fetching machines (Lynx and HtmlGobble)(b) Custom fetching machines (Arizona and Serbia)

Indexing score (#5):(a) Major goal of indexing is to identify the contents of a WWW document(b) Major procedures of indexing are 1. Word identification (ignored: case and punctuation) 2. Word filter (extracted: common function, pure, and general words)

Comparing the similarity of homepages - Jaccard's score:(a) A homepage with a higher Jaccard score has a higher fitness with the input homepage(b) Score computed from links or from indexing

Page 18: Genetic Algorithms and Simulated Annealing  for Internet Search

S c o r e F r o m L i n k s :H o m e p a g e s a r e x a n d yT h e i r l i n k s a r e X = { x 1 , x 2 , . . . } a n d Y = { y 1 , y 2 , . . . }

J a c c a r d ' s s c o r e b e t w e e n x a n d y i s e q u a l t o :

)YX(#

)YX(#)y,x(JS links

I f X = Y t h e n J = 1

I f X < > Y t h e n J = 0

Reference:

Goldberg, D.E.,

"Genetic Algorithms in Search, Optimization, and Machine Learning,"

Addison-Wesley, Reading, Massachusetts, 1989.

Page 19: Genetic Algorithms and Simulated Annealing  for Internet Search

S c o r e F r o m I n d e x i n g :# 1 . T o t a l n u m b e r o f h o m e p a g e s i s c o u n t e d N# 2 . T e r m s o f a h o m e p a g e a r e i d e n t i f i e d s e t t# 3 . T o t a l n u m b e r o f t e r m s i s c o u n t e d L# 4 . T h e n u m b e r o f w o r d s i n t e r m j i s c a l c u l a t e d w j

# 5 . T e r m f r e q u e n c y n u m b e r o f o c c u r a n c e s o f t e r m j i n h o m e p a g e x t f x j

# 6 . H o m e p a g e f r e q u e n c y n u m b e r o f h o m e p a g e s i n s e t N w h e r e t e r m j o c c u r s d f j

# 7 . C o m b i n e d w e i g h t o f t e r m j i n h o m e p a g e x d x j

)wdf

Nlog(tfd j

jxjxj

)( CBA

AJ

L

j jyd

jxdA

1

)(*)(

2

1)(

L

jjxdB

2

1)(

L

jjxdC

Page 20: Genetic Algorithms and Simulated Annealing  for Internet Search

Details of the Best First Search Implementation:

Essence:(a) Looking for the best homepage in each iteration(b) The number of iterations is equal to the number of required homepages

Algorithm:(a) Initialization and input #1. Initialize k to 1 #2. Obtain the initial set of homepages from the user(s) #3. Homepages from the initial set are fetched #4. Linked homepages of input set are saved in H={h1, h2, ...}

Page 21: Genetic Algorithms and Simulated Annealing  for Internet Search

( b ) D e t e r m i n i n g t h e b e s t h o m e p a g e i n H :

# 1 . D e t e r m i n e t h e J a c c a r d ' s s c o r e f o r a l l e l e m e n t s o f s e t H # 2 . S c o r e i s c o m p u t e d a s :

N

jijlinksilinks )h ,input(JS

N)h(JS

1

1

N

jijindexiindex hinputJS

NhJS

1),(

1)(

))h(JS)h(JS()h(JS iindexilinksi 21

( c ) F e t c h t h e h o m e p a g e f r o m H w i t h t h e h i g h e s t J S

# 1 . S a v e i t a s O U T P U T ( k ) # 2 . I n c r e a s e k b y 1

( d ) R e p e a t u n t i l a l l o u t p u t h o m e p a g e s o b t a i n e d

Page 22: Genetic Algorithms and Simulated Annealing  for Internet Search

Simulated Annealing for Internet Search

A discrete method for findingthe global minimum of a function.

Unlike in continuous methods,no calculation of derivatives is needed.

By simulating the slow cooling process,a local minimum can be found.

By simulating a sudden heating processand stochastic crossover, conditions are generatedfor a potentially more efficient minimum search.

As the number of iterations increase,with a certain probability, the smallest minimumcan be declared equal to the global minimum.

Page 23: Genetic Algorithms and Simulated Annealing  for Internet Search

Explanation: Symbolic Representation of Simulated AnnealingSlC - Slow CoolingSuH - Sudden HeatingStC - Stochastic Crossover

0

10

20

30

40

50

60

70

80

90

0 10 20 30 40 50 60 70 80 90 100

time (t)

f1(t

)

SuH

#1

SlC

#2

StC

Page 24: Genetic Algorithms and Simulated Annealing  for Internet Search

Explanation: Simulated Annealing on a MIMD MachineThe speed-up is considerably slower than linear [Green90],due to quasi-minima created by partitioning and interprocesscommunications.

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25

time (t)

f2(t

)

Page 25: Genetic Algorithms and Simulated Annealing  for Internet Search

Important Difference Between SA and GA

SA - Single solution being modified over time(inherently serial)

GA - A population of candidate solutions maintained(inherently parallel)

Single search (typical of SA) is inherently serialand therefore difficult to get parallelized.

However, it potentially offers a better performance,since no edge effects and interprocess communicationsare present [Chen98].

Consequently, a hybrid approach may be a solution!

Page 26: Genetic Algorithms and Simulated Annealing  for Internet Search

Essence of the Hybrid Approaches

One solution maintained per processing element (PE). Each PE accepts a solution from other PEs

for crossover and mutation. If the best solution from the neighborhood is selected,

convergence is potentially faster,but serialization gets limited.

If all PEs receive the visiting solutionfrom the same direction, convergence gets slower,but parallelization gets easier,and enables an overcompensation.

Page 27: Genetic Algorithms and Simulated Annealing  for Internet Search

GSA Algorithm Running on Each PEof a MasPar Machine

1 begin2 temperature:=_initial_temperature();3 r:=_random_solution();4 for i:=1 to max_iteration do5 begin6 direction:=random(0, 7, random_seed);7 distance:=random(1, max_distance, random_seed);8 v:=XNet_direction(distance).r;9 [n0, n1]:=crossover_mutation(r, v);10 r:=select(r, n0, n1, temperature);11 temperature:= temperature * ;12 end13 return r14 end