Upload
richard-clay
View
32
Download
7
Embed Size (px)
DESCRIPTION
Genetic Algorithms and Simulated Annealing for Internet Search. GAAS. W. D. W. D. W. M. W. Genetic Algorithm for Internet Search. Issues of Importance. Possible Solutions. Simple keyword evaluation Jaccard's score Link evaluation. StC. #1. #2. SlC. SuH. - PowerPoint PPT Presentation
Citation preview
Genetic Algorithms and Simulated Annealing for Internet Search
Veljko Milutinović
Department of Electrical EngineeringSchool of Electrical Engineering
University of BelgradePOB 35-54, 11120 Belgrade, Serbia, Yugoslavia
[email protected]://galeb.etf.bg.ac.yu/~vm
GAAS
Support for links-based search agents (Spiders),as an alternative to index-based search (Altavista)
Genetic algorithms invented and developed in AI;may be efficient if properly applied to Internet search
Simulated annealing introduced in mathematics;applicable to Internet search, but reported as not too efficient
Industry leader in packages for EBI-related strategic planning:Comshare, Inc.
Genetic Algorithmfor Internet Search
1. Select the initial WWW presentation or a set thereof2. Extract all URLs and fetch the corresponding WWW presentations3. Measure the fitness value for each newly fetched WWW presentation4. Continue with a subset of the most promising WWW presentations, while occasionally mutating the extracted URLs
[Chen+Chung+Ramsey+Yang+Ma+Yen97].
D
D
M
W
W
W
W
Issues of Importance
1. Representation of genes (URL is a numerically-encoded string)2. Crossover (one parent: WWW page; the other: selection function)3. Fitness function (typically, Jaccard's score)4. Number of offsprings (limited to a subset of the "best")5. Mutation type (typically, DB-based)
[Mirković+Kraus+Milutinović97].
Possible SolutionsPossible representation approaches:
1. String2. Array of strings, etc…
Possible crossover approaches:1. Link crossover (one explicit parent and one implicit parent)2. Classical (two explicit parents), etc…
Possible fitness functions:1. Jaccard's function2. Evaluation function, etc…
Possible number of offsprings:1. Limited2. Unlimited, etc…
Possible mutation types:1. DB-based2. Semantics-based, etc…
http://www.altavista.com EOS
http://galeb.etf.bg.ac.yu/~vm/tutorial.html EOS
Example: String representation
representationof genomes
string array ofstrings
numerical
integerencoded
bit-stringencoded
String Array of strings Numerical
Integer encoded Bit-string encoded
Representation of Genomes
Explanation: URL is represented as a string,terminated by the End-Of-String character.
Classical Parent Link
Overlapping links Link pre-evaluation
crossoveroperator
classical parent link
overlappinglinks
linkpre-evaluation
Example: Link pre-evaluation
parents
potentialoffsprings
0,1 0,3 0,9 0,93 0,2 0,8 0,9
selectedoffsprings
lanation: Fitness function values are calculated for all linksand the best ones are selected as offsprings.
Crossover Operator
•Simple keyword evaluation•Jaccard's score•Link evaluation
Fitness Function
Example: Jaccard's score
Explanation: The best first search algorithm performed usingJaccard's score as the fitness evaluation function.
0.1 0.1
0.9 0.60.7
0.42.
3.0.3
0.5 0.31.
Input documents
Potential offsprings
Output documents
Iteration number
0-1 Jaccard's score
1.
fitness function
simplekeyword
evaluation
Jaccard'sscore
linkevaluation
fitness function
Simple Keyword evaluation
Jaccard's score Link
evaluation
Degree of the Crossover
degree of crossover
limited unlimited
Example: Unlimited crossover
Explanation: Parents and all offsprings are rankedaccording to their fitness function values; the best onesamong them are selected.
Limited Unlimited
parents
offsprings 0,9
rankedsolutions
0,2
selectedoffsprings
0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9
0,80,70,50,40,30,1
0,6
Mutation Operator
mutationoperator
generational selective
DB-based semantic
unsorted
topicsorted
indexed spatiallocality temporal
locality
site-type
Example: Topic sorted, DB-based mutation
Explanation: One URL is randomly selected from the set ofURLs that cover certain topic.
educationhttp://www.santafe.eduhttp://www.etf.bg.ac.yuhttp://www.cmu.edu http://www.etf.bg.ac.yu
out: offspring
in: topic
Generational Selective DB-based Unsorted Topic sorted Indexed Semantic Spatial locality Temporal locality Site-type
Generation of the Output Set
Interactive Post-generation
generation ofthe output set
interactive post-generation
Example: Interactive generation
Explanation: The best individuals from eachgeneration are selected for the output set.
1. population
2. population
0,1 0,2 0,4
0,3 0,5 0,7
the output set
3. population0,8 0,85 0,9
0,4
0,7
0,9
Selected Papers About Intelligent Agents on Internet
Chen, H., Chung, Y.-M., Ramsey, M., Yank C., Ma, P.-C., Yen, J.,"Intelligent Spider for Internet Searching,"Proceedings of the HICSS-97, Maui, Hawai'i, USA, pp. 178-188.
This paper introduces a new interactive genetic search algorithm,which is better than traditional genetic search without on-line adjustments(the worst case of which is the best-search algorithm)
The number of home pages is doubled every 6 months!Consequently, searching is a challenge!!!Intelligent searching agents are called "spiders."
Major problems:(a) Information overload(b) Vocabulary differences (synonyms, different languages, ...)
Main information retrieval mechanisms:(a) Keyword search (Lycos at CMU and Yahoo at Stanford)(b) Hypertext browsing (Mosaic and Netscape)
Two main approaches to Internet searching:(a) Client-based searching spider(b) Server-based on-line database indexing and searching
Client-based searching spiders:(a) TueMosaic based on the Best First Search algorithm(b) TueMosaic v2.42 based on the Fish Search algorithm(c) WebCrawler based on an Improved Fish Search algorithm
The Best First Search elements:(a) Current homepage (one or a set)(b) User specified set of keywords(c) Depth and width of search for links contained inthe current homepage
The Fish Search - a modification of the Best First Search:(a) Each URL corresponds to a fish(b) After the document is retrieved, fish spawns children (URLs)(c) These URLs are "produced" only if relevant (not unconditionally)
Drawbacks of the Best First and the Fish Search:(a) Potentially relevant homepages which do not connect with the current one are inaccessible!(b) The search is exponential, with the increase of depth and width.
The Crawler Search - a modification of the Fish Search:(a) Search initiated using index(b) Links followed in an intelligent order:
Relevance of a link is evaluated using the anchor testAnchor test measures similarity between anchor text and user queryAnchor text are the words describing the link to another documentAnchor text is a small subset of the documentSearch speed versus search quality
Essence: If weak links avoided - more strong links in unit of time!(c) Used by America Online since January 1995
On-line database searching and indexing:(a) Entire WWW documents are retrieved and temporarily stored on the host server(b) All relevant information is indexed on the host server(c) This creates a server-based replica of all information on the WWW(d) Index is used as a search key
Examples:(a) WWWW - World Wide Web Work(b) AilWeb(c) Harvest Information Discovery and Access System(d) University of Arizona WWW Lab(e) Lycos(f) Excite(g) Yahoo(h) Alta Vista
Architecture of an intelligent spider
(5 components):(a) Requests and control(b) Graphical user interface(c) Search engine(d) Home page fetching(e) Indexing source
Requests and control (#1):(a) Users submit queries with information such as 1. Starting URL(s) 2. Keywords 3. Number of URLs expected to return 4. Category of the searching space(b) When a query is submitted the appropriate searching space is invoked in the available databse
Graphical user interface (#2):(a) A link between the submitted query and the searching engine(b) Important that users can view intermediate results
Search engine (#3):(a) Genetic algorithm(b) Simulated annealing
Homepage fetching (#4):(a) Public fetching machines (Lynx and HtmlGobble)(b) Custom fetching machines (Arizona and Serbia)
Indexing score (#5):(a) Major goal of indexing is to identify the contents of a WWW document(b) Major procedures of indexing are 1. Word identification (ignored: case and punctuation) 2. Word filter (extracted: common function, pure, and general words)
Comparing the similarity of homepages - Jaccard's score:(a) A homepage with a higher Jaccard score has a higher fitness with the input homepage(b) Score computed from links or from indexing
S c o r e F r o m L i n k s :H o m e p a g e s a r e x a n d yT h e i r l i n k s a r e X = { x 1 , x 2 , . . . } a n d Y = { y 1 , y 2 , . . . }
J a c c a r d ' s s c o r e b e t w e e n x a n d y i s e q u a l t o :
)YX(#
)YX(#)y,x(JS links
I f X = Y t h e n J = 1
I f X < > Y t h e n J = 0
Reference:
Goldberg, D.E.,
"Genetic Algorithms in Search, Optimization, and Machine Learning,"
Addison-Wesley, Reading, Massachusetts, 1989.
S c o r e F r o m I n d e x i n g :# 1 . T o t a l n u m b e r o f h o m e p a g e s i s c o u n t e d N# 2 . T e r m s o f a h o m e p a g e a r e i d e n t i f i e d s e t t# 3 . T o t a l n u m b e r o f t e r m s i s c o u n t e d L# 4 . T h e n u m b e r o f w o r d s i n t e r m j i s c a l c u l a t e d w j
# 5 . T e r m f r e q u e n c y n u m b e r o f o c c u r a n c e s o f t e r m j i n h o m e p a g e x t f x j
# 6 . H o m e p a g e f r e q u e n c y n u m b e r o f h o m e p a g e s i n s e t N w h e r e t e r m j o c c u r s d f j
# 7 . C o m b i n e d w e i g h t o f t e r m j i n h o m e p a g e x d x j
)wdf
Nlog(tfd j
jxjxj
)( CBA
AJ
L
j jyd
jxdA
1
)(*)(
2
1)(
L
jjxdB
2
1)(
L
jjxdC
Details of the Best First Search Implementation:
Essence:(a) Looking for the best homepage in each iteration(b) The number of iterations is equal to the number of required homepages
Algorithm:(a) Initialization and input #1. Initialize k to 1 #2. Obtain the initial set of homepages from the user(s) #3. Homepages from the initial set are fetched #4. Linked homepages of input set are saved in H={h1, h2, ...}
( b ) D e t e r m i n i n g t h e b e s t h o m e p a g e i n H :
# 1 . D e t e r m i n e t h e J a c c a r d ' s s c o r e f o r a l l e l e m e n t s o f s e t H # 2 . S c o r e i s c o m p u t e d a s :
N
jijlinksilinks )h ,input(JS
N)h(JS
1
1
N
jijindexiindex hinputJS
NhJS
1),(
1)(
))h(JS)h(JS()h(JS iindexilinksi 21
( c ) F e t c h t h e h o m e p a g e f r o m H w i t h t h e h i g h e s t J S
# 1 . S a v e i t a s O U T P U T ( k ) # 2 . I n c r e a s e k b y 1
( d ) R e p e a t u n t i l a l l o u t p u t h o m e p a g e s o b t a i n e d
Simulated Annealing for Internet Search
A discrete method for findingthe global minimum of a function.
Unlike in continuous methods,no calculation of derivatives is needed.
By simulating the slow cooling process,a local minimum can be found.
By simulating a sudden heating processand stochastic crossover, conditions are generatedfor a potentially more efficient minimum search.
As the number of iterations increase,with a certain probability, the smallest minimumcan be declared equal to the global minimum.
Explanation: Symbolic Representation of Simulated AnnealingSlC - Slow CoolingSuH - Sudden HeatingStC - Stochastic Crossover
0
10
20
30
40
50
60
70
80
90
0 10 20 30 40 50 60 70 80 90 100
time (t)
f1(t
)
SuH
#1
SlC
#2
StC
Explanation: Simulated Annealing on a MIMD MachineThe speed-up is considerably slower than linear [Green90],due to quasi-minima created by partitioning and interprocesscommunications.
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25
time (t)
f2(t
)
Important Difference Between SA and GA
SA - Single solution being modified over time(inherently serial)
GA - A population of candidate solutions maintained(inherently parallel)
Single search (typical of SA) is inherently serialand therefore difficult to get parallelized.
However, it potentially offers a better performance,since no edge effects and interprocess communicationsare present [Chen98].
Consequently, a hybrid approach may be a solution!
Essence of the Hybrid Approaches
One solution maintained per processing element (PE). Each PE accepts a solution from other PEs
for crossover and mutation. If the best solution from the neighborhood is selected,
convergence is potentially faster,but serialization gets limited.
If all PEs receive the visiting solutionfrom the same direction, convergence gets slower,but parallelization gets easier,and enables an overcompensation.
GSA Algorithm Running on Each PEof a MasPar Machine
1 begin2 temperature:=_initial_temperature();3 r:=_random_solution();4 for i:=1 to max_iteration do5 begin6 direction:=random(0, 7, random_seed);7 distance:=random(1, max_distance, random_seed);8 v:=XNet_direction(distance).r;9 [n0, n1]:=crossover_mutation(r, v);10 r:=select(r, n0, n1, temperature);11 temperature:= temperature * ;12 end13 return r14 end