Upload
zena
View
26
Download
0
Embed Size (px)
DESCRIPTION
Genetic Learning for Information Retrieval. Andrew Trotman Computer Science 365 * 24 * 60 / 40 = 13,140. X. Genetic Learning. The Core Algorithm Crossover, Mutation, Reproduction Fitness proportionate selection Genetic Algorithms Chromosome is an array Genetic Programming - PowerPoint PPT Presentation
Citation preview
Genetic Learning forInformation Retrieval
Andrew TrotmanComputer Science
365 * 24 * 60 / 40 = 13,140365 * 24 * 60 / 40 = 13,140
Genetic Learning• The Core Algorithm
• Crossover, Mutation, Reproduction• Fitness proportionate selection
• Genetic Algorithms• Chromosome is
an array
• Genetic Programming• Chromosome is
an abstract syntax tree
{A B C D E F}
X
{1 2 3 4 5 6}
X
Information Retrieval (Text)• Online Systems
– Dialog, LexisNexis, etc.
• Web Systems– Alta Vista, Excite, Google, etc.
• Scientific Literature Systems– CiteSeer, PubMed, BioMedNet, etc.
• Question:– How should scientific literature be ranked?
• Less time searching / More time researching• Higher exposure for “good” work
How Google Works• PageRank
– Document ranking from PageRank
– A document’s PageRank is some factor (d) of the rank of incoming citations
– A document’s influence is some factor of its rank and its outgoing citations
• Characteristics of Scientific Literature– Citations unidirectional (backwards in time)– 12 month publication cycle– Scientific citation “cliques”
∑=
+−=n
t t
th Outbound
RankwwRank
1
)1(
How IR works• Indexing
– Build the dictionary– Construct the Postings (<d,f> pairs)
• Searching– Look up terms in dictionary– Boolean resolution– Rank on density (probability, vector space, etc.)
• Performance– Recall and precision
Record1: Of OtagoRecord2: Otago UniversityRecord3: OtagoRecord4: Of
OF
OTAGO
UNIVERSITY
dictionary postings
<1,1><4,1>
<2,1>
<2,1><3,1>
Structured-IR• Sci-Lit documents have structure
• Title, abstract, conclusions, etc.• <d,f> becomes <d,p,f>
<doc><docid>1</docid><place><name>University of Otago</name></place><cntry>New Zealand</cntry></doc>
<doc><docid>3</docid><place><name>University of Otago</name><rank>top</rank></place></doc>
<doc><docid>2</docid><cntry>New Zealand</cntry><sport>sailing</sport></doc>
doc:1
rank:7
sport:6cntry:5place:3docid:2
name:4
Using Structure in Ranking• Documents have structure
– Title, Abstract, Conclusions, etc.
• Weight each structure on “importance”– Title higher than Abstract higher than …
• How to choose the weights– Specified in the query (XIRQL)– Query feedback– Learn with a Genetic Algorithm
• Adapt ranking model to use structure• Each tree node is a locus• Weights are genes
Experiment• 50 training queries
• 50 evaluation queries
• 25 generations
• Probabilistic IR
• Vector Space IR
PROBABILISTIC IR
• 75.5% queries improved
• 6.7% increase in MAP (8.8% max)
VECTOR SPACE IR
• 61% queries improved
• 4.7% increase in MAP (5.4% max)
Results
Weighted Probability Model Learning
0.1950
0.1955
0.1960
0.1965
0.1970
0.1975
0.1980
1 3 5 7 9 11 13 15 17 19 21 23 25
Generation
Training Set MAP
0.1775
0.1780
0.1785
0.1790
0.1795
0.1800
0.1805
Evaluation Set MAP
Training Evaluation
Weighted Probability Model MAP Improvement by Topic
-0.06
-0.04
-0.02
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
Topic
Average Precision
Ranking Algorithms• Multitude exist
– Probability, vector space, Boolean– Several published nomenclatures
• Over 100,000 “published” algorithms
• Purpose– Put relevant documents first– Sorting– Performance measures with precision
• Sources– Some guy thought it up
Experiment• 50 training queries
• 50 evaluation queries
• 31 runs
• Weekend time limit
• Compare to Probabilistic
• 67% queries improved• 15% increase in MAP
ResultsImprovement in MAP for each Query in Fittest of Best Run
-100%-50%
0%50%
100%150%200%250%300%350%400%
Query
Improvement
Improvement over Probability Method
Fittest Individual's MAP by Generation for Best Run
0.14
0.15
0.16
0.17
0.18
0.19
0.20
0.21
0.22
1 11 21 31 41 51 61 71 81 91 101 111 121
Generation
Mean Average Precision.
Run 15 Evaluation Set Run 15 Training Set
Probability Evaluation Set Probability Training Set
Function Comparison
wdq=StÎq(((((((((U / sqrt(sqrt(nt))) / (mq / sqrt((((Lq / (sqrt(sqrt(Ld)) / sqrt((U / nc)))) * min(mq, N)) / sqrt(((((((Tmax / sqrt(U)) / sqrt((((log2(sqrt(nt)) / sqrt(nt)) / sqrt(Umax)) / (M / nc)))) / sqrt((U / nc))) - uq) / mq) / sqrt(nt))))))) / sqrt((log(Tmax) / nc))) / sqrt(nt)) / sqrt(nt)) / sqrt((Lq / sqrt(((sqrt((sqrt(sqrt(Ld)) / sqrt((min(mq, sqrt((((log(Tmax) / nc) / sqrt(Umax)) / (mq / sqrt(((N * min((sqrt(nc) / sqrt(U)), Ld)) / sqrt(N))))))) / sqrt(Ld))))) / sqrt((Tmax / nc))) / sqrt(nt)))))) / sqrt((min(mq, N) / nc))) / sqrt((log(Tmax) / nc))) / sqrt(nt))
∑∈
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛×⎟⎟⎠
⎞⎜⎜⎝
⎛=
qt ttq
ttddq n
Ntf
n
Ntfw 22 loglog
Vector Space
( )∑∈
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛×−+×⎟⎟
⎠
⎞⎜⎜⎝
⎛ +−+=
qt d
td
t
tdq m
tfKK
n
nNCw 1
1log2
Probability
Learned
Conclusions• Using document structure improved ranking• Structure weights can be learned with a GA• GP can be used to learn ranking functions
Speculation• Combining GA and GP to learn a structure
ranking algorithm will better GA and GP alone
Questions?
Random NumbersRandom NumbersAre your results an artifact of
your random number generator?