Genetic Learning for Information Retrieval

Genetic Learning forInformation Retrieval

Andrew TrotmanComputer Science

365 * 24 * 60 / 40 = 13,140365 * 24 * 60 / 40 = 13,140

Genetic Learning• The Core Algorithm

• Crossover, Mutation, Reproduction• Fitness proportionate selection

• Genetic Algorithms• Chromosome is

an array

• Genetic Programming• Chromosome is

an abstract syntax tree

{A B C D E F}

X

{1 2 3 4 5 6}

X

Information Retrieval (Text)• Online Systems

– Dialog, LexisNexis, etc.

• Web Systems– Alta Vista, Excite, Google, etc.

• Scientific Literature Systems– CiteSeer, PubMed, BioMedNet, etc.

• Question:– How should scientific literature be ranked?

• Less time searching / More time researching• Higher exposure for “good” work

How Google Works• PageRank

– Document ranking from PageRank

– A document’s PageRank is some factor (d) of the rank of incoming citations

– A document’s influence is some factor of its rank and its outgoing citations

• Characteristics of Scientific Literature– Citations unidirectional (backwards in time)– 12 month publication cycle– Scientific citation “cliques”

∑=

+−=n

t t

th Outbound

RankwwRank

1

)1(

How IR works• Indexing

– Build the dictionary– Construct the Postings (<d,f> pairs)

• Searching– Look up terms in dictionary– Boolean resolution– Rank on density (probability, vector space, etc.)

• Performance– Recall and precision

Record1: Of OtagoRecord2: Otago UniversityRecord3: OtagoRecord4: Of

OF

OTAGO

UNIVERSITY

dictionary postings

<1,1><4,1>

<2,1>

<2,1><3,1>

Structured-IR• Sci-Lit documents have structure

• Title, abstract, conclusions, etc.• <d,f> becomes <d,p,f>

<doc><docid>1</docid><place><name>University of Otago</name></place><cntry>New Zealand</cntry></doc>

<doc><docid>3</docid><place><name>University of Otago</name><rank>top</rank></place></doc>

<doc><docid>2</docid><cntry>New Zealand</cntry><sport>sailing</sport></doc>

doc:1

rank:7

sport:6cntry:5place:3docid:2

name:4

Using Structure in Ranking• Documents have structure

– Title, Abstract, Conclusions, etc.

• Weight each structure on “importance”– Title higher than Abstract higher than …

• How to choose the weights– Specified in the query (XIRQL)– Query feedback– Learn with a Genetic Algorithm

• Adapt ranking model to use structure• Each tree node is a locus• Weights are genes

Experiment• 50 training queries

• 50 evaluation queries

• 25 generations

• Probabilistic IR

• Vector Space IR

PROBABILISTIC IR

• 75.5% queries improved

• 6.7% increase in MAP (8.8% max)

VECTOR SPACE IR

• 61% queries improved

• 4.7% increase in MAP (5.4% max)

Results

Weighted Probability Model Learning

0.1950

0.1955

0.1960

0.1965

0.1970

0.1975

0.1980

1 3 5 7 9 11 13 15 17 19 21 23 25

Generation

Training Set MAP

0.1775

0.1780

0.1785

0.1790

0.1795

0.1800

0.1805

Evaluation Set MAP

Training Evaluation

Weighted Probability Model MAP Improvement by Topic

-0.06

-0.04

-0.02

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Topic

Average Precision

Ranking Algorithms• Multitude exist

– Probability, vector space, Boolean– Several published nomenclatures

• Over 100,000 “published” algorithms

• Purpose– Put relevant documents first– Sorting– Performance measures with precision

• Sources– Some guy thought it up

Experiment• 50 training queries

• 50 evaluation queries

• 31 runs

• Weekend time limit

• Compare to Probabilistic

• 67% queries improved• 15% increase in MAP

ResultsImprovement in MAP for each Query in Fittest of Best Run

-100%-50%

0%50%

100%150%200%250%300%350%400%

Query

Improvement

Improvement over Probability Method

Fittest Individual's MAP by Generation for Best Run

0.14

0.15

0.16

0.17

0.18

0.19

0.20

0.21

0.22

1 11 21 31 41 51 61 71 81 91 101 111 121

Generation

Mean Average Precision.

Run 15 Evaluation Set Run 15 Training Set

Probability Evaluation Set Probability Training Set

Function Comparison

wdq=StÎq(((((((((U / sqrt(sqrt(nt))) / (mq / sqrt((((Lq / (sqrt(sqrt(Ld)) / sqrt((U / nc)))) * min(mq, N)) / sqrt(((((((Tmax / sqrt(U)) / sqrt((((log2(sqrt(nt)) / sqrt(nt)) / sqrt(Umax)) / (M / nc)))) / sqrt((U / nc))) - uq) / mq) / sqrt(nt))))))) / sqrt((log(Tmax) / nc))) / sqrt(nt)) / sqrt(nt)) / sqrt((Lq / sqrt(((sqrt((sqrt(sqrt(Ld)) / sqrt((min(mq, sqrt((((log(Tmax) / nc) / sqrt(Umax)) / (mq / sqrt(((N * min((sqrt(nc) / sqrt(U)), Ld)) / sqrt(N))))))) / sqrt(Ld))))) / sqrt((Tmax / nc))) / sqrt(nt)))))) / sqrt((min(mq, N) / nc))) / sqrt((log(Tmax) / nc))) / sqrt(nt))

∑∈

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛×⎟⎟⎠

⎞⎜⎜⎝

⎛=

qt ttq

ttddq n

Ntf

n

Ntfw 22 loglog

Vector Space

( )∑∈

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛×−+×⎟⎟

⎠

⎞⎜⎜⎝

⎛ +−+=

qt d

td

t

tdq m

tfKK

n

nNCw 1

1log2

Probability

Learned

Conclusions• Using document structure improved ranking• Structure weights can be learned with a GA• GP can be used to learn ranking functions

Speculation• Combining GA and GP to learn a structure

ranking algorithm will better GA and GP alone

Questions?

Random NumbersRandom NumbersAre your results an artifact of

your random number generator?