Upload
juan-julian-merelo-guervos
View
486
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Presentation for the GECCO conference
Citation preview
A Genetic Algorithm for DynamicModelling and Prediction of
Activity in Document StreamsLourdes Araujo,JJ Merelo
[email protected], [email protected]
Dpto. Lenguajes y Sistemas Informaticos
Universidad Nacional de Educacion a Distancia
Dpto. Arquitectura y Tecnologıa de Computadores
Universidad de Granada
Spain
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.1/24
Why• Document
metadata, such asarrival timehelporganize documentstreams.
• Temporalinformation helpmake sense ofdocument streamssuch ase-mailsandnews items.
• Its study combinescontent analysisandtime series mode-lling. A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.2/24
Showing interest• Hypothesis: Explosions in interest match points
in time where arrival intensity increases sharply.• In general, arrival time is quiteirregular.
X
Y
#doc
umen
t arr
ival
s
Time
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.3/24
Regularizing irregularity• A cost function, that reflects
how difficultis hiking fromone state to another, isintroduced.
• Intervals of similar frequencyshould be grouped in a sin-gle state, so change of sta-te will be penalyzed. But weshouldn’t overdo it.
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.4/24
Kleinberg’s model• The document stream is modeled as aninfinite
state automaton, A, which emits messages withdifferent frequencies.
• Each state has a frequency assigned.• Burstsare indicated by transitions from a lower
to a higher state.• Frequency changes are controlled by assigning
costs to state changes, avoiding small explosionsand making identification of real explosionseasier.
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.5/24
Infinite state automaton model• Generation of time sequence
based on aexponentialdistribution.• Time intervalx between
messagei andi + 1follows exponentialdistribution functionf(x) = αe−αx, for α > 0.
• Expected value for theinterval isα−1.
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.6/24
First things first: two state mo-del
• Basic model2-State probabilistic automataA: q0
(low emission rate) y q1 (high).
q0
q1
• n + 1 messages,n intervals: Bayes procedureused to fit to a conditional probability of a statesequence:q = (qi1, · · · , qin):
c(q|x) = b ln (1 − p
p) + (
n∑
t=1
−ln fit(xt))
whereb = state transitions, 1st term: low numberof transitions, 2nd: states fit the sequence
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.7/24
To the infinite and beyond• Given a sequence of intervalsx =
(x1, x2, · · · , xn), a sequenceq = (qi1, · · · , qin)that minimizes
c(q|x) =n−1∑
t=0
τ(it, it+1) +n∑
t=1
−ln fit(xt)
must be found• f is related to theresolutionof discrete rates
within continuous emission rates, andτ thefacility of changing state.
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.8/24
Infinite is a bit too much• A∗
s,γ that minimizesc(q|x) is restricted toAks,γ
with k states.
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.9/24
Infinite is a bit too much• A∗
s,γ that minimizesc(q|x) is restricted toAks,γ
with k states.• We will use aevolutionary algorithmto findAk
s,γ.
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.9/24
Infinite is a bit too much• A∗
s,γ that minimizesc(q|x) is restricted toAks,γ
with k states.• We will use aevolutionary algorithmto findAk
s,γ.
• Finally!
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.9/24
Individual representation• n integer sequence,1 < qij < E, representing
automaton state and idi of last document insequence.
• i arrives at0 ≤ ti ≤ T (intervalsxi = ti − ti−1).
t1 t2 · · · tn
| qt1, tk1| qtk1
+1, tk2| · · · | qtf , tn |
• Fitness function= cost function.• Initial population: documents chosen at random
thatsplit the document stream in intervals, withrandom states.
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.10/24
Crossoverg11 · · · g1i · · · g1f1
q11, (t1, · · · ) · · · q1i, (t − n1, · · · , t, · · · t + m1) · · · q1f1, (· · · , tn)
g21 · · · g2j · · · g2f2
q21, (t1, · · · ) · · · q2j , (t − n2, · · · , t, · · · t + m2) · · · q2f2, (· · · , tn)
g11 · · · g1i−1 c.p. g2j+1 · · · g2f2
q11 q1i−1 q2j+1 q2f2
(t1, · · · ) · · · (· · · , t − n1 − 1) ? (t + m2 + 1, · · · ) · · · (· · · , tn)
g21 · · · g2j−1 c.p. g1i+1 · · · g1f1
q21 q2j−1 q1i+1 q1f1
(t1, · · · ) · · · (· · · , t − n2 − 1) ? (t + m1 + 1, · · · ) · · · (· · · , tn)
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.11/24
Mutation• Several mutation
operators• Increment state by
one• Merge two genes,
state taken randomly• Split a gene in two:
one with originalstate, another±1.
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.12/24
Effect of crossover
10 20 30 40 50Crossover rate %
100
200
300
400
500
Gen
erat
ion
N.
stream astream bstream c
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.13/24
Effect of mutation
0 5 10 15 20 25 30Mutation rate %
0
100
200
300
400
500
Gen
erat
ion
N.
stream astream bstream c
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.14/24
Effect of population size
100 200 300 400 500Population size
0
100
200
300
400
500
Gen
erat
ion
N.
stream astream bstream c
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.15/24
Effect of number of generations
0 100 200 300 400 500Generation N.
2e+05
3e+05
4e+05
5e+05
6e+05
7e+05
8e+05
9e+05
Cos
t fun
ctio
n
stream astream bstream c
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.16/24
Time resultsState n. Viterbi Evo. Alg
Ex. time Cost Ex. time Cost (Av. Cost, Std. dev.)
15 2319.36 277402 1678.61 277712 (279385.6, 980.11)
20 3117.28 277306 2182.12 277528 (278980.4, 1114.91)
25 3835.37 277260 2033.81 277270 (279472.6, 1116.03)
15 20 25
010
0020
0030
0040
00Time comparison
states
time
(s.)
Evolutionary algorithm
Viterbi
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.17/24
Predicting the state of new arri-vals
• Main point of this work:to predict whether buzzis going up or down.
• Several possibleapproaches: usingViterbi algorithm overthe whole sequence, andreusing evolutionaryalgorithms.
• Easy approach for a sin-gle state: assume currenttrend continues.
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.18/24
Local approximation: results
Previous substream A. T. Old s. New s. Trend
· · · 38 38 39 41 49 49 52 12 0 ↓
· · · 41 49 49 52 68 69 69 3 4 ↑
· · · 88 89 90 90 91 92 95 0 0 →
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.19/24
But it breaks down after a while
date GA approx.0(2004-04-02) 7(0.694669)· · · · · ·
74(2004-06-15) 14(0.797281)75(2004-06-16) 24(0.970706)76(2004-06-17) 19(0.87973)
77(2004-06-18) 19(0.87973) 19(0.87973)78(2004-06-19) 0(0.605263) 19(0.87973)79(2004-06-20) 0(0.605263) 19(0.87973)
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.20/24
Fast GA for modelling new arri-vals
• Using results of previous fitting• Chromosome extended, and last gene mutation
probability higher.
0 50 100 150
Time0,6
0,7
0,8
0,9
1
Fre
quen
cy
GA fitapprox. fit
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.21/24
Fast GA: Results
Subst. len. New Subs. len. T. w/out seed T. w/ seed
219900 100
3895.28
141.45 (79.09)
219000 1000 144.75 (81.96)
210000 10000 166.73 (79.32)
Subst. Len. New Subs. len. T. w/out seed T. w/ seed
3032 100
5048.49
54.6
2632 500 92.247
2132 1000 294.97
1132 2000 570.41
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.22/24
Conclusions
• The presented system dynamically detectschanges on the trends of interest on a documentstream.
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.23/24
Conclusions
• The presented system dynamically detectschanges on the trends of interest on a documentstream.
• An EA allows to deal with very large sequencesof documents in a reasonable time.
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.23/24
Conclusions
• The presented system dynamically detectschanges on the trends of interest on a documentstream.
• An EA allows to deal with very large sequencesof documents in a reasonable time.
• Extending this EA allows fitting a stream whichis an extension of a previously fitted substream ina very short time.
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.23/24
Conclusions
• The presented system dynamically detectschanges on the trends of interest on a documentstream.
• An EA allows to deal with very large sequencesof documents in a reasonable time.
• Extending this EA allows fitting a stream whichis an extension of a previously fitted substream ina very short time.
• We plan to study correlations among documentstreams, to automatically detect the occurrence ofnew topics composed of multi-word concepts.
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.23/24
The end
• Thanks for your attention
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.24/24
The end
• Thanks for your attention• Any question?
A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.24/24