Chair of Entrepreneurial Risks at the Department of Management, Technology and Economics Master-Thesis for the Degree of “Master of Science ETH in Physics” Course Number: 402-0900-00L Shocks and Self-Excitation in Twitter ∼ Response Function Characterization in Epidemic Processes in Social Media Author: Stefan Rustler Supervisors: Prof. Dr. Didier Sornette Dr. Vladimir Filimonov Zurich, Switzerland November 2014
Chair of Entrepreneurial Risks at the Department of Management,
Technology and Economics
Master-Thesis for the Degree of
“Master of Science ETH in Physics”
Course Number: 402-0900-00L
Response Function Characterization in Epidemic Processes in Social
Media
Author: Stefan Rustler Supervisors: Prof. Dr. Didier Sornette
Dr. Vladimir Filimonov
This work aims to pave the way towards applying the self-excited
Hawkes model on Twitter messaging, more concretely on the epidemic
process of tweets triggering re-tweets for a certain topic or
hashtag. One of the key ingredients of the Hawkes model is the
response function, i.e. the timing of re-tweets after their
original or mother tweet. Due to the heterogeneity in topics and
users a categorization of response functions was conducted for
hashtags of both “endogenous” and “exogenous” single events, like
movie releases or bombings, as well as for phases of extended
events, in this work social protests.
The specific kind of re-tweeting (one-click re-tweets) investigated
here always refers back to the most original tweet, allowing one to
robustly sample the renormalized mem- ory kernel. Furthermore, due
to the clear definition of original, exogenous mother tweets and
endogenous re-tweets, the branching ratio can be easily calculated.
A simple numer- ical method to reconstruct the bare memory kernel
from the renormalized memory kernel was developed, which may give
insights into the individual as opposed to the collective human
behavior in responding to Twitter content. With the help of
synthetic data this could be used to demonstrate the theoretically
predicted anomalous scaling, where a longer renormalized memory
results from a shorter bare memory. Further, the very cause of this
behavior could be empirically confirmed, namely the earlier
triggering of events resulting from a shorter bare kernel.
Having confirmed the methods and metrics on synthetic data, the
empirical Twitter data could be addressed. Generally, a longer
memory, could be observed across different hashtags. When a power
law could be concluded, the exponent was found to be θ = 0.7±0.1.
The two methods of extracting the power law exponent, namely
plotting the Hill estimation versus the lower bound and a
maximum-likelihood estimation combined with Kolmogorov- Smirnov
tests, yielded similar results. Endogenous single events could not
be classified according to a common power low exponent because they
are in fact composed of two sub- classes: pre-burst-focussed and
after-burst-focussed events. On the contrary, exogenous single
events had a similar exponent of θ ∈ [0.706; 0.749].
For the extended events power law behavior could not be concluded
for the individual phases, though for both of the entire social
protests. While predicting real-life-criticality by means of the
branching ratio turns out to be very cumbersome, the first protest
(#Fer- guson) could be classified as a protest dominated by an
initially triggering exo-burst with diminishing self-excitation,
the second protest (#OccupyCentral) as a well-maintained and
centrally organized protest with constant criticality across all
phases.
The results of this work show that a larger-scale analysis with
more hashtags is necessary in order to more robustly determine
hashtag classes and response functions and thus be able to
calibrate the Hawkes model on Twitter data for a higher predictive
power.
Key words: Self-excited Hawkes process, renormalized memory kernel,
anomalous scaling, social epidemics.
2
Acknowledgements
First and foremost I would like to express my gratitude towards Dr.
Vladimir Filimonov for always having an open ear and mind for my
questions, ideas and results. It has been a pleasure working with
him. Further, I am thanking Prof. Didier Sornette for his guiding
insights and connecting my efforts to other works and results. I am
also grateful for Alexandru Sava’s inspirations for mathematical
derivations. Last but not least I want to thank Daniel Philipp,
Pawel Morzywolek, and Joel Bloch and the other master students for
the good spirit and the motivating atmosphere in the office both
during the week and the weekend.
3
Contents
1 Introduction 5
2 Theoretical Background 8 2.1 Self-Excited Hawkes Processes . . .
. . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Power Laws . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11 2.3 Bare and Renormalized Memory Kernel . . . . . . . . . . . .
. . . . . . . . 14 2.4 Anomalous Scaling . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 20
3 Methods 25 3.1 Data Simulation . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 25 3.2 Adaptive Kernel Density
Estimation . . . . . . . . . . . . . . . . . . . . . . 26 3.3
Numerical Reconstruction of Bare Memory Kernel . . . . . . . . . .
. . . . 27 3.4 Extracting Highest Generation over Time . . . . . .
. . . . . . . . . . . . . 29 3.5 Characterizing Response Functions
. . . . . . . . . . . . . . . . . . . . . . . 29
4 Empirical Data 32 4.1 Scraping Twitter Data . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 32 4.2 Hashtags of Single
Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Hashtags of Extended Events . . . . . . . . . . . . . . . . . .
. . . . . . . . 36
5 Results 39 5.1 Synthetic Data . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 39
5.1.1 Cluster Characteristics . . . . . . . . . . . . . . . . . . .
. . . . . . . 39 5.1.2 Performance of Numerical Reconstruction of
Bare Kernel . . . . . . 43 5.1.3 Effects of Error Sources . . . . .
. . . . . . . . . . . . . . . . . . . . 48 5.1.4 Demonstration of
Anomalous Scaling . . . . . . . . . . . . . . . . . . 56
5.2 Empirical Data . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 61 5.2.1 Hashtags of Endogenous Bursts . . . .
. . . . . . . . . . . . . . . . . 62 5.2.2 Hashtags of Exogenous
Bursts . . . . . . . . . . . . . . . . . . . . . 75 5.2.3 Hashtags
over Protest Phases . . . . . . . . . . . . . . . . . . . . . .
87
6 Conclusion 95
7 References 98
8 Appendix 101 8.1 Normalization of Resolvent . . . . . . . . . . .
. . . . . . . . . . . . . . . . 101 8.2 Individual Figures on
Protest Phases . . . . . . . . . . . . . . . . . . . . . .
104
4
1 Introduction
In complex systems, like human societies or neuronal networks, a
large number of strongly interacting subsystems or agents is
crucial for the emergence of counter-intuitive phenom- ena from
within the system. These internal feedback mechanisms are
complemented by external impacts on the system, that allow the
system to leave its state of equilibrium, adapt to those external
impacts, and undergo phase transitions. This interplay of internal
and external or endogenous and exogenous forces is of great
interest in understanding the dynamics of information in complex
systems. One of the crucial questions is whether it is possible to
separate and quantify the influence of those endo- and exo-factors.
Being able to quantify the former, gives insight into the
robustness of a system to external influences. A system in which
external excitations are coupled to and enhanced by internal
self-excitation is likely to be in a more fragile or critical
state. Hence, the degree of self-excitation can be regarded as a
measure of the criticality of a system. An early example of this
can be found in the physics of earthquakes. Here aftershocks
(endogenous) are triggered by shocks (exogenous). If sufficiently
many aftershocks per shock are set off, the system is in a critical
state: a series of sudden, strong shocks, i.e. an earthquake, can
occur.
In [1] V. Filimonov and D. Sornette proposed a method of
quantitative estimation of the impact of self-excitation in
financial markets, based on the so-called “self-excited Hawkes
process”, a generalization of a Poisson process which effectively
combines exogenous influences with endogenous dynamics and thus may
be applied to a variety of different contexts. As shocks trigger
aftershocks, in the financial context, for example, price changes
can trigger price changes. A critical state would then be
characterized by sufficiently many internally triggered price
changes, i.e. a high volatility of the market, which may lead to
financial crashes. The degree of self-excitation may serve as
precursor for such crashes. The two key ingredients for the
self-excited Hawkes process are the unconditional arrival rate of
shocks, capturing the exogenous dynamics, and the arrival rate of
aftershocks conditional on the history of the process, which
captures the endogenous mechanisms. The main component of the
latter is the so-called memory kernel that determines the influence
of past shocks on future shocks. For example, in a system with long
memory current shocks may be triggered by or happen in response to
long passed shocks. In this sense the memory kernel is also
referred to as response function. Both ingredients need to be
adapted to the specific system considered, which is a non-trivial
tasks.
This thesis aims to apply this concept to the epidemic processes,
i.e. the spread of information, in social media. Concretely, we
apply the method proposed in [1] on data from the micro-blogging
service Twitter. Having become one of the most widely used social
media platforms, commercial and academic interest is substantial.
Marketing companies perform big data analyses to understand what is
popular and trending. Research focus has for example been put in
trying to predict election results [2]. Recently it has been shown
that Twitter can be regarded as an excitable medium in the physical
sense, being subject to increased activity-fluctuations close to
critical points such that phase transitions
5
are in principle predictable [3]. The short messages or tweets can
be labelled via so- called hashtags and may be re-posted and thus
spread by other users as so-called “re- tweets.” In allusion to the
terminology used in the Hawkes model, we will refer to this
re-posting as “mother tweets” triggering re-tweets. The timing of
mother tweets and their re-tweets provide direct and robust insight
into the afore-mentioned response function. In contrast to
earthquake systems, the collective memory of Twitter is highly
heterogeneous since it is used by more than 271 million active
users [4], discussing a variety of topics. It is therefore sensible
to categorize content via hashtags and determine the respective
response functions. Inspired by the work in [5] about classes of
response functions, we will investigate hashtags referring to
announced versus unexpected single events, which are likely to
follow different mechanisms. Prime examples here are movie releases
versus shootings or bombings. However, other works have shown that
there are more numerous and not so clearly defined hashtags, e.g.
in [6]. Elucidating the response function will not only pave the
way for applying the Hawkes but also provide insights into how the
activity in Twitter is leading up to such events and/or how
interest in certain topics behaves over time and may be
manipulated.
Since Twitter is frequently used as an important tool for political
mobilization, espe- cially during riots and unstable periods, we
will also investigate two recent protests in 2014 that have
distinct hashtags: the social unrest in Ferguson, USA and the
student protests in Hongkong. These prolonged episodes are a
natural extension to considering single events with an individual
burst in tweet activity. Here it is of particular interest whether
different phases of the protest can be characterized by means of
the response function and the degree of self-excitation. In analogy
to financial crashes or earthquakes, in the context of social
unrest the manifestation of criticality predicted by the model may
be the outbreak of a protest on the street, i.e. outside of the
system of Twitter. By investigating the triggering of re-tweets
over time, the question whether criticality in the Hawkes model
matches or even precedes criticality in reality, will be
addressed.
Due to the high number of Twitter users, the question arises how
individual activ- ity aggregates to collective human behavior,
resulting in large-scale phenomena such as blockbusters, election
results or - as already mentioned - protests. In order to shed
light on this, we develop a simple method to reconstruct the
individual response function from the aggregate response. In the
framework of the Hawkes model these two are referred to as bare
versus renormalized memory kernel. For this we conduct a detailed
analysis with synthetic data for both short- (exponential) and
long-memory (power law) kernels. In particular, we examine the
theoretically predicted paradox of anomalous scaling, where a
longer renormalized memory kernel may result from a shorter bare
memory kernel. For the characterization of the memory kernels
different metrics will be applied and different common methods for
the estimation of the power law exponent compared. Furthermore, the
effect of known errors sources arising from the limitations of the
empirical data will be analyzed as well. With the insights gained
from the synthetic data, the analysis will be extended to the
empirical data of Twitter.
6
In summary, this thesis intends to set the stage for applying the
self-excited Hawkes model onto large-scale Twitter data by
characterizing both individual and aggregate re- sponse functions
for different handpicked hashtags and periods of hashtags. The
focus of this work will hence be on the methodology and its
systematic test using synthetic data, the results from the
empirical data should be understood as first test runs.
7
2.1 Self-Excited Hawkes Processes
The process of tweeting can be regarded as a discontinuous point
process, in which events, i.e. tweets, arrive with a certain rate.
This arrival rate is generally referred to as the intensity λ of
the process and takes different functional forms. The simplest
point process is the Poisson process where the intensity is a
simple constant. This means that events occur homogeneously, i.e.
independent of each other. However, the intensity of an inho-
mogeneous Poisson process has further dependencies, for example
time. The self-excited Hawkes process is a generalization of an
inhomogeneous Poisson process, which not only depends on time but
also on the history of the process [7]. The intensity of the
self-excited Hawkes process can be written as
λ(t) = µ(t) +
−∞ h(t− s) dN(s). (2.1)
Here the first summand µ(t) denotes the background or unconditional
intensity describing the arrival rate of exogenous events
(exo-events). Due to the time-dependence, µ(t) alone would make a
simple inhomogeneous Poisson process. The second addend refers to
the intensity conditional on the history of the process, ranging
from −∞ to the present time t via the integrational variable s. The
integrand h(t − s) denotes the memory kernel, which describes the
positive direct effect of a past event at s on the current
instantaneous intensity at t of the endogenous events (endo-events)
[8]. Usually, the further events at t and s are apart, the smaller
h will be, which corresponds to more distant events being
“remembered” less by the system. The conditional intensity can be
regarded as a sum of the instantaneous number of events having
happened at s, i.e. dN(s), weighted by their influence on the
present events at t via the memory kernel.
Assuming stationarity of the process as well as requiring that µ(t)
= µ, an average total intensity Λ can be calculated by taking the
expectation value on both sides of equation (2.1) [9].
Λ = λ(t) =
∫ ∞ 0 h(τ) dτ. (2.2)
Here the fact that dN(s)/ds = λ(s) was used. To obtain the last
equality the substitution τ := t − s was performed. The
interpretation of τ would then be the time it took for an
endo-event to be triggered after the arrival of the triggering
event, often referred to as
8
waiting time. In our case of Twitter, this would correspond to the
time between tweet and retweet, i.e. the response time.1 Now using
the definition,
n :=
the average intensity can finally be written as
Λ = µ
1− n . (2.4)
Since we were requiring that the history of events must have a
positive effect on the present intensity, this equation is
meaningful only for the case of n < 1, such that Λ ≥ µ. When n =
0, the process becomes independent of history and thus a simple
inhomogeneous Poisson process. For the case of n > 1, the
intensity as defined by equation (2.1) may explode [10]. The case
of n = 1 can therefore be regarded as the critical case, separating
the subcritical and the supercritical regime. A complete
pedagogical classification of the different regimes can be found in
[11].
Now what is the meaning of n? As equation (2.3) shows, a memory
kernel with long memory would result in a larger value for n. As
explained before such a memory kernel results in an increased
instantaneous intensity. The parameter n must therefore be related
to how many events a single event is likely to trigger in the
future. The entirety of events triggering further events can be
regarded as a branching process. In fact, it is well known that
Hawkes processes can be mapped into such a branching processes, in
which all events are either exogenous (“mother”) or endogenous
(“daughter”) events of different generations [8]. In this language
a mother event would trigger a daughter event and so on, forming a
“family tree” or cluster. How the point process governed by the
Hawkes intensity (2.1) behaves with regards to cluster formation is
visualized in figure 1. Now n would indeed be the average number of
first-generation daughters of a single mother [1]. With this
interpretation and the stationarity assumption, we can calculate an
average endogenous (or self-excited or conditional) intensity
Λendo. The following calculation is mostly adopted from [12].
Generally, the intensity of i-th order events can be written
as
λi(t) =
∫ t
−∞ λi−1(s)h(t− s) ds, i ∈ N, (2.5)
where λ0(t) ≡ µ(t). Now if we require µ(t) = µ, as we have already
done in equation (2.2), equation (2.5) would for the case of the
intensity of first-generation daughter events simplify to λ1 = nµ.
The intensity of the second-generation daughter events would
likewise be given by λ2 = nλ1 = n2µ. Continuing this to infinity we
get for the additional requirement of n < 1 the following.
1The terms response time and waiting time will be used
interchangeably in this work.
9
Triggering
Re-tweets
Figure 1: Instantaneous intensity of a 1D self-excited Hawkes
process in comparison with cluster formation and time stamps. The
source event clustering is the increased probability (via n) for
events to happen right after the occurrence of an event as given by
the Hawkes intensity (cyan). In this sense a mother event can
trigger a daughter event and so on, such that family trees or
clusters with the most original mother as cluster center (dark
circles) are formed. While the occurrence of the independent
cluster centers (blue, red and green) is governed by the
unconditional intensity µ(t), the occurrence of descendants (light
squares) is governed by the memory kernel h(t), which can be
directly seen in the drop of the Hawkes intensity after an event
has happened. This picture corresponds to a branching ratio equal
to n = 0.89.
10
Λendo = µ n
1− n (2.6)
In equation (2.4) we have calculated the total average intensity,
for which Λ = µ + Λendo must hold. Taking the result from equation
(2.6), this is indeed the case. Now we can calculate the overall
proportion of endogenous events to all events.
Λendo Λ
= n/(1− n)
1/(1− n) = n (2.7)
Therefore in the subcritical regime (n < 1) and in the case of a
constant unconditional intensity [µ(t) = µ], the branching ratio n
as average number of first-generation daughters per mother can be
interpreted as the proportion of endogenous events among all
events. The branching ratio thus is a direct quantitative estimate
of the degree of endogeneity, i.e. self-excitation.
For the memory kernel h(τ) there are in principle many functional
forms, the most common being an exponential kernel of shorter
memory like it was used in [1] and a power law kernel of longer
memory like it was used in [10]. The latter form will be elaborated
on in the next section, the former briefly addressed afterwards. In
order to determine the memory kernel, τ must be extracted and
sampled from the data. We need to get these response times by
connecting “mother tweets” and re-tweets:
τk = ti − tj , (2.8)
where ti and tj with i, j ∈ [1;Nt(T )] denote the time stamps of
the re-tweet and, re- spectively, its mother tweet; k ∈ [1;Nt(T )
−M ] is the index of responses. The counting process Nt(T ) equals
the number of events after the total time T , and M the number of
mother tweets. In the context of financial markets or earthquakes
it is highly impractical if not impossible to make a causal link
between mother and daughter event, such that the functional form
has to be estimated for example via maximum likelihood estimation.
In contrast, in the context of Twitter there actually is the
possibility to make this causal link for certain types of
responses, most notably re-tweets, as will be explained in section
4. We are therefore in a very good position to determine a
well-founded memory kernel.
2.2 Power Laws
In [1] the memory kernel was chosen to be an exponential
distribution. However, in this work we will be also dealing with
power law (PL) distributions, which generally underlie long memory
processes. In principle there are many different kinds of power
laws and other distributions that are heavy-tailed. However,
focussing on the tails of the distribution, the
11
power law distribution is a very common and simple one, giving a
good first idea for further distributions to be tested. In this
section we introduce the notation and particular form for the power
laws in this work. Furthermore, an approximation of this power law
as a sum of exponentials will be explained in detail as this has
computational and algebraic advantages.
Though the variable whose distribution we are interested in, i.e.
the waiting time, is discrete due to the precision of measurement
(seconds), we treat it as continuous variable for three reasons.
First, response time is in fact continuous and only the measurement
makes it discrete. Second, we are considering very large response
times up to the order of months, in comparison to which seconds are
very small. Thirdly, the mathematical treatment is much simpler,
though still providing a good approximation.
Now, a continuous power law distribution of non-negative x is
described by a probability density function (PDF) p(x) such
that
p(x)dx = Pr(x ≤ X < x+ dx) = A(x+ c)−αdx, (2.9)
where α is the scaling parameter, X is the observed value of x and
A is the normalization constant. Shifting the distribution to the
left by c, will ensure that the density does not diverge at x→ 0.
Using the normalization condition of
∫∞ 0 p(x) dx = 1 as well as requiring
α > 1, the functional form of the power law investigated here
turns out to be
p(x) = (α− 1)cα−1
Φ(t) = θcθ
(t+ c)1+θ . (2.11)
Here we introduced the notation to match our case of waiting times
t sampled from the bare memory kernel Φ(t).2 It is related to the
memory kernel h(t) introduced in equation (2.3) as
Φ(t) = h(t)/n. (2.12)
This power law form is often referred to as Omori law. Note that
the notation of the power law exponent as α is often used in cases
with finite lower bound in x such that c is rather denoted as xmin,
e.g. in [13]. In our and many other cases, x needs to be finite
right above zero. Further the exponent expressed as θ = α − 1 turns
out to be more practical when dealing with a renormalization of the
exponent as will be shown in section 2.3. We therefore stick to the
notation with θ as in equation (2.11).
2Note that since we will be dealing almost exclusively with waiting
times as opposed to time stamps, the letter t will be used as
waiting time, whereas τ will be used as characteristic time in the
exponential kernel to be introduced in the next section.
12
Generally, a larger θ corresponds to a faster decay of t, in the
context of the bare memory kernel a shorter bare memory. Note that
0 < θ ≤ 1 and that a well-defined mean only exists if θ > 1.
Often times it is useful to consider the complementary cumulative
distribution function (CCDF) of a PDF. The CCDF which will be
denoted as P (x) and is defined as
P (x) = Pr(X ≥ x) = 1− Pr(X < x). (2.13)
In our case the CCDF turns out to be
P (x) = 1− ∫ x
x+ c
)θ . (2.14)
For a given c the scaling parameter θ can be estimated via
maximum-likelihood that has a closed form solution of
θN = N
]−1
, (2.15)
where ti are the observed values of the response times. A higher
lower bound c will reduce the number of sample points N that can be
used for the estimation. Hence the lower bound occurs with a
subscript as cN . The estimation becomes better for larger N , i.e.
number of data points, such that lim
N→∞ θN = θ. Equation (2.15) is equivalent to the well-known
Hill
estimator for the estimation of θ [14].
When producing synthetic data via simulating a Hawkes process, a
power law kernel as expressed above comes with the big disadvantage
that it cannot be analytically transformed into its renormalized
memory kernel (explained in the following section 2.3). This can be
overcome by expressing the power law kernel as a sum of
exponentials, which approximates the power law behavior well enough
for finite waiting times. As a sum of exponentials the
transformation from bare to renormalized kernel becomes exactly
solvable.3 The power law approximation is an adoption from [10]
with an additional c-shift expressed as follows.
Φ(t) = 1
(1/ξi) θe − c ξi − S c
m e−m,
3For legibility the term power law will be used in this work when
actually we are using its approximation.
13
(1/ξi) 1+θ,
ξi = cmi.
The choice of M and m will determine how well the power law is
approximated. In [10] M and m are chosen to be 15 and 5,
respectively, such that the most extreme term in the sum will be of
the order mM−1 = 514 ≈ 109, which is sufficient for our purposes in
the case of synthetic data. Here we choose maximal process
durations of the order 106. The goodness of the approximation is
visualized in figure 2.
2.3 Bare and Renormalized Memory Kernel
In the previous section only the bare memory kernel Φ(t) was
explained in detail, which considering equation (2.12) turns out to
be the probability of a mother event triggering a daughter event
directly. In some systems we do not observe such first-order
responses but the responses of descendants with respect to their
earliest mother, i.e. the immigrant. As the generation of the
descendants are generally unknown, the bare response function is
hidden and only the renormalized response function R(t) or memory
kernel can be determined.4 The two are related via the integral
equation
Φ(t) = R(t)− n ∫ t
0 Φ(t− t′)R(t′) dt′, (2.17)
as shown in [15]. A standard technique to solve this so-called
Volterra integral equation of second kind is via Laplace
transformation, which is defined as follows.
f(s) ≡ L{f(t)}(s) :=
∫ ∞ 0 f(t)e−st dt, (2.18)
where s ∈ C. More concretely, this is the one-sided Laplace
transformation, defined for non-negative values of t ≥ 0. This is
suitable for causal systems, i.e. systems in which the past
influences present and not vice-versa, which is the case for
response functions. Inverse-transforming the function from the
complex frequency domain back to the real time domain is defined as
follows.
4Renormalized (memory) kernel and response function both refer to
R(t). The former expression will be used to discern it from the
bare (memory) kernel, whereas the latter will be used when talking
about response times.
14
c−shifted OL−approximation
Figure 2: Head and tail of a sample power law bare kernel. The
green curve depicts the ideal, true power law kernel (OL stands for
Omori law), the red curve the approximation as given by equation
(2.16). An artifact of this approximation is the slight
“oscillation” around the true curve. The approximation is good
until about 109 which is given by the choice of M and m,
determining the most extreme term in the sum mM−1 = 514 ≈
109.
L−1{f(s)}(t) := 1
0, for t < 0 , (2.19)
where γ must be larger than the constant of absolute convergence of
the integral. The causal system with f(t < 0) = 0 is retrieved.
Now expressing equation (2.17) in its frequency domain via the
Laplace transformation (2.18), yields
Φ(s) = R(s)− nΦ(s)R(s)
0.01
0.02
0.05
0.10
0.20
0.50
Waiting time
M em
o ry
k er
n el
w it
h m
o d
e at
ze ro
Figure 3: Bare memory kernel in comparison with the respective
renormalized memory kernel for different branching ratios. The bare
kernel (green) is of the most simple form suitable for our case: 1
τ e−t/τ , which analytically transforms into 1
τ e−t 1−n τ for the renormalized kernel. A low branching
ratio of n = 0.1 (magenta, dashed) and a high branching ratio of n
= 0.9 (red) are depicted. What is striking is that the memory is
dramatically increased upon near-criticality. In fact, for n = 1
the renormalized kernel becomes a constant of 1/τ . For even
higher, i.e. supercritical, n the system’s memory is no longer
decreasing but increasing over time, which might result in an
explosion of events.
So in order to retrieve Φ(t) from R(t), the latter must be Laplace
transformed such that equation (2.20) can be applied and Φ(s)
obtained, which then must be inverse-Laplace transformed. Not
surprisingly, the Laplace transformation is straightforward for
expo- nential or trigonometric functions, but analytically not
solvable for power law resolvents [15]. In order to illustrate the
relationship between bare and renormalized kernel, simple functions
for different n are plotted in figures 3 till 6.
A higher branching ratio n will generally result in a longer
effective memory of the system. Intuitively this makes sense, as a
higher number of offspring increases the overall lifetime of the
cluster or the “family tree” the offspring are part of. The cluster
center or the mother of the family tree is thus able to indirectly
trigger events that are further in the future, i.e. the effective
or renormalized memory is larger. Consequentially, the renormalized
kernel cannot be regarded as a PDF because it does not integrate to
unity like for the bare kernel but to a value K that must be
greater than unity and dependent on n. This value K turns out to
be
K = 1
1− n . (2.21)
A detailed derivation of this result can be found in the appendix
8.1.
16
0.000
0.005
0.010
0.015
0.020
0.025
0.030
0.035
o d e
Figure 4: Bare memory kernel in comparison with the respective
renormalized memory kernel for different branching ratios. The bare
kernel (green) resembles a power law of the form (2.11)
approximated by a sum of exponentials as in [10] expressed as a
shifted version in equation (2.16). A low branching ratio of n =
0.1 (magenta, dashed) and a high branching ratio of n = 0.9 (red)
are depicted. The effect of renormalizing a kernel via n becomes
clearer when looking at the mode: The memory is increased in such a
way that the PDF is stretched right-upwards.
For the most simple case of an exponential bare kernel, i.e.
Φ(t) = exp(−t/τ)/τ, (2.22)
the behavior of its renormalized kernel, which turns out to
be
R(t) = exp (−t(1− n)/τ) /τ, (2.23)
illustrates very well why the case of n = 1 separates a
sub-critical from a super-critical regime as it was mentioned in
section 2.1. For n ≥ 1 the exponential changes from decaying to
growing. That is, in the supercritical case the system’s memory is
no longer decreasing but increasing over time, which might result
in an explosion of events. Clearly, R(t) does not integrate to
unity for finite n or even to a finite value for n ≥ 1. A
near-critical case is depicted in figure 3.
Another strictly decaying bare kernel is the powerlaw kernel, which
has substantial probability mass at the long tail, making it a
long-memory kernel. Since the transformation to the corresponding
renormalized kernel is analytically only asymptotically possible
([15]), we approximate the powerlaw as a sum of exponentials as
described at the end of section 2.2 such that an analytical
transformation is possible. The kernel behavior of this case is
depicted in figure 4. Here the effect of a finite n becomes
clearer: The memory is increased in such a way that the PDF is
stretched right-upwards.
17
10-11
10-8
10-5
0.01
10
ro
Figure 5: Bare memory kernel in comparison with the respective
renormalized memory kernel for different branching ratios. The bare
kernel (green) and the renormalized kernels of different branching
ratios (magenta for low, red for high) are the same as in figure 4
expressed by equation (2.16): After an initial phase with lower
exponent (t ≈ 100), it becomes the same as for the bare kernel The
situation is quite different for the critical (dotted purple) and
supercritical (dotted blue) case. In the former case the exponent
is renormalized from 1+θ to 1−θ, i.e. a shorter bare memory results
in a longer renormalized memory. This paradoxical behavior will be
discussed in detail in the next section 2.4. The latter case the
power law decay transitions to an exponential increase, which
reproduces the results in [11].
In figure 5 we investigate the behavior in the long tail by
plotting the previous case on log-log-axes. For a subcritical
branching ratio, the memory is temporarily increased but then
transitions to the same exponent. In a log-log plot this is
reflected by parallel curves. For a critical branching ratio, on
the contrary, the situation is quite different. The curves are not
parallel anymore. In fact, the exponent is renormalized from 1 + θ
to 1− θ, i.e. a shorter bare memory results in a longer
renormalized memory. This paradoxical behavior called “anomalous
scaling” will be explained in detail in section 2.4. For
completeness also a supercritical branching ratio is included. Here
the power law decay transitions to an
exponential increase. Both transitions occur for t ≈ t∗ = c (
nΓ(1−θ) |1−n|
)1/θ , a result from [11].
Note that also for a power law approximated by a sum of
exponentials as in (2.16) this result is correctly
reproduced.
Now we look at a more elaborate case in figure 6 with R(t) ∼e−at
cos2(bt)+e−ct, where a, b, c ∈ R+. Again, the stretching is
right-upwards to the positive region. This becomes most evident
when looking at the first and second mode. The second mode has
become the largest. Increasing the branching ratio even more will
put more mass on the third mode and so on. Despite the
supercritical n, the overall memory is decreasing such that an
explosion of events is very unlikely.
18
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
m em
o ry
k er
n el
Figure 6: Bare memory kernel in comparison with the respective
renormalized memory kernel for different branching ratios. The bare
kernel (green) is of the form e−at cos2(bt)+e−ct, where a, b, c ∈
R+, in order to provide a simple multimodal PDF, for which the
Laplace transform and its inverse can be performed analytically.
Renormalized kernels of a low branching ratio of n = 0.1 (magenta,
dashed) and a high branching ratio of n = 0.9 (red) are depicted. A
supercritical branching ratio of n = 2.6 (blue, dotted) was chosen
to illustrate even more clearly the stretching of the bare kernel.
Again, the stretching is right-upwards to the positive region. This
becomes most evident when looking at the first and second mode. The
second mode has become the largest. Increasing the branching ratio
even more will put more mass on the third mode and so on. Despite
the supercritical n, the stretching of the modes decreases such
that an explosion of events seems unlikely.
19
All the relations above were obtained by solving for Φ(t) via the
Laplace transformation as described in equations (2.18) - (2.20).
However, this method relies on the knowledge of the functional form
of the resolvent R(t), which even for the common, relatively simple
power law-case is analytically not viable anymore. Furthermore,
since resolvents sampled from complex systems will most likely not
follow a clear simple functional, i.e. paramet- ric, form,
non-parametric numerical techniques have to be employed. Generally
in such methods, sampled data points are not fit into parametric
models but processed as such, which is more loyal to the data
themselves. The detailed methodology of this is described in
section 3.3.
2.4 Anomalous Scaling
While the general relationship between bare and renormalized kernel
was discussed in the previous section, this section focusses on the
special case at criticality, where in a power law kernel a larger θ
results in a shorter bare memory but longer memory of the response
function (see figure 8, bottom). Intuitively, one would actually
expect that the renormalized kernel describing the response of the
aggregate system of all generations is simply an “up- scaled”
version of the bare kernel describing the response on a
single-generational level (see figure 8, top). For this reason this
paradoxical behavior is referred to as anomalous scaling. For the
case of 0 < θ < 1 a number of authors have shown that R(t) ∼
1/t1−θ for n = 1 (synthesis in [16]). But also for the subcritical
case there is a phase with renormalized exponent, i.e. when
t < t∗ = c
)1/θ
(2.24)
as derived in [11] and visualized in 7. Note that in this work the
case of 1 < θ is not exam- ined. Note that even for exponential
memory kernels anomalous scaling can be observed, however, only in
the supercritical regime. This can be easily derived from equation
(2.23). This section is largely a reproduction of the intuitive
derivation of the above paradox in [15]. However, more detail, in
particular with regards to the assumptions made, is added in order
to be able to actually test the argumentation with synthetic
data.
We start by realizing that the total activity A(t) after one
exo-event at t = 0 can be decomposed into the contributions per
generation k and is directly proportional to the response function
R(t) via the branching ratio n:
A(t) =
A(t) = nR(t). (2.26)
Here K(t) denotes the highest generation triggered until time t
such that 0 < k ≤ K(t). The latter equality holds for the
general case of a sudden impulse happening at t = 0 (see
20
Figure 7: 3D-plot of crossover time t∗ as given by equation (2.24).
The variable c has been set to 1 as it enters the equation as a
simple prefactor. Generally a choice of both n and θ close to 1
result in a higher crossover time. For θ → 0 the crossover becomes
a sudden transition at n = 0.5. For θ → 1 the crossover becomes
independent of n. For intermediate values of θ the branching ratio
should also be chosen rather close to criticality for the crossover
to be better visible.
equation 6 in [15]) and the response of the system to this impulse.
Now we can look into the activity per generation Ak(t) which is
analogous to equation (2.26),
Ak(t) = nkΦk(t), (2.27)
and from there determine what A(t) would look like given equation
(2.25), for which K(t) has to be determined as well. The activity
per generation in equation (2.27) is composed of the number of
offspring nk and the bare memory kernel Φk(t). Since we are on the
first-order, single-generational level, it only makes sense to take
the bare kernel. Here the subscript k refers to the fact that the
short time cutoff will not be c as in (2.11) but the only
characteristic time scale associated with generation k instead,
i.e. the typical waiting time in that generation tk. The bare
memory kernel per generation is hence expressed as
Φk(t) = θtθk
(t+ tk)1+θ . (2.28)
To determine the relationship between tk and k, it is easier to
first look into the analogous aggregate case of K(t) depending on
t. For this consider the chain of events leading up to the
occurrence of the highest generation
t = t1 + t2 + ...+ tK(t) ≈ K(t) tt , (2.29)
where t2 denotes the waiting time between immigrant and
second-order descendant, for
21
Figure 8: Bare (green) and renormalized (red) memory kernels for
different exponents. Top: Exponential kernel with low exponent of τ
= 0.5 (dashed) and high exponent of τ = 2 (thick) at
near-criticality (n = 0.9). The scaling behavior can be checked by
comparing those two exponents: Decreasing the exponent will
decrease both the bare and renormalized memory (lower slope from
thick to dashed). Bottom: Power law kernel with low exponent of θ =
0.4 (thick) and high exponent of θ = 0.9 (dashed) at criticality.
Comparing these two exponents reveals an anomalous scaling:
Increasing the exponent will decrease the bare memory (lower slope
from thick to dashed) but increase the renormalized memory (higher
slope from thick to dashed). Anomalous scaling therefore is a
fundamental property of long memory processes.
22
example. The approximation becomes an equality when considering
infinitely many inde- pendent realizations for a given K(t) and t.
The time average .t is calculated as
tt =
1− θ tmax(t)1−θ. (2.30)
Note that here we take advantage of the fact there is no waiting
time beyond tmax(t) in a finite set of K(t) generations such that
the integration goes until tmax(t) only and becomes finite itself.
In the last approximation we assumed that c tmax(t). To continue we
have to determine tmax(t), which can be done via checking the
probability of finding a tk equal to or larger than tmax(t) in the
chain of K(t) events above. By consistency this probability should
be roughly 1:
1 ≈ K(t)P [X ≥ tmax(t)] = K(t)
∫ ∞ tmax(t)
tmax + c
)θ . (2.31)
By again assuming that c tmax(t), tmax(t) finally be written
as
tmax(t) ≈ cK(t)1/θ, (2.32)
which plugged into equation (2.30) and using equation (2.29)
results in
t ≈ θc
tk ∼ ck1/θ. (2.35)
Now resuming to equation (2.27) and requiring that tk t, we can
write the activity per generation as
Ak(t) ∼ nk θk
t1+θ , (2.36)
which plugged into the equation (2.25) for the total activity
results in
A(t) ∼ θ
t1+θ
23
for the critical case of n = 1. The quadratic relationship between
A(t) and K(t) will be dramatically less pronounced in the
subcritical case. Now by including the behavior of K(t) given in
equation (2.34), the behavior of the total activity becomes
A(t) ∼ K(t)2
t1+θ ∼ 1/t1−θ, (2.38)
such that also R(t) ∼ 1/t1−θ. Therefore, the larger θ is, the
shorter the memory of 1/t1+θ, but the faster growing
is the number of generations that are triggered up to time t. Thus
the resulting slower decaying renormalized response function ∼
1/t1−θ results from the fact that the number of triggered
generations grows sufficiently fast according to equation (2.34) so
as to more than compensate the shorter memory associated with a
larger θ. This anomalous scaling results fundamentally from the
property that the expected waiting time is infinite for 0 < θ
< 1 [15]. For a large θ the probability mass of the bare kernel
is concentrated at short times, resulting in more early
offspring.
To summarize, for a power law kernel at criticality we need to
ensure the following in a simulation in order to reproduce
anomalous scaling.
1. tk t, which is fulfilled for sufficiently large t such that K(t)
spans over several decades.
2. There is at least one tk in the highest-generation cluster that
is of the order tmax (assumption for (2.31)).
3. We need to consider intact chains of events such that t = K(t)
tt.
4. For simulations only at near-criticality we further need to
ensure that t < t∗.
24
3.1 Data Simulation
In order to empirically investigate the relationship between bare
and renormalized kernel, the synthetic data needs to be such that
we know the corresponding immigrant for every descendant event.
With this information it is possible to directly sample the waiting
times for the renormalized kernel in order to determine the shape
of R(t). Via the knowledge of n, which will simply be the ratio of
descendants over all events, the shape of Φ(t) can in principle be
determined as described in section 2.3. How this is realized will
be described in section 3.3. Of course conventional methods such as
the maximum likelihood estimation (MLE) can be applied to directly
determine the shape of Φ(t). However, as already concluded in the
previous section, non-parametric methods, i.e. not MLE, are more
suitable in cases where the underlying laws are unknown.
How the simulation of the synthetic data as well as its sampling
for the above-mentioned purpose is realized, will be briefly
outlined in this section. In the following section 3.2, the
determination of the shape of R(t) via the adaptive kernel density
estimation (AKDE) will be described in detail.
There are two commonly used algorithms for simulating a Hawkes
process: First, the thinning procedure in which each event is
simulated according to the intensity as given in equation (2.1).
Here the disadvantage for our purposes is that we can neither infer
which event is an immigrant nor which event triggered which
subsequent event. Moreover, the numerical complexity goes as O(N2)
such that the simulation of long time series at near- criticality
would be overly extensive [17]. For a more detailed description of
this method refer to [9]. The second algorithm simulates immigrants
and descendants separately: the former as a Poisson process, the
latter via the bare memory kernel. Obviously, it is straight-
forward to determine n here. This option is also well-suited for
simulating the information on which event triggered which event,
briefly outlined in the following.
First, the immigrants are simulated as homogeneous Poisson process
via the afore- mentioned thinning procedure only that the intensity
will simply be λ(t) = µ = const. ((2.1)), which needs to be
specified. As a Poisson process per definition produces indepen-
dent events, these immigrants will act as centers for independent
clusters. Second, these clusters will be simulated by filling them
with descendants. For this all the immigrants with k = 0 will be
taken and their descendants with k = 1 simulated in parallel via
another thinning procedure with the intensity being λi(t) = nΦ(t −
ti) = h(t − ti), which needs to be specified. Here ti is the time
stamp of the mother or immigrant i ∈ [1;M ]. This is repeated in
parallel for all clusters until the cessation of the last
offspring, i.e. the youngest daughter without children. In this way
the branching structure of the entire process can be reconstructed
perfectly and the numerical complexity only goes as O(µTK), where T
is the simulation time window and K the number of generations in
this window [17].
Now each simulated event is defined by its time stamp t, its
generation k and the its
25
unique cluster-ID i. Now by taking the time stamp of the earliest
event for each cluster, i.e. the smallest t or equivalently the
event with k = 0, and subtracting it from the time stamp of each
remaining event in that cluster, one gets the positive waiting
times for the renormalized kernel R(t), fulfilling the requirement
t 7→ R(t), t ∈ R+. Hence in this work we refer to those waiting
times as renormalized waiting times or rwt in the code. The code
get rwt.m can be found on https://github.com/SRustler/Shocks and
Self- Excitation in Twitter.
With this detailed knowledge of the cluster many other dependencies
can be investigated such as the number of events per generation
Nk(k) or the cluster size distribution Ns(s). The further
processing of the such sampled renormalized waiting times is
discussed in the following section.
3.2 Adaptive Kernel Density Estimation5
The most straightforward way of obtaining an underlying PDF of a
sampled random vari- able X, is to make a normalized histogram.
However, this has several disadvantages: First, the estimate is
very crude. Second, the resulting PDF is not smooth, which may pose
diffi- culties for further numerical treatment. A method to
overcome these is the kernel density estimation (KDE), which for
each sampled data point stacks a smooth kernel as opposed to a
discrete box of length binwidth. The kernel needs to be specified
in type and band- width. Most commonly it is a kernel following a
normal distribution centered at µ = X with the standard deviation σ
as bandwidth. Obviously, a smaller bandwidth will allow for more
detail of the KDE. It may, however, be sensible to choose different
kernels, e.g. an exponential kernel for an exponential memory
kernel.
In our case of strictly decaying PDFs with very dense points at the
head and very scarce points in the tail of the distribution, a
simple KDE would reach its limits: A large bandwidth will be
appropriate for the many data points at the head but allow for too
much detail in the tail. On the contrary, a small bandwidth will be
appropriate for the few data points in the tail but allow for too
little detail in the head: At the head for t = 0 there is an abrupt
drop to zero, which will require a very small bandwidth in order to
be captured well. Clearly, the bandwidth needs to vary along the
distribution.
The method to realize this bandwidth variation, is the adaptive
kernel density estima- tion (AKDE). In principle, there are many
ways to determine a dynamic optimal band- width. It could be
directly linked to the scarcity of data or the distance to the
nearest neighbors, for example. Since we are dealing with strictly
decaying PDFs in the case of synthetic data, it is sensible to go a
faster and simpler route, i.e. changing the bandwidth variation
linearly with respect to the abscissa, in this work we chose σ =
X/2 such that our AKD-estimated PDF with normal kernel can be
written as
5“Kernel” in the context of AKDE is not to be confused with the
memory kernel of the previous section on the theoretical
background.
) , (3.1)
where X ≡ xi. Strictly speaking though, this is not adaptive to the
data. For empirical data of response times it may happen that at
certain times the responses are peaking, in which case the
bandwidth needs to be smaller at those times. The easiest way to
realize this is to link the bandwidth to the probability of the
response time. However, as it will be shown later in section 5.1.2,
allowing for this much detail in the determination of the
renormalized memory kernel might impair the numerical calculation
of the bare memory kernel.
A common feature of the memory kernels used for data simulation is
a sharp peak right at t = 0 arising from the requirement t 7→ R(t),
t ∈ R+. As a consequence, for finite bandwidths there will always
be an underestimation at the head. By employing a sharp exponential
or power law kernel (finite for t > 0) as opposed to a normal
kernel (finite for any t), this underestimation can in principle be
cured. However, the very fact that the kernel is finite only on the
positive half-axis introduces an upward bias in the
distribution.
A better method to cure the underestimation at the head of the
distribution is to apply a post-AKDE reflection of negative
probability mass or short “post-reflection” about the ordinate. In
this approach we allow probability mass to occur for negative
waiting times, i.e. we calculate the PDF also for t < 0 at
first. Once the calculation for each sampled data point is
completed, the negative probability mass is reflected with respect
to the ordinate. Obviously the PDF will be decaying towards 0− such
that the head at 0+ will be lifted up most strongly by this
procedure, whereas the tail is barely affected. As a result the
underestimation can be partially cured without introducing a bias
in the rest of the distribution.
As described earlier in this work, we will be sampling renormalized
waiting times con- stituting the renormalized memory kernel
normalized to 1/(1− n) as opposed to the bare memory kernel
normalized to 1. The distribution estimated by the above-described
AKDE will therefore have to be scaled as
RAKDE(t) = p(t)/(1− n). (3.2)
3.3 Numerical Reconstruction of Bare Memory Kernel
As described in the previous section the KDE results in a set of
data points that constitute the renormalized memory kernel
RAKDE(t). In order to obtain the bare memory kernel Φ(t), these
data points could in principle be fitted to a parametric form that
could be analytically solved via Laplace transformations. This,
however, introduces errors and is not sensible as we do not know
the underlying law of the data points. Moreover, analytical
solutions do not exist even for the simple power law case. It is
more faithful to data if a
27
non-parametric processing of the data is performed. In concrete,
equation (2.17) can be numerically solved by exploiting the fact
that such Volterra integral equations are causal, i.e. they
integrate only up until the current time t. Since we expect the
shape of both the renormalized and bare kernel to be very mild,
i.e. not having any singularities and being very smooth, standard
methods of numerical integration should be able to solve for the
bare kernel [18]. How this is performed will be described in detail
in this section.
Changing variables in equation (2.17) as x = t − t′ results in a
single-variable Φ(x) in the integral:
Φ(t) = R(t)− n ∫ t
which we can approximate by rectangles as
Φ(ti) = R(ti)− n i−1∑ j=0
Φ(tj)R(ti − tj)(tj+1 − tj). (3.4)
In this form it becomes evident that we can now easily take the
values of Φ(tj) that have been calculated prior to Φ(ti) since j
< i and Φ(t0) = R(t0) if t0 ≈ 0. With the knowledge of the bare
kernel at t0 it can be solved successively for increasing i:
Φ(t1) = R(t1)− nΦ(t0)R(t1 − t0)t Φ(t2) = R(t2)− nΦ(t0)R(t2 − t0)t−
nΦ(t1)R(t2 − t1)t Φ(t3) = R(t3)− nΦ(t0)R(t3 − t0)t− nΦ(t1)R(t3 −
t1)t− nΦ(t2)R(t3 − t2)t ... ΦN = RN − nΦ0RN,0t− ...−
nΦN−1RN,N−1t
and so on until the maximum waiting time is reached. In the last
line an abbreviated notation was used. Also note that since ti are
equispaced, ti+1 − ti = t, which speeds up the calculation.
Moreover, the computation is sped up by the fact that for all i, j
the respective Ri,j are known. The rectangular approximation of
equation (3.3) can be improved by employing a trapezoidal
approximation:
Φi = Ri − nt
ΦjRi,j + Φi−1Ri,i−1
. (3.5)
A quadratic approximation is usually better and achieved by
implementing the composite Simpson’s rule:
Φi = Ri − nt
. (3.6)
28
Again, the fact that only Φj that have already been calculated
prior to the current Φi
is exploited. The first and last term in the parentheses of both
methods correspond to the values j = 0 and j = i − 1, respectively.
Obviously, for all the numerical integration methods above the
computational effort goes as O(N2/2).
3.4 Extracting Highest Generation over Time
Section 2.4 gives an intuitive theoretical explanation of the
anomalous scaling, for which the approach of empirical confirmation
will be explained here.
The most straightforward way to empirically demonstrate anomalous
scaling via syn- thetic data, is to compare the generating bare
kernel and the resulting renormalized kernel for different values
of θ. For the sake of completeness this will be included in the
results sec- tion under 5.1.4. However, to demonstrate the actual
cause of anomalous scaling, namely the over-compensating number of
offspring with increasing θ, the relationship (2.34) has to be
confirmed. By extracting the highest generation over time K(t), we
can test for this relationship. The nave approach to do this is to
store the highest generation at any point in time. Obviously, this
needs to be repeated for several simulations because for one
simulation the resulting K(t)-curve will be a very crude
“staircase” function. However, this approach will, in fact, result
in a weaker growth of K(t) as it violates a key assumption made in
the derivation in 2.4: the necessity of intact chains of events at
time t as expressed in equation (2.29). In order to adhere to this
requirement, we exploit the fact that per defi- nition the chain of
events is ongoing within a cluster. Moreover, clusters are
independent to each other such that we can average over all
clusters, performing only one long simulation. The second equality
of equation (2.29) will be truer for lower K because there are more
of those events in the system. For the equality to also hold for
larger K, a high unconditional intensity µ and a high θ can be
chosen. Of course with this approach t in K(t) will not be the
system’s time but the cluster’s time, i.e. exactly the renormalized
waiting time. For this reason extracting K(t) will be performed via
the function get rwt.m, explained in section 3.1.
When considering clusters individually, we strictly take one event
and waiting time value per cluster and K-generation in order to
ensure proper averaging. The question, however, is, which event of
K-generation to choose for any cluster. Assumption number 2 at the
end of section 2.4 provides a criterion here: In order to make it
more likely for tmax to occur amongst the different tk for a given
K > k > 0, we chose the latest event of K-generation to
occur.
3.5 Characterizing Response Functions
In order to compare the different non-parametric memory kernels,
several metrics have to be applied. One of those metrics is the
power law exponent θ assuming that the tail starting from the short
time cutoff c follows a power law. Equation (2.15) provides a
straightforward
29
way to determine θ = α − 1 for a given c. However, the parameter c
may not be easily determined. In this case the estimated θN needs
to be plotted against cN and a plausible choice of both values has
to be made. For high values of cN , i.e. looking at the tip of the
tail only, the estimate of θN will be very off as the number of
data points N is very low. With decreasing cN and hence increasing
N , the estimate improves, which manifests in the curve of θN
against cN converging to θ. However, this will only go as far as
the PDF is a straight line in the loglog-plot. For a PDF of the
form (2.11) bending will occur in the region of the true c. Figure
9 shows this approach applied on sample data for both power law
kernels of the form (2.11) as well as non-shifted power law
kernels, i.e. without lower bound (red curve, right column). When
sampling the renormalized kernel, bending will occur at even higher
values, i.e. at the cross over time as defined in equation (2.24),
which decreases the region of stable θN even more. In fact the
determined cN actually needs to be interpreted as cross over time
t∗ as opposed to short time cutoff, from which the actual cN can be
extracted via equation (2.24). Consequently, determining θN and cN
is a non-trivial task.
For this reason, in addition to our estimates we will employ the
power law test as im- plemented in plfit.m and plvar.m from[19],
which estimate xmin and α, and respectively their errors as
described in [13]. The procedure combines MLEs with goodness-of-fit
tests based on the Kolmogorov-Smirnov statistic and likelihood
ratios. The results of these tests will be plotted in the same
figure as the Hill estimation in order to enable a direct com-
parison. The description of the codes from [19] reads as: “The
fitting procedure works as follows:
• For each possible choice of xmin, we estimate alpha via the
method of maximum likelihood, and calculate the Kolmogorov-Smirnov
goodness-of-fit statistic D.
• We then select as our estimate of xmin, the value that gives the
minimum value D over all values of xmin.”
Here xmin denotes the lower bound of the random variable above
which the power law is fitted. In our case this is c. α denotes the
power law exponent or scaling parameter as already used in section
2.2. Recall that α = θ + 1.
We further characterize the response time distribution by simple
metrics that focus on the head and the body of the distribution. In
concrete we provide the median, i.e. the time until which 50% of
the responses have happened, as well as the 90%-quantile and
99%-quantile. Both are directly visible in the CCDF of the response
time distribution.
30
Figure 9: Test of the performance of the Hill estimator on known
sample data. The top subplots show the theoretical PDF and the
respective histogram for two different distributions of θ1 = 0.8
(blue) and θ2 = 0.992 (red). The bottom subplots show the
respective curves of the Hill estimator
θN against the lower bound cN . Left: The two known distributions
are of the form (2.11), however, without the c-shift in t, such
that the curve is diverging at t = 0 (in this case the PDF would
only start being finite at t = c as opposed to t = 0). As a result
the distribution follows a straight line on [c;∞ in the
loglog-plot. This favors the Hill estimation, as can be seen in the
bottom plot where at approximately t = 10 a stable region is
reached and the power law exponents are well recovered (green). The
fact that the region of stability remains until the smallest
t-value, indicates that there is indeed no c-shift. Right: Now the
second distribution (red) is exactly of the form (2.11). As a
result a bending of the PDF at around t = c will occur, affecting
the Hill estimation. This can be seen in the bottom subplot where
the region of stability ends at around t = c = 2. In fact, looking
at the first distribution, the region of stability only starts at t
= 1, i.e. after the bending of the PDF sets in. This can heavily
affect this method.
31
4 Empirical Data
The tools and methods above are tailored to analyzing
non-parametric data, which is appropriate for empirical data of
which the underlying laws are unknown. This section elaborates on
the empirical data examined in this work, namely Twitter data.
While the self-excited Hawkes process has been successfully applied
in the context of earthquakes, financial markets and urban
burglaries ([20],[1],[21]), where some kind of shock may trigger
aftershocks, in the context of Twitter a tweet may trigger
re-tweets. As expressed in equation (2.1) making predictions with
the help of Twitter data requires the knowledge of the
unconditional intensity µ(t) and the memory kernel h(t). As mother
tweets are exogenous and re-tweets endogenous, Twitter data is
well-suited for being applied to the Hawkes model, which
effectively combines both types of mechanisms. In line with the
work of [1] determining a branching ratio n over time may
potentially serve as precursor for manifestations of criticality.
As Twitter is frequently used as an important tool for political
mobilization, especially during riots and unstable periods, such a
manifestation of criticality could be the actual outbreak of a
protest or riot on the street [22]. In this work we therefore
analyze two recent social protests, elaborated on later in this
section. In contrast, the characterization of the memory kernels
may provide insights into how social interest - be it a protest or
a movie - can be sustained and fuelled over time.
As “SMS of the internet”6 Twitter is designed to enable users to
share short messages and links very efficiently, allowing one to
label tweets via so-called hashtags, i.e. keywords preceded by a
hash “#”, and cite and reply to other users’ tweets [23]. Those
hashtags provide an effective way of obtaining and contributing
with relevant content. In this work we filter content via hashtags
in order to obtain tweets with regards to certain topics. In the
next section, how those tweets were obtained will be explained. In
the subsequent two sections the specific choice of hashtags will be
specified.
4.1 Scraping Twitter Data
There are many different ways to obtain different amounts of
Twitter data. Getting 100% of tweets fulfilling certain criteria,
generally involves high costs. For example, the big data service
“DataSift” that collaborates with Twitter charges on average 0.10
USD per 1,000 tweets on top of a monthly subscription fee of 5,000
- 45,000 USD.7 Another option is using Twitter’s “Spritzer”
streaming API, which is publically accessible and free of charge
but does not allow to acquire historical data and further only
stream about 1% of all tweets (tweet-ID mod 100). Using the “REST
API” (REpresentational State Transfer) allows for past data but is
limited by about 1 request per second [24]. Since requests are not
constrained by the tweet-ID as for the Stream API, a higher
percentage is possible. How
6http://www.business-standard.com/article/technology/swine-flu-s-tweet-tweet-causes-online-flutter-
109042900097 1.html
the REST API was used to obtain a larger number of tweets, will be
explained in the following.
Obtaining the tweets of a certain hashtag from the REST API,
requires the knowledge of the tweet-IDs. When a tweet is displayed
in the browser, its tweet-ID can be found in the source code. Hence
we use Twitter’s advanced web search to have the tweets of a
certain hashtag and date displayed in the browser. In order to
increase the number of search results, one has to manually scroll
down until the query limit is reached, which currently is 400
tweets. On the results page one can toggle between “Top” and “All”
results. “Top” tweets are comprised of tweets that received or are
likely to receive a wider attention which is calculated from the
number of re-tweets, followers of the tweeter, favorites and
replies [25]. “All” tweets are a fixed random sample of all tweets
fulfilling the search criteria, i.e. repeating the search would not
increase the sample. The query limit of 400 tweets holds for both
types. In this work we exclusively consider top tweets as there are
more responses available for better statistics. The source code of
the search results including the tweet-IDs is saved and further
processed by the Python code t get.py that parses the source code
to extract the tweet-IDs and communicates via an associated Twitter
account with the REST API of Twitter.
There are many types of tweets triggered by different mechanisms.
First and foremost there are original tweets that may or may not be
re-posted by other users as so-called “re-tweets (RTs)” [26]. In
allusion to the terminology used in the Hawkes model with exogenous
“mother events” and endogenous “daughter events,” we will call
original tweets “mother tweets (MTs).” For each mother tweet-ID the
100 most recent re-tweets along with a host of meta-data can be
obtained. Currently the following is stored for further
processing:
• Unique tweet-ID (variable digit number)
• Unique tweet-ID of mother tweet (if applicable)
• Time stamp with precision of seconds
• Unique user-ID (variable digit number)
• Number of re-tweets (if mother tweet, otherwise −1)
Via the reference to each tweet’s mother tweet (if it is not a
mother tweet itself) it is straightforward to extract the response
times, i.e. the time needed for a re-tweet to occur after its
mother tweet, and construct the response function. So unlike in
[1], with Twitter data we are able to causally link mother and
daughter events, which puts us in a good position to robustly
determine the memory kernel. It is important to note that this
procedure only allows for the detection for a certain type of
re-tweeting, i.e. the one- click or automatic re-tweeting as
opposed to the manual re-tweeting, where the content of the mother
tweet is copied and pasted with own comments and a reference added
in
33
the re-tweet. Obviously, Twitter cannot draw a link between such a
manual re-tweet and its mother tweet. Yet there are ways to
reconstruct this connection as shown for example in [27]. However,
since it is likely that different ways of re-tweeting follow a
different mechanism of self-excitation in the system, this work is
limited to automatic re-tweeting only. The defining feature of the
one-click re-tweets is the fact that they will always refer back to
their most original mother, i.e. the immigrant. Therefore, when
sampling the response times in Twitter, they will strictly
constitute the renormalized memory kernel.
As derived in section 2.3 determining the renormalized memory
kernel requires knowing the branching ratio such that normalization
to 1/(1 − n) is possible. Clearly, re-tweets are per definition
endogenous events. In our framework mother tweets will be treated
as exogenous event. Strictly speaking, though, mother tweets do not
necessarily all have to be exogenous events. They themselves can in
principle be triggered by truly exogenous events, such as relevant
news releases in the media. However, within the system of Twitter
they constitute the exogenous events such that the branching ratio
can be directly calculated as
n = Nendo
NRT +NMT . (4.1)
Two factors that were already mentioned can affect the calculation
n and hence the correct scaling of the renormalized memory kernel,
which in turn affects the estimation of the bare memory kernel.
First, when considering top tweets, they consist of tweets that
tend to have many re-tweets such that the branching ratio will be
higher and thus not representative for the entire system. Second,
in case some mother tweets have more than 100 re-tweets, only the
most recent ones can be obtained from Twitter, which affects the
calculation of the head of the response function. Also, the
branching ratio will be underestimated as some re-tweets are
disregarded. However, in each mother tweet’s meta-data the number
of re-tweets, even beyond 100, is included, allowing for a correct
calculation of n. Yet the effect on the renormalized kernel of not
knowing the response times of some early re-tweets, is non-trivial
and remains to be investigated. The effects of both of those
factors will be addressed in section 5.1.3
4.2 Hashtags of Single Events
Single events are events that are very confined in duration such
that the response to it in (social) media can be regarded as a
burst with declining activity in the aftermath. For anticipated or
announced events there will also be pre-burst activity. This
activity increasing towards the event is mostly endogenously driven
by word-of-mouth and online sharing, for example. The decreasing
activity following the event is both exogenously and endogenously
driven. Both the event itself as well as people talking about it
generate activity in the aftermath. In terms of the Hawkes model,
activity directly triggered by the event itself would be
first-order responses, while activity triggered by other people
talking about the event would be higher-order responses.
34
In [5] Crane and Sornette treat Youtube video views as activity and
categorize the videos according to their foreshock and aftershock
dynamics. Events with after-burst activity only are categorized as
exogenous because the activity is initially triggered by the
external burst. This category is further divided in to the
subcritical (n < 1) and critical case (n = 1). So there is a
certain degree of endogeneity in this category. As has been shown
in section 2.3 in the former case the response function will
eventually fall off as the bare kernel (1/t1+θ), whereas in the
latter case the decrease will be renormalized to 1/t1−θ. On the
contrary, events with after-burst and pre-burst activity are
categorized as endogenous because the pre-burst activity can only
be triggered by endogenous mechanisms. As shown in [28] the
activity around the burst will be symmetric and behave as 1/t1−2θ
in the critical case. In the subcritical case the activity will
follow a random noise process.
Inspired by this classification of bursts we chose hashtags of
events that are a priori exogenous or endogenous, i.e. single
events that are unexpected versus announced. The sub-categories
with regards to criticality remain to be determined ex post.
However, we choose events which received significant attention and
whose hashtags have been widely used, such that criticality may
actually be reached. The hashtags and events chosen are as
follows.
A. Hashtags of endogenous events:
#Election2012 Presidential elections of Barack Obama versus Mitt
Romney in the USA held on November 6, 2012.
#IndyRef Historical referendum on whether Scotland should become
independent from the UK held on September 18, 2014.
#ManOfSteel One of the most anticipated movies in 2013,
internationally released on June 14, 2013.8
#Transformers One of the most anticipated movies in 2014,
internationally released on June 27, 2014.8
B. Hashtags of exogenous events:
#SandyHook Shooting at Sandy Hook Elementary School in Newtown, USA
on December 14, 2012. With 26 children killed, it was one of the
most fatal shootings in the last 20 years.
#BostonMarathon Bombing at the annual Boston Marathon held on April
15 in 2013 with 5 deaths and 280 people injured. The incident
gained national and international attention as US-grown Islamist
terrorism.
#Mandela Death of South Africa’s first black president, Nelson
Mandela, on December 5, 2013.
#RIPRobinWilliams Death of Hollywood actor and comedian, Robin
Williams, on Au- gust 11, 2014.
8According to www.movieinsider.com
35
For each of those events the mother tweets are generally sampled
from an approximate time window of three months around the event,
while the re-tweets may reach up to the present. It is therefore to
be expected that the more recent events will exhibit finite size
effects, which will be most clearly visible in the CCDFs.
4.3 Hashtags of Extended Events
Extended events are events that are spread over a longer period of
time and composed of a series of sub-events. While the beginning of
such an extended event may be well-defined, for example due to an
exogenous shock, the ending might not be necessarily. The prime
example of such events are social protests triggered by an external
event and carried on long after it, generally exceeding the
aftershock activity of single events. Other examples include other
conflicts, wars, political terms of office, and ongoing debates on
social issues. Scheduled longer events such as sport events, are
not suitable in this context because their extension is mostly
externally defined as opposed to internally driven.
As stated initially, this work focusses on social protests that
rely to a significant degree on social media for political
mobilization. We have chosen two very recent protests, both of
which having a widely-used hashtag clearly associated with the
conflict:
#OccupyCentral Student protests in Hongkong, which have started at
the end of August 2014 after the government’s announcement of
limiting up- coming major elections and are ongoing at the time of
writing this paper. The movement, also referred to as “Umbrella
Revolution” in the media, was initiated and is carried out mostly
by students who generally tend to use social media more.
#Ferguson Social unrest in the city of Ferguson in the USA. The
triggering external shock here is very distinct, i.e. the shooting
of the African- American Michael Brown by the white police officer
Darren Wilson on August 9, 2014. Protests have been peaking right
after but are as well still ongoing.
Because these protests are extended and composed of multiple
bursts, it is sensible to consider and investigate different phases
of the protests. This necessitates a case study which separates the
protest into its phases and individual major bursts. Since both
protests received wide national and international media attention,
they are well-documented and allow for a separation into phases
that are generally constituted of a burst with succeeding and
sometimes preceding activity, much like for the single-event
hashtags. Although there are many such bursts, there are a few
major ones that can also be seen in the activity over time for the
respective protests as depicted in figure 10. The phases will not
always be very clear-cut, in which case gaps between phases of a
few days may arise. This is to prevent a mixing of potentially
different mechanisms per phase.
36
The social unrest in Ferguson can be roughly separated into five
phases.9:
1. Initial uproar after the shooting of Michael Brown: August 9 -
13.
2. Re-escalation after demilitarization of police: August 15 -
18.
3. De-escalation after initial uproar: August 19 - 26.
4. Re-escalation after burning of M. Brown’s memorial: September 23
- October 1.
5. Peaceful protests and “Ferguson October”: October 2 - 24.
The student protests in Hongkong can be roughly separated into five
phases:10
1. Initial uproar after the announcement of electoral reform:
August 31 - September 13.
2. First major student assembly and protests: September 22 -
30.
3. Eruption of violence and de-escalation: October 3 - 8.
4. Re-escalation after cancellation of meeting with student
leaders: October 9 - 18
5. First round of talks and de-escalation: October 21 - 28.
9http://en.wikipedia.org/wiki/2014 Ferguson unrest More phases are
listed there, some of which, how- ever, may be combined.
10http://en.wikipedia.org/wiki/Occupy Central with Love and Peace.
The phases are not completely objective here but can be reasonably
guessed from the article.
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Google Trends of "Ferguson"
Sep 03 Sep 13 Sep 23 Oct 03 Oct 13 Oct 23 Nov 01 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Days
Google Trends of "OccupyCentral"
Figure 10: Activity over time for the two protests investigated.
The activity of the tweets (blue bars) with the hashtags #Ferguson
(top subplot) and #OccupyCentral (bottom subplot), respectively, as
well as the activity of the same search terms in Google (cyan) are
plotted over time. The five approximate phases for the #Ferguson
and the four for #OccupyCentral can also be seen in the activity.
Sometimes GoogleTrends resembles the phases better, sometimes
Twitter.
38
5.1 Synthetic Data
The general goal of working with synthetic data first, is to
confirm the correct working of the developed algorithms and tools
before applying them onto empirical data whose underlying laws are
unknown. In the first section, general observations of the
simulated data will be made in order to get a first corroboration
of the correctness of the simulation algorithm. The second section
evaluates the performance of the numerical reconstruction of the
bare kernel from the renormalized kernel. The third section
addresses known error sources arising from the limitation in the
empirical data. The fourth section demonstrates anomalous scaling
by applying this numerical reconstruction and by confirming its
very cause.
5.1.1 Cluster Characteristics
This section is meant to present first general results on the data
produced by the simula- tion algorithm as described in section 3.1.
The algorithm that was extended with cluster information, i.e. the
cluster affiliation and the generation of each event, allows for a
quick preliminary check whether the simulation is correct.
Figure 11 visualizes the simulated counting process and the
respective behavior of event generations for both an exponential
memory kernel of the form (2.22) with τ = 2 and power law memory
kernel of the form (2.11) with c = 0.1 and θ = 0.8. First
considering only the top subplots, the overall counting process
exhibits the expected irregular jumps arising from the clustering
of events. Further, its average increase roughly retrieves the
theoretical value of the branching ratio n = 0.7, when comparing it
to the average counting process of the empirical unconditional
intensity. This intensity µemp = Nexo/T can be directly calculated
from the knowledge of the cluster centers, i.e. the immigrant
events with k = 0, and compared to the theoretical value µtru that
was used to generate the data. Both values agree very well,
considering the rather short simulation time of 1,000. The
disagreement for the exponential and the power law case are 4.0%
and 8.0%, respectively. Doing the same for the overall intensities
λemp = N/T and λtru results in a disagreement of 2.0% and 3.4% for
the exponential and the power law case, respectively.
Now comparing the counting processes to the bottom subplot of the
generation k for each event over time, one notices that the cluster
growth proxied by the increase in k, goes along with higher slopes,
i.e. the clustering of events, in the counting process. In the
short simulation time T there is no major qualitative difference
visible between exponential and power law bare kernel. The higher
highest value of k for the power law case, though, already hints at
the longer memory connected to power law kernels.
Figure 12 shows the distribution of the sampled renormalized
waiting times and the theoretically renormalized memory kernel that
was obtained by analytically transforming the data-generating bare
kernel of the form (2.16) via the Laplace transformations as
39
0 100 200 300 400 500 600 700 800 900 1000 0
50
100
150
200
250
300
350
ro c e s s
exponential bare kernel (τ = 2)
Complete process with Λ emp
/Λ tru
/µ tru
= 1.04
0 100 200 300 400 500 600 700 800 900 1000 0
50
100
150
200
250
300
350
Complete process with Λ emp
/Λ tru
/µ tru
= 1.08
0 100 200 300 400 500 600 700 800 900 1000 0
1
2
3
4
5
6
7
8
9
10
ti o n o
f e v e n t
0 100 200 300 400 500 600 700 800 900 1000 0
2
4
6
8
10
12
14
t − Time
Figure 11: Cluster statistics over time for both an exponential
(left) and power law memory ker- nel (right). Top: Counting process
Nt(t) of complete process (blue) and unconditional process
(dashed). The true and empirical values agree well in the short
simulation time of T = 103
with a disagreement of 2.0% and 4.0%, respectively. Further, the
theoretical branching ratio n = 0.7 ≈ Nendo/N = 250/350 is
retrieved. Bottom: Respective behavior of event generations k(t).
Cluster growth proxied by the increase in k(t), goes along with the
higher slopes, i.e. the clustering of events, in the counting
process N(t) as expected. The true and empirical values agree well
with a disagreement of 3.4% and 8.0%, respectively. There is no
major qualitative difference for the exponential and power law case
visible, except that the longer memory power law case has a higher
highest k = 14 as opposed to k = 10 already at this short
simulation time.
40
Figure 12: Loglog-plot of renormalized memory kernel with a power
law bare kernel approximated as sum of exponentials. The
approximation is given by equation (2.16) and can be analytically
transformed via Laplace transformations into the respective
renormalized kernel for n = 0.4 ac- cording to equation (2.17). The
fact that both the empirical renormalized waiting times (cyan) and
the analytically calculated renormalized kernel agree to a degree
where even the “waviness” of the power law approximation is visible
for both (see inset for more detail), shows that both the algo-
rithm for data simulation as well as the transformation from
generating bare kernel to renormalized kernel are correct.
outlined in the beginning of section 2.3. The fact that both the
renormalized waiting times and the renormalized kernel agree to a
degree where even the “waviness” of the power law approximation is
visible for both, shows that both the algorithm for data simulation
as well as the transformation from generating bare kernel to
renormalized kernel are correct.
Figure 13 plots further statistics that can be directly calculated
from the additional cluster information for both memory kernels
with the same parameters but for T = 105. The top subplot compares
the distribution of generations Nk(k). For both memory kernel types
the distribution seems to follow an exponential decay of the form
Nk(k) ∼ exp(−k/6), which agrees well with the theoretical
prediction of Nk(k) ∼ nk = exp(log(0.7)k). The bottom subplot
compares the distribution of cluster sizes Ns(s). Again, both
memory kernel types seem to follow the same underlying law of Ns(s)
∼ s−2.5. It is interesting to see that there are 5,000 clusters of
size s = 1, i.e. clusters composed of a single immigrant without
descendants (Ms=1 = 5, 000). The entirety of descendants (≈ 23, 333
for n = 0.7) is triggered by the other Ms>1 = 5, 000 immigrants,
amounting to M = 104 immigrants in total, given T = 105 and µ =
0.1.
41
0
f s iz
Exponential bare kernel (τ=2)
Power law bare kernel (c=0.1, θ=0.8)
Figure 13: Cluster statistics over time for both an exponential
(green) and power law memory kernel (red). The respective
parameters are the same as in figure 11 but the simulation time is
longer (T = 105). Top: Distribution of generations, which seems to
behave as Nk(k) ∼ exp(−k/6) for both kernel types. However, for
large generations the tail of the power law case seems to become
heavier. Bottom: Distribution of cluster sizes, which seems to
behave as Ns(s) ∼ s−2.5 for both kernel types. For the given µ and
T it can be inferred that all descendants are triggered by only 50%
of the immigrants, while the other half does not trigger any
descendants, constituting single-event clusters with s = 1.
42
A major difference between the two memory kernels, however, be