115
Chair of Entrepreneurial Risks at the Department of Management, Technology and Economics Master-Thesis for the Degree of “Master of Science ETH in Physics” Course Number: 402-0900-00L Shocks and Self-Excitation in Twitter Response Function Characterization in Epidemic Processes in Social Media Author: Stefan Rustler Supervisors: Prof. Dr. Didier Sornette Dr. Vladimir Filimonov Zurich, Switzerland November 2014

Master-Thesis - ETH Z

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Chair of Entrepreneurial Risks at the Department of Management, Technology and Economics
Master-Thesis for the Degree of
“Master of Science ETH in Physics”
Course Number: 402-0900-00L
Response Function Characterization in Epidemic Processes in Social Media
Author: Stefan Rustler Supervisors: Prof. Dr. Didier Sornette
Dr. Vladimir Filimonov
This work aims to pave the way towards applying the self-excited Hawkes model on Twitter messaging, more concretely on the epidemic process of tweets triggering re-tweets for a certain topic or hashtag. One of the key ingredients of the Hawkes model is the response function, i.e. the timing of re-tweets after their original or mother tweet. Due to the heterogeneity in topics and users a categorization of response functions was conducted for hashtags of both “endogenous” and “exogenous” single events, like movie releases or bombings, as well as for phases of extended events, in this work social protests.
The specific kind of re-tweeting (one-click re-tweets) investigated here always refers back to the most original tweet, allowing one to robustly sample the renormalized mem- ory kernel. Furthermore, due to the clear definition of original, exogenous mother tweets and endogenous re-tweets, the branching ratio can be easily calculated. A simple numer- ical method to reconstruct the bare memory kernel from the renormalized memory kernel was developed, which may give insights into the individual as opposed to the collective human behavior in responding to Twitter content. With the help of synthetic data this could be used to demonstrate the theoretically predicted anomalous scaling, where a longer renormalized memory results from a shorter bare memory. Further, the very cause of this behavior could be empirically confirmed, namely the earlier triggering of events resulting from a shorter bare kernel.
Having confirmed the methods and metrics on synthetic data, the empirical Twitter data could be addressed. Generally, a longer memory, could be observed across different hashtags. When a power law could be concluded, the exponent was found to be θ = 0.7±0.1. The two methods of extracting the power law exponent, namely plotting the Hill estimation versus the lower bound and a maximum-likelihood estimation combined with Kolmogorov- Smirnov tests, yielded similar results. Endogenous single events could not be classified according to a common power low exponent because they are in fact composed of two sub- classes: pre-burst-focussed and after-burst-focussed events. On the contrary, exogenous single events had a similar exponent of θ ∈ [0.706; 0.749].
For the extended events power law behavior could not be concluded for the individual phases, though for both of the entire social protests. While predicting real-life-criticality by means of the branching ratio turns out to be very cumbersome, the first protest (#Fer- guson) could be classified as a protest dominated by an initially triggering exo-burst with diminishing self-excitation, the second protest (#OccupyCentral) as a well-maintained and centrally organized protest with constant criticality across all phases.
The results of this work show that a larger-scale analysis with more hashtags is necessary in order to more robustly determine hashtag classes and response functions and thus be able to calibrate the Hawkes model on Twitter data for a higher predictive power.
Key words: Self-excited Hawkes process, renormalized memory kernel, anomalous scaling, social epidemics.
2
Acknowledgements
First and foremost I would like to express my gratitude towards Dr. Vladimir Filimonov for always having an open ear and mind for my questions, ideas and results. It has been a pleasure working with him. Further, I am thanking Prof. Didier Sornette for his guiding insights and connecting my efforts to other works and results. I am also grateful for Alexandru Sava’s inspirations for mathematical derivations. Last but not least I want to thank Daniel Philipp, Pawel Morzywolek, and Joel Bloch and the other master students for the good spirit and the motivating atmosphere in the office both during the week and the weekend.
3
Contents
1 Introduction 5
2 Theoretical Background 8 2.1 Self-Excited Hawkes Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Power Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Bare and Renormalized Memory Kernel . . . . . . . . . . . . . . . . . . . . 14 2.4 Anomalous Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Methods 25 3.1 Data Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Adaptive Kernel Density Estimation . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Numerical Reconstruction of Bare Memory Kernel . . . . . . . . . . . . . . 27 3.4 Extracting Highest Generation over Time . . . . . . . . . . . . . . . . . . . 29 3.5 Characterizing Response Functions . . . . . . . . . . . . . . . . . . . . . . . 29
4 Empirical Data 32 4.1 Scraping Twitter Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 Hashtags of Single Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Hashtags of Extended Events . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5 Results 39 5.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.1 Cluster Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1.2 Performance of Numerical Reconstruction of Bare Kernel . . . . . . 43 5.1.3 Effects of Error Sources . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.1.4 Demonstration of Anomalous Scaling . . . . . . . . . . . . . . . . . . 56
5.2 Empirical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.1 Hashtags of Endogenous Bursts . . . . . . . . . . . . . . . . . . . . . 62 5.2.2 Hashtags of Exogenous Bursts . . . . . . . . . . . . . . . . . . . . . 75 5.2.3 Hashtags over Protest Phases . . . . . . . . . . . . . . . . . . . . . . 87
6 Conclusion 95
7 References 98
8 Appendix 101 8.1 Normalization of Resolvent . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 8.2 Individual Figures on Protest Phases . . . . . . . . . . . . . . . . . . . . . . 104
4
1 Introduction
In complex systems, like human societies or neuronal networks, a large number of strongly interacting subsystems or agents is crucial for the emergence of counter-intuitive phenom- ena from within the system. These internal feedback mechanisms are complemented by external impacts on the system, that allow the system to leave its state of equilibrium, adapt to those external impacts, and undergo phase transitions. This interplay of internal and external or endogenous and exogenous forces is of great interest in understanding the dynamics of information in complex systems. One of the crucial questions is whether it is possible to separate and quantify the influence of those endo- and exo-factors. Being able to quantify the former, gives insight into the robustness of a system to external influences. A system in which external excitations are coupled to and enhanced by internal self-excitation is likely to be in a more fragile or critical state. Hence, the degree of self-excitation can be regarded as a measure of the criticality of a system. An early example of this can be found in the physics of earthquakes. Here aftershocks (endogenous) are triggered by shocks (exogenous). If sufficiently many aftershocks per shock are set off, the system is in a critical state: a series of sudden, strong shocks, i.e. an earthquake, can occur.
In [1] V. Filimonov and D. Sornette proposed a method of quantitative estimation of the impact of self-excitation in financial markets, based on the so-called “self-excited Hawkes process”, a generalization of a Poisson process which effectively combines exogenous influences with endogenous dynamics and thus may be applied to a variety of different contexts. As shocks trigger aftershocks, in the financial context, for example, price changes can trigger price changes. A critical state would then be characterized by sufficiently many internally triggered price changes, i.e. a high volatility of the market, which may lead to financial crashes. The degree of self-excitation may serve as precursor for such crashes. The two key ingredients for the self-excited Hawkes process are the unconditional arrival rate of shocks, capturing the exogenous dynamics, and the arrival rate of aftershocks conditional on the history of the process, which captures the endogenous mechanisms. The main component of the latter is the so-called memory kernel that determines the influence of past shocks on future shocks. For example, in a system with long memory current shocks may be triggered by or happen in response to long passed shocks. In this sense the memory kernel is also referred to as response function. Both ingredients need to be adapted to the specific system considered, which is a non-trivial tasks.
This thesis aims to apply this concept to the epidemic processes, i.e. the spread of information, in social media. Concretely, we apply the method proposed in [1] on data from the micro-blogging service Twitter. Having become one of the most widely used social media platforms, commercial and academic interest is substantial. Marketing companies perform big data analyses to understand what is popular and trending. Research focus has for example been put in trying to predict election results [2]. Recently it has been shown that Twitter can be regarded as an excitable medium in the physical sense, being subject to increased activity-fluctuations close to critical points such that phase transitions
5
are in principle predictable [3]. The short messages or tweets can be labelled via so- called hashtags and may be re-posted and thus spread by other users as so-called “re- tweets.” In allusion to the terminology used in the Hawkes model, we will refer to this re-posting as “mother tweets” triggering re-tweets. The timing of mother tweets and their re-tweets provide direct and robust insight into the afore-mentioned response function. In contrast to earthquake systems, the collective memory of Twitter is highly heterogeneous since it is used by more than 271 million active users [4], discussing a variety of topics. It is therefore sensible to categorize content via hashtags and determine the respective response functions. Inspired by the work in [5] about classes of response functions, we will investigate hashtags referring to announced versus unexpected single events, which are likely to follow different mechanisms. Prime examples here are movie releases versus shootings or bombings. However, other works have shown that there are more numerous and not so clearly defined hashtags, e.g. in [6]. Elucidating the response function will not only pave the way for applying the Hawkes but also provide insights into how the activity in Twitter is leading up to such events and/or how interest in certain topics behaves over time and may be manipulated.
Since Twitter is frequently used as an important tool for political mobilization, espe- cially during riots and unstable periods, we will also investigate two recent protests in 2014 that have distinct hashtags: the social unrest in Ferguson, USA and the student protests in Hongkong. These prolonged episodes are a natural extension to considering single events with an individual burst in tweet activity. Here it is of particular interest whether different phases of the protest can be characterized by means of the response function and the degree of self-excitation. In analogy to financial crashes or earthquakes, in the context of social unrest the manifestation of criticality predicted by the model may be the outbreak of a protest on the street, i.e. outside of the system of Twitter. By investigating the triggering of re-tweets over time, the question whether criticality in the Hawkes model matches or even precedes criticality in reality, will be addressed.
Due to the high number of Twitter users, the question arises how individual activ- ity aggregates to collective human behavior, resulting in large-scale phenomena such as blockbusters, election results or - as already mentioned - protests. In order to shed light on this, we develop a simple method to reconstruct the individual response function from the aggregate response. In the framework of the Hawkes model these two are referred to as bare versus renormalized memory kernel. For this we conduct a detailed analysis with synthetic data for both short- (exponential) and long-memory (power law) kernels. In particular, we examine the theoretically predicted paradox of anomalous scaling, where a longer renormalized memory kernel may result from a shorter bare memory kernel. For the characterization of the memory kernels different metrics will be applied and different common methods for the estimation of the power law exponent compared. Furthermore, the effect of known errors sources arising from the limitations of the empirical data will be analyzed as well. With the insights gained from the synthetic data, the analysis will be extended to the empirical data of Twitter.
6
In summary, this thesis intends to set the stage for applying the self-excited Hawkes model onto large-scale Twitter data by characterizing both individual and aggregate re- sponse functions for different handpicked hashtags and periods of hashtags. The focus of this work will hence be on the methodology and its systematic test using synthetic data, the results from the empirical data should be understood as first test runs.
7
2.1 Self-Excited Hawkes Processes
The process of tweeting can be regarded as a discontinuous point process, in which events, i.e. tweets, arrive with a certain rate. This arrival rate is generally referred to as the intensity λ of the process and takes different functional forms. The simplest point process is the Poisson process where the intensity is a simple constant. This means that events occur homogeneously, i.e. independent of each other. However, the intensity of an inho- mogeneous Poisson process has further dependencies, for example time. The self-excited Hawkes process is a generalization of an inhomogeneous Poisson process, which not only depends on time but also on the history of the process [7]. The intensity of the self-excited Hawkes process can be written as
λ(t) = µ(t) +
−∞ h(t− s) dN(s). (2.1)
Here the first summand µ(t) denotes the background or unconditional intensity describing the arrival rate of exogenous events (exo-events). Due to the time-dependence, µ(t) alone would make a simple inhomogeneous Poisson process. The second addend refers to the intensity conditional on the history of the process, ranging from −∞ to the present time t via the integrational variable s. The integrand h(t − s) denotes the memory kernel, which describes the positive direct effect of a past event at s on the current instantaneous intensity at t of the endogenous events (endo-events) [8]. Usually, the further events at t and s are apart, the smaller h will be, which corresponds to more distant events being “remembered” less by the system. The conditional intensity can be regarded as a sum of the instantaneous number of events having happened at s, i.e. dN(s), weighted by their influence on the present events at t via the memory kernel.
Assuming stationarity of the process as well as requiring that µ(t) = µ, an average total intensity Λ can be calculated by taking the expectation value on both sides of equation (2.1) [9].
Λ = λ(t) =
∫ ∞ 0 h(τ) dτ. (2.2)
Here the fact that dN(s)/ds = λ(s) was used. To obtain the last equality the substitution τ := t − s was performed. The interpretation of τ would then be the time it took for an endo-event to be triggered after the arrival of the triggering event, often referred to as
8
waiting time. In our case of Twitter, this would correspond to the time between tweet and retweet, i.e. the response time.1 Now using the definition,
n :=
the average intensity can finally be written as
Λ = µ
1− n . (2.4)
Since we were requiring that the history of events must have a positive effect on the present intensity, this equation is meaningful only for the case of n < 1, such that Λ ≥ µ. When n = 0, the process becomes independent of history and thus a simple inhomogeneous Poisson process. For the case of n > 1, the intensity as defined by equation (2.1) may explode [10]. The case of n = 1 can therefore be regarded as the critical case, separating the subcritical and the supercritical regime. A complete pedagogical classification of the different regimes can be found in [11].
Now what is the meaning of n? As equation (2.3) shows, a memory kernel with long memory would result in a larger value for n. As explained before such a memory kernel results in an increased instantaneous intensity. The parameter n must therefore be related to how many events a single event is likely to trigger in the future. The entirety of events triggering further events can be regarded as a branching process. In fact, it is well known that Hawkes processes can be mapped into such a branching processes, in which all events are either exogenous (“mother”) or endogenous (“daughter”) events of different generations [8]. In this language a mother event would trigger a daughter event and so on, forming a “family tree” or cluster. How the point process governed by the Hawkes intensity (2.1) behaves with regards to cluster formation is visualized in figure 1. Now n would indeed be the average number of first-generation daughters of a single mother [1]. With this interpretation and the stationarity assumption, we can calculate an average endogenous (or self-excited or conditional) intensity Λendo. The following calculation is mostly adopted from [12]. Generally, the intensity of i-th order events can be written as
λi(t) =
∫ t
−∞ λi−1(s)h(t− s) ds, i ∈ N, (2.5)
where λ0(t) ≡ µ(t). Now if we require µ(t) = µ, as we have already done in equation (2.2), equation (2.5) would for the case of the intensity of first-generation daughter events simplify to λ1 = nµ. The intensity of the second-generation daughter events would likewise be given by λ2 = nλ1 = n2µ. Continuing this to infinity we get for the additional requirement of n < 1 the following.
1The terms response time and waiting time will be used interchangeably in this work.
9
Triggering
Re-tweets
Figure 1: Instantaneous intensity of a 1D self-excited Hawkes process in comparison with cluster formation and time stamps. The source event clustering is the increased probability (via n) for events to happen right after the occurrence of an event as given by the Hawkes intensity (cyan). In this sense a mother event can trigger a daughter event and so on, such that family trees or clusters with the most original mother as cluster center (dark circles) are formed. While the occurrence of the independent cluster centers (blue, red and green) is governed by the unconditional intensity µ(t), the occurrence of descendants (light squares) is governed by the memory kernel h(t), which can be directly seen in the drop of the Hawkes intensity after an event has happened. This picture corresponds to a branching ratio equal to n = 0.89.
10
Λendo = µ n
1− n (2.6)
In equation (2.4) we have calculated the total average intensity, for which Λ = µ + Λendo must hold. Taking the result from equation (2.6), this is indeed the case. Now we can calculate the overall proportion of endogenous events to all events.
Λendo Λ
= n/(1− n)
1/(1− n) = n (2.7)
Therefore in the subcritical regime (n < 1) and in the case of a constant unconditional intensity [µ(t) = µ], the branching ratio n as average number of first-generation daughters per mother can be interpreted as the proportion of endogenous events among all events. The branching ratio thus is a direct quantitative estimate of the degree of endogeneity, i.e. self-excitation.
For the memory kernel h(τ) there are in principle many functional forms, the most common being an exponential kernel of shorter memory like it was used in [1] and a power law kernel of longer memory like it was used in [10]. The latter form will be elaborated on in the next section, the former briefly addressed afterwards. In order to determine the memory kernel, τ must be extracted and sampled from the data. We need to get these response times by connecting “mother tweets” and re-tweets:
τk = ti − tj , (2.8)
where ti and tj with i, j ∈ [1;Nt(T )] denote the time stamps of the re-tweet and, re- spectively, its mother tweet; k ∈ [1;Nt(T ) −M ] is the index of responses. The counting process Nt(T ) equals the number of events after the total time T , and M the number of mother tweets. In the context of financial markets or earthquakes it is highly impractical if not impossible to make a causal link between mother and daughter event, such that the functional form has to be estimated for example via maximum likelihood estimation. In contrast, in the context of Twitter there actually is the possibility to make this causal link for certain types of responses, most notably re-tweets, as will be explained in section 4. We are therefore in a very good position to determine a well-founded memory kernel.
2.2 Power Laws
In [1] the memory kernel was chosen to be an exponential distribution. However, in this work we will be also dealing with power law (PL) distributions, which generally underlie long memory processes. In principle there are many different kinds of power laws and other distributions that are heavy-tailed. However, focussing on the tails of the distribution, the
11
power law distribution is a very common and simple one, giving a good first idea for further distributions to be tested. In this section we introduce the notation and particular form for the power laws in this work. Furthermore, an approximation of this power law as a sum of exponentials will be explained in detail as this has computational and algebraic advantages.
Though the variable whose distribution we are interested in, i.e. the waiting time, is discrete due to the precision of measurement (seconds), we treat it as continuous variable for three reasons. First, response time is in fact continuous and only the measurement makes it discrete. Second, we are considering very large response times up to the order of months, in comparison to which seconds are very small. Thirdly, the mathematical treatment is much simpler, though still providing a good approximation.
Now, a continuous power law distribution of non-negative x is described by a probability density function (PDF) p(x) such that
p(x)dx = Pr(x ≤ X < x+ dx) = A(x+ c)−αdx, (2.9)
where α is the scaling parameter, X is the observed value of x and A is the normalization constant. Shifting the distribution to the left by c, will ensure that the density does not diverge at x→ 0. Using the normalization condition of
∫∞ 0 p(x) dx = 1 as well as requiring
α > 1, the functional form of the power law investigated here turns out to be
p(x) = (α− 1)cα−1
Φ(t) = θcθ
(t+ c)1+θ . (2.11)
Here we introduced the notation to match our case of waiting times t sampled from the bare memory kernel Φ(t).2 It is related to the memory kernel h(t) introduced in equation (2.3) as
Φ(t) = h(t)/n. (2.12)
This power law form is often referred to as Omori law. Note that the notation of the power law exponent as α is often used in cases with finite lower bound in x such that c is rather denoted as xmin, e.g. in [13]. In our and many other cases, x needs to be finite right above zero. Further the exponent expressed as θ = α − 1 turns out to be more practical when dealing with a renormalization of the exponent as will be shown in section 2.3. We therefore stick to the notation with θ as in equation (2.11).
2Note that since we will be dealing almost exclusively with waiting times as opposed to time stamps, the letter t will be used as waiting time, whereas τ will be used as characteristic time in the exponential kernel to be introduced in the next section.
12
Generally, a larger θ corresponds to a faster decay of t, in the context of the bare memory kernel a shorter bare memory. Note that 0 < θ ≤ 1 and that a well-defined mean only exists if θ > 1. Often times it is useful to consider the complementary cumulative distribution function (CCDF) of a PDF. The CCDF which will be denoted as P (x) and is defined as
P (x) = Pr(X ≥ x) = 1− Pr(X < x). (2.13)
In our case the CCDF turns out to be
P (x) = 1− ∫ x
x+ c
)θ . (2.14)
For a given c the scaling parameter θ can be estimated via maximum-likelihood that has a closed form solution of
θN = N
]−1
, (2.15)
where ti are the observed values of the response times. A higher lower bound c will reduce the number of sample points N that can be used for the estimation. Hence the lower bound occurs with a subscript as cN . The estimation becomes better for larger N , i.e. number of data points, such that lim
N→∞ θN = θ. Equation (2.15) is equivalent to the well-known Hill
estimator for the estimation of θ [14].
When producing synthetic data via simulating a Hawkes process, a power law kernel as expressed above comes with the big disadvantage that it cannot be analytically transformed into its renormalized memory kernel (explained in the following section 2.3). This can be overcome by expressing the power law kernel as a sum of exponentials, which approximates the power law behavior well enough for finite waiting times. As a sum of exponentials the transformation from bare to renormalized kernel becomes exactly solvable.3 The power law approximation is an adoption from [10] with an additional c-shift expressed as follows.
Φ(t) = 1
(1/ξi) θe − c ξi − S c
m e−m,
3For legibility the term power law will be used in this work when actually we are using its approximation.
13
(1/ξi) 1+θ,
ξi = cmi.
The choice of M and m will determine how well the power law is approximated. In [10] M and m are chosen to be 15 and 5, respectively, such that the most extreme term in the sum will be of the order mM−1 = 514 ≈ 109, which is sufficient for our purposes in the case of synthetic data. Here we choose maximal process durations of the order 106. The goodness of the approximation is visualized in figure 2.
2.3 Bare and Renormalized Memory Kernel
In the previous section only the bare memory kernel Φ(t) was explained in detail, which considering equation (2.12) turns out to be the probability of a mother event triggering a daughter event directly. In some systems we do not observe such first-order responses but the responses of descendants with respect to their earliest mother, i.e. the immigrant. As the generation of the descendants are generally unknown, the bare response function is hidden and only the renormalized response function R(t) or memory kernel can be determined.4 The two are related via the integral equation
Φ(t) = R(t)− n ∫ t
0 Φ(t− t′)R(t′) dt′, (2.17)
as shown in [15]. A standard technique to solve this so-called Volterra integral equation of second kind is via Laplace transformation, which is defined as follows.
f(s) ≡ L{f(t)}(s) :=
∫ ∞ 0 f(t)e−st dt, (2.18)
where s ∈ C. More concretely, this is the one-sided Laplace transformation, defined for non-negative values of t ≥ 0. This is suitable for causal systems, i.e. systems in which the past influences present and not vice-versa, which is the case for response functions. Inverse-transforming the function from the complex frequency domain back to the real time domain is defined as follows.
4Renormalized (memory) kernel and response function both refer to R(t). The former expression will be used to discern it from the bare (memory) kernel, whereas the latter will be used when talking about response times.
14
c−shifted OL−approximation
Figure 2: Head and tail of a sample power law bare kernel. The green curve depicts the ideal, true power law kernel (OL stands for Omori law), the red curve the approximation as given by equation (2.16). An artifact of this approximation is the slight “oscillation” around the true curve. The approximation is good until about 109 which is given by the choice of M and m, determining the most extreme term in the sum mM−1 = 514 ≈ 109.
L−1{f(s)}(t) := 1
0, for t < 0 , (2.19)
where γ must be larger than the constant of absolute convergence of the integral. The causal system with f(t < 0) = 0 is retrieved. Now expressing equation (2.17) in its frequency domain via the Laplace transformation (2.18), yields
Φ(s) = R(s)− nΦ(s)R(s)
0.01
0.02
0.05
0.10
0.20
0.50
Waiting time
M em
o ry
k er
n el
w it
h m
o d
e at
ze ro
Figure 3: Bare memory kernel in comparison with the respective renormalized memory kernel for different branching ratios. The bare kernel (green) is of the most simple form suitable for our case: 1 τ e−t/τ , which analytically transforms into 1
τ e−t 1−n τ for the renormalized kernel. A low branching
ratio of n = 0.1 (magenta, dashed) and a high branching ratio of n = 0.9 (red) are depicted. What is striking is that the memory is dramatically increased upon near-criticality. In fact, for n = 1 the renormalized kernel becomes a constant of 1/τ . For even higher, i.e. supercritical, n the system’s memory is no longer decreasing but increasing over time, which might result in an explosion of events.
So in order to retrieve Φ(t) from R(t), the latter must be Laplace transformed such that equation (2.20) can be applied and Φ(s) obtained, which then must be inverse-Laplace transformed. Not surprisingly, the Laplace transformation is straightforward for expo- nential or trigonometric functions, but analytically not solvable for power law resolvents [15]. In order to illustrate the relationship between bare and renormalized kernel, simple functions for different n are plotted in figures 3 till 6.
A higher branching ratio n will generally result in a longer effective memory of the system. Intuitively this makes sense, as a higher number of offspring increases the overall lifetime of the cluster or the “family tree” the offspring are part of. The cluster center or the mother of the family tree is thus able to indirectly trigger events that are further in the future, i.e. the effective or renormalized memory is larger. Consequentially, the renormalized kernel cannot be regarded as a PDF because it does not integrate to unity like for the bare kernel but to a value K that must be greater than unity and dependent on n. This value K turns out to be
K = 1
1− n . (2.21)
A detailed derivation of this result can be found in the appendix 8.1.
16
0.000
0.005
0.010
0.015
0.020
0.025
0.030
0.035
o d e
Figure 4: Bare memory kernel in comparison with the respective renormalized memory kernel for different branching ratios. The bare kernel (green) resembles a power law of the form (2.11) approximated by a sum of exponentials as in [10] expressed as a shifted version in equation (2.16). A low branching ratio of n = 0.1 (magenta, dashed) and a high branching ratio of n = 0.9 (red) are depicted. The effect of renormalizing a kernel via n becomes clearer when looking at the mode: The memory is increased in such a way that the PDF is stretched right-upwards.
For the most simple case of an exponential bare kernel, i.e.
Φ(t) = exp(−t/τ)/τ, (2.22)
the behavior of its renormalized kernel, which turns out to be
R(t) = exp (−t(1− n)/τ) /τ, (2.23)
illustrates very well why the case of n = 1 separates a sub-critical from a super-critical regime as it was mentioned in section 2.1. For n ≥ 1 the exponential changes from decaying to growing. That is, in the supercritical case the system’s memory is no longer decreasing but increasing over time, which might result in an explosion of events. Clearly, R(t) does not integrate to unity for finite n or even to a finite value for n ≥ 1. A near-critical case is depicted in figure 3.
Another strictly decaying bare kernel is the powerlaw kernel, which has substantial probability mass at the long tail, making it a long-memory kernel. Since the transformation to the corresponding renormalized kernel is analytically only asymptotically possible ([15]), we approximate the powerlaw as a sum of exponentials as described at the end of section 2.2 such that an analytical transformation is possible. The kernel behavior of this case is depicted in figure 4. Here the effect of a finite n becomes clearer: The memory is increased in such a way that the PDF is stretched right-upwards.
17
10-11
10-8
10-5
0.01
10
ro
Figure 5: Bare memory kernel in comparison with the respective renormalized memory kernel for different branching ratios. The bare kernel (green) and the renormalized kernels of different branching ratios (magenta for low, red for high) are the same as in figure 4 expressed by equation (2.16): After an initial phase with lower exponent (t ≈ 100), it becomes the same as for the bare kernel The situation is quite different for the critical (dotted purple) and supercritical (dotted blue) case. In the former case the exponent is renormalized from 1+θ to 1−θ, i.e. a shorter bare memory results in a longer renormalized memory. This paradoxical behavior will be discussed in detail in the next section 2.4. The latter case the power law decay transitions to an exponential increase, which reproduces the results in [11].
In figure 5 we investigate the behavior in the long tail by plotting the previous case on log-log-axes. For a subcritical branching ratio, the memory is temporarily increased but then transitions to the same exponent. In a log-log plot this is reflected by parallel curves. For a critical branching ratio, on the contrary, the situation is quite different. The curves are not parallel anymore. In fact, the exponent is renormalized from 1 + θ to 1− θ, i.e. a shorter bare memory results in a longer renormalized memory. This paradoxical behavior called “anomalous scaling” will be explained in detail in section 2.4. For completeness also a supercritical branching ratio is included. Here the power law decay transitions to an
exponential increase. Both transitions occur for t ≈ t∗ = c ( nΓ(1−θ) |1−n|
)1/θ , a result from [11].
Note that also for a power law approximated by a sum of exponentials as in (2.16) this result is correctly reproduced.
Now we look at a more elaborate case in figure 6 with R(t) ∼e−at cos2(bt)+e−ct, where a, b, c ∈ R+. Again, the stretching is right-upwards to the positive region. This becomes most evident when looking at the first and second mode. The second mode has become the largest. Increasing the branching ratio even more will put more mass on the third mode and so on. Despite the supercritical n, the overall memory is decreasing such that an explosion of events is very unlikely.
18
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
m em
o ry
k er
n el
Figure 6: Bare memory kernel in comparison with the respective renormalized memory kernel for different branching ratios. The bare kernel (green) is of the form e−at cos2(bt)+e−ct, where a, b, c ∈ R+, in order to provide a simple multimodal PDF, for which the Laplace transform and its inverse can be performed analytically. Renormalized kernels of a low branching ratio of n = 0.1 (magenta, dashed) and a high branching ratio of n = 0.9 (red) are depicted. A supercritical branching ratio of n = 2.6 (blue, dotted) was chosen to illustrate even more clearly the stretching of the bare kernel. Again, the stretching is right-upwards to the positive region. This becomes most evident when looking at the first and second mode. The second mode has become the largest. Increasing the branching ratio even more will put more mass on the third mode and so on. Despite the supercritical n, the stretching of the modes decreases such that an explosion of events seems unlikely.
19
All the relations above were obtained by solving for Φ(t) via the Laplace transformation as described in equations (2.18) - (2.20). However, this method relies on the knowledge of the functional form of the resolvent R(t), which even for the common, relatively simple power law-case is analytically not viable anymore. Furthermore, since resolvents sampled from complex systems will most likely not follow a clear simple functional, i.e. paramet- ric, form, non-parametric numerical techniques have to be employed. Generally in such methods, sampled data points are not fit into parametric models but processed as such, which is more loyal to the data themselves. The detailed methodology of this is described in section 3.3.
2.4 Anomalous Scaling
While the general relationship between bare and renormalized kernel was discussed in the previous section, this section focusses on the special case at criticality, where in a power law kernel a larger θ results in a shorter bare memory but longer memory of the response function (see figure 8, bottom). Intuitively, one would actually expect that the renormalized kernel describing the response of the aggregate system of all generations is simply an “up- scaled” version of the bare kernel describing the response on a single-generational level (see figure 8, top). For this reason this paradoxical behavior is referred to as anomalous scaling. For the case of 0 < θ < 1 a number of authors have shown that R(t) ∼ 1/t1−θ for n = 1 (synthesis in [16]). But also for the subcritical case there is a phase with renormalized exponent, i.e. when
t < t∗ = c
)1/θ
(2.24)
as derived in [11] and visualized in 7. Note that in this work the case of 1 < θ is not exam- ined. Note that even for exponential memory kernels anomalous scaling can be observed, however, only in the supercritical regime. This can be easily derived from equation (2.23). This section is largely a reproduction of the intuitive derivation of the above paradox in [15]. However, more detail, in particular with regards to the assumptions made, is added in order to be able to actually test the argumentation with synthetic data.
We start by realizing that the total activity A(t) after one exo-event at t = 0 can be decomposed into the contributions per generation k and is directly proportional to the response function R(t) via the branching ratio n:
A(t) =
A(t) = nR(t). (2.26)
Here K(t) denotes the highest generation triggered until time t such that 0 < k ≤ K(t). The latter equality holds for the general case of a sudden impulse happening at t = 0 (see
20
Figure 7: 3D-plot of crossover time t∗ as given by equation (2.24). The variable c has been set to 1 as it enters the equation as a simple prefactor. Generally a choice of both n and θ close to 1 result in a higher crossover time. For θ → 0 the crossover becomes a sudden transition at n = 0.5. For θ → 1 the crossover becomes independent of n. For intermediate values of θ the branching ratio should also be chosen rather close to criticality for the crossover to be better visible.
equation 6 in [15]) and the response of the system to this impulse. Now we can look into the activity per generation Ak(t) which is analogous to equation (2.26),
Ak(t) = nkΦk(t), (2.27)
and from there determine what A(t) would look like given equation (2.25), for which K(t) has to be determined as well. The activity per generation in equation (2.27) is composed of the number of offspring nk and the bare memory kernel Φk(t). Since we are on the first-order, single-generational level, it only makes sense to take the bare kernel. Here the subscript k refers to the fact that the short time cutoff will not be c as in (2.11) but the only characteristic time scale associated with generation k instead, i.e. the typical waiting time in that generation tk. The bare memory kernel per generation is hence expressed as
Φk(t) = θtθk
(t+ tk)1+θ . (2.28)
To determine the relationship between tk and k, it is easier to first look into the analogous aggregate case of K(t) depending on t. For this consider the chain of events leading up to the occurrence of the highest generation
t = t1 + t2 + ...+ tK(t) ≈ K(t) tt , (2.29)
where t2 denotes the waiting time between immigrant and second-order descendant, for
21
Figure 8: Bare (green) and renormalized (red) memory kernels for different exponents. Top: Exponential kernel with low exponent of τ = 0.5 (dashed) and high exponent of τ = 2 (thick) at near-criticality (n = 0.9). The scaling behavior can be checked by comparing those two exponents: Decreasing the exponent will decrease both the bare and renormalized memory (lower slope from thick to dashed). Bottom: Power law kernel with low exponent of θ = 0.4 (thick) and high exponent of θ = 0.9 (dashed) at criticality. Comparing these two exponents reveals an anomalous scaling: Increasing the exponent will decrease the bare memory (lower slope from thick to dashed) but increase the renormalized memory (higher slope from thick to dashed). Anomalous scaling therefore is a fundamental property of long memory processes.
22
example. The approximation becomes an equality when considering infinitely many inde- pendent realizations for a given K(t) and t. The time average .t is calculated as
tt =
1− θ tmax(t)1−θ. (2.30)
Note that here we take advantage of the fact there is no waiting time beyond tmax(t) in a finite set of K(t) generations such that the integration goes until tmax(t) only and becomes finite itself. In the last approximation we assumed that c tmax(t). To continue we have to determine tmax(t), which can be done via checking the probability of finding a tk equal to or larger than tmax(t) in the chain of K(t) events above. By consistency this probability should be roughly 1:
1 ≈ K(t)P [X ≥ tmax(t)] = K(t)
∫ ∞ tmax(t)
tmax + c
)θ . (2.31)
By again assuming that c tmax(t), tmax(t) finally be written as
tmax(t) ≈ cK(t)1/θ, (2.32)
which plugged into equation (2.30) and using equation (2.29) results in
t ≈ θc
tk ∼ ck1/θ. (2.35)
Now resuming to equation (2.27) and requiring that tk t, we can write the activity per generation as
Ak(t) ∼ nk θk
t1+θ , (2.36)
which plugged into the equation (2.25) for the total activity results in
A(t) ∼ θ
t1+θ
23
for the critical case of n = 1. The quadratic relationship between A(t) and K(t) will be dramatically less pronounced in the subcritical case. Now by including the behavior of K(t) given in equation (2.34), the behavior of the total activity becomes
A(t) ∼ K(t)2
t1+θ ∼ 1/t1−θ, (2.38)
such that also R(t) ∼ 1/t1−θ. Therefore, the larger θ is, the shorter the memory of 1/t1+θ, but the faster growing
is the number of generations that are triggered up to time t. Thus the resulting slower decaying renormalized response function ∼ 1/t1−θ results from the fact that the number of triggered generations grows sufficiently fast according to equation (2.34) so as to more than compensate the shorter memory associated with a larger θ. This anomalous scaling results fundamentally from the property that the expected waiting time is infinite for 0 < θ < 1 [15]. For a large θ the probability mass of the bare kernel is concentrated at short times, resulting in more early offspring.
To summarize, for a power law kernel at criticality we need to ensure the following in a simulation in order to reproduce anomalous scaling.
1. tk t, which is fulfilled for sufficiently large t such that K(t) spans over several decades.
2. There is at least one tk in the highest-generation cluster that is of the order tmax (assumption for (2.31)).
3. We need to consider intact chains of events such that t = K(t) tt.
4. For simulations only at near-criticality we further need to ensure that t < t∗.
24
3.1 Data Simulation
In order to empirically investigate the relationship between bare and renormalized kernel, the synthetic data needs to be such that we know the corresponding immigrant for every descendant event. With this information it is possible to directly sample the waiting times for the renormalized kernel in order to determine the shape of R(t). Via the knowledge of n, which will simply be the ratio of descendants over all events, the shape of Φ(t) can in principle be determined as described in section 2.3. How this is realized will be described in section 3.3. Of course conventional methods such as the maximum likelihood estimation (MLE) can be applied to directly determine the shape of Φ(t). However, as already concluded in the previous section, non-parametric methods, i.e. not MLE, are more suitable in cases where the underlying laws are unknown.
How the simulation of the synthetic data as well as its sampling for the above-mentioned purpose is realized, will be briefly outlined in this section. In the following section 3.2, the determination of the shape of R(t) via the adaptive kernel density estimation (AKDE) will be described in detail.
There are two commonly used algorithms for simulating a Hawkes process: First, the thinning procedure in which each event is simulated according to the intensity as given in equation (2.1). Here the disadvantage for our purposes is that we can neither infer which event is an immigrant nor which event triggered which subsequent event. Moreover, the numerical complexity goes as O(N2) such that the simulation of long time series at near- criticality would be overly extensive [17]. For a more detailed description of this method refer to [9]. The second algorithm simulates immigrants and descendants separately: the former as a Poisson process, the latter via the bare memory kernel. Obviously, it is straight- forward to determine n here. This option is also well-suited for simulating the information on which event triggered which event, briefly outlined in the following.
First, the immigrants are simulated as homogeneous Poisson process via the afore- mentioned thinning procedure only that the intensity will simply be λ(t) = µ = const. ((2.1)), which needs to be specified. As a Poisson process per definition produces indepen- dent events, these immigrants will act as centers for independent clusters. Second, these clusters will be simulated by filling them with descendants. For this all the immigrants with k = 0 will be taken and their descendants with k = 1 simulated in parallel via another thinning procedure with the intensity being λi(t) = nΦ(t − ti) = h(t − ti), which needs to be specified. Here ti is the time stamp of the mother or immigrant i ∈ [1;M ]. This is repeated in parallel for all clusters until the cessation of the last offspring, i.e. the youngest daughter without children. In this way the branching structure of the entire process can be reconstructed perfectly and the numerical complexity only goes as O(µTK), where T is the simulation time window and K the number of generations in this window [17].
Now each simulated event is defined by its time stamp t, its generation k and the its
25
unique cluster-ID i. Now by taking the time stamp of the earliest event for each cluster, i.e. the smallest t or equivalently the event with k = 0, and subtracting it from the time stamp of each remaining event in that cluster, one gets the positive waiting times for the renormalized kernel R(t), fulfilling the requirement t 7→ R(t), t ∈ R+. Hence in this work we refer to those waiting times as renormalized waiting times or rwt in the code. The code get rwt.m can be found on https://github.com/SRustler/Shocks and Self- Excitation in Twitter.
With this detailed knowledge of the cluster many other dependencies can be investigated such as the number of events per generation Nk(k) or the cluster size distribution Ns(s). The further processing of the such sampled renormalized waiting times is discussed in the following section.
3.2 Adaptive Kernel Density Estimation5
The most straightforward way of obtaining an underlying PDF of a sampled random vari- able X, is to make a normalized histogram. However, this has several disadvantages: First, the estimate is very crude. Second, the resulting PDF is not smooth, which may pose diffi- culties for further numerical treatment. A method to overcome these is the kernel density estimation (KDE), which for each sampled data point stacks a smooth kernel as opposed to a discrete box of length binwidth. The kernel needs to be specified in type and band- width. Most commonly it is a kernel following a normal distribution centered at µ = X with the standard deviation σ as bandwidth. Obviously, a smaller bandwidth will allow for more detail of the KDE. It may, however, be sensible to choose different kernels, e.g. an exponential kernel for an exponential memory kernel.
In our case of strictly decaying PDFs with very dense points at the head and very scarce points in the tail of the distribution, a simple KDE would reach its limits: A large bandwidth will be appropriate for the many data points at the head but allow for too much detail in the tail. On the contrary, a small bandwidth will be appropriate for the few data points in the tail but allow for too little detail in the head: At the head for t = 0 there is an abrupt drop to zero, which will require a very small bandwidth in order to be captured well. Clearly, the bandwidth needs to vary along the distribution.
The method to realize this bandwidth variation, is the adaptive kernel density estima- tion (AKDE). In principle, there are many ways to determine a dynamic optimal band- width. It could be directly linked to the scarcity of data or the distance to the nearest neighbors, for example. Since we are dealing with strictly decaying PDFs in the case of synthetic data, it is sensible to go a faster and simpler route, i.e. changing the bandwidth variation linearly with respect to the abscissa, in this work we chose σ = X/2 such that our AKD-estimated PDF with normal kernel can be written as
5“Kernel” in the context of AKDE is not to be confused with the memory kernel of the previous section on the theoretical background.
) , (3.1)
where X ≡ xi. Strictly speaking though, this is not adaptive to the data. For empirical data of response times it may happen that at certain times the responses are peaking, in which case the bandwidth needs to be smaller at those times. The easiest way to realize this is to link the bandwidth to the probability of the response time. However, as it will be shown later in section 5.1.2, allowing for this much detail in the determination of the renormalized memory kernel might impair the numerical calculation of the bare memory kernel.
A common feature of the memory kernels used for data simulation is a sharp peak right at t = 0 arising from the requirement t 7→ R(t), t ∈ R+. As a consequence, for finite bandwidths there will always be an underestimation at the head. By employing a sharp exponential or power law kernel (finite for t > 0) as opposed to a normal kernel (finite for any t), this underestimation can in principle be cured. However, the very fact that the kernel is finite only on the positive half-axis introduces an upward bias in the distribution.
A better method to cure the underestimation at the head of the distribution is to apply a post-AKDE reflection of negative probability mass or short “post-reflection” about the ordinate. In this approach we allow probability mass to occur for negative waiting times, i.e. we calculate the PDF also for t < 0 at first. Once the calculation for each sampled data point is completed, the negative probability mass is reflected with respect to the ordinate. Obviously the PDF will be decaying towards 0− such that the head at 0+ will be lifted up most strongly by this procedure, whereas the tail is barely affected. As a result the underestimation can be partially cured without introducing a bias in the rest of the distribution.
As described earlier in this work, we will be sampling renormalized waiting times con- stituting the renormalized memory kernel normalized to 1/(1− n) as opposed to the bare memory kernel normalized to 1. The distribution estimated by the above-described AKDE will therefore have to be scaled as
RAKDE(t) = p(t)/(1− n). (3.2)
3.3 Numerical Reconstruction of Bare Memory Kernel
As described in the previous section the KDE results in a set of data points that constitute the renormalized memory kernel RAKDE(t). In order to obtain the bare memory kernel Φ(t), these data points could in principle be fitted to a parametric form that could be analytically solved via Laplace transformations. This, however, introduces errors and is not sensible as we do not know the underlying law of the data points. Moreover, analytical solutions do not exist even for the simple power law case. It is more faithful to data if a
27
non-parametric processing of the data is performed. In concrete, equation (2.17) can be numerically solved by exploiting the fact that such Volterra integral equations are causal, i.e. they integrate only up until the current time t. Since we expect the shape of both the renormalized and bare kernel to be very mild, i.e. not having any singularities and being very smooth, standard methods of numerical integration should be able to solve for the bare kernel [18]. How this is performed will be described in detail in this section.
Changing variables in equation (2.17) as x = t − t′ results in a single-variable Φ(x) in the integral:
Φ(t) = R(t)− n ∫ t
which we can approximate by rectangles as
Φ(ti) = R(ti)− n i−1∑ j=0
Φ(tj)R(ti − tj)(tj+1 − tj). (3.4)
In this form it becomes evident that we can now easily take the values of Φ(tj) that have been calculated prior to Φ(ti) since j < i and Φ(t0) = R(t0) if t0 ≈ 0. With the knowledge of the bare kernel at t0 it can be solved successively for increasing i:
Φ(t1) = R(t1)− nΦ(t0)R(t1 − t0)t Φ(t2) = R(t2)− nΦ(t0)R(t2 − t0)t− nΦ(t1)R(t2 − t1)t Φ(t3) = R(t3)− nΦ(t0)R(t3 − t0)t− nΦ(t1)R(t3 − t1)t− nΦ(t2)R(t3 − t2)t ... ΦN = RN − nΦ0RN,0t− ...− nΦN−1RN,N−1t
and so on until the maximum waiting time is reached. In the last line an abbreviated notation was used. Also note that since ti are equispaced, ti+1 − ti = t, which speeds up the calculation. Moreover, the computation is sped up by the fact that for all i, j the respective Ri,j are known. The rectangular approximation of equation (3.3) can be improved by employing a trapezoidal approximation:
Φi = Ri − nt
ΦjRi,j + Φi−1Ri,i−1
. (3.5)
A quadratic approximation is usually better and achieved by implementing the composite Simpson’s rule:
Φi = Ri − nt
. (3.6)
28
Again, the fact that only Φj that have already been calculated prior to the current Φi
is exploited. The first and last term in the parentheses of both methods correspond to the values j = 0 and j = i − 1, respectively. Obviously, for all the numerical integration methods above the computational effort goes as O(N2/2).
3.4 Extracting Highest Generation over Time
Section 2.4 gives an intuitive theoretical explanation of the anomalous scaling, for which the approach of empirical confirmation will be explained here.
The most straightforward way to empirically demonstrate anomalous scaling via syn- thetic data, is to compare the generating bare kernel and the resulting renormalized kernel for different values of θ. For the sake of completeness this will be included in the results sec- tion under 5.1.4. However, to demonstrate the actual cause of anomalous scaling, namely the over-compensating number of offspring with increasing θ, the relationship (2.34) has to be confirmed. By extracting the highest generation over time K(t), we can test for this relationship. The nave approach to do this is to store the highest generation at any point in time. Obviously, this needs to be repeated for several simulations because for one simulation the resulting K(t)-curve will be a very crude “staircase” function. However, this approach will, in fact, result in a weaker growth of K(t) as it violates a key assumption made in the derivation in 2.4: the necessity of intact chains of events at time t as expressed in equation (2.29). In order to adhere to this requirement, we exploit the fact that per defi- nition the chain of events is ongoing within a cluster. Moreover, clusters are independent to each other such that we can average over all clusters, performing only one long simulation. The second equality of equation (2.29) will be truer for lower K because there are more of those events in the system. For the equality to also hold for larger K, a high unconditional intensity µ and a high θ can be chosen. Of course with this approach t in K(t) will not be the system’s time but the cluster’s time, i.e. exactly the renormalized waiting time. For this reason extracting K(t) will be performed via the function get rwt.m, explained in section 3.1.
When considering clusters individually, we strictly take one event and waiting time value per cluster and K-generation in order to ensure proper averaging. The question, however, is, which event of K-generation to choose for any cluster. Assumption number 2 at the end of section 2.4 provides a criterion here: In order to make it more likely for tmax to occur amongst the different tk for a given K > k > 0, we chose the latest event of K-generation to occur.
3.5 Characterizing Response Functions
In order to compare the different non-parametric memory kernels, several metrics have to be applied. One of those metrics is the power law exponent θ assuming that the tail starting from the short time cutoff c follows a power law. Equation (2.15) provides a straightforward
29
way to determine θ = α − 1 for a given c. However, the parameter c may not be easily determined. In this case the estimated θN needs to be plotted against cN and a plausible choice of both values has to be made. For high values of cN , i.e. looking at the tip of the tail only, the estimate of θN will be very off as the number of data points N is very low. With decreasing cN and hence increasing N , the estimate improves, which manifests in the curve of θN against cN converging to θ. However, this will only go as far as the PDF is a straight line in the loglog-plot. For a PDF of the form (2.11) bending will occur in the region of the true c. Figure 9 shows this approach applied on sample data for both power law kernels of the form (2.11) as well as non-shifted power law kernels, i.e. without lower bound (red curve, right column). When sampling the renormalized kernel, bending will occur at even higher values, i.e. at the cross over time as defined in equation (2.24), which decreases the region of stable θN even more. In fact the determined cN actually needs to be interpreted as cross over time t∗ as opposed to short time cutoff, from which the actual cN can be extracted via equation (2.24). Consequently, determining θN and cN is a non-trivial task.
For this reason, in addition to our estimates we will employ the power law test as im- plemented in plfit.m and plvar.m from[19], which estimate xmin and α, and respectively their errors as described in [13]. The procedure combines MLEs with goodness-of-fit tests based on the Kolmogorov-Smirnov statistic and likelihood ratios. The results of these tests will be plotted in the same figure as the Hill estimation in order to enable a direct com- parison. The description of the codes from [19] reads as: “The fitting procedure works as follows:
• For each possible choice of xmin, we estimate alpha via the method of maximum likelihood, and calculate the Kolmogorov-Smirnov goodness-of-fit statistic D.
• We then select as our estimate of xmin, the value that gives the minimum value D over all values of xmin.”
Here xmin denotes the lower bound of the random variable above which the power law is fitted. In our case this is c. α denotes the power law exponent or scaling parameter as already used in section 2.2. Recall that α = θ + 1.
We further characterize the response time distribution by simple metrics that focus on the head and the body of the distribution. In concrete we provide the median, i.e. the time until which 50% of the responses have happened, as well as the 90%-quantile and 99%-quantile. Both are directly visible in the CCDF of the response time distribution.
30
Figure 9: Test of the performance of the Hill estimator on known sample data. The top subplots show the theoretical PDF and the respective histogram for two different distributions of θ1 = 0.8 (blue) and θ2 = 0.992 (red). The bottom subplots show the respective curves of the Hill estimator
θN against the lower bound cN . Left: The two known distributions are of the form (2.11), however, without the c-shift in t, such that the curve is diverging at t = 0 (in this case the PDF would only start being finite at t = c as opposed to t = 0). As a result the distribution follows a straight line on [c;∞ in the loglog-plot. This favors the Hill estimation, as can be seen in the bottom plot where at approximately t = 10 a stable region is reached and the power law exponents are well recovered (green). The fact that the region of stability remains until the smallest t-value, indicates that there is indeed no c-shift. Right: Now the second distribution (red) is exactly of the form (2.11). As a result a bending of the PDF at around t = c will occur, affecting the Hill estimation. This can be seen in the bottom subplot where the region of stability ends at around t = c = 2. In fact, looking at the first distribution, the region of stability only starts at t = 1, i.e. after the bending of the PDF sets in. This can heavily affect this method.
31
4 Empirical Data
The tools and methods above are tailored to analyzing non-parametric data, which is appropriate for empirical data of which the underlying laws are unknown. This section elaborates on the empirical data examined in this work, namely Twitter data. While the self-excited Hawkes process has been successfully applied in the context of earthquakes, financial markets and urban burglaries ([20],[1],[21]), where some kind of shock may trigger aftershocks, in the context of Twitter a tweet may trigger re-tweets. As expressed in equation (2.1) making predictions with the help of Twitter data requires the knowledge of the unconditional intensity µ(t) and the memory kernel h(t). As mother tweets are exogenous and re-tweets endogenous, Twitter data is well-suited for being applied to the Hawkes model, which effectively combines both types of mechanisms. In line with the work of [1] determining a branching ratio n over time may potentially serve as precursor for manifestations of criticality. As Twitter is frequently used as an important tool for political mobilization, especially during riots and unstable periods, such a manifestation of criticality could be the actual outbreak of a protest or riot on the street [22]. In this work we therefore analyze two recent social protests, elaborated on later in this section. In contrast, the characterization of the memory kernels may provide insights into how social interest - be it a protest or a movie - can be sustained and fuelled over time.
As “SMS of the internet”6 Twitter is designed to enable users to share short messages and links very efficiently, allowing one to label tweets via so-called hashtags, i.e. keywords preceded by a hash “#”, and cite and reply to other users’ tweets [23]. Those hashtags provide an effective way of obtaining and contributing with relevant content. In this work we filter content via hashtags in order to obtain tweets with regards to certain topics. In the next section, how those tweets were obtained will be explained. In the subsequent two sections the specific choice of hashtags will be specified.
4.1 Scraping Twitter Data
There are many different ways to obtain different amounts of Twitter data. Getting 100% of tweets fulfilling certain criteria, generally involves high costs. For example, the big data service “DataSift” that collaborates with Twitter charges on average 0.10 USD per 1,000 tweets on top of a monthly subscription fee of 5,000 - 45,000 USD.7 Another option is using Twitter’s “Spritzer” streaming API, which is publically accessible and free of charge but does not allow to acquire historical data and further only stream about 1% of all tweets (tweet-ID mod 100). Using the “REST API” (REpresentational State Transfer) allows for past data but is limited by about 1 request per second [24]. Since requests are not constrained by the tweet-ID as for the Stream API, a higher percentage is possible. How
6http://www.business-standard.com/article/technology/swine-flu-s-tweet-tweet-causes-online-flutter- 109042900097 1.html
the REST API was used to obtain a larger number of tweets, will be explained in the following.
Obtaining the tweets of a certain hashtag from the REST API, requires the knowledge of the tweet-IDs. When a tweet is displayed in the browser, its tweet-ID can be found in the source code. Hence we use Twitter’s advanced web search to have the tweets of a certain hashtag and date displayed in the browser. In order to increase the number of search results, one has to manually scroll down until the query limit is reached, which currently is 400 tweets. On the results page one can toggle between “Top” and “All” results. “Top” tweets are comprised of tweets that received or are likely to receive a wider attention which is calculated from the number of re-tweets, followers of the tweeter, favorites and replies [25]. “All” tweets are a fixed random sample of all tweets fulfilling the search criteria, i.e. repeating the search would not increase the sample. The query limit of 400 tweets holds for both types. In this work we exclusively consider top tweets as there are more responses available for better statistics. The source code of the search results including the tweet-IDs is saved and further processed by the Python code t get.py that parses the source code to extract the tweet-IDs and communicates via an associated Twitter account with the REST API of Twitter.
There are many types of tweets triggered by different mechanisms. First and foremost there are original tweets that may or may not be re-posted by other users as so-called “re-tweets (RTs)” [26]. In allusion to the terminology used in the Hawkes model with exogenous “mother events” and endogenous “daughter events,” we will call original tweets “mother tweets (MTs).” For each mother tweet-ID the 100 most recent re-tweets along with a host of meta-data can be obtained. Currently the following is stored for further processing:
• Unique tweet-ID (variable digit number)
• Unique tweet-ID of mother tweet (if applicable)
• Time stamp with precision of seconds
• Unique user-ID (variable digit number)
• Number of re-tweets (if mother tweet, otherwise −1)
Via the reference to each tweet’s mother tweet (if it is not a mother tweet itself) it is straightforward to extract the response times, i.e. the time needed for a re-tweet to occur after its mother tweet, and construct the response function. So unlike in [1], with Twitter data we are able to causally link mother and daughter events, which puts us in a good position to robustly determine the memory kernel. It is important to note that this procedure only allows for the detection for a certain type of re-tweeting, i.e. the one- click or automatic re-tweeting as opposed to the manual re-tweeting, where the content of the mother tweet is copied and pasted with own comments and a reference added in
33
the re-tweet. Obviously, Twitter cannot draw a link between such a manual re-tweet and its mother tweet. Yet there are ways to reconstruct this connection as shown for example in [27]. However, since it is likely that different ways of re-tweeting follow a different mechanism of self-excitation in the system, this work is limited to automatic re-tweeting only. The defining feature of the one-click re-tweets is the fact that they will always refer back to their most original mother, i.e. the immigrant. Therefore, when sampling the response times in Twitter, they will strictly constitute the renormalized memory kernel.
As derived in section 2.3 determining the renormalized memory kernel requires knowing the branching ratio such that normalization to 1/(1 − n) is possible. Clearly, re-tweets are per definition endogenous events. In our framework mother tweets will be treated as exogenous event. Strictly speaking, though, mother tweets do not necessarily all have to be exogenous events. They themselves can in principle be triggered by truly exogenous events, such as relevant news releases in the media. However, within the system of Twitter they constitute the exogenous events such that the branching ratio can be directly calculated as
n = Nendo
NRT +NMT . (4.1)
Two factors that were already mentioned can affect the calculation n and hence the correct scaling of the renormalized memory kernel, which in turn affects the estimation of the bare memory kernel. First, when considering top tweets, they consist of tweets that tend to have many re-tweets such that the branching ratio will be higher and thus not representative for the entire system. Second, in case some mother tweets have more than 100 re-tweets, only the most recent ones can be obtained from Twitter, which affects the calculation of the head of the response function. Also, the branching ratio will be underestimated as some re-tweets are disregarded. However, in each mother tweet’s meta-data the number of re-tweets, even beyond 100, is included, allowing for a correct calculation of n. Yet the effect on the renormalized kernel of not knowing the response times of some early re-tweets, is non-trivial and remains to be investigated. The effects of both of those factors will be addressed in section 5.1.3
4.2 Hashtags of Single Events
Single events are events that are very confined in duration such that the response to it in (social) media can be regarded as a burst with declining activity in the aftermath. For anticipated or announced events there will also be pre-burst activity. This activity increasing towards the event is mostly endogenously driven by word-of-mouth and online sharing, for example. The decreasing activity following the event is both exogenously and endogenously driven. Both the event itself as well as people talking about it generate activity in the aftermath. In terms of the Hawkes model, activity directly triggered by the event itself would be first-order responses, while activity triggered by other people talking about the event would be higher-order responses.
34
In [5] Crane and Sornette treat Youtube video views as activity and categorize the videos according to their foreshock and aftershock dynamics. Events with after-burst activity only are categorized as exogenous because the activity is initially triggered by the external burst. This category is further divided in to the subcritical (n < 1) and critical case (n = 1). So there is a certain degree of endogeneity in this category. As has been shown in section 2.3 in the former case the response function will eventually fall off as the bare kernel (1/t1+θ), whereas in the latter case the decrease will be renormalized to 1/t1−θ. On the contrary, events with after-burst and pre-burst activity are categorized as endogenous because the pre-burst activity can only be triggered by endogenous mechanisms. As shown in [28] the activity around the burst will be symmetric and behave as 1/t1−2θ in the critical case. In the subcritical case the activity will follow a random noise process.
Inspired by this classification of bursts we chose hashtags of events that are a priori exogenous or endogenous, i.e. single events that are unexpected versus announced. The sub-categories with regards to criticality remain to be determined ex post. However, we choose events which received significant attention and whose hashtags have been widely used, such that criticality may actually be reached. The hashtags and events chosen are as follows.
A. Hashtags of endogenous events:
#Election2012 Presidential elections of Barack Obama versus Mitt Romney in the USA held on November 6, 2012.
#IndyRef Historical referendum on whether Scotland should become independent from the UK held on September 18, 2014.
#ManOfSteel One of the most anticipated movies in 2013, internationally released on June 14, 2013.8
#Transformers One of the most anticipated movies in 2014, internationally released on June 27, 2014.8
B. Hashtags of exogenous events:
#SandyHook Shooting at Sandy Hook Elementary School in Newtown, USA on December 14, 2012. With 26 children killed, it was one of the most fatal shootings in the last 20 years.
#BostonMarathon Bombing at the annual Boston Marathon held on April 15 in 2013 with 5 deaths and 280 people injured. The incident gained national and international attention as US-grown Islamist terrorism.
#Mandela Death of South Africa’s first black president, Nelson Mandela, on December 5, 2013.
#RIPRobinWilliams Death of Hollywood actor and comedian, Robin Williams, on Au- gust 11, 2014.
8According to www.movieinsider.com
35
For each of those events the mother tweets are generally sampled from an approximate time window of three months around the event, while the re-tweets may reach up to the present. It is therefore to be expected that the more recent events will exhibit finite size effects, which will be most clearly visible in the CCDFs.
4.3 Hashtags of Extended Events
Extended events are events that are spread over a longer period of time and composed of a series of sub-events. While the beginning of such an extended event may be well-defined, for example due to an exogenous shock, the ending might not be necessarily. The prime example of such events are social protests triggered by an external event and carried on long after it, generally exceeding the aftershock activity of single events. Other examples include other conflicts, wars, political terms of office, and ongoing debates on social issues. Scheduled longer events such as sport events, are not suitable in this context because their extension is mostly externally defined as opposed to internally driven.
As stated initially, this work focusses on social protests that rely to a significant degree on social media for political mobilization. We have chosen two very recent protests, both of which having a widely-used hashtag clearly associated with the conflict:
#OccupyCentral Student protests in Hongkong, which have started at the end of August 2014 after the government’s announcement of limiting up- coming major elections and are ongoing at the time of writing this paper. The movement, also referred to as “Umbrella Revolution” in the media, was initiated and is carried out mostly by students who generally tend to use social media more.
#Ferguson Social unrest in the city of Ferguson in the USA. The triggering external shock here is very distinct, i.e. the shooting of the African- American Michael Brown by the white police officer Darren Wilson on August 9, 2014. Protests have been peaking right after but are as well still ongoing.
Because these protests are extended and composed of multiple bursts, it is sensible to consider and investigate different phases of the protests. This necessitates a case study which separates the protest into its phases and individual major bursts. Since both protests received wide national and international media attention, they are well-documented and allow for a separation into phases that are generally constituted of a burst with succeeding and sometimes preceding activity, much like for the single-event hashtags. Although there are many such bursts, there are a few major ones that can also be seen in the activity over time for the respective protests as depicted in figure 10. The phases will not always be very clear-cut, in which case gaps between phases of a few days may arise. This is to prevent a mixing of potentially different mechanisms per phase.
36
The social unrest in Ferguson can be roughly separated into five phases.9:
1. Initial uproar after the shooting of Michael Brown: August 9 - 13.
2. Re-escalation after demilitarization of police: August 15 - 18.
3. De-escalation after initial uproar: August 19 - 26.
4. Re-escalation after burning of M. Brown’s memorial: September 23 - October 1.
5. Peaceful protests and “Ferguson October”: October 2 - 24.
The student protests in Hongkong can be roughly separated into five phases:10
1. Initial uproar after the announcement of electoral reform: August 31 - September 13.
2. First major student assembly and protests: September 22 - 30.
3. Eruption of violence and de-escalation: October 3 - 8.
4. Re-escalation after cancellation of meeting with student leaders: October 9 - 18
5. First round of talks and de-escalation: October 21 - 28.
9http://en.wikipedia.org/wiki/2014 Ferguson unrest More phases are listed there, some of which, how- ever, may be combined.
10http://en.wikipedia.org/wiki/Occupy Central with Love and Peace. The phases are not completely objective here but can be reasonably guessed from the article.
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Google Trends of "Ferguson"
Sep 03 Sep 13 Sep 23 Oct 03 Oct 13 Oct 23 Nov 01 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Days
Google Trends of "OccupyCentral"
Figure 10: Activity over time for the two protests investigated. The activity of the tweets (blue bars) with the hashtags #Ferguson (top subplot) and #OccupyCentral (bottom subplot), respectively, as well as the activity of the same search terms in Google (cyan) are plotted over time. The five approximate phases for the #Ferguson and the four for #OccupyCentral can also be seen in the activity. Sometimes GoogleTrends resembles the phases better, sometimes Twitter.
38
5.1 Synthetic Data
The general goal of working with synthetic data first, is to confirm the correct working of the developed algorithms and tools before applying them onto empirical data whose underlying laws are unknown. In the first section, general observations of the simulated data will be made in order to get a first corroboration of the correctness of the simulation algorithm. The second section evaluates the performance of the numerical reconstruction of the bare kernel from the renormalized kernel. The third section addresses known error sources arising from the limitation in the empirical data. The fourth section demonstrates anomalous scaling by applying this numerical reconstruction and by confirming its very cause.
5.1.1 Cluster Characteristics
This section is meant to present first general results on the data produced by the simula- tion algorithm as described in section 3.1. The algorithm that was extended with cluster information, i.e. the cluster affiliation and the generation of each event, allows for a quick preliminary check whether the simulation is correct.
Figure 11 visualizes the simulated counting process and the respective behavior of event generations for both an exponential memory kernel of the form (2.22) with τ = 2 and power law memory kernel of the form (2.11) with c = 0.1 and θ = 0.8. First considering only the top subplots, the overall counting process exhibits the expected irregular jumps arising from the clustering of events. Further, its average increase roughly retrieves the theoretical value of the branching ratio n = 0.7, when comparing it to the average counting process of the empirical unconditional intensity. This intensity µemp = Nexo/T can be directly calculated from the knowledge of the cluster centers, i.e. the immigrant events with k = 0, and compared to the theoretical value µtru that was used to generate the data. Both values agree very well, considering the rather short simulation time of 1,000. The disagreement for the exponential and the power law case are 4.0% and 8.0%, respectively. Doing the same for the overall intensities λemp = N/T and λtru results in a disagreement of 2.0% and 3.4% for the exponential and the power law case, respectively.
Now comparing the counting processes to the bottom subplot of the generation k for each event over time, one notices that the cluster growth proxied by the increase in k, goes along with higher slopes, i.e. the clustering of events, in the counting process. In the short simulation time T there is no major qualitative difference visible between exponential and power law bare kernel. The higher highest value of k for the power law case, though, already hints at the longer memory connected to power law kernels.
Figure 12 shows the distribution of the sampled renormalized waiting times and the theoretically renormalized memory kernel that was obtained by analytically transforming the data-generating bare kernel of the form (2.16) via the Laplace transformations as
39
0 100 200 300 400 500 600 700 800 900 1000 0
50
100
150
200
250
300
350
ro c e s s
exponential bare kernel (τ = 2)
Complete process with Λ emp
/Λ tru
/µ tru
= 1.04
0 100 200 300 400 500 600 700 800 900 1000 0
50
100
150
200
250
300
350
Complete process with Λ emp
/Λ tru
/µ tru
= 1.08
0 100 200 300 400 500 600 700 800 900 1000 0
1
2
3
4
5
6
7
8
9
10
ti o n o
f e v e n t
0 100 200 300 400 500 600 700 800 900 1000 0
2
4
6
8
10
12
14
t − Time
Figure 11: Cluster statistics over time for both an exponential (left) and power law memory ker- nel (right). Top: Counting process Nt(t) of complete process (blue) and unconditional process (dashed). The true and empirical values agree well in the short simulation time of T = 103
with a disagreement of 2.0% and 4.0%, respectively. Further, the theoretical branching ratio n = 0.7 ≈ Nendo/N = 250/350 is retrieved. Bottom: Respective behavior of event generations k(t). Cluster growth proxied by the increase in k(t), goes along with the higher slopes, i.e. the clustering of events, in the counting process N(t) as expected. The true and empirical values agree well with a disagreement of 3.4% and 8.0%, respectively. There is no major qualitative difference for the exponential and power law case visible, except that the longer memory power law case has a higher highest k = 14 as opposed to k = 10 already at this short simulation time.
40
Figure 12: Loglog-plot of renormalized memory kernel with a power law bare kernel approximated as sum of exponentials. The approximation is given by equation (2.16) and can be analytically transformed via Laplace transformations into the respective renormalized kernel for n = 0.4 ac- cording to equation (2.17). The fact that both the empirical renormalized waiting times (cyan) and the analytically calculated renormalized kernel agree to a degree where even the “waviness” of the power law approximation is visible for both (see inset for more detail), shows that both the algo- rithm for data simulation as well as the transformation from generating bare kernel to renormalized kernel are correct.
outlined in the beginning of section 2.3. The fact that both the renormalized waiting times and the renormalized kernel agree to a degree where even the “waviness” of the power law approximation is visible for both, shows that both the algorithm for data simulation as well as the transformation from generating bare kernel to renormalized kernel are correct.
Figure 13 plots further statistics that can be directly calculated from the additional cluster information for both memory kernels with the same parameters but for T = 105. The top subplot compares the distribution of generations Nk(k). For both memory kernel types the distribution seems to follow an exponential decay of the form Nk(k) ∼ exp(−k/6), which agrees well with the theoretical prediction of Nk(k) ∼ nk = exp(log(0.7)k). The bottom subplot compares the distribution of cluster sizes Ns(s). Again, both memory kernel types seem to follow the same underlying law of Ns(s) ∼ s−2.5. It is interesting to see that there are 5,000 clusters of size s = 1, i.e. clusters composed of a single immigrant without descendants (Ms=1 = 5, 000). The entirety of descendants (≈ 23, 333 for n = 0.7) is triggered by the other Ms>1 = 5, 000 immigrants, amounting to M = 104 immigrants in total, given T = 105 and µ = 0.1.
41
0
f s iz
Exponential bare kernel (τ=2)
Power law bare kernel (c=0.1, θ=0.8)
Figure 13: Cluster statistics over time for both an exponential (green) and power law memory kernel (red). The respective parameters are the same as in figure 11 but the simulation time is longer (T = 105). Top: Distribution of generations, which seems to behave as Nk(k) ∼ exp(−k/6) for both kernel types. However, for large generations the tail of the power law case seems to become heavier. Bottom: Distribution of cluster sizes, which seems to behave as Ns(s) ∼ s−2.5 for both kernel types. For the given µ and T it can be inferred that all descendants are triggered by only 50% of the immigrants, while the other half does not trigger any descendants, constituting single-event clusters with s = 1.
42
A major difference between the two memory kernels, however, be