Complex Event Processing on MongoDB

Complex Event Processing on MongoDB

Bachelor’s Thesis

Author : Tim Bauerle

Advisors : Alessandro Berti, M.Sc.Dr.-Ing. Merih Seran Uysal

Supervisors : Univ.-Prof. Prof. h. c. Dr. h. c. Dr. ir. Wil van der AalstUniv.-Prof. Dr.-Ing. Ulrik Schroeder

Registration date : 2019-08-13

Submission date : 2019-12-11

This work is submitted to the institute

i9 Process and Data Science (PADS) Chair, RWTH Aachen University

ii

Acknowledgments

At first, I would like to thank Alessandro Berti for the advices and the guidance duringthe period of the thesis. He helped me to constantly improve and progress my thesis withhis valuable feedback, and made this work possible.

The feedback provided by Dr. Merih Seran Uysal is greatly appreciated as well.

Also I would like to thank Prof. van der Aalst and Prof. Schroeder for supporting thisthesis.

Finally, I would like to thank my friends and family, especially my parents, for alwayssupporting and encouraging me in the last months.

iii

iv

Abstract

Complex Event Processing (CEP) aims to detect predefined patterns in large sets of events.It can be particularly useful when it comes to analyzing streams, for example in processmonitoring or conformance checking in an online setting. Difficulties arise from largeamounts of data or events arriving out-of-order, for example due to distributed event pro-ducers. Most CEP applications are designed for "high throughput", but not for storing thedata permanently or fault tolerance. In this thesis, the possibilities of CEP implementedon top of the MongoDB database are investigated. Therefore, some CEP approaches arepresented, implemented and evaluated. Furthermore we will provide an implementationof an online conformance checking algorithm, namely the technique of footprints. Theproposed architecture is modular, easy to deploy and provides a direct database access.In the end, the implemented techniques are evaluated by comparing them to the widelyused CEP framework Esper. The results indicate a better performance of the in-memoryEsper framework.

v

vi

Contents

Acknowledgement iii

Abstract v

1 Introduction 1

1.1 The Relationship to Process Science . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Use Cases of Complex Event Processing . . . . . . . . . . . . . . . . . . . . . 3

1.3 MongoDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Related Work 6

2.1 CEP Languages and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 CEP and Business Process Monitoring . . . . . . . . . . . . . . . . . . . . . . 7

2.3 CEP and IoT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Performance Evaluation of MongoDB . . . . . . . . . . . . . . . . . . . . . . 8

3 Preliminaries 9

3.1 The Notion of Complex Events . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Formalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.1 Update functions applied on event arrival . . . . . . . . . . . . . . . . 11

3.2.2 Continuously applied update functions . . . . . . . . . . . . . . . . . 13

3.2.3 Batch window update functions . . . . . . . . . . . . . . . . . . . . . . 14

4 Processing Events using CEP 16

4.1 Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2 Data Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

vii

4.2.1 Time Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2.2 Length Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2.3 Other Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3 Footprints Conformance Checking . . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Implementation 31

5.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2 How to use the Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2.1 Configure the Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2.2 Configure the Alerts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2.3 Register the Event Streams . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2.4 Start the Data Windows . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2.5 Start the Alert Listener . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 Assessment 35

6.1 Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.1.1 Esper (Comparative Setting) . . . . . . . . . . . . . . . . . . . . . . . 36

6.1.2 Implementation Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.2.1 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.2.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7 Conclusion 42

Bibliography 45

A Full Assessment Results 46

viii

Chapter 1

Introduction

Today’s information systems produce vast amounts of event data. Event data arise inmany settings involving for example sensor data or user interactions in a social network.Processing this data could provide interesting insights on the one hand, or could even benecessary in order to react at some pattern, which also requires at least an almost realtime processing. Consider for example an algorithmic trading system for a stock. Todetermine a trading strategy, it is necessary to track the current events (the orders) andthen act according to the observed event patterns. Such an event pattern is a predefinedoccurrence of events, for example if at least 80% of the orders in the last hour are aboutselling the stock. This pattern is also called a complex event. There are different notionsof complex events that are presented in a later chapter. Complex event processing (CEP)deals with the computations related to complex events, such as detecting or transformingthem [1].

1.1 The Relationship to Process Science

Complex event processing is strongly related to process science, since events usually hap-pen in processes that also provide a context to analyze them. Thus, various types ofmodels exist to describe processes. The definitions presented here are derived from vander Aalst [2]. A process can be described as a set of activities with an order modelled bycausal dependencies. Furthermore, a single instance of a process is referred to as a caseand the ordered sequence of events corresponding to a case is referred to as a trace. Eachevent from a trace corresponds again to one activity, though an activity can occur mul-tiple times in a trace. Consider for example the process of withdrawing money at a cashmachine from a customer’s view pictured in Figure 1.1. The process could be described a

Figure 1.1: Process model example in Business Process Model Notation (BPMN)

1

Chapter 1. Introduction

Case ID Activity Timestamp Account ID

001 Insert Card 2019-05-27 13:14:16.539 2015569380001 Select Amount 2019-05-27 13:14:20.287 2015569380001 Enter PIN 2019-05-27 13:14:39.138 2015569380001 Take Card 2019-05-27 13:15:05.264 2015569380001 Take Money 2019-05-27 13:15:30.657 2015569380002 Insert Card 2019-05-28 21:38:16.053 1903674423002 ... ... ...

Table 1.1: An example of an event log

little bit simplified as follows: The customer first inserts the card, then selects the amountof money, enters the PIN, takes the card and finally takes the money.

The model in Figure 1.1 uses the Business Process Model Notation (BPMN) which is awidespread notation for process models and self-explanatory for this simple example. Fora detailed description of the notation see [3]. In this example each block, such as enteringthe PIN, corresponds to an activity in the process model. The events corresponding toa process are captured in an event log shown in Table 1.1. A single row in the event logrepresents an event that is matched to a case (the particular process instance). Further-more, there are attributes such as timestamps and the corresponding activity assigned toan event. The events of a certain case ID ordered by their timestamp represent the trace.Note that the number of events in the traces of two distinct cases is not necessarily equal.The process in the example could also be modelled in a more complex way. For examplethere could be a possibility included to abort the process at any step, which would causemany additional branches in the model.

Process mining is a subject that comprises several aspects related to modelling andanalysing processes. There are three main fields within process mining. The first oneis process discovery, which aims to generate process models from event logs. This is par-ticularly useful if no process model exists, but will not be dealt with in this thesis. Thesecond field is conformance checking and will be partly dealt with in this thesis. Confor-mance checking requires both a process model and an event log to exist. Based on theprocess model and the event log capturing the execution data of the process, conformancechecking detects deviations from the process model in the process execution. Therefore,the occurrence of events captured in an event log is checked to match the given model.The results can be used to judge the quality of the process model. In some aspects, con-formance checking and CEP are similar. Obviously both techniques rely on event dataand both aim to detect certain patterns or rather deviations from a given pattern in caseof conformance checking. On the other hand, CEP rather focuses on an event stream thanan event log that is used in conformance checking. Furthermore, the aspect of real timeis present in CEP while conformance checking is not necessarily in real time.

The last field of process mining is process enhancement which is again not dealt with inthis thesis. Given a process model and an event log, the aim is to improve the modelaccording to the execution data gathered in the event log. There is repair on the one handto adapt the model according to the event log and extension on the other hand to addadditional information to the model. According to [2], process mining in general and CEP

2


complement each other. In particular CEP can be useful as a preprocessing step to inferhigher level events, that are easier to analyze, from a stream of comparatively meaninglessevents.

1.2 Use Cases of Complex Event ProcessingComplex event processing has various use cases, some of which are presented here.

• Process monitoring is a part of the Business Process Managemant lifecycle and ause case for CEP. Business Process Management (BPM) encompasses methods ofsupervising and improving business processes from the process design to its imple-mentation and further adjustments [4]. Process monitoring focuses on the executionof processes and gaining insights on the correctness and efficiency of the processes.Based on these insights, improvements to the process can be made reducing costs orexecution time.

• The Internet of things (IoT) also has several use cases for CEP. Equipping variousphysical devices with sensors leads to large amounts of generated data including eventdata. Often a real time analysis of this data is required, which can be done using CEPtechniques. As the amount of data generated by IoT devices grows rapidly while ahigh bandwidth is not always guaranteed in wide area networks, there is a demandto process the event data near or inside the device generating the data. When itis required to process events of geographically distributed devices, this reduces thesize of the data to be streamed to the central CEP engine. As the computationsare conducted at the edge of the network, this is also called edge analytics [5]. TheIoT is present in different domains such as transportation and smart cities. Gao etal. [6] introduced the Automated Complex Event Implementation System (ACEIS)that is built to discover and integrate heterogeneous sensor data streams in thecontext of smart cities. It serves as a middleware between sensor data streams andapplications. The user can specify requirements as complex event requests in orderto retrieve composed data streams according to the request. A possible usage is theprocessing of public transport data for timetable information.

• Another domain is health care where wearable sensors enable the monitoring ofpatients to detect and prevent diseases. For example, CEP techniques can be usedto monitor patients suffering from cardiovascular diseases in order to detect hearthfailures as early as possible [7]. This example is of course also related to the IoT.

1.3 MongoDBMongoDB [8] is one of the most popular NoSQL database systems. In terms of Big Data,the need for handling large amounts of data that might not be structured is constantlygrowing which applies to event processing as well. Big data were characterized by Laneywith the following properties: volume, velocity and variety in 2001 [9]. Meanwhile themodel has been extended by veracity. The attribute volume describes the large amountof data that has to be analyzed, transferred or stored. Similarly, velocity refers to thefrequency of newly arising data and the need to process these in real-time as much aspossible. Variety describes the different data types and structures that vary from textfiles to images.

3


Veracity outlines a possible uncertainty of the data, for example sensor data that mightbe imprecise. Sometimes there are even more attributes mentioned to describe Big Datasuch as the value describing the business value of data, but volume, velocity, variety andveracity are commonly used. Moreover, volume, velocity, variety and veracity do not onlyapply to Big Data, but also to complex event processing.

In particular, MongoDB has certain advantages compared to the classical relational databaseapproach with respect to Big Data.

• One of the most crucial advantages addresses the variety of the data. MongoDBdoes not require its records to have a certain schema in contrast to a relationaldatabase. This approach allows to deal with semi-structured and unstructured datamore efficiently and store data just as they come up. More specifically, MongoDBis based on documents storing the data as key value pairs. Furthermore there arealso structures such as arrays or subdocuments allowed within documents. The doc-uments are defined individually in a JSON-like format and each document can havedifferent attributes varying in type and number. To provide some more structure,documents can be organized in collections.

• Speaking of volume, MongoDB supports horizontal scaling allowing to adapt thecapacity of a system by extending with extra servers to increase the performance.This involves the possibility of distributing the data geographically, called sharding.Although a sharded system is more complex to handle at the same time, shardingor horizontal scaling respectively is generally cheaper than so called vertical scalingbased on upgrading the hardware of single machines. On the other hand, the samedata can also be stored several times using data replication to increase the faulttolerance of the system.

• Regarding the velocity of the data, there is a need on the one hand to analyze the dataas soon as possible after they arise and to store data as fast as possible. MongoDBtherefore allows to insert documents quickly. On the other hand modifications ofthe data might take more time but are usually not required in CEP.

• Veracity is an intrinsic property of data and more or less not dependent on the datastorage architecture but rather affects the way to analyze the data.

However, CEP tools like Esper, which will be presented in the next chapter, focus on ahigh throughput of the event data but do not save them. Furthermore, the possibilitiesof fault tolerance are also limited. Therefore a database is required and as events in areal world scenario may not necessarily arise as structured data, it seems reasonable touse MongoDB. So it would be interesting to investigate the possibilities of complex eventprocessing directly on top of MongoDB.

1.4 Thesis Structure

After we had a short introduction to complex event processing, process mining and Mon-goDB, Chapter 2 provides an overview of efforts that have already been done in the fieldof complex event processing and similar fields. In this context, we do not limit ourselves toapproaches having been done in research and also have a look at existing commercial andopen-source technologies. After that, we will have a closer look on the notion of complex

4


events and introduce concepts of different data windows to evaluate streams in chapter 3.Built upon this, the data windows are illustrated in a more practical way in chapter 4 anda conformance checking algorithm (that is the technique of footprints), is introduced. Asthe aforementioned concepts are implemented within the scope of this thesis, an instruc-tion how to use the implemented techniques is done in chapter 5. Chapter 6 provides anevaluation of the implemented techniques which is done by a comparison to the popularCEP software Esper. Therefore the performance and accuracy is compared on equal testdata. Chapter 7 finally summarizes the efforts done in this thesis and gives an outlook tofuture work.

5

Chapter 2

Related Work

CEP was originally developed in order to analyze and abstract events in distributed in-formation systems [10]. Luckham describes the foundations of CEP in [11] including thedevelopment of CEP applications using the Rapide event pattern language.

Etzion and Niblett present an overview of event processing in general in [12]. Starting withbasic concepts, such as the notion of events, more complicated concepts are discussed likeevent-driven architectures while the implementation of these systems is also considered.Although the book is not on CEP, the concepts presented are often overlapping with theconcepts of complex event processing.

Luckham furthermore describes the difference between CEP and event stream processingin [13] as follows. Therefore he differentiates between event streams and event clouds. Themain difference between those is the underlying assumption about the ordering of events.In event stream processing, it is usually assumed that events arrive ordered by their timeof occurrence. An event cloud in contrast combines multiple sources of events and assumesthat events may arrive out-of-order which is computationally more expensive. Generallyspeaking, CEP is dealing with so called event clouds while event stream processing is basedon a stream, as the name suggests.

Though, the term event stream is often used to describe event clouds as well, as the termof a stream outlines the property of data coming up sequentially over time. In this thesis,we also refer to event streams, but without the assumption that events arrive in the orderof their occurrence.

2.1 CEP Languages and Tools

A majority of complex event processing languages are so the called stream-oriented lan-guages or data stream query languages, but there are also other types such as logic lan-guages or composition operators. A more detailed survey on different language styles isprovided in [14]. The stream-oriented languages are often based on SQL to query theevent stream [12]. As there are always new events coming to a stream, only particularwindows, for example time windows, of the stream are considered and queried. On theone hand it is difficult to query the whole stream due to events that arise during the querycomputation, on the other hand the past events up to a certain point in time are probably

6

Chapter 2. Related Work

not relevant any more. A window is treated similarly to a relation and an event similarlyto a tuple in the sense of relational databases.

An implementation in the stream-oriented style is Esper [15], a widespread open sourcetool for CEP and streaming analytics that incorporates a language, a compiler and aruntime. The language that Esper offers is a declarative language named Event ProcessingLanguage (EPL) extending the SQL-Standard. Esper is built to have a low latency andhigh throughput as the data are mainly analyzed and filtered but not necessarily saved.The EPL therefore provides various types of data windows such as sliding windows todefine a certain part of the event streams. Furthermore, many functions are provided toanalyze the events such as aggregations or patterns. Since Esper is written in Java withoutany dependencies on external services, it can be integrated in any Java application. Thereis also a corresponding implementation for .NET called NEsper, that is not used in thisthesis though.

Other popular commercial CEP systems are for example Oracle CEP [16] and Siddhi [17],each offering also an own event query language based on SQL.

Apache Spark [18] offers a fault tolerant streaming library called Spark Streaming. Themain difference to Esper is that it uses micro-batching as an execution model instead of acontinuous streaming model.

Rapide [19] which has already been mentioned, is one of the first CEP systems developedat Stanford University. It was developed in order to model and simulate system architec-tures that can then be analyzed using CEP. Rapide also offers an own pattern languageand is capable of modelling timing or causal relationships.

Cayuga [20] is another research CEP system developed at Cornell University. It is basedon the publish/subscribe paradigm and enables monitoring applications. Therefore it hasan own event query language and a query processing engine.

2.2 CEP and Business Process Monitoring

As already mentioned, business process monitoring is a use case for CEP. Especially, itis a challenge for process monitoring if events that arise during process execution are notcaptured properly in an event log. Therefore Bülow et al. [21] presented an approachto monitor processes with a given process model, using only the execution data thatmay consist of unstructured events. Using CEP techniques, the properties of the eventsare enriched with process related information such as the activity corresponding to anevent.

Baouab et al. [22] presented an approach to monitor violations of cross-organizationalbusiness processes. Using an underlying choreography model, Esper queries are auto-matically generated to evaluate message exchange events and monitor deviations of thismodel.

2.3 CEP and IoT

Wu et al. presented a complex event processing system for monitoring applications inthe context of RFID technology [23]. Therefore, they developed a language to query

7

Chapter 2. Related Work

event streams. Furthermore they used optimization techniques to deal with large slidingwindows and large intermediate results in order to enable a high performance of theirlanguage.

An approach using CEP and statistical analysis for heart failure prediction (CEP4HFP)was introduced by Mdhaffar et. al. in [7]. The system can be used to monitor patients suf-fering from cardiovascular diseases and alert patients and cardiologists when a heart failurestroke is predicted. Relevant health parameters such as heart rate and blood pressure aremeasured via several sensors. The observed data are stored in MongoDB database andprocessed by the Siddhi [17] CEP engine to determine complex events. Therefore, rulesare defined by cardiologists in order to identify important complex events. A challengeis the individuality of parameter thresholds for each patient. Hence a statistical analysisof the collected data is performed so that thresholds are updated at runtime. When acomplex event is detected, for example when a parameter exceeds a certain threshold, analert is triggered.

2.4 Performance Evaluation of MongoDBChang et al. conducted a performance comparison of MongoDB and MySQL on semi-structured social media data [24]. Therefore the time was measured for insert, select,update, and delete operations on different sized data sets. Furthermore, a performancetuning was made for both databases before the measurement. Overall, MongoDB per-formed better in the majority of cases on all operations, outlining its advantages for semi-structured data. Similar results are presented in [25] [26]. Also good results compared toother NoSQL databases are presented in [27] [28].

8

Chapter 3

Preliminaries

3.1 The Notion of Complex Events

As there exist slightly different definitions of complex events and related terms, an attemptof standardization was made by the Event Processing Technical Society. Therefore theEvent Processing Glossary [1] was published providing the definitions of relevant conceptsin this thesis.

Definition 3.1 (Event). An event denotes anything that happens, or is considered tohappen. [1]

Corresponding to the example of the cash machine mentioned in section 1.1, an eventcould be a particular execution of an activity, such as entering the PIN which is associatedwith a particular case and execution time. Note that the representations of events used bycomputer systems are called event objects. In addition to the type of event represented,the event object can store more information in the form of attributes such as timestamps.In the implementation done in this thesis, an event object corresponds to a document inMongoDB.

Furthermore events can be viewed either as simple or complex events.

Definition 3.2 (Simple Event). A simple event is an event that is not considered to bea compound of a set of other events. [1]

Definition 3.3 (Complex Event). A complex event is an event that is a summary, arepresentation, or a definition of a set of other events. [1]

Distinguishing between simple and complex events always depends on the context. In thecash machine example, a particular event of the activity "enter the PIN", could be seen as asimple event. However, from another point of view it could also be seen as a complex eventsince it summarizes multiple events corresponding to pressing single keys of the machine.Moreover there are also classifications for different types of complex events.

Definition 3.4 (Derived Event, Synthesized Event). A complex event is called a derivedor synthesized event if it is inferred from one or more events. [1].

For instance, from a sequence of events corresponding to withdrawing money, an eventcould be derived reporting that a customer has increased his amount of cash.

9

Chapter 3. Preliminaries

Figure 3.1: Different kinds of event objects

Definition 3.5 (Composite Event). A composite event is a derived event that is generatedby a specific set of constructors like disjunction and conjunction and furthermore includesthe events from which it is derived, called the base events. [1]

The derived event reporting that a customer has increased his amount of cash could alsobe a composite event, if sequences belong to the particular set of event constructors andthe underlying base events are included.

Note that complex events in general are sometimes referred to as composite events whichis not in the sense of the presented definitions. A composite event is always a derived eventand a derived event is always a complex event while not all complex events are derivedevents and not all derived events are composite events. Figure 3.1 provides an overview ofthe relationships of the different kinds of events mentioned. Note that the intersection ofsimple and complex events is empty. This seems to contradict the aforementioned remarkthat distinguishing between simple and complex events is always relative. From a singlepoint of view though, any event is either simple or complex.

More importantly, there is the term of complex event processing, which is referred to asthe set of computations on complex events such as their generation, transformation orabstraction. [1]

Eckert et. al. [29] distinguish between complex events as a-priori known patterns andcomplex events as unknown patterns. The first ones can be processed efficiently usingevent processing languages, the second ones requires the application of machine learningand data mining techniques. In this thesis the focus is primarily on a-priori known pattersthat need to be checked.

3.2 Formalization

In order to describe the implemented techniques more precisely, a formalization is done inthe following adapting relevant notions from [30] and [2]. First of all, E is referred to asthe universe of events. In particular, e ∈ E denotes an event object here, represented by atuple of certain attribute values. Additionally, let A denote the set of attribute identifiers.The value of an attribute a ∈ A can be accessed by a projection function πa ∶ E ↦ D ∪ {�}where D is the universe of data values. Note that when a particular attribute a is notdefined for an event e ∈ E , then πa(e) = �. Let furthermore T ⊂ D be the universe of

10


timestamp values.

A sequence σ is a function σ ∶D → C where the domain D ⊆ N is a finite or infinite subsetof the positive integers, and the codomain C is an arbitrary set. It can also be writtenas σ = ⟨c1, c2, .., cn⟩ or σ = (ci)i∈D. Let moreover C∗ denote the set of all finite or infinitesequences over the set C. Furthermore for a sequence σ, we write ∣σ∣ for ∣domain(σ)∣ and ⟨⟩for the empty sequence with ∣⟨⟩∣ = 0. To keep things simple in the following, we define thetimestamp projection of the empty sequence πt(⟨⟩) as the current time for t ∈ T . Given twosequences σ,σ′, we define the concatenation of σ and σ′ as σ ○ σ′ = ⟨c1, ..., c∣σ∣, c′1, ..., c

′∣σ′∣⟩.

Furthermore σ is called a subsequence of σ′, written as σ ⊆∗ σ′, if and only if thereexists a mapping f so that σ(i) = σ′(f(i)) and ∀i, j ∈ N(i < j → f(i) < f(j)). σ is a strictsubsequence of σ′ if and only if σ is a subsequence of σ′ and ∀x ∈ [1, ∣σ∣](f(x+1) = f(x)+1)when f is the required mapping. In addition, a subsequence σ′ of a sequence σ can bedefined as the restriction σ∣D to a smaller domain D ⊆ domain(σ).The notions of sequences allow us to define event streams and event stores as follows.

Definition 3.6 (Event Stream [30]). An event stream S ∈ E ∗ is a possibly infinite sequenceof events (ei)i∈N0 so that ∀ei, ej ∈ E (ei = ej ⇒ i = j) which implies that each event occursat most once in the sequence.

A general assumption about event streams is that a whole event stream is not accessibleat any time. On the one hand, events happen over time and thus events might have nothappened yet. On the other hand, a limited amount of memory makes it impossible tostore a possibly infinite number of events. Thus, when it comes to storing events of astream, only a subsequence of the actual stream is usually stored. As there are new eventsarriving on the stream, the event store is updated from time to time. This leads to thefollowing definition of an event store:

Definition 3.7 (Event Store [30]). Let S ∈ E ∗ be an event stream and n ∈ N0. An eventstore Φi

n of size n when i events have been observed on S, characterizes a finite subsequenceof S, hence Φi

n ∶ E ∗ → E ∗ so that Φin(S) ∈ {σ ∈ E ∗∣σ ⊆∗ S∣[1,i] ∧ ∣σ∣ ≤ n}.

Definition 3.8 (Event Store Update Function [30]). An event store update functionφn is a function φn ∶ E ∗ × E → E ∗ if φn is applied only at event arrival or a functionφn ∶ E ∗ × E ∪ {⟨⟩} → E ∗ if φn is applied continuously. Furthermore, we require that givena maximum size n ∈ N0 of the event store, a stored sequence σ ∈ E ∗ and a new event e ∈ Eor e ∈ E ∪ {⟨⟩} respectively, φn(σ, e) ∈ {σ′∣σ′ ⊆∗ σ ○ ⟨e⟩ ∧ ∣σ′∣ ≤ n}.In the following, event store update functions are presented for different types of win-dows. First of all, the window update functions are presented that only act when a newevent arrives. Secondly, the window update functions are presented that are continuouslyapplied, i.e. the events stored by the window can be changed also when no new eventarrives. After that, the batch window update functions are introduced that are a differenttype of update function. These are also distinguished between those functions that arecontinuously applied and those that are not.

3.2.1 Update functions applied on event arrival

Definition 3.9 (Length Window Update Function). An event store update function φlwn ∶E ∗ × E → E ∗ is a length window update function if n is the maximum size of the window

11


and given a stored sequence σ ∈ E ∗ and a new event e ∈ E :

φlwn (σ, e) =⎧⎪⎪⎨⎪⎪⎩σ ○ ⟨e⟩, if ∣σ∣ < n⟨e2, ..., e∣σ∣⟩ ○ ⟨e⟩, otherwise.

The length window update function is a clear example of an update function that is onlychanging the content of the underlying event store when a new event arrives. The followingfirst length window update function is different in the way that once an event is stored inthe corresponding event store, it is never removed and the sequence is fixed after it reachesthe length n.

Definition 3.10 (First Length Window Update Function). An event store update func-tion φflwn ∶ E ∗×E → E ∗ is a first length window update function if n denotes the maximumsize of the window and given a new event e ∈ E and a stored sequence σ ∈ E ∗:

φflwn (σ, e) =⎧⎪⎪⎨⎪⎪⎩σ ○ ⟨e⟩, if ∣σ∣ < nσ, otherwise.

Definition 3.11 (Keep-All Window Update Function). An event store update functionφkw ∶ E ∗ × E → E ∗ is a keep-all window update function if given a new event e ∈ E and astored sequence σ ∈ E ∗:

φkw(σ, e) = σ ○ ⟨e⟩The underlying concept of the keep-all window is to store any arriving event which ismodelled by the presented keep-all window update function, though it is not applicable inpractice, as an event stream is unbounded and the memory limited. Thus, the constraintto keep all events will eventually be violated. One option to deal with that, is to remove noevents from the window, which is similar to the definition of the first length window updatefunction. Another option is to keep always the newest events leading to the length windowupdate function. Hence, a perfect keep-all window is not possible so that depending onthe context, a length window or a first length window of the size of the available memoryis the closest practical way to a keep-all window. In general, this does not only applyto the keep-all window. When too many events arrive in a stream, each data windowwill eventually act like a length or first length window. The functions presented in thefollowing describe the way the windows work as long as memory is available.

Definition 3.12 (Sorted Window Update Function). An event store update functionφswsort,n ∶ E ∗ × E → E ∗ is a sorted window update function if n denotes the maximum sizeof the window, sort denote criteria to sort the events and given a new event e ∈ E and astored sequence σ ∈ E ∗:

φswsort,n(σ, e) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

⟨e1, ..., ei, e, ei+1, ..., e∣σ∣⟩, if ∣σ∣ < n ∧ πsort(ei) ≤ πsort(e) ≤ πsort(ei+1)⟨e2, ..., ei, e, ei+1, ..., e∣σ∣⟩, if ∣σ∣ = n ∧ πsort(ei) ≤ πsort(e) ≤ πsort(ei+1)σ, otherwise.

Note that πsort is a projection function to a tuple of attribute values specified by the sortcriteria and <,≤ describe the order of these tuples as specified by the sort criteria. For thefollowing unique window update function, πuni is also a projection function to a tuple ofattribute values, while an order is not defined.

Definition 3.13 (Unique Window Update Function). An event store update function

12


φuwuni ∶ E ∗ × E → E ∗ is a unique window update function if uni denotes criteria accordingto which every event has to be unique in the window and given a new event e ∈ E and astored sequence σ ∈ E ∗:

φuwuni(σ, e) =⎧⎪⎪⎨⎪⎪⎩⟨e1, ..., ei, e, ei+2, ..., e∣σ∣⟩, if πuni(ei+1) = πuni(e)σ ○ ⟨e⟩, otherwise.

Similar to the keep-all window update function, the presented unique window updatefunction does not specify the behaviour when the memory limit is exceeded. Again, thereare the options to act like a length or first length window update function, but a perfectunique window is not possible for a certain amount of memory and arbitrary uniquenesscriteria.

Definition 3.14 (Externally-timed Window Update Function). An event store updatefunction φetw∆t,te ∶ E ∗ × E → E ∗ is an externally-timed window update function if ∆t is themaximum difference between the timestamp attribute te of the newest and oldest event inthe window and given a new event e ∈ E and a stored sequence σ ∈ E ∗:

φetw∆t,te(σ, e) =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

⟨e1, ..., ei, e, ei+1, ..., e∣σ∣⟩, ifπte(e∣σ∣) − πte(e) ≤ ∆t∧πte(ei) ≤ πte(e) ≤ πte(ei+1)

⟨ei, ..., e∣σ∣, e⟩, ifπte(e) − πte(ei) ≤ ∆t∧πte(e) − πte(ei−1) > ∆t

Note that πte is a projection to a timestamp value which is not necessarily the occurrenceor arrival time of the event.

Definition 3.15 (First Time Window Update Function). An event store update functionφftwt ∶ E ∗ × E → E ∗ is a first time window update function if t ∈ T is the time until whichthe window inserts new events, so given a new event e ∈ E and a stored sequence σ ∈ E ∗:

φftwt (σ, e) =⎧⎪⎪⎨⎪⎪⎩σ ○ ⟨e⟩, if πt(⟨⟩) < tσ, otherwise.

The event store update functions presented in the following also define changes of the eventstore that are not dependent on event arrival. To ensure that the event store is storingevents correctly, the following update functions have to be applied continuously.

3.2.2 Continuously applied update functions

Definition 3.16 (Time Window Update Function). An event store update function φtw∆t ∶E ∗ × E ∪ {⟨⟩} → E ∗ is a time window update function if ∆t is the maximum differencebetween the current time and any event in the window, so given a new event e ∈ E ∪ {⟨⟩}and a stored sequence σ ∈ E ∗:

φtw∆t(σ, e) = ⟨ei, ..., e∣σ∣⟩ ○ ⟨e⟩, for i such that ∶ πt(e) − πt(ei) < ∆t ∧ πt(e) − πt(ei−1) ≥ ∆t

Differently to the windows mentioned above, the time window is the first window thatcan change without a newly arriving event. Thus, the case may occur that e is the emptysequence, representing no arriving event. As previously defined, π(e) is the current timethen to avoid a lot more distinct cases. When the window is updated, regardless of whether

13


a new event arrives or not, all events that arrived earlier than the current time minus thetime period of the window are removed.

Just like the length and first length window update function, the time and first timewindow update function have the main difference that first time window update functiondoes not remove any events. Hence, it is sufficient to update the first time window onlyon event arrival.

Definition 3.17 (Time-Accumulating Window Update Function). An event store updatefunction φtaw∆t (σ, e) ∶ E ∗ ×E ∪ {⟨⟩}→ E ∗ is a time-accumulating window update function if∆t is a time interval and given a new event e ∈ E ∪ {⟨⟩} and a stored sequence σ ∈ E ∗:

φtaw∆t (σ, e) =⎧⎪⎪⎨⎪⎪⎩σ ○ ⟨e⟩, if πt(e) − πt(e∣σ∣) < ∆t⟨e⟩, otherwise.

Definition 3.18 (Time-Order Window Update Function). An event store update functionφtow∆t ∶ E ∗ × E ∪ {⟨⟩} → E ∗ is a time-order window update function if ∆t is the maximumtime difference between the current time and any event in the window and given a newevent e ∈ E ∪ {⟨⟩} and a stored sequence σ ∈ E ∗:

φtow∆t (σ, e) =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

⟨ei, ..., e∣σ∣⟩, ifπt(e) ≤ πt(⟨⟩) −∆t ∨ e = ⟨⟩∧πt(ei) > πt(⟨⟩) −∆t ∧ πt(ei−1) ≤ πt(⟨⟩) −∆t

⟨ei, ..., ej , e, ej+1, ..., e∣σ∣⟩, ifπte(e) > πte(⟨⟩) −∆t∧πt(ej) ≤ πt(e) ≤ πt(ej+1)∧πt(ei) > πt(⟨⟩) −∆t ∧ πt(ei−1) ≤ πt(⟨⟩) −∆t

Let in the following πte(⟨⟩) also denote the current time for an attribute te providing atimestamp value.

Definition 3.19 (Time-To-Live Window Update Function). An event store update func-tion φttlwte ∶ E ∗×E ∪{⟨⟩}→ E ∗ is a time-to-live window update function if te is an attributeproviding an expiration time and given a new event e ∈ E ∪ {⟨⟩} and a stored sequenceσ ∈ E ∗:

φttlwte (σ, e) =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

⟨ei, ..., e∣σ∣⟩, ifπte(e) < πte(⟨⟩) ∨ e = ⟨⟩∧πte(ei) > πte(⟨⟩) ∧ πte(ei−1) ≤ πte(⟨⟩)

⟨ei, ..., ej , e, ej+1, ..., e∣σ∣⟩, ifπte(e) > πte(⟨⟩)∧πte(ej) ≤ πte(e) ≤ πte(ej+1)∧πte(ei) > πte(⟨⟩) ∧ πte(ei−1) ≤ πte(⟨⟩)

The following window update functions are defined for batch windows, that work differ-ently to the previous windows.

3.2.3 Batch window update functions

Basically, a batch window consists of two event stores. One event store corresponds to thewindow output and another event store is used to buffer events internally. New events canbe inserted in both event stores, while events can be moved from the internal event store

14


to the output store, but not the other way around. In general, a batch window updatefunction accumulates events in the internal event store until a condition is satisfied, so theevents are moved from the internal event store to the output event store. Note that allevents of the output event store get replaced, each time it is updated.

Definition 3.20 (Event Store Update Function for Batch Windows). An event storeupdate function for batch windows φn is a function φn ∶ E ∗ × E ∗ × E → E ∗ × E ∗ if φn isapplied only at event arrival or a function φn ∶ E ∗×E ∗×E ∪{⟨⟩}→ E ∗×E ∗ if φn is appliedcontinuously. Furthermore, we require that given a maximum size n ∈ N0 of both eventstores, a stored sequence σout ∈ E ∗ of the output event store, a stored sequence σin ∈ E ∗ ofthe internal event store and a new event e ∈ E or e ∈ E ∪{⟨⟩} respectively, φn(σout, σin, e) ∈{(σ′out, σ′in)∣(σ′out = σout ∧ σ′in = σin ○ ⟨e⟩ ∧ ∣σ′in∣ ≤ n) ∨ (σ′out = σin ○ ⟨e⟩ ∧ σ′in = ⟨⟩)}.Definition 3.21 (Length Batch Window Update Function). An event store update func-tion φlbwn ∶ E ∗ × E ∗ × E → E ∗ × E ∗ is a length batch window update function if n denotesthe batch size, i.e. the maximum size of each event store, and given a new event e ∈ E ,the content of the output event store σout ∈ E ∗ and the content of the internal event storeσin ∈ E ∗:

φlbwn (σout, σin, e) =⎧⎪⎪⎨⎪⎪⎩(σin ○ ⟨e⟩, ⟨⟩), if ∣σin ○ ⟨e⟩∣ = n(σout, σin ○ ⟨e⟩), otherwise.

While the length batch window update function only changes the corresponding eventstores when a new event arrives, the following update functions may also apply changeswhen no event arrives.

Definition 3.22 (Time Batch Window Update Function). An event store update functionφtbw∆t,tref

∶ E ∗ × E ∗ × E ∪ {⟨⟩} → E ∗ × E ∗ is a time batch window update function if ∆t isthe batch time interval, tref is the reference point for the first batch release and given anew event e ∈ E ∪ {⟨⟩}, the content of the previously output event store σout ∈ E ∗ and thecontent of the internal event store σin ∈ E ∗:

φtbw∆t,tref(σout, σin, e) =

⎧⎪⎪⎨⎪⎪⎩(σin ○ ⟨e⟩, ⟨⟩), if πt(⟨⟩) − tref mod ∆t = 0(σout, σin ○ ⟨e⟩), otherwise.

Definition 3.23 (Time-Length Combination Batch Window Update Function). An eventstore update function φtlwn,∆t,tref

∶ E ∗×E ∗×E ∪{⟨⟩}→ E ∗×E ∗ is a time-length combinationbatch window update function if n denotes the maximum batch size, i.e. the maximumsize of each event store, ∆t is the batch time interval, tref denotes the reference point forthe next batch release and given a new event e ∈ E ∪ {⟨⟩}, the content of the output eventstore σout ∈ E ∗ and the content of the internal event store σin ∈ E ∗:

φtlwn,∆t,tref(σout, σin, e) =

⎧⎪⎪⎨⎪⎪⎩(σin ○ ⟨e⟩, ⟨⟩), if πt(⟨⟩) = tref ∨ ∣σin ○ ⟨e⟩∣ = n(σout, σin ○ ⟨e⟩), otherwise.

After we have seen a formal framework to describe event streams and events stores withtheir corresponding update functions, we will now have a more practical look on thepreviously described techniques in order to implement them.

15

Chapter 4

Processing Events using CEP

4.1 ConsiderationsThe general idea of the implemented CEP techniques is to capture one or many eventstreams in distinct event logs. Newly arriving events are not only inserted into an eventlog but also processed by custom data windows, where each data window belongs to exactlyone event log, while multiple data windows can belong to the same event log. Each eventlog and each data window is represented by a single collection in the MongoDB database.The functionalities of the data windows are described later on in 4.2.

• When the events captured in a data window change, the data window is evaluatedand the results are stored in a result collection.

• When the results match a predefined rule, an alert is triggered and saved to the alertcollection.

• Furthermore there is the possibility of writing alerts into the Windows Event Logusing a Python script.

The example in Figure 4.1 shows a possible structure of the approach. There are twodifferent event streams that are saved to an event log each. The first event log has acorresponding time window and length window. The second event log has a correspondinglength window. The results obtained from the two length windows are saved in the same

Figure 4.1: Representation of some MongoDB collections supporting the techniques de-scribed in this thesis

16

Chapter 4. Processing Events using CEP

collection while the results of the time window are saved in a different collection. Bothresult collections are then used to trigger alerts. In general, all components are looselycoupled to make an easy exchange, for example of a data window, possible.

There are two conceptually different kinds of collections used in the framework. On theone hand, there are the collections of the event logs saving all incoming events of an eventstream. As an event stream is assumed to run infinitely, the number of events in the eventlogs is always finite but unbounded, or only bounded indirectly by the amount of availablememory respectively. On the other hand, there are the collections of the data windows,that usually satisfy a condition bounding the number of events to a certain size or timeperiod. Therefore the data windows need to keep satisfying the condition and require notonly to insert new events, but also to remove expired events.

To insert the new events, each data window maintains a change stream that notifiesinsertions of events to the corresponding event log. Change streams are a feature imple-mented by MongoDB and track the changes applied either on a specified collection or awhole database. Furthermore a change stream can be configured to notify only particularchanges, for example the insertion of events having a certain attribute. For instance thisis relevant for externally-timed windows requiring the events to have a given timestampattribute. When the insertion of an event to an event log is notified by a data window, thedata window is updated. Length and time windows then always insert the new event andif necessary remove the oldest event. The updates of the time window are furthermore notonly taking place when a new event is inserted. Also if an event exceeds the predefinedtime span, it is removed from the data window. Note that there are also data windowssuch as the time-order window which do not necessarily insert every arriving event. Inthe following, the windows listed below will be introduced and their behavior on a streamexplained.

• Time-bounded Windows (Windows considering a particular time period)

– Time Window

– First Time Window

– Time Batch Window

– Time-Accumulating Window

– Externally-Timed Window

– Time-Order Window

– Time-To-Live Window

• Length-bounded Windows (Windows considering a particular number of events)

– Length Window

– First Length Window

– Length Batch Window

– Sorted Window

• Other Windows (unbounded windows or windows bounded in time and size)

17


– Unique Window

– Keep-All Window

– Time-Length Combination Batch Window

4.2 Data Windows

In the following, the different implemented data windows are presented. For practicalreasons it is more interesting to classify the windows by their bounding condition instead ofwhether the window is updated only on event arrival or continuously. Thus, we distinguishbetween the windows bounded by a number of events, a time period or other. Generally,every data window has two main functions.

• First, there is an initialization function setting up the window according to thegiven parameters. For example indices used by the window are built within theinitialization function.

• Second, there is the update function which was introduced in the last chapter, de-ciding which events are inserted into the window and which are removed.

In order to get notified whenever a new event is inserted to an event log, every data windowdeploys a change stream on the corresponding log, filtering for changes by insertion of anevent.

4.2.1 Time Windows

Time Window

Given a time interval, the time window keeps only the events that arrived within theinterval from the current time into the past. Conceptually, this window slides continuously,meaning that the events kept by the window arrived exactly within the time interval. Inpractice though, the window slides only when it is updated which happens in discretetime intervals. Note that the updates not only take place when a new event arrives. If thewindow would only be updated when a new event arrives, the window would eventuallybecome inconsistent if no more events would arrive. Figure 4.2 depicts an example of atime window for a time interval of 4 seconds. When event e1 arrives at t = 1, it is insertedinto the window. Similarly, event e2 and e3 are inserted at t = 2 and t = 4 respectively.When the window is updated at t = 5, e1 is removed after being in the window for 4seconds. Equally, e2 is removed at t = 6 and e3 is removed at t = 8. When e4 is removedat t = 11, the window is empty again.

To implement the time window, new events stored by the window get an additional at-tribute that denotes the event’s arrival time which is not necessarily the time when theevent took place. Besides the event insertion, the update function of the window has tocheck continuously whether events have to be removed. In order to make querying theevents more efficient, an index is created on the added arrival time attribute.

First Time Window

Given a time interval, the first time window keeps all events that arrive within the intervalbeginning when the window is started. Thus the window does not slide at all. Figure 4.3

18


Description Use Case

Time Window (time interval ∆t)Retains all events having arrived withinthe last ∆t seconds.

Computing a moving average, such as theaverage stock price of the last hour.

First Time Window (time interval ∆t)Retains the events arriving in the first ∆tseconds and ignores following events.

Store all registrations for a product untila deadline.

Time Batch Window (time interval ∆t, [reference point r])From the reference point r, which is 0seconds by default, every ∆t seconds allevents that arrived within this interval areoutput at once.

Data reduction in large networks [31] bysummarizing events.

Time-Accumulating Window (time interval ∆t)Retains all events and is cleared when noevents arrive for ∆t seconds.

Detecting absence of events when eventsshould arrive continuously. The aggrega-tion results of the window are null then.

Externally-Timed Window (time interval ∆t, timestamp expression expr)Retains the newest events according toexpr within the interval of ∆t seconds.

Failure recovery in event processing appli-cations [32].

Time-Order Window (time interval ∆t, timestamp expression expr)Retains the events with a timestamp pro-vided by expr that is within the last ∆tseconds.

Processing events arriving asynchronouslyaccording to the system clock.

Time-To-Live Window (timestamp expression expr)Retains events until the point in time pro-vided by expr.

In a fee-based WiFi network, monitor thenumber of paid memberships that are cur-rently active.

Table 4.1: Time-bounded windows

19


time t0 5 10

e1 e2 e3 e4

e1 e1 e1e2 e2

e2

e3e3

e3 e3e4

e4

Window In-stances afterUpdates

Event Ar-rival

Figure 4.2: Time Window

shows a first time window for a time interval of 5 seconds. t = 0 marks the point in timewhen the window is started, so the window stops gathering events at t = 5. Hence, untilevent e3 every arriving event is stored in the window. Event e4 is disregarded though, asit arrives after t = 5.

In the implementation, the first time window does not build an index on its collectionunlike the time window. Furthermore, the underlying algorithm of the first time windowterminates after the given time. Being a collection in the MongoDB database, the eventsof the window are still available then, but are not changed further. The update functionof the first time window hence only inserts new events.

time t0 5 10

e1 e2 e3 e4

e1 e1 e1e2 e2

e3Window In-stances afterUpdates

Event Ar-rival

Figure 4.3: First Time Window

Time Batch Window

The time batch window is given a time interval and optionally a reference point to outputthe first batch. When no reference point is given, the time batch window waits for thefirst event in order to set the reference point. The window then gathers events for thegiven time interval and outputs them at once in a batch afterwards.

Figure 4.4 shows a time batch window for an interval of 5 seconds and no reference point.Hence, the first time interval starts with the first event arrival at t = 1. Then the eventsarriving within the next 5 seconds are gathered, and output at t = 6. Afterwards, thefollowing events are gathered and output at t = 11. In general, only the last batch ofevents can be accessed. For example event e1 is not accessible before t = 6 although it

20


arrived already at t = 1.

In the implementation, the arriving events are buffered until the next release of the batch.In particular, the events are released by inserting them into the corresponding MongoDBcollection, overwriting the previous batch of events.

time t0 5 10

e1 e2 e3 e4

e1e2e3

e4


Event Ar-rival

Figure 4.4: Time Batch Window

Time-Accumulating Window

The time-accumulating window is given a time interval, so that after no events arrived forthe interval the window is cleared.

Figure 4.5 shows a time-accumulating window for an interval of 3 seconds. Thus, if noevents arrive for 3 seconds, the window is cleared as at t = 6 and t = 10. To implement thetime-accumulating window, it is necessary to store the time of the most recently insertedevent. Any arriving event is inserted in the window then, while updating the time of themost recent insertion. Apart from that, the elapsed time since the last insertion is contin-uously checked, so when the predefined interval is exceeded, the window is cleared.

time t0 5 10

e1 e2 e3 e4

e1 e1 e1e2 e2

e3

e4


Event Ar-rival

Figure 4.5: Time-Accumulating Window

Externally-Timed Window

The Externally-Timed Window takes a timestamp expression and a time interval andstores the most recent events within that time interval according to the expression. Atimestamp expression is an attribute that is evaluated as a timestamp value. The mainidea of the window is to enable the processing of events according to timestamp data

21


that do not refer to the system time, for example past event data where the arrival timematters. Differently to the windows presented before, the externally-timed windows donot follow the runtime and hence are only updated when a new event arrives. In orderto compare events, all events stored by the window need to have the specified timestampattribute, as other events are disregarded.

Figure 4.6 shows an externally-timed window of a time interval of 4 seconds. The positionsof the events below the time axis mark the timestamp values indicated by the correspond-ing expression according to an external clock. Thus, event e1 has timestamp 1, e2 hastimestamp 2, e3 has timestamp 5 and e4 has timestamp 7. The dotted lines then indicatethe arrival time according to the runtime at which the window updates take place. Att = 3 the two arriving events e1 and e2 are inserted into the window as the time differenceof the external time equals 1 second. When event e3 arrives at t = 5, e1 is removed as thedifference to e3 is greater than 3 seconds, while e2 remains having a difference of exactly3 seconds to e3. At t = 10, e4 arrives so e2 is having a difference greater than 3 secondsto e4 and therefore is removed. In contrast to the time window, an event could possiblyremain longer than the specified time period in the externally-timed window. Assumingthe external clock would coincidentally equal the system clock, event e3 would remainmore than 3 seconds in the window.

The implemented update function for the externally-timed window considers the tail timeof the window, which is the greatest timestamp by an event minus the given time in-terval. Any event having a timestamp smaller than the tail time of the window is thenremoved.

time t0 5 10

e1 e2 e3 e4

e1e2

e2e3

e3e4


Event Oc-currence

Figure 4.6: Externally-Timed Window

Time-Order Window

The time-order window takes a time interval and a timestamp expression, similar to theexternally-timed window. The key difference is that the time-order window follows thesystem clock and is therefore updated even if no event arrived.

Figure 4.7 shows a time-order window of a 4 second time interval. The events depictedbelow the time axis mark the occurrences according to the runtime. The dotted lines thendetermine the arrival time of the events. Differently to the time window, the events areremoved according to their occurrence and not the arrival. Thus, e1 is removed from thewindow at t = 5 after remaining only 2 seconds in the window.

In the implementation, the time-order window maintains the tail time of the window

22


similar to the externally-timed window. Thus all the events not providing timestamp valuegreater than the tail time, are constantly removed from the window. Independently fromthe event removal, a new event is inserted whenever it possesses the required timestampattribute and fits into the window.

time t0 5 10

e1 e2 e3 e4

e1e2e2

e2e3

e3 e3e4

e4


Event Oc-currence

Figure 4.7: Time-Order Window

Time-To-Live Window

The time-to-live window requires events to have a certain timestamp expression markingthe point in time to remove the event from the window. Basically, the time-to-live windowworks the same way as the time-order window before. In principle, the tail time of thewindow here equals the current time, so that the events are required to have a timestampvalue greater than the current time, as the corresponding event timestamps are interpretedas an expiration date in this window.

Figure 4.8 shows a time-to-live window. The positions of the events below the time axiscorrespond to the expiration time provided by the required timestamp, while the dottedlines mark the event arrival. For example, event e1 arrives at t = 1 and is removed againat t = 4. Event e4 arrives exactly at its expiration time and is therefore not inserted intothe window.

time t0 5 10

e1 e2e3 e4

e1 e1e2 e2

e2e3


Event Expi-ration

Figure 4.8: Time-To-Live Window

4.2.2 Length Windows

Length Window

23



Length Window (size n)Retains the n events that arrived most re-cently.

Computing a moving average bounded insize, such as the average stock price of thelast 100 buy orders.

First Length Window (size n)Retains the first n arriving events and ig-nores the following events.

For a limited amount of possible registra-tions for a product, store the first ones.

Length Batch Window (size n)Batches events and outputs them at oncewhen n events have been collected.

Summarize events e.g. if there are toomany ’small’ events.

Sorted Window (size n, sort criteria cr)Retains the n maximum or minimumevents according to the criteria cr.

Store the largest amounts of stocks thatare sold or bought in a day.

Table 4.2: Length-bounded windows

The length window is given a certain size bounding the number of events to keep by thewindow. Similar to the time window, the events are removed from the length window inthe same order in which the events arrive, while the length window is only updated onevent arrival.

Figure 4.9 shows a length window of size 3. Thus, any arriving event is inserted into thewindow and removed again when 3 more events arrived. For example, e1 is the eventarriving at first and is removed when e4 arrives.

To implement the length window, MongoDB’s concept of capped collections is used tobound a collection in size. A capped collection allocates an amount of memory and hasa maximum number of documents. If either the size or the memory limit is reached anda new document should be inserted, the oldest documents are removed until the newdocument fits into the collection. Thus, a capped collection works similar to a circularbuffer. To use a capped collection as a length window, the amount of memory has tobe sufficiently large so that the given number of events can be held by the collection.Then the events only need to inserted by an update function as the removal is performedautomatically by MongoDB.

First Length Window

The first length window is given a size which determines the number of first arriving eventsto be stored by the window. The difference to the length window is that the first lengthwindow does not slide. Thus, after the given number of events occured, the window is notupdated anymore. Figure 4.10 shows a first length window of size 3. Hence, the first 3events arriving in a stream are stored while all following events are disregarded.

In the implementation, the first length window is not making use of capped collections,

24


time t0 5 10

e1 e2 e3 e4 e5

e1e1

e1

e2e2

e2

e3e3

e3

e4e4e5


Event Ar-rival

Figure 4.9: Length Window

as a removal of events is not required. Instead, the number of inserted events is counteduntil the predefined size is reached.

time t0 5 10

e1 e2 e3 e4 e5

e1e1

e1

e2e2e3


Event Ar-rival

Figure 4.10: First Length Window

Length Batch Window

The length batch window is given a certain batch size. Then, the window gathers eventsand when the given size of events is collected, the events are output in batch. Afterwards,events are gathered again and when the size is reached, the previous batch of events isreplaced by the new events. Figure 4.11 shows a length batch window of size 2. Thus,when two events are buffered, they are released as a batch. Note that similar to the timebatch window, only the most recent batch of events is stored by the window.

Similar to the length window, the length batch window uses a capped collection, initializedto the batch size, in the implementation. Hence, when a batch is output, the previousbatch is just overwritten.

Sorted Window

The sorted window is given a size and some sort criteria, to store the given number ofevents with highest priority according to the sort criteria such as the highest value of acertain attribute.

Figure 4.12 shows a sorted window of size 3 sorted ascending by an attribute depicted asthe right number in each event. The attribute value is represented by the right number

25


time t0 5 10

e1 e2 e3 e4

e1e2

e3e4


Event Ar-rival

Figure 4.11: Length Batch Window

of each event. After the first three events are inserted, the maximum size of the windowis reached. Hence, if a new event has a higher priority than an old event, the old event isremoved when the new event is inserted. Event e4 has an attribute value of 2 which hasa higher priority than the value 5 of event e1 being the event with least priority in thewindow. Thus e1 is replaced by e4. Event e5 has the same priority as the least prioritizedevent e3 and therefore replaces e3, as in case of same priority, the newest event is storedby the window. Event e6 has a smaller priority than all events in the window and is notinserted hence.

To implement the sorted window, an index for the given sort criteria is created on thecorresponding collection to handle the comparisons of events more efficiently.

time t0 5 10

e1∣5 e2∣3 e3∣4 e4∣2 e5∣4 e6∣8

e1∣5e1∣5

e1∣5

e2∣3 e2∣3e2∣3 e2∣3 e2∣3e3∣4e3∣4

e4∣2 e4∣2 e4∣2

e5∣4 e5∣4Window In-stances afterUpdates

Event Ar-rival

Figure 4.12: Sorted Window

4.2.3 Other Windows

Unique Window

The unique window is given some uniqueness criteria, which is a set of attributes, so theevents stored by the window have to have distinct values in these attributes. All arrivingevents are inserted to the window. If there is already an event in the window having thesame attribute values for the specified unique attributes, it is replaced by the new event.Note that the unique window is not directly bounded in size, but should be used onlyon attributes with finite domains and especially not with timestamps such as the arrivaltime.

26



Unique Window (unique criteria cr)Retains the newest unique events accord-ing to cr.

Keeping the most recent event for eachcase in conformance checking.

Keep-All Window ()Retains all events and is therefore equiva-lent to an event log.

Storing all incoming events in an eventlog.

Time-Length Combination Batch Window (time interval ∆t, size n)Outputs the last collected events when ei-ther n events have been collected or ∆tseconds passed.

A time batch window avoiding too largebatches in case of a high event arrival rate.

Table 4.3: Other windows

Figure 4.13 shows a unique window requiring a unique color for the events. In case of e3and e5, an existing event is replaced in the window by a new event with the same color.The other events are inserted without replacement.

For the implementation, an index is created on the attributes specified by the uniquecriteria. Before a new event is inserted, an event having the same values for the uniqueattributes is deleted in case it exists.

time t0 5 10

e1 e2 e3 e4 e5

e1 e1e2 e2 e2

e3 e3 e3

e4 e4e5


Event Ar-rival

Figure 4.13: Unique Window

Keep-All Window

The keep-all window is also a data window implemented by Esper. As in this imple-mentation all events are stored within an event log, an extra code for this window is notnecessary. To establish a keep-all window, the event log can be evaluated in the same wayas the other windows are.

Time-Length Combination Batch Window

The time-length combination batch window combines the properties of the time batchwindow and length batch window described previously. It takes a maximum batch size

27


time t0 5 10

e1 e2 e3 e4

e1 e1 e1 e1e2 e2 e2

e3 e3e4


Event Ar-rival

Figure 4.14: Keep-All Window

and a maximum batch time as an input. When either the maximum number of events hasbeen buffered since the last batch, or the maximum time has passed, the buffered eventsare released in a batch. Similar to the other batch windows, the window only stores themost recent batch. Differently to the time batch window, this window does not take areference point for the first batch as an input.

Figure 4.15 shows a time-length combination window of size 3 and time 5 seconds. Whenthe window starts at t = 0, the next reference point in time for a batch is t = 5. However,three events arrive before t = 5 reaching the maximum size of a batch. Thus the first batchis output at t = 4 when e3 arrives. The next reference point in time is then t = 9 whichis the time of the most recent batch plus the given time. As less than three events arriveuntil t = 9, the next batch, containing only e4, is output then.

For the implementation, using a capped collection is not possible, as documents cannot bedeleted but only overwritten. Thus, a variable batch size would not be possible withoutdeleting and recreating the whole collection which is why a normal collection is usedhere.

time t0 5 10

e1 e2 e3 e4

e1e2e3 e4


Event Ar-rival

Figure 4.15: Time-Length Combination Batch Window

4.3 Footprints Conformance CheckingIn this section, we will have a look at the technique of footprints which is a techniqueof conformance checking on streams that provides a useful extension of the presented

28


CEP techniques. Conformance checking is a field in process mining that aims to detectdeviations of a given process model, such as a petri net, from the corresponding executiondata.

The footprint of a process describes the relationship between its activities, meaning thatit specifies the allowed successors for each activity. In this case, we assume that a processmodel is known and that it is given manually in form of the allowed directly-follows arcs,which contains all pairs of activities where the first activity is followed by the secondactivity. Thus, we can check for a given directly follows relation, if the events of an eventstream correspond to the modelled behavior. Usually the setting is different to the classicalone in conformance checking, as the execution data is often contained often given in anoffline event log while a stream provides the events here, creating an ’online’ scenario.When an event is not an allowed successor of the last event of the same case, an alert istriggered here, indicating that the model is violated.

time t0 5 10

A B C DCase 1

A E C DCase 2

A C D ECase 3

Figure 4.16: Technique of Footprints process example

Figure 4.16 shows an example of three different cases and the order of their activities.We assume that we have the directly-follows relation of the underlying process given asA→ B, B → C, C →D, A→ E and E → B. For simplicity reasons, we will have a look atthe behavior of the algorithm by case instead of the arrival time of the events.

• The first event of case 1 is of activity A. As there is no previous event for case 1,the algorithm just stores this event as the most recent event for case 1. When thesecond event of activity B arrives, the previous event of activity A is considered andthe directly follows relation is checked, whether A→ B is contained. This transitionis contained, so the algorithm proceeds and stores B as the most recent event forcase 1. The same procedure is done for the next two events. As B → C and C → Dare also contained in the directly follows relation, no alert is triggered.

• For case 2, the first event is also of activity A. The next event of activity E isallowed to follow activity A. After that, an event of activity C follows. As E → Cis not contained in the directly follows relation, an alert is triggered that an eventof activity C is not allowed. Though, the event is stored as the most recent eventof case 2, so when the next event of activity D arrives, C → D is contained in thedirectly follows relation and therefore is allowed.

• For case 3, the first event is of activity A again. As A → C is not contained in thedirectly follows relation, an alert is triggered when the event of activity C arrives.C →D is contained again in the directly follows relation so that no alert is triggeredwhen the event of activity D arrives. When the event of activity E arrives, D → E

29


is checked which is not in the directly follows relation so that an alert is triggered.

In the implementation of the algorithm, the directly-follows relation is represented as anarray of sets. This means, that for each possible activity the set of allowed followingactivities is stored. So an alert can be triggered when the current activity of a case is notin the set of allowed activities of the previous activity of the same case.

Often, footprints are illustrated in a matrix [2]. Table 4.4 shows the footprint of theaforementioned directly-follows relation which is the underlying model and Table 4.5 showsthe footprint of the example process corresponding to the actual execution data.

A B C D EA # → # # →B ← # → # ←C # ← # → #D # # ← # #E ← → # # #

Table 4.4: Footprint of the directly-follows relation

A B C D EA # → → # →B ← # → # #C ← ← # → ←D # # ← # →E ← # → ← #

Table 4.5: Footprint of the process ex-ample

The footprint matrix representation can be inferred from the directly-follows relation.This way, the differences of the footprints between the model and execution data arevisualized. In the footprint matrix, the ordering of activities is denoted by the sequencerelation (→,←), the parallel relation (∣∣) and no relation (#). A→B or B←A respectively,means that activity A is followed by activity B, but activity A never follows B. A#Bdescribes that A is never followed by B and B is never followed by A. The third relationA∣∣B, which is not present in the footprint matrices above, means that A is followed by Band B is followed by A as well.

In general, the technique of footprints is a very rough conformance checking algorithm.More insights can be provided for example using token-based replay or alignments [2].

30

Chapter 5

Implementation

5.1 Requirements

The implemented techniques require a running MongoDB deployment using a replica setwith at least two nodes in order to use change streams. In particular, MongoDB version4.0.4 has been used for the tests. Detailed guides to install MongoDB can be found at [33]and to deploy a replica set at [34]. A replica set can be deployed on a single machine fortesting purposes. Therefore, the ’mongod’ process which is the main process in MongoDBhas to be executed in multiple instances on different ports and directories. By specifyingthese running processes as members, the replica set can be deployed. Furthermore, theevent streams to be processed are required to be stored in MongoDB and thus have tobe encoded in JSON. Multiple streams can be stored in a single collection as well as indistinct collections.

5.2 How to use the Implementation

The gitlab repository can be reached at https://git.rwth-aachen.de/tim.bauerle/cep_on_mongodb. Recall the example architecture of the implementation in Figure 5.1.In the following, we will configure the implementation step-by-step to the shown architec-

Figure 5.1: Implementation and assessment architecture

31

Chapter 5. Implementation

ture.

5.2.1 Configure the Statistics

Before the data windows can be started, it is important that a suitable callback function isprovided in order to evaluate the events in the data window. That function is called eachtime the window is updated from the window specific script. When the given architectureis used with the given event streams, a suitable function is already provided that aggregatesthe current events of the data window. Otherwise a Javascript function with the signa-ture aggregate(dataWindow, statsCollection) has to be created. The dataWindowparameter is the identifier of the data window to be evaluated and the statsCollectionparameter specifies where the result of the function should be stored. Therefore, the filewhere the function is stored has to be specified when the data window is started. Eachdata window can use a different callback function which requires the functions to be inseparate files. In the presented architecture, each window uses the same function in thesame file.

5.2.2 Configure the Alerts

As the complex events that should be detected have to be defined as well, a callbackfunction has to be provided that defines the rules for the desired complex events. Thisfunction is called each time the window is updated after the previously mentioned ag-gregate function is called. These rules in particular query the results computed by theprevious function. Therefore, similar to the previous section, a function with the signaturecheckAlerts(windowName, statCollection, alertCollection) has to be provided.For the presented architecture, this is already the case. The statCollection parameterspecifies the collection to read the current statistics from. For that, the windowName pa-rameter can be used here to query the statistics for a particular data window. In case a ruleis matched, the alert is inserted into the collection specified by alertCollection.

5.2.3 Register the Event Streams

After the callback functions have been provided, the actual streams can be started. Inthe presented architecture, we need to have two event streams so that incoming eventsare stored in two distinct event logs. For demonstration or testing purpose, we can easilygenerate event streams for event log 1 and event log 2 using the streamStart.bat file.After entering the connection parameters (example parameters are shown in Table 5.1)and naming the event log to store the events, a stream of one event per second is generatedand stored in the specified collection. Note that the MongoDB collection representing theevent log is automatically created in the case it does not exist yet.

Connecting a stream that is not generated by the script provided here, takes a littlemore effort. To insert an event in MongoDB, it has to be specified in the JSON for-mat. Thus, if the events arising in the stream are not in JSON, the relevant attributeshave to be extracted and provided as event = { [Key1] : [Value1], ..., [KeyN]: [ValueN] } . To insert a new event to eventLog 1 for example then we can usedb.eventLog1.insert(event).

In the following we assume that two event streams exist and each one is stored in collectionsnamed "eventLog1" and "eventLog2" respectively. Note that it is also possible to set up

32


the streams in the end, for example when a particular event sequence should be tested.However, setting up the data windows requires that the event log collections already existin MongoDB.

5.2.4 Start the Data Windows

After the event streams are set up, we would like to process them using time and lengthwindows. To deploy a data window, first of all the corresponding batch file has to bestarted which is timeWinStart.bat for the time window and lenWinStart.bat for thelength windows. The batch file opens a command line for user interaction, so we can enterthe required parameters for the window. From the batch file, a Javascript routine thathandles the insertion and deletion of events in the MongoDB database is started withthe entered parameters. In case the window is updated, the previously defined callbackfunctions are also called.

The parameters listed in Table 5.1 are required for all scripts in the implementation.Table 5.2 shows the parameters required by the data windows. In order to recreate the ar-chitecture, we have to set "eventLog1" as event log for the time window and "timeWindow"as the collection to store the current window. Furthermore, we set "timeWindowResults"as the statistics collection and "alertCollection" as the alert collection. To start the twolength windows, we also have to start the batch file twice. At the first time, we set "event-Log1" as event log and "lengthWindow1" as the collection to store the current window.The second time, we use "eventLog2" and "lengthWindow2" instead. For both windows,we set "lengthWindowResults" as the statistics collection and "alertCollection" as the alertcollection. All other parameters can be set arbitrarily, but here we assume using thedefault values.

After that, the data windows are listening to new events inserted in the correspondingevent logs.

5.2.5 Start the Alert Listener

Using the alertListenerStart.bat file, the alerts can also be shown in the WindowsEvent Viewer which is more noticeable than a document in MongoDB. Note that thisrequires a python installation including the pymongo and pywin32 packages that can beinstalled using pip. According to the package specifications, Python versions 2.7 and3.5+ are compatible. The code has been tested with python 3.7. Starting the alertlistener requires the usual connection parameters, as well as the name of the collection inMongoDB the alerts are written to, which is "alertCollection" in our case. After that, newalerts also appear in the Windows Event Viewer.

33


Parameter Description Default

Hostname Name of the primary node of thereplica set to connect to

localhost

Port Port of the host to connect to 27018

Database Name of the database the stream iswritten to

cepdb

Table 5.1: Connection parameters

Parameter Description Default

Window Collection Name of the collection storing thedata window in MongoDB

e.g. lengthWindow forlength windows

Event Log Name of the collection the stream iswritten to

eventLog1

Statistics Collection Name of the collection to store theaggregation results

statsCollection1

Alert Collection Name of the collection to store theaggregation results

alertCollection

Aggregation Function File path to function aggre-gate(dataWindow, statsCollection).Best case is the same directory asthe window batch files

aggregate.js

Alert Function File path to functioncheckAlerts(dataWindow,statsCollection,alertCollection). Best caseis the same directory as the windowbatch files

checkAlerts.js

Window Size Number of events for windowsbounded in size

20

Window Batch Size Size of batch for length batch win-dows

20

Window Time Time interval in seconds 60

Sort Expression 1=ascending, -1=descending Syn-tax: [Field1] [1/-1],...,[FieldN] [1/-1]

price 1,time -1

Unique Expression Combination of attributes re-quired to be unique. Syntax:[Field1],...,[FieldN]

price,time

Timestamp Expres-sion

Attribute to be evaluated as times-tamp

time

Table 5.2: Window specific parameters

34

Chapter 6

Assessment

In order to validate the implemented techniques, an evaluation is done in this chaptercomparing the presented approach to the Esper framework. On the one hand, the correct-ness is evaluated assuming Esper, that is community tested, is working correctly as theground truth. On the other hand, the performance of both solutions is compared.

The main goal here is to show the correctness of the implemented techniques. As MongoDBuses the disk to store the events, Esper is expected to be much times faster using the mainmemory to store the events. All tests are performed on a commercial laptop with anIntel Core i5-8250U CPU using 4 cores, each running at 1,6 GHz, an M.2 SSD and 16GBRAM. The test data used here are synthetic and described in the following section. Thedata can be accessed at https://git.rwth-aachen.de/tim.bauerle/cep_on_mongodb_test_data.git

6.1 Set-up

As testing all kinds of data windows is beyond the scope here, we restrict to testing onlythe length and the time window. Therefore we use the same architecture as already inchapter 5 with two event streams, a time window and a length window on the first stream,named StockTickA, and a length window on the second stream, named StockTickB. Thelength windows will have size 10 and the time window will retain events for two seconds.The two streams will simulate stock market feeds having slightly different events. Toensure that the results are comparable, the test data have to be equal and therefore willbe generated at first.

The test data will have different sizes, i.e. 100, 1000 and 10000 events where each streamis simulated with one half of the test data. For each size, there will be 50 test data setsto provide reliable results. Each data set consists of two files, one for each stream. Anevent stream is stored in a file as a JSON-Array. For a single test, we take a file for eachstream, parse both files and alternately emit an event. We then measure the processingtime for each data set.

Furthermore, we define three complex events that are notified by an alert so we cancompare the outcome of both implementations. The first alert is triggered, if the averageprice of the first length windows is greater than or equal to 6. The second one is triggered,

35

Chapter 6. Assessment

Figure 6.1: Esper set-up

if the average price of the time window is greater than or equal to 5. The third one istriggered, if the average price of the second length window is greater than or equal to6. The events emitted by both streams have a price attribute that is an integer valuebetween 0 and 9. Hence, 4.5 is the expected average value of multiple events. As thelength windows are limited to ten events, the time window is expected to contain moreevents most of the time. Thus, the average price of the events in a time window is morelikely to deviate less from the expected average price than the average price of both lengthwindows. The number of alerts then indicates whether the different implementations actequally on the streams.

6.1.1 Esper (Comparative Setting)

The central component in Esper is the EPServiceProvider interface which represents aninstance of the Esper engine. It provides access to the EPAdministrator interface that isused to create EPL statements and to the EPRuntime interface that is used to send eventsto the engine in order to process them.

To define what the events will look like, we first of all define the types of events that areinput by the streams. The events are represented by the TickA Java class for one streamand by the TickB Java class for the other stream. These classes are used to configure theengine.

We process the test data grouped by size. Therefore we iterate over the data and take onefile for each stream at once. For each alert mentioned above, we create an EPLStatement.As we do not want to evaluate events from previous files, we create new instances of thestatements at each iteration and destroy them afterwards to avoid an overhead. Further-more, we provide a listener class for callbacks by the engine. For each statement, weregister an instance of the listener class that is called when the statement is matched. Weuse the listener to count the alerts distinctly for each statement.

Afterwards, we parse a file for the StockTickA stream and the StockTickB stream. Foreach stream, the events from the files are stored as objects of TickA and TickB respectivelyin an ArrayList.

36


Figure 6.2: Implementation set-up

Then the events are ready to be processed. Thus, we store the current time as the startdate. Next, we iterate over the ArrayLists and send one event of each type alternately tothe runtime. Each time one of the statements is matched, the listener is called and thecalls are counted. When each event of the ArrayLists has been sent to the runtime, thecurrent time is taken again and the difference to the start time denotes the processingtime.

6.1.2 Implementation Set-up

The architecture of the implementation used here is basically the same as the one describedin the previous chapters. From a batch file, we iterate over the test data and restart thedata windows for each set of event data. Furthermore, the database is cleared after eachrun, in order to avoid past data to influence the current test.

In a single iteration, the database is initialized at first, which means that the event logsare created after the previous ones were dropped. Then, each data window is startedin a distinct mongo shell process with the parameters described above. Another mongoshell process is used to insert the test data into the database, i.e. this process simulatesthe streams. Before the test starts and the time is actually measured, the test data isloaded in the Javascript scope of this process as a JSON array for each stream. Next, theevents of each stream are inserted alternately in the corresponding event log. As the datawindows are executed in different mongo shell processes, they act independently of eachother which allows to measure the time for each window separately. The time measurementfor a data window is started when the first event is notified and ended when the last eventis processed. This is possible as the number of events is known a priori in this setting.When a complex event is detected, an alert is inserted into the alert collection denotingthe data window and number of the test data. This allows to extract the number of alertsfor each window and data set from the database after all data sets are processed.

When the data windows finished processing the current data, the measured computationtimes are stored and the collections corresponding the data windows, statistics and eventlogs are deleted.

6.2 Results

The full results of the assessment are given in the appendix.

37


100 1000 1000085

90

95

100 100 100

96

100 100

94

100

96

86

Data size

Matchingcoun

ts[%

]

LenWin1 LenWin2 TimeWin

Figure 6.3: Matching alert count

6.2.1 Correctness

To evaluate the correctness of the presented approach, we compare Esper to the presentedapproach regarding the number of alerts emitted. Here, we differentiate between the alertscounted for each data window in each test run. Therefore, the ratio of identical alert countsfor each data window and size of the data set is shown in Figure 6.3. For example, the bluebar of Length Window 1 at a data size of 100 events per data set indicates that for eachdata set of this size, the number of alerts emitted by Esper and the presented approachis equal. This can also be seen more detailed in Table A.1 as the numbers in the columnscorresponding to Length Window 1 are identical.

Hence, for all data sets consisting of 100 events, the results of the presented approach arecorrect under the assumption that Esper works correctly.

For the data sets consisting of 1000 events, the results of the length windows are correct aswell, while the number of alerts of the time window deviates in two tests. The particularnumbers are shown in Table 6.1. As we can see, more alerts have been counted by thepresented approach than by Esper. If we have a look at the runtime of the time window inTable A.5, we see that the time window in the presented approach took around 2.8 secondsof computation time, while Esper needed only around 0.005 seconds of computation timein general. Considering that the time window is set to a time interval of 2 seconds, thecontent of both windows should be different, as after the 2 seconds, events get removedfrom the time window. So a difference in the number of alerts counted does not implya malfunction of the time window in the presented approach as it could be explained byexecution time.

For the data sets of size 10000, we can also see a difference in the number of countedalerts at both length windows. For Length Window 1 there are three tests depicted inTable 6.2 and for Length Window 2 there are two tests depicted in Table 6.3 where the

38


Data Set # Alerts Thesis # Alerts Esper8 2 041 13 6

Table 6.1: Different results of the Time Windows for 1000 Events

number of alerts in the presented approach is less than the number of alerts counted byEsper. This deviation is not explainable by the behavior caused by different computationtimes. Running the single tests again, produces equal numbers of alerts instead. Thissuggests a non-systematic error, which could possibly be caused by stopping the processof the corresponding window and clearing the database too early. Nevertheless, we cannotexplain the error with certainty.

Data Set # Alerts Thesis # Alerts Esper7 221 22534 286 28746 325 326

Table 6.2: Different Results of Length Window 1 for 10000 Events

Data Set # Alerts Thesis # Alerts Esper20 307 30827 382 390

Table 6.3: Different Results of Length Window 2 for 10000 Events

We can see the largest difference in the number of counted alerts at the Time Windowfor a data size of 10000 events in Table 6.4. As we already observed previously, the TimeWindow in the presented approach emitted a larger number of alerts than Esper in eachtest. Considering the computation times again, we see that Esper took between 0.009 and0.042 seconds while the presented approach took around 18 seconds. Hence, the content ofthe Time Windows is expected to be different again in both windows which might explainthe different numbers.

All in all, the presented approach seems to work fairly accurate due to the mentionedresults.

Data Set # Alerts Thesis # Alerts Esper0 32 51 54 117 24 168 18 1018 7 419 4 142 40 31

Table 6.4: Different Results of the Time Window for 10000 Events

39


Data Size Esper Thesis

100 1.38 1085.9

1000 4.1 2700.5

10000 12.36 18781.3

Table 6.5: Average runtime in milliseconds

6.2.2 Performance

Table 6.5 compares the average runtime of both implementations. The runtime denotesthe time period between the first notification of an event by a data window and the timewhen the last event is processed. As we measured the time for each window distinctlyin the presented approach, only the window with the maximum runtime is considered forthe average time which is always the Time Window. In Esper, the processing times ofthe data windows are not measurable independently so the overall time of the test datais considered. The results demonstrate a significantly better performance of the Esperframework compared to the implementation. As seen in Table 6.5, the average runtime ofthe two implementations differs by a factor that is roughly between 650 and 1500.

There are a couple of reasons that might explain this behavior. First of all, Esper pro-cesses all incoming events using the main memory. In contrast to this, MongoDB storesthe events in collections that are stored on the hard drive which has a greater latency inreading and writing operations than the main memory. Hence, Esper has a performanceadvantage both in reading and writing the data. Furthermore, Esper supports incremen-tal aggregations, which means that results such as an average value do not have to berecomputed from scratch. If the event removal is performed by MongoDB, for example incapped collections, an incremental aggregation is impossible, as the removal is noted onlyafter it is performed, which means that the data are lost.

Apart from the comparison to Esper, it is also interesting to compare the distinct averageruntimes of each data windows in the presented approach. Figure 6.4 shows these runtimesgrouped by each data size. Both length windows do not differ significantly in their runtimeswhich is not surprising, as they work the same way. The difference between time and lengthwindows is more interesting though. As we can see, both length windows need considerablyless time compared to the time window. A possible explanation is that the length windowsare directly bounded in size while the time window can grow almost arbitrarily. Thus, ata high event arrival rate a time window might have more events to aggregate which clearlytakes more time.

6.2.3 Summary

The results show that the presented approach works almost correct. A separate testingof the test cases that were inaccurate in the first place, delivers correct results though.The results also demonstrate a better performance of Esper compared to the presentedapproach.

In general, the assessment was conducted on a few particular tests cases that may bebetter suited to one implementation or the other. Also, the performance results may vary

40


0 5 10 15 20 25

100

1000

10000

0.13

1.29

13.5

0.13

1.29

13.5

1.09

2.7

18.78

runtime [seconds]

Datasiz

e

LenWin1 LenWin2 TimeWin

Figure 6.4: Average runtime per window

on different systems, but at least the performance results give an indication on how thetwo implementations compare in general. One can certainly say that Esper has advantageson very large event streams due to the presented performance results. Though Esper has alot more features to offer and is tested by its community, the approach presented here alsohas certain advantages. First of all, it is designed in a more modular way than Esper andthus requires less own code in order to detect complex events. Data windows for examplecan be started using a batch file and entering values instead of writing code.

Furthermore, the presented approach is built upon the MongoDB database, so the eventsare permanently stored in a database which is not done by Esper. Moreover, using datareplication in MongoDB also provides a better fault tolerance.

41

Chapter 7

Conclusion

In this thesis, we have proposed several techniques for complex event processing that havebeen implemented upon the MongoDB database. CEP has many use cases, with processmonitoring being one of the important ones. Also a variety of CEP software has evolvedboth in research and in industry.

After presenting basic terms in CEP and related fields, we have seen a formal frameworkto describe the relevant CEP techniques, in particular different data windows to capturean excerpt of an event stream. These were furthermore described from a more practicalpoint of view to outline the different behaviors of the data windows. Subsequently, in-structions were provided how to use the implemented techniques. Afterwards, some of theimplemented techniques were compared to the popular CEP software Esper that showeda better performance in the presented test scenario. Though, the approach presented heremay have advantages on smaller data streams, as less code is required and a direct accessto the MongoDB database is given.

Future Work

As the performance is a clear benefit of the Esper software, it would be interesting to seeif the performance of the implemented approach could be increased. While incrementalaggregation is impossible with the current approach which is assumed to be a performanceissue, an approach to process events in memory might lead to a better performance.Therefore the "In-Memory Storage Engine" provided in the MongoDB Enterprise editioncould be a promising concept.

Also, it would be interesting to further investigate the properties of the approach regardingfault tolerance and failure recovery in distributed applications, as this is an advantage ofMongoDB.

42

Bibliography

[1] David Luckham and Roy W Schulte. Event processing glossary - version 2.0. EventProcessing Technical Society, 07 2011.

[2] Wil M. P. van der Aalst. Process Mining: Data Science in Action. Springer, Heidel-berg, 2 edition, 2016.

[3] OMG. Business Process Model and Notation (BPMN), Version 2.0. Object Manage-ment Group, January 2011. URL http://www.omg.org/spec/BPMN/2.0.

[4] Marlon Dumas, Marcello La Rosa, Jan Mendling, and Hajo A. Reijers. Fundamentalsof Business Process Management. Springer Berlin Heidelberg, 2 edition, 2018.

[5] Miyuru Dayarathna and Srinath Perera. Recent advancements in event processing.ACM Comput. Surv., 51(2):33:1–33:36, February 2018.

[6] Feng Gao, Muhammad Intizar Ali, and Alessandra Mileo. Semantic discovery andintegration of urban data streams. In Proceedings of the Fifth International Confer-ence on Semantics for Smarter Cities - Volume 1280, S4SC’14, pages 15–30, Aachen,Germany, 2014. CEUR-WS.org.

[7] A. Mdhaffar, I. Bouassida Rodriguez, K. Charfi, L. Abid, and B. Freisleben. Cep4hfp:Complex event processing for heart failure prediction. IEEE Transactions onNanoBioscience, 16(8):708–717, Dec 2017.

[8] MongoDB Inc. Mongodb server. http://www.mongodb.com, 2007-2019.

[9] Doug Laney. 3d data management: Controlling data volume, velocity and variety.META group research note, 6(70):1, 2001.

[10] David C Luckham and Brian Frasca. Complex event processing in distributed systems.Computer Systems Laboratory Technical Report CSL-TR-98-754. Stanford University,Stanford, 28:16, 1998.

[11] David C. Luckham. The Power of Events: An Introduction to Complex Event Pro-cessing in Distributed Enterprise Systems. Addison-Wesley Longman Publishing Co.,Inc., Boston, MA, USA, 2001. ISBN 0201727897.

[12] Opher Etzion and Peter Niblett. Event Processing in Action. Manning PublicationsCo., Greenwich, CT, USA, 1st edition, 2010.

[13] David Luckham. What’s the difference between esp andcep? URL http://www.complexevents.com/2019/07/15/

43

http://www.omg.org/spec/BPMN/2.0

http://www.complexevents.com/2019/07/15/whats-the-difference-between-esp-and-cep-2/


Bibliography

whats-the-difference-between-esp-and-cep-2/. last actualization: July15, 2019.

[14] Michael Eckert, François Bry, Simon Brodt, Olga Poppe, and Steffen Hausmann. ACEP Babelfish: Languages for Complex Event Processing and Querying Surveyed,pages 47–70. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011.

[15] EsperTech Inc. Esper. http://www.espertech.com, 2006-2019.

[16] Oracle Corporation. Oracle cep. https://docs.oracle.com/cd/E12839_01/doc.1111/e14476/toc.htm, 2008-2019.

[17] WSO2 Inc. Siddhi. https://siddhi.io/, 2015-2019.

[18] Apache Software Foundation. Apache spark. https://spark.apache.org/, 2014-2019.

[19] David C. Luckham. Rapide: A language and toolset for simulation of distributedsystems by partial orderings of events. Technical report, Stanford, CA, USA, 1996.

[20] Cornell University. Cayuga. http://www.cs.cornell.edu/database/cayuga/, 2007-2013.

[21] Susanne Bülow, Michael Backmann, Nico Herzberg, Thomas Hille, Andreas Meyer,Benjamin Ulm, Tsun Yin Wong, and Mathias Weske. Monitoring of business processeswith complex event processing. In Business Process Management Workshops, pages277–290. Springer International Publishing, 2014.

[22] Aymen Baouab, Olivier Perrin, and Claude Godart. An optimized derivation of eventqueries to monitor choreography violations. volume 7636, 11 2012.

[23] Eugene Wu, Yanlei Diao, and Shariq Rizvi. High-performance complex event process-ing over streams. In Proceedings of the 2006 ACM SIGMOD International Conferenceon Management of Data, SIGMOD ’06, pages 407–418, New York, NY, USA, 2006.ACM. ISBN 1-59593-434-0.

[24] Ming-Li Emily Chang and Hui Na Chua. Sql and nosql database comparison. InKohei Arai, Supriya Kapoor, and Rahul Bhatia, editors, Advances in Informationand Communication Networks, pages 294–310, Cham, 2019. Springer InternationalPublishing.

[25] S. Chickerur, A. Goudar, and A. Kinnerkar. Comparison of relational database withdocument-oriented database (mongodb) for big data applications. In 2015 8th In-ternational Conference on Advanced Software Engineering Its Applications (ASEA),pages 41–47, Nov 2015.

[26] Zachary Parker, Scott Poe, and Susan V. Vrbsky. Comparing nosql mongodb to ansql db. In Proceedings of the 51st ACM Southeast Conference, ACMSE ’13, pages5:1–5:6. ACM, 2013. ISBN 978-1-4503-1901-0.

[27] KB Sundhara Kumar, S Mohanavalli, et al. A performance comparison of documentoriented nosql databases. In 2017 International Conference on Computer, Commu-nication and Signal Processing (ICCCSP), pages 1–6. IEEE, 2017.

[28] Yishan Li and Sathiamoorthy Manoharan. A performance comparison of sql and nosql

44



Bibliography

databases. In 2013 IEEE Pacific Rim Conference on Communications, Computersand Signal Processing (PACRIM), pages 15–19, Aug 2013.

[29] Michael Eckert and François Bry. Complex event processing (cep). Informatik-Spektrum, 32(2):163–167, Apr 2009.

[30] Sebastiaan J. van Zelst. Process mining with streaming data. PhD thesis, TechnischeUniversiteit Eindhoven, 2019.

[31] Bartosz Balis, Bartosz Kowalewski, and Marian Bubak. Leveraging complex eventprocessing for grid monitoring. pages 224–233, 01 2010.

[32] Konstantinos Vandikas, Paris Carbone, and Farjola Peco. Recovery of operationalstate values for complex event processing based on a time window defined by anevent query, June 2016.

[33] MongoDB Inc. Install mongodb, 2008-2019. URL https://docs.mongodb.com/manual/installation/#install-mongodb.

[34] MongoDB Inc. Replication, 2008-2019. URL https://docs.mongodb.com/manual/replication/index.html#replication.

45

https://docs.mongodb.com/manual/installation/#install-mongodb

https://docs.mongodb.com/manual/installation/#install-mongodb

https://docs.mongodb.com/manual/replication/index.html#replication

https://docs.mongodb.com/manual/replication/index.html#replication

Appendix A

Full Assessment Results

46

Appendix A. Full Assessment Results

Data No. LW1Thesis LW1Esper LW2Thesis LW2Esper TWThesis TWEsper

0 11 11 1 1 18 181 6 6 13 13 1 12 4 4 3 3 18 183 1 1 5 5 1 14 0 0 0 0 0 05 3 3 3 3 28 286 0 0 5 5 4 47 6 6 1 1 30 308 12 12 14 14 12 129 11 11 14 14 26 2610 0 0 5 5 6 611 12 12 8 8 38 3812 5 5 1 1 7 713 10 10 1 1 43 4314 10 10 0 0 14 1415 5 5 6 6 10 1016 3 3 1 1 10 1017 3 3 0 0 6 618 3 3 3 3 17 1719 0 0 0 0 2 220 3 3 1 1 8 821 5 5 14 14 7 722 1 1 1 1 27 2723 9 9 3 3 4 424 20 20 8 8 49 4925 2 2 1 1 15 1526 0 0 0 0 1 127 0 0 5 5 0 028 3 3 2 2 21 2129 6 6 7 7 17 1730 0 0 1 1 0 031 0 0 8 8 2 232 2 2 4 4 4 433 7 7 1 1 43 4334 4 4 12 12 13 1335 10 10 0 0 20 2036 15 15 5 5 30 3037 3 3 1 1 8 838 1 1 2 2 3 339 3 3 0 0 0 040 0 0 0 0 2 241 0 0 0 0 0 042 0 0 2 2 29 2943 1 1 7 7 1 144 1 1 3 3 0 045 8 8 6 6 17 1746 8 8 1 1 34 3447 7 7 4 4 15 1548 3 3 7 7 7 749 4 4 0 0 1 1

Table A.1: Alert Counts for Data Size 100

47



0 28 28 30 30 7 71 25 25 13 13 11 112 18 18 23 23 6 63 70 70 25 25 0 04 41 41 27 27 17 175 40 40 19 19 22 226 16 16 34 34 1 17 19 19 20 20 4 48 22 22 48 48 2 09 15 15 49 49 7 710 40 40 20 20 70 7011 20 20 49 49 0 012 26 26 17 17 27 2713 61 61 16 16 98 9814 14 14 23 23 37 3715 38 38 17 17 6 616 29 29 29 29 35 3517 10 10 28 28 1 118 22 22 30 30 12 1219 58 58 54 54 125 12520 45 45 27 27 18 1821 29 29 44 44 43 4322 46 46 21 21 16 1623 13 13 22 22 11 1124 11 11 33 33 0 025 39 39 30 30 0 026 28 28 22 22 1 127 32 32 59 59 47 4728 25 25 25 25 3 329 43 43 51 51 0 030 18 18 31 31 67 6731 18 18 28 28 1 132 31 31 21 21 4 433 13 13 10 10 14 1434 29 29 33 33 0 035 13 13 12 12 1 136 24 24 17 17 17 1737 22 22 21 21 29 2938 11 11 34 34 1 139 31 31 45 45 30 3040 33 33 32 32 4 441 47 47 22 22 13 642 20 20 22 22 0 043 29 29 24 24 17 1744 26 26 46 46 4 445 20 20 42 42 17 1746 23 23 31 31 21 2147 20 20 31 31 1 148 42 42 22 22 48 4849 23 23 33 33 0 0


48



0 306 306 254 254 32 51 298 298 187 187 54 112 332 332 339 339 4 43 238 238 271 271 0 04 256 256 309 309 11 115 292 292 289 289 64 646 270 270 337 337 17 177 221 225 322 322 24 168 296 296 283 283 18 109 270 270 273 273 0 010 299 299 250 250 38 3811 233 233 260 260 4 412 230 230 325 325 0 013 330 330 254 254 49 4914 222 222 318 318 8 815 345 345 214 214 12 1216 220 220 250 250 4 417 294 294 215 215 106 10618 266 266 294 294 7 419 292 292 225 225 4 120 247 247 307 308 9 921 276 276 276 276 1 122 212 212 266 266 2 223 232 232 277 277 26 2624 328 328 248 248 0 025 269 269 293 293 2 226 230 230 262 262 0 027 281 281 382 390 90 9028 282 282 298 298 14 1429 270 270 218 218 43 4330 265 265 233 233 0 031 237 237 271 271 42 4232 296 296 338 338 1 133 310 310 232 232 11 1134 286 287 279 279 12 1235 183 183 238 238 33 3336 255 255 260 260 7 737 226 226 278 278 0 038 302 302 221 221 56 5639 310 310 357 357 27 2740 260 260 274 274 44 4441 271 271 309 309 0 042 287 287 319 319 40 3143 263 263 314 314 1 144 322 322 286 286 65 6545 269 269 317 317 1 146 325 326 225 225 1 147 264 264 305 305 4 448 273 273 228 228 3 349 211 211 270 270 112 112


49


Data No. LW1Thesis Time LW2Thesis Time TWThesis Time Total Time Esper

0 222 164 1165 81 131 129 1062 42 132 128 1072 33 126 127 1085 24 126 124 1077 25 126 126 1089 36 124 122 1080 17 128 125 1078 48 131 130 1086 19 130 129 1058 210 124 127 1096 111 132 128 1106 112 128 125 1073 213 132 125 1102 114 131 125 1084 115 130 130 1072 116 128 125 1113 117 128 124 1095 118 129 126 1104 019 123 122 1054 020 130 127 1065 121 124 126 1077 122 128 127 1093 123 133 127 1098 124 137 129 1105 125 130 127 1109 126 126 126 1073 127 134 130 1027 128 128 127 1068 229 130 127 1106 130 125 124 1087 131 126 127 1086 132 129 126 1096 133 128 124 1082 234 130 133 1090 135 133 126 1088 136 132 125 1084 137 129 125 1080 138 127 125 1091 139 128 122 1153 040 125 125 1063 141 126 126 1068 142 129 125 1073 143 129 130 1084 144 136 126 1071 145 131 130 1087 146 133 128 1084 047 128 125 1090 148 129 128 1086 149 129 125 1080 1

Table A.4: Computation times [ms] for data size 100

50



0 1326 1323 2843 241 1296 1285 2678 52 1306 1306 2689 43 1324 1294 2687 64 1315 1306 2751 45 1303 1296 2728 46 1302 1322 2733 47 1303 1309 2736 48 1314 1327 2742 59 1300 1325 2708 810 1305 1309 2757 511 1297 1317 2709 512 1302 1296 2738 513 1323 1292 2755 714 1304 1307 2798 615 1311 1299 2746 716 1319 1320 2736 317 1301 1327 2732 518 1319 1310 2744 419 1334 1328 2674 320 1290 1281 2719 421 1335 1335 2831 522 1411 1385 2814 423 1382 1387 2866 224 1247 1271 2643 325 1266 1267 2645 226 1271 1279 2651 327 1263 1273 2672 328 1254 1253 2657 329 1271 1278 2631 230 1277 1290 2748 431 1259 1262 2651 332 1266 1260 2617 333 1256 1258 2669 234 1275 1278 2644 335 1266 1252 2642 336 1258 1246 2677 337 1260 1267 2654 338 1261 1275 2626 239 1263 1271 2687 340 1263 1260 2639 241 1274 1258 2591 242 1261 1258 2634 243 1221 1225 2816 444 1280 1282 2648 245 1249 1265 2658 346 1261 1274 2686 247 1262 1270 2629 348 1273 1258 2657 449 1262 1285 2638 3


51



0 12901 12837 18020 421 13142 12986 18221 212 12673 12629 17806 203 12696 12706 17806 144 12771 12752 18850 135 12922 12945 18960 296 12852 12907 18749 197 13013 13250 18299 158 13179 13101 18310 119 13261 13236 18482 1510 13287 13285 18512 1611 13501 13470 18688 1212 13211 13291 18530 1813 13350 13343 18497 1114 13323 13398 18497 1315 13343 13269 18405 1116 13186 13184 18367 917 13409 13331 18479 1018 13417 13408 18646 1119 13253 13174 18387 920 13309 13416 18357 1121 13572 13690 18750 1222 13474 13414 18499 1023 13635 13654 18748 1224 13487 13461 18519 1025 13640 13671 18817 1026 13436 13389 18594 1127 13357 13298 18271 1028 13465 13519 18631 1129 13960 13971 19120 930 13878 13682 18725 1231 13271 13542 18453 832 13538 13556 18487 1033 13640 13710 18727 1034 13623 13712 18538 1535 13674 13684 18528 936 13722 13681 18817 1137 13630 13679 18817 938 13946 13883 19189 939 14280 14391 19501 940 14859 14786 20444 941 14551 14641 19724 942 14160 14231 19323 1043 13927 13916 19034 944 13456 13344 19495 945 13617 13585 19783 946 13717 13826 19758 947 13494 13536 19718 1048 14071 13960 19317 949 13711 13745 19870 8


52

Documents

Complex Event Processing on MongoDB