D7.4 Final Project Report v1.0 - FERARI Project · One key aspect for achieving scalability at very large scale is the provision on communication ... Doing so requires a rethinking

1

FRONT PAGE

PROJECT FINAL REPORT Grant Agreement number: 619491

Project acronym: FERARI Project title: Flexible Event pRocessing for big dAtaaRchItectures Funding Scheme: STREP

Period covered: from 1st of February, 2014 to 31st of January 2017

Name of the scientific representative of the project's co-ordinator, Title and Organisation:

PD Dr. Michael Mock, Fraunhofer IAIS Tel: +49 2241 14 2576 Fax: +49 2241 14 2324 E-mail: [email protected]

Project website address: www.ferari-project.eu

2

o Final publishable summary report Executive Summary FERARI provides a general-purpose architecture for flexible, communication efficient distributed complex event processing on massively distributed streams of data. The FERARI architecture is implemented in the open source FERARI soft-ware platform provided at https://bitbucket.org/sbothe-iais/ferari. As the FERARI architecture is of general purpose, the FERARI software platform is distributed middleware, that can be configured for supporting a wide range of distributed streaming applications. Specific use cases are implemented in the architecture by instantiating and configuring an application as a combination of distributed streaming operators, including CEP (Complex Event Processing) operators, communication-efficient in-situ processing operators, and machine learning operators. Applications are defined in an abstract, user-friendly and flexible way on a logical level, hiding specific aspects of distribution and locality. The FERARI CEP Optimizer takes the application specification and breaks it down into a communication efficient, fully distributed execution specification, which will be instantiated and executed at distributed FERARI sites. The FERARI architecture has been demonstrated on various occasions, including demo presentations at SIGMOD 2016 and DEBS 2016.

As part of the FERARI development, we extended IBM’s PROTON (IBM’s PROactive Technology Online) open source CEP runtime environment to run on top of STORM, in order to integrate into the FERARI architecture. PROTON on STORM is again open source and provides the PROTON functionality in a distributed, scalable manner. It is being used in a number of EU projects and supports the flexible Event Model (TEM) developed in FERARI and presented at DEBS 2016. IBM recently recognized PROTON and its scalable version Proton on STORM as a fundamental scientific contribution with the corporation’s prestigious Accomplishment award.

While the architecture is of general purpose and hence not restricted to specific use cases and application domains, it is instantiated and validated in the FERARI project in use cases working with real-world data. Instantiation of the FERARI architecture in those use cases and evaluation on real-world data in the scope of these use cases is addressed. For this purpose, HT has collected CDR (Call Detail Record) data for evaluation purposes. This data has been anonymized for analysis in the project. Fraud detection rules have been implemented using the distributed FERARI CEP implementation. The implementation has been evaluated both in terms of accuracy and performance, showing 100% detection of fraudster compared with the current HT solution while achieving much higher performance.

One key aspect for achieving scalability at very large scale is the provision on communication efficient in-situ processing operators in FERARI. Main goal of these operators is to process data locally, where it is generated, and to communicate to a centralized (or dynamically placed) coordinator only when needed. FERARI developed in-situ processing methods that go beyond the state of the art and make it possible to efficiently monitor, in a distributed way, complex functions over the data collected by remote sites. Its role is to (a) provide communication efficient methods that integrate the sensor layer, (b) provide low latency methods for complex event processing, and (c) using its novel advancements, provide efficient distributed machine learning techniques. The basic so-called Geometric Monitoring technique, being developed by FERARI partners in preceding projects, has been employed and extended in various directions: sampling techniques have been introduced to reduce communication with probabilistic guarantees (SIGMOD 2016), integration with distributed online learning was achieved (ECML 2014/16), and convex bounding techniques have been developed (VLDB 2015, KDD 2016) to widen the applicability of the methods to new settings as for example LDA classification and monitoring of large distributed graphs (IPDPS 2017).

3

ContextandObjectivesThe data revolution. A number of recent technological developments have started to change our world forever: (1) the rise of the internet, (2) the ever growing amount of activities in social networks, (3) the widespread adoption of smart phones and other mobile devices, and (4) the instrumentation of the world with sensors. This is accompanied by dropping prices for computers, networks, and storage. In consequence, exponentially growing amounts of data arising from these sources can be stored and processed. This trend to capture, store, process, analyse, and use the new information sources for making better business decisions, for better public planning, and for scientific purposes is known as Big Data. It is characterized by the 3 V’s of Volume (the size of the data), Velocity (the speed of incoming data and data processing), and Variety (the heterogeneous nature of many data sources, including structured and unstructured information). Often, an additional condition, Veracity (the uncertain or imprecise nature of data), is added. Many of today’s Big Data technologies were built on the tacit assumption of web-based systems processing data generated ultimately by humans: e.g. analyzing incoming Twitter streams, managing social networks at Facebook, or indexing web pages at Google. This data is unstructured or semi-structured at best. Human-generated data is predominantly persistent, i.e. is required to be stored for relatively long periods of time. As a result, Big Data technologies to date mainly focus on batch processing of data stored on distributed file systems; implementations are designed to be as general as possible, employing simple and general key-value semantics. As Big Data finds its way to other business areas, this design decision becomes limiting. An area with great future potential is machine-to-machine interaction (M2M), and the Internet of Things. A few examples include smart energy grids, car-to-car communication, mobile network quality monitoring, optimizing operation of large, complicated systems through mining machine and service logs, fault detection in clouds, automated negotiation systems – all have been identified as important hot use cases for Big Data. Data volumes generated from M2M interaction surpass by far the amount of data generated by humans. M2M data is required to be processed in real-time as it is produced, it is predominantly transient (does not need to be, and is too large to be stored for future reference), and is much more structured in nature. Consequently, current Big Data technology is inadequate for processing contemporary and expected amounts of M2M data since they are • not smart enough for autonomous operation -most analytical tasks are of a rather simple

nature;• notflexibleenough–itrequiresexpertskillstosetupanapplication;• notsufficientlyscalable–theystilllackscalabilitytolargescaleservicesbecauseoftoomuch

datashuffling;• not fast enough – storing the data previous to processing it involves very high latency,

preventingreal-timeknowledgeextraction;• expensive and resource consuming – storing all the data and shuffling it necessitates

enormous infrastructures and power investments which are mostly redundant andsuperfluous.

The FERARI vision. The goal of the FERARI project is to address these bottlenecks and to pave the way for efficient and timely processing of Big Data. We intend to exploit the structured nature of M2M data while retaining the flexibility required for handling unstructured data elements. Taking into account the structured nature of the data will enable business users to express complex tasks, such as efficiently identifying sequences of events over distributed sources with complex

4

relations, or learning and monitoring sophisticated abstract models of the data. These tasks will be expressed in a high-level declarative language (as opposed to programming them manually as is the case in current streaming systems). The system will be able to perform these tasks in an efficient and timely manner. Importantly, this systematic approach will enable leveraging recent advances in in-situ processing algorithms, which perform much of the processing at the source where the data is generated. Instead of transporting all the data to a data center for centralized storing and processing, the data is processed in place, and a centralized location is only required to coordinate the processing efforts, and to receive final results. The advantages of in-situ processing are especially important for M2M data, where any transportation of data is truly wasteful since there is no need to store the data. In-situ processing is a crucial component for achieving truly large-scale and geographically distributed scalability: avoiding sending all the data to a centralized location for storage and processing simultaneously addresses both communication and computational scalability issues. By diminishing the need for large centralized infrastructures, huge data transfers, and the respective necessary energy, in-situ processing lowers the cost and environmental ramifications of Big Data stream processing systems by orders of magnitude. Similarly, huge acceleration is obtained in performing real-time knowledge extraction and monitoring. FERARI will implement its overall vision for building large scale distributed systems by decomposing it into a number of specific objectives: Objective 1: Provide support for large scale services by making the sensor layer a first class citizen in Big Data architectures. FERARI advocates in situ processing (right where the data is generated) as a first choice for scalable processing of the future large scale services. It turns out that this is the most principled and, in the long run, the only realistic approach for keeping up with exponentially growing information at the sensors. It is the only approach that can avoid unnecessary data shuffling between nodes. Doing so requires a rethinking of current assumptions in state-of-the-art Big Data architectures, e.g. adding control flow mechanisms to the data flow capabilities. Objective 2: Provide support for Complex Event Processing technology for business users in Big Data architectures: The goal is to bring stream processing much closer to the business world by extending simple stream processing of numeric or textual data to the much more powerful realm of Complex Event Processing (CEP). One of the open challenges for Big Data is to transform the approach from its original application areas – large scale web data processing – to other areas of business and industry. Only if this can be achieved it will live up to its economic promises. Providing a seamless model that applies CEP as part of Big Data application in a way easily consumable by a business will position CEP a great step towards bridging this gap. Objective 3: Provide support for integrating machine learning tasks in the architecture. For implementing automated observe-decide-act cycles, it is not enough to analyse incoming information piece-by-piece, doing simple data transformations, aggregations and statistics. Instead it is necessary to learn sophisticated models based on machine learning techniques, which are then the basis for decisions and actions. Current Big Data architectures are mostly data flow driven and lack some of the required functionality for supporting efficient learning. The goal is to use the added control flow capabilities (objective 1) to support such learning algorithms. Objective 4: Provide an open source architecture for flexible and adaptive analytics workflows. Current data flows are difficult to set up and to maintain. One goal (taking advantage of objectives 2 and 3) is to support workflows that are more adaptable to changing data distributions and changes in either the environment or in the requirements by supporting adaptive event-driven workflows using machine learning techniques. A second direction is to provide tools that simplify the set-up of such processes. Objective 5: Exemplify the potential of the new architecture in the telecommunication and the cloud domain. To show the potential of the approach, FERARI has selected two scenarios in challenging, high-impact areas of industry, where communication bottlenecks currently are severe

5

limiting factors. These scenarios are (1) the analysis of mobile phone fraud in telecommunication networks and (2) real-time health monitoring in clouds and large data centers as a scenario where already today high volume of data is severely limiting the optimization and monitoring of IT systems. Jointly, objectives (1)-(3) provide the basis for supporting the complex analytics tasks that will address the requirements of application scenarios (objective (5)). Objective (4) supports these objectives by simplifying set-up and maintenance of workflows.

Figure 1 shows how the objectives are interrelated: the architecture extensions at the middleware layer (control flow, sensor layer integration, push based protocols and alarm resolution) provide the necessary primitives for innovations in the application layer (integration of machine learning, drift and change monitoring, and distributed CEP); this in turn is the pre-condition for implementing new and innovative use cases, two of which will be investigated in depth in the

project (mobile phone fraud detection, cloud health monitoring). The FERARI architecture will be made available as open source and be evaluated on the use cases fraud-detection and cloud health monitoring on real-world data.

MainS&Tresults/foregroundsOverview on the FERARI architecture

FERARI provides a general-purpose architecture for flexible, communication efficient distributed complex event processing on massively distributed streams of data. The FERARI architecture is implemented in the open source FERARI soft-ware platform provided at https://bitbucket.org/sbothe-iais/ferari. As The FERARI architecture is of general purpose, the FERARI software platform is distributed middleware, that can be configured for supporting a wide range of distributed streaming applications. Specific use cases are implemented in the architecture by instantiating and configuring an application as a combination of distributed streaming operators, including CEP (Complex Event Processing) operators, communication-efficient in-situ processing operators, and machine learning operators. Applications are defined in an abstract, user-friendly and flexible way on a logical level, hiding specific aspects of distribution and locality. The FERARI CEP Optimizer takes the application specification and breaks it down into a communication efficient, fully distributed execution specification, which will be instantiated and executed at distributed FERARI sites. The FERARI architecture has been demonstrated on various occasions, including demo presentations at ACM Sigmod 2016 and ACM DEBS 2016.

While the architecture is of general purpose and hence not restricted to specific use cases and application domains, it is instantiated and validated in the FERARI project in use cases working with real-world data. Instantiation of the FERARI architecture in those use cases and evaluation on

Figure 1. FERARI innovations across layers. The project addresses the full technology stack, adding innovation at every layer.

6

real-world data in the scope of these use cases is addressed

Figure 2 summarizes the main characteristics of WP2 and WP1, showing the differences and rela-tionship between these two work packages. The architecture developed in WP2 is made available as open source platform in the FERARI open source repository https://bitbucket.org/sbothe-iais/ferari. It is intended to be of general purpose and usable in any application domain that works on distributed streaming data. It provides flexible mechanisms for complex event processing, and libraries, and run-time components for the distributed execution of applications, making use communication efficient protocols for in-situ processing and distributed execution of complex event processing runtimes provided by the FERARI open source platform. Application development of large-scale Big Data streaming applications is supported and significantly simplified by the FERARI open source platform. In addition, we provide a simple application example algorithms for fraud detection on a small sample data as part of the open source platform. Full applications, validated against existing, non-distributed applications, are developed and described in WP1. These specific use-case related applications running on top of the general purpose FERARI platform are working on real-world data provided by HT (Hrvatski Telekom).

Figure 2: Relationship between WP2 and WP1

and are instantiated in a test bed on cluster hardware installed at HT. As these applications are very specific, contain HT related and closely related to the HT data, they are not open sourced.

The FERARI architecture integrate many components and methodologies, which have been developed in the respective workpackages of the FERARI project. The methodology for defining general processing operators with Complex Event Processing is described in deliverable D4.3 and the methodology for communication efficient in-situ operators is described in D3.3. The placement of these operators in the distributed system is achieved via the CEP optimizer, being described in detail in deliverable D5.3

Challenges and Design Decisions

The major design goal of the FERARI architecture is to support in-situ processing of distributed, continuous data streams in a flexible, scalable, and communication efficient way. The following design challenges have been addressed in particular

• in-situ processing: FERARI considers the paradigm of in-situ processing as key element for achieving large scale scalability. In-situ processing refers to the ability to execute the computational processing of continuous data streams as close as possible to the source of the

7

data stream, best on the data generating sensor itself. This ability avoids communication overheads by achieving communication reduction, and exploits the distributed processing power of the sensors in the system. Both aspects lead to scalability at large scale, as over- loading of crucial communication and processing resources is avoided. In the rare case that data streams can be processed completely independently, achieving in-situ processing is al-gorithmically rather trivial. However, in most cases, the functionality of the system requires a global view on distributed streams. For example, in the fraud detection use case, global conditions on all calls issued by a specific sim are have to be monitored and evaluated, and as the cell phone is mobile, these calls can be issued from many different base stations in the telecommunication network. The system must be able to monitor global conditions, to evaluate global functions and to learn global models, while the raw events generating input to these conditions, functions and models occur distributedly. The challenge is to provide operators for global CEP, function monitoring and learning, which can be divided into in-situ and global parts in a communication efficient manner. While these operators are developed in the respective workpackages (WP 3 develops in-situ operators for function monitoring, WP 4 develops distributed CEP operators and WP 5 distributed machine learning operators), the architecture must provide a means to instantiate and place these operators at the correct position in the distributed system (in-situ close to the sensors, or more upstream for global parts of the operators). Furthermore, up- and downstream communication must be supported between the local and global parts of the system.

• flexible Complex Event Processing (CEP): Complex Event Processing is a high-level method-ology to define and execute stream processing applications. Instead of coding applications via low-level programming as required for instance in Big Data streaming platforms such as Storm or Spark Streaming, application functionality is expressed at a high level in terms of rules and conditions. In particular, details regarding the instantiation and placement of operators in a distributed system are hidden from the user. Apart from providing suitable high-level abstractions for CEP, an issue which is addressed and explained in detail in deliverable D4.2, combining CEP with in-situ processing also implies architectural challenges: on the one side, distribution of the application operators to different sites should be optimized for achieving communication efficiency, on the other side, the application designer should not be burdened with distribution issues. Hence, the system must provide mechanisms to automatically generate a “good” distribution plan from a purely functional CEP application specification.

• adaption to changing environments: assume that the issue mentioned above would have been solved, i.e., the FERARI architecture provides a solution that allows to generate and instantiate communication efficient distributed execution plans for operators that implement a functional CEP specification. Such a distribution plan depends on assumptions regarding event frequencies, network latencies and computing power available to the applications. What happens if these conditions change over time? The architecture provides a mechanism monitor current executions, detect changes compared to current assumptions, and dynamically adapt to new situations.

• integration with use-cases: the FERARI software platform is intended to be general purpose, providing a set of CEP operators, in-situ operators and machine learning operators for defining distributed streaming applications. However, in a specific use case such as fraud detection, the end user expects a system integrating an end-user application, easy to use and tailored to the specific use case.

8

• open source software platform: last, but not least, FERARI is not limited to architectural design, but also provides an implementation of the architecture as open source code. The following basic design decisions have been taken in order to cope with the challenges mentioned above.

• re-using existing Big Data streaming platform: in order to come quickly to first results and having good chances of system integration, we decided to implement the FERARI architecture on top of an existing lower level Big Data streaming platform. For this goal, we evaluated different options (namely Storm and Spark streaming) for putting FERARI on top. Due for reasons discussed in detail in D1.3, we choose the STORM as primary underlying system and worked on overcoming its inherent limitations regarding large-scale scalability. Storm is basically intended to run in a cluster of computer and uses a Zookeeper installation as backbone for configuration parameters. Zookeeper itself provides strong consistency via a consensus protococol and is not scalable at large scale over many physically distributed sites. Hence we decided to support scalability in FERARI in a hierarchical approach: a complete FERARI system consists of a scalable number of FERARI sites, each of which running on top of a full STORM system. FERARI supports intra-site scalability on top of STORM, and inter-site scalability by allowing for multiple FERARI sites. Communication inside a FERARI site is handled via STORM mechanisms, communication between FERARI sites is handled via communication protocol provided by FERARI described in detail in D2.3

• re-using and extending the existing PROTON CEP engine: PROTON (IBM’s PROactive Technology ONline )is an open source CEP runtime environment developed by the FER- ARI partner IBM research lab Haifa, and being in use in a number of research projects at European level. We decided to extend PROTON to run on top of STORM in the FERARI project, in order to integrate into the FERARI architecture. PROTON on STORM is again open source and provides the PROTON functionality in a distributed, scalable manner. We use the PROTON interface for specifying the global CEP application, and run PROTON on STORM instances in each FERARI site, handling the site local CEP operators. By this approach, we achieve three fundamental benefits: 1) avoid to re-code CEP processing by re-use of existing PROTON code, 2) achieving scalability by porting PROTON from thread- based shared memory systems to STORM, and 3) profiting from any extension developed for PROTON, such as the flexible Event Model (TEM) as described in D4.3. IBM recently recognized PROTON and its scalable version Proton on STORM as a fundamental scientific contribution with the corporation’s prestigious Accomplishment award.

• integration of event buffering and push-pull communication: as described above, distributed CEP, in-situ operators and machine learning make use of specific mechanisms for reducing communication. The basic goal is to avoid that every single event, occuring on a sensor or a local site, always has to be communicated over the network. However, the mechanism partly require that events, which are not communicated (pushed up-stream) per se, are buffered for the (rare) case in which they have to be retrieved on the global site. Hence, the FERARI architecture comprises on each site a buffering component, which is denoted as “Time Machine”, and mechanisms to support communication either in “push mode” or “pull mode”. In “push mode”, events are not buffered locally, but pushed up-stream immediatly, in “pull mode”, events are retrieved from the site local buffers. This distinction is hidden from the application designer, who only deals with the logical combination of operators.

• Global CEP optimizer: the CEP optimizer adresses the challenge to hide distribution from

9

the application designer while achieving an optimal allocation of operation to the distributed sites. The appliation is being designed at a global, functional level as a combination of CEP operators, in-situ operators and machine learning operators. This specification, together with initial system parameters and optimization goals, is handled over to the CEP optimizer, which generates a distributed execution plan for the application. This execution plan is shipped to the respective FERARI sites, each of which instantiating their local part of the global set of operators. Corresponding communication paths between the FERARI sites are set-up automatically and are automatically set to the right communication mode (“push” or “pull”), according to the CEP optimizer’s output. Furthermore, runtime statistics are gathered and send to the CEP optimizer, such that the CEP optimizer can initiate the generation of an alternative distributed execution plan in order to adapt to changes in the environment.

• communication and adaption via the FERARI communication system: in the FERARI architecture, each FERARI site runs the site local part of the operators on top of STORM. While each FERARI site consists of a scalable number of computing nodes, that use STORM mechanisms to communicate among each other, communication between FERARI sites is achieved outside of STORM via the FERARI communication components. They implement an overlay network between FERARI sites, supporting bi-directional communication in push and pull mode. The FERARI communication components support a further important case in the FERARI architecture: if the CEP optimizer decides to generate a new distributed execution plan in order to adapt to changes in the environment, the communication system distributes the new plan and supports running the new and old distributed plan in parallel. This is needed in order to avoid loosing state that might have been accumulated in the FERARI operators executing the old plan. The communication system handles distributing events among multiple parallel plans and harmonizing event output from parallel plans.

• use case fronted integration: for use case integration, FERARI adopts the design pattern found in the speed layer of the Lambda Architectur: FERARI can be considered as a large scale distributed stream processing engine, that processes massively distributed input streams and generates only the application relevant output events, such as alerts or fraud alarms. These output events are streamed again to any application specific view, that can be freely adapted to the needs of the end user.

The next Section gives on overview, how these design decisions are incoroporated in the global FERARI architecture.

FERARI Architecture: Global Picture

FERARI supports an hierarchical approach to scalability at large scale. A FERARI site consists of a scalable number of computing nodes, running the FERARI platform on top of STORM. A FERARI applications is defined via a CEP expressions, combining CEP operators, in-situ operators and machine learning operators. These operators are distributed among a scalable number of FERARI sites. FERARI sites are connected via the FERARI communication components. At run-time, a FERARI application is executed on top of a number of STORM topologies, each running in a FERARI site, being connected to the other STORM topologies via the FERARI communication components. Demos of the FERARI architecture have been presented at at ACM Sigmod 2016 and ACM DEBS 2016

10

FERARI architecture is depicted in Figure 3, while the architecture of each site is detailed in Figure 4.

CEP Optimizer!logical plan!

physical plan!

event stream!analyzer!

cost

Site Configurations!

FERARI Dashboard!

runtime statistics!Business Application

designer!

Abstract CEP Model!

Annotated CEP Model!

TEM!

…

…

real-time !input streams

Output!

Push Pull

1)

2)

3)

3)

4)

5) 5)

Figure 3: FERARI overall architecture

Figure 3 depicts the components and the overall workflow for designing and running a FERARI application. The workflow comprises the following steps:

1. The application designer describes the business logic of the application using an abstract event model TEM (1). TEM allows do describe the logic of an application by filling ap-plication tables, in which the rows describe the logical conditions, under which events are monitored and aggregated, and when derived events are to be generated. In addition, non- functional requirements on performance and latencies can be expressed and the use of special operators for in-situ processing and machine learning can be indicated. More details on TEM are found in deliverable D4.3.

2. The TEM translates the abstract CEP model provided by the application designer into a lower level representation, suitable to serve as input for a CEP execution runtime engine. This representation, which in case of the PROTON CEP engine is a text file in json format, expresses the application logic in terms of operators and could directly by the PROTON or PROTON on STORM engine. For achieving large scale stability over different FERARI sites, we extended the low level CEP representation with annotations regarding latency and performance requirements, and with information regarding network topologies, latencies, and initial event frequencies. This information can be provided the admin of the system is not required from the business end user. The annotated low level CEP model serves as input

11

for the FERARI CEP optimizer. The optimizer breaks down the logical global CEP expression into a distributed execution plan, consisting of interrelated sub-expressions, that is implemented by distributing CEP operators, in-situ operators, and machine learning operators on the different FERARI sites. The placement of the operators happens to fulfil the indicated optimization goals. As a result, a so-called site configuration file is generated and shipped to each FERARI site.

3. Each FERARI site receives its site configuration file as part of the global execution plan from the CEP optimizer. The site configuration file describes, which CEP operators, in- situ operators and machine learning operators have to be instantiated on the site, to which other FERARI sites input and output connections have to be established via the FERARI communication components, and in which mode (push/pull) the communication should be organized for each event type. All operators run on top of STORM. Each site starts a STORM topology, running its part of the application, and connects the inputs on outputs of the STORM topology via the FERARI communication system with the other FERARI sites, according the application specific connection topology described in the site configuration file. A detailed description of the CEP optimizer can be found in D5.3.

4. Each FERARI site monitors the execution of the its part of application and generates statis-tics for event frequencies. In addition, global event frequency estimations are learned as described in D5.3. The runtime statistics are transmitted via the FERARI communication components to the CEP optimizer, which may decide to generate a new global execution plan in order to adapt to changes in the environment. This plan is then again distributed and instantiated on the FERARI sites. In order to cope with the state, that might have been accumulated in the operators running the old, previous plan, both plans run in parallel for a application dependent period of time, under supervision of the FERARI communication components. The communication components duplicate input to multiple, parallel plans and consolidate output from different plans consistently, such that the application is fed with outputs form one plan only at a time..

5. At runtime, streams of input data are continuously fed into the FERARI system. Input data can be generated on computing nodes containing sensor, or, for example, be sent from remote nodes via Kafka. The interconnected FERARI sites, each running a automatically generated STORM topology, process the incoming data as raw events, filtering, aggregating and generating derived events according the abstract application specification provided in step 1. Derived events of relevance for the application are being forwarded via the FERARI communication components to an application specific system, where they can be persisted and/or visualized. Note that the FERARI CEP offers the option to deliver the causing raw input events in an aggregated form together with the derived event, which is useful for example for furhter analysis of the derived events. In the use case example of fraud detection on CDRs, the CDRs causing a fraud alarm may inspected by a human operator in order to finally decide on whether a SIM card should be suspended or not.

A FERARI application runs distributedly on a scalable number of FERARI sites, each of which running a part of the application on top of a STORM topology. Figure 4 describes the internal components inside of the FERARI site.

12

CEP Optimizer

Input Streams Site Configuration In-situ CEP Comm

Output

Local Site

Time Machine

Com

mun

icat

or

CEP

Input

Gate- Keeper

Remote Site (e.g., Coordinator)

runtime statistics

Figure 4: FERARI site architecture (running on Storm)

Communication between a FERARI site and other FERARI sites as well as with external systems (input of streaming data, output of derived events for the end user application) is organized via the FERARI communication compenents, that “surround” the FERARI site as depicted in Figure 4. The CEP box in Figure 4 stands for the PROTON on STORM implementation, executing CEP operators in a distributed and scalable manner on top of STORM. These CEP operators filter out and aggregate events, generating derived events that are forwarded to the “Time Machine” component, which acts as a buffer for site local derived events. This event buffer is involved in the communication efficient protocols implemented in the push/pull mechanism, the in-situ processing operators (for the resolution of so-called local violations) and in the machine learning operators. The “Gate Keeper” component executes these specific protocols. For example, for in-situ operators, it checks site local monitoring conditions and only forwards derived events in case of violating these local conditions.

The FERARI distributed CEP Optimizer

The query optimizer is an essential part of the overall FERARI architecture. It enables the generation of optimal plans that efficiently balance communication cost and complex event detection latency. The basic optimization goal of in-situ processing is realized through the deployment of those plans in the underlying distributed network and through the exploitation of the push and pull paradigm. Each site of the network is responsible for monitoring its own stream of events and the optimizer is responsible for providing plans that manage inter-site communication for the timely detection of complex events.

13

Additionally, in order to facilitate the in-situ processing the optimizer selects micro-coordinators that are responsible for answering specific parts of the query through push and pull requests among all involved sites. Depending on the query set and the network parameters (inter-site connectivity and local event frequencies) the optimizer is able to run the algorithms presented in Deliverable D5.3 and provide the necessary files that specify the way the queries are answered and the way event detection is communicated through the network.

The basic focus of the optimizer is to provide a set of Pareto optimal plans and subsequently supervise the plans’ performance. If at some point of time the performance of the currently chosen plan deteriorates, the optimizer will rerun its plan generation algorithms and decide whether the plans should be replaced as obsolete. The produced plans are Pareto optimal with respect to the detection latency and communication cost for detecting complex events. In order to facilitate such optimality, Non-Deterministic Finite Automata (NFA) are used for representing those plans. Each state of the resulting NFA represents the monitoring of a set of events (push mode). With the detection of all involving events within the specified time window, a state transition occurs which marks the monitoring of a new set of events (pull mode). When the final state of the NFA is reached a complex event is generated meaning that the complex event query is satisfied.

The Geometric Monitoring operator plans are also represented as NFAs but present different functionality than ordinary CEP operators. The first state always represents the threshold monitoring of a function and when a local violation occurs a series of states of the NFA mark the resolution of that violation. Thus, the first state is always on push mode and when a violation occurs, depending on the number of states, a series of pull requests are generated to a set of involving sites, stated by each state of the NFA, to resolve the threshold crossing. With all the above in mind, we can realize that many architectural injections are needed to facilitate the Optimizer’s goals to work properly and create the conditions for optimality in detection latency and communication bandwidth usage. Intra-site architecture must also be adapted to store statistics for monitoring plan effectiveness and handle the necessary pull requests for plan execution. We will elaborate on the plan generation workflow and focus on how the generated plans affect the overall architecture and the implementation injections that made all the above possible. The plan generation algorithms are discussed in Deliverable D5.3.

The query optimizer is a distinct entity in the FERARI architecture. It may or may not reside in a specific site of the distributed network but must be constantly running. An entity, though, that has to be able to communicate with every site in the network and every site in the network has to be able to communicate with the optimizer, as illustrated in Figure 5. For conceptual reasons the optimizer is cut off the underlying distributed architecture and created as a distinct object (not as part of a specific site).

14

tion latency and communication bandwidth usage. Intra-site architecture must also be adaptedto store statistics for monitoring plan e�ectiveness and handle the necessary pull requests for planexecution. In this chapter we will elaborate on the plan generation workflow and focus on howthe generated plans a�ect the overall architecture and the implementation injections that madeall the above possible. The plan generation algorithms are discussed in Deliverable D5.3.

2.1.1 Embedding the Optimizer in the FERARI ArchitectureThe query optimizer is a distinct entity in our architecture. It may or may not reside in a specificsite of the distributed network but must be constantly running. An entity, though, that has tobe able to communicate with every site in the network and every site in the network has to beable to communicate with the optimizer, as illustrated in Figure 2.1. For conceptual reasons theoptimizer is cut o� the underlying distributed architecture and created as a distinct object (notas part of a specific site).

The optimizer has its own Apache Storm topology that consists of the OptimizerSpout andthe OptimizerBolt as illustrated in Figure 2.1, which depicts the location and interconnectionsof the optimizer in the MobileFraud FERARI use case (with the towers depicting the locationof the remote source sites). The OptimizerSpout is responsible for receiving incoming commu-nication and the Optimizer Bolt is responsible for executing the algorithms that produce theplans that will subsequently be executed by the various sites in our network, as will be thoroughlydiscussed in Section 2.2. All the generated configuration files are communicated to every site bythe OptimizerBolt after their generation.

The optimizer’s work though is not over at this point, as the plans’ performance must bemonitored and actions must be taken should the performance deteriorate, as analyzed in Section2.2 and Deliverable D5.3.

Figure 2.1: Query Optimizer’s Place in the Architecture

14

Figure 5: Placement of the FERARI optimizer

Thebasicarchitecturalenrichmentthatisimplementedtofacilitatetheexecutionofthegeneratedplansbytheoptimizeristhepotentialexistenceofmultiplecoordinatingsites(referredtoasmicro-coordinators).Themicro-coordinatorsareselectedbytheoptimizerandareresponsiblefortheexecutionofaqueryplanandallnecessarycommunicationneededfordetectingeverycomplexevent,irrespectiveofwhichsourcegeneratedit.

15

2.1.2 Extended Distributed ArchitectureThe basic architectural enrichment that is implemented to facilitate the execution of the generatedplans by the optimizer is the potential existence of multiple coordinating sites (referred to as micro-coordinators). The micro-coordinators are selected by the optimizer and are responsible for theexecution of a query plan and all necessary communication needed for detecting every complexevent, irrespective of which source generated it.

Figure 2.2: FERARI extended distributed architecture with multiple micro-coordinators

Thus our architecture can be illustrated as a multi-star topology where each site may be amicro-coordinator having full coordinating functionality for executing its query plan. In Figure 2.2we can see such an architectural scheme where we have two micro-coordinators (marked as CEPCoordinator) that are linked with every site (marked as CEP Source) that is needed for theexecution of the optimizer’s plan. The above functionality is a direct result of the optimizer’sconfiguration files for each site. The communication links, depicted by the red arrows in Figure 2.2,can be either directed or undirected, depending on the networks actual communication links, andeach introduces a certain amount of latency to the pairwise communication.

Each site may simultaneously be a source site, a CEP micro-coordinator, a Geometric Mon-itoring source site and a Geometric Monitoring micro-coordinator depending on the optimizer’sgenerated plan. To facilitate this functionality, additional adaptations are made inside each site’stopology (intra-site):

• Event bu�ering, for each relevant event that may be pulled from this site or another(primitive or derived).

• Pull request, generation and handling of event requests for executing communication/latencye�cient plans.

• Push mode, incorporation of automatic event shipment when an event is monitored byanother micro-coordinator.

• Statistics storage, for all events participating in each site’s plans for e�ciency monitoring.

15

Figure 6: FERARIextendeddistributedarchitecturewithmultiplemicro-coordinators

Thusourarchitecturecanbeillustratedasamulti-startopologywhereeachsitemaybeamicro-coordinatorhavingfullcoordinatingfunctionalityforexecutingitsqueryplan.InFigure6wecanseesuchanarchitecturalschemewherewehavetwomicro-coordinators(markedasCEPCoordinator)thatarelinkedwitheverysite(markedasCEPSource)thatisneededfortheexecutionoftheoptimizer’splan.Theabovefunctionalityisadirectresultoftheoptimizer’sconfigurationfilesforeachsite.Thecommunicationlinks,depictedbytheredarrowsinFigure6,canbeeitherdirectedorundirected,dependingonthenetworksactualcommunicationlinks,andeachintroducesacertainamountoflatencytothepairwisecommunication.

Eachsitemaysimultaneouslybeasourcesite,aCEPmicro-coordinator,aGeometricMon-itoringsourcesiteandaGeometricMonitoringmicro-coordinatordependingontheoptimizer’sgeneratedplan.Tofacilitatethisfunctionality,additionaladaptationsaremadeinsideeachsite’stopology(intra-site):

• Eventbuffering,foreachrelevanteventthatmaybepulledfromthissiteoranother(primitiveorderived).

• Pullrequest,generationandhandlingofeventrequestsforexecutingcommunication/latencyefficientplans.

• Pushmode,incorporationofautomaticeventshipmentwhenaneventismonitoredbyanothermicro-coordinator.

• Statisticsstorage,foralleventsparticipatingineachsite’splansforefficiencymonitoring.

Thequeryoptimizerneedstwobasicsetsofinformationasinputparametersinordertoexecutetheplangenerationalgorithms;thenetworkparametersandthesetofqueries.Thosetwopiecesofinformation,whichisaprioriknowledgeofournetworkanditseventstreams,isfedintothe

16

queryoptimizerthroughexternalcommunicationandreadbytheOptimizerSpout.TheexternalcommunicationcantakewhicheverformissuitabletoourneedsfromSocketstoRESTcommunicationandrepresentsthebasicunderstandingofourcomplexeventprocessingsystemwhichwillbecreatedbybothanexpertuser(networkparameters)andanon-expert/businessuser(setofqueries).Thesetwofilescanbetheresultoftwodifferentprogramsrunningoutsideofthedesirednetworktofacilitatethesenecessaryprocessesandsenttotheoptimizerwiththeaforementionedexternalcommunication.TheyconstitutetheoutputofWorkPackage4,whichistheninputtotheoptimizer.

OncethetwofilesarereadbytheOptimizerSpout,theyaresent,throughstormmessages,totheOptimizerBoltofthequeryoptimizer,whereallthealgorithmsforplangenerationareexecuted.Asaresult,eachsitereceivesasetoffileswithspecificinstructionsofhowtooperate.Now,thewholesystemisreadytogoonlineandeachsitemaystartreceivingeventsfromitsownstream,producingresultswithrespecttothegivenqueriesandreportingthemtotheend-userthroughtheFERARIdashboard.

Duringtheexecutionofsubmittedqueryplans,theoptimizermonitorstheirperformance.Thestatisticsaremonitoredinadistributedandcommunicationefficient-fashion(describedinDeliverableD5.3)andtransmittedbyeachsitetotheoptimizeronlywhentheyarebelievedtohavechangedsignificantly.ThesestatisticsarethenagainreadbytheOptimizerSpoutandforwardedtotheOptimizerBoltforevaluation.Theoptimizerthenusestheupdatedstatisticsandevaluateswhetheritisbeneficialtoinitiateandadaptationprocessinordertoreplaceacurrentlyexecuteplanwithanewone.Theadaptationprocessshouldoffergainsintermsofcommunicationcostandshouldbeperformedinamannerthatnocomplexeventwillbemissed.TheaforementionedprocessthatpresentedtheoptimizerworkflowisillustratedinFigure7.

These are some of the injections needed to support such an architectural scheme. These architec-tural injections are thoroughly addressed in the following sections.

2.2 The Optimizer WorkflowThe query optimizer needs two basic sets of information as input parameters in order to executethe plan generation algorithms; the network parameters and the set of queries. Those two piecesof information, which is a priori knowledge of our network and its event streams, is fed intothe query optimizer through external communication and read by the OptimizerSpout. Theexternal communication can take whichever form is suitable to our needs from Sockets to RESTcommunication and represents the basic understanding of our complex event processing systemwhich will be created by both an expert user (network parameters) and a non-expert/businessuser (set of queries). These two files can be the result of two di�erent programs running outsideof the desired network to facilitate these necessary processes and sent to the optimizer with theaforementioned external communication. They constitute the output of Work Package 4, whichis then input to the optimizer.

Once the two files are read by the OptimizerSpout, they are sent, through storm messages,to the OptimizerBolt of the query optimizer, where all the algorithms for plan generation areexecuted. As a result, each site receives a set of files with specific instructions of how to operate.Now, the whole system is ready to go online and each site may start receiving events from its ownstream, producing results with respect to the given queries and reporting them to the end-userthrough the FERARI dashboard.

During the execution of submitted query plans, the optimizer monitors their performance.The statistics are monitored in a distributed and communication e�cient-fashion (described inDeliverable D5.3) and transmitted by each site to the optimizer only when they are believedto have changed significantly. These statistics are then again read by the OptimizerSpout andforwarded to the OptimizerBolt for evaluation. The optimizer then uses the updated statisticsand evaluates whether it is beneficial to initiate and adaptation process in order to replace acurrently execute plan with a new one. The adaptation process should o�er gains in terms ofcommunication cost and should be performed in a manner that no complex event will be missed.The aforementioned process that presented the optimizer workflow is illustrated in Figure 2.3.

Figure 2.3: Query Optimizer’s Workflow

16

Figure 7: Query optimizer workflow.

Afterthesystemsetup,allthesitescanstartprocessingevents.Theseeventsmaycausecommunicationamongthesitessoastoevaluateaqueryorain-situGM(geometricmonitoring)

17

ormachinelearningoperator.

ThepushandpullincorporationinthearchitectureisillustratedinFigures8and9.Thedashedarrowsdepictinter-sitecommunication,whereasthenormalarrowsdepictintra-sitemessages.Typically,eventsarriveineachsiteasastreamintheInputSpout.TheyareprocessedbytheCEPmodulewhichistheProtonOnStormandaresubsequentlyforwardedtotheTimeMachinemoduleforbufferingpurposes.Thisprocessiscontinuousandoccursinaeverysiteregardlessofpossiblyassignedcoordinatingtasks.

Section 2.3.2 the full operational workflow of the Geometric Monitoring Operator is also presentedin detail.

2.3.1 Complex Event Processing WorkflowThe push and pull incorporation in the architecture is illustrated in Figures 2.5 and 2.6. The dashedarrows depict inter-site communication, whereas the normal arrows depict intra-site messages.Typically, events arrive in each site as a stream in the InputSpout. They are processed by theCEP module which is the ProtonOnStorm and are subsequently forwarded to the TimeMachinemodule for bu�ering purposes. This process is continuous and occurs in a every site regardless ofpossibly assigned coordinating tasks.

Figure 2.5: CEP micro-coordinator pull request

The micro-coordinating site, depicted in the top left on both Figures 2.5 and 2.6, is responsiblefor executing the query plan. If the query plan has more than one states, the first state is alwaysin push mode (Figure 2.6) and the later states are activated upon detection of all events includedin previous steps within the time window. Upon state transition, a pull request is generated fromthe ProtonOnStorm module and emitted to the Communicator module. The Communicator is thenresponsible of informing all sites (including itself) that produce this event type, that it requiresall relevant events produced within a given time window. Then the pull request is sent from theCommunicator of the coordinating site to the Communication Spout of all sites that produce thattype of event (Figure 2.5). Upon arrival, the pull request is sent to the bu�ering module (i.e. theTimeMachine) which will switch to push mode, for the requested events for the requested timespan.Once in push mode the TimeMachine checks whether the requested events have already occurredwithin the specified timespan and forwards them back to the coordinating site’s Input Spoutthrough the source’s Communicator (Figure 2.6). At the same time the TimeMachine remains onpush mode until the pull request’s timespan expires and whenever an event of interest is detectedit is immediately forwarded back to the micro-coordinator.

21

Figure 8: CEP micro-coordinator pull request.

Themicro-coordinatingsite,depictedinthetopleftonbothFigures8and9,isresponsibleforexecutingthequeryplan.Ifthequeryplanhasmorethanonestates,thefirststateisalwaysinpushmode(Figure9)andthelaterstatesareactivatedupondetectionofalleventsincludedinpreviousstepswithinthetimewindow.Uponstatetransition,apullrequestisgeneratedfromtheProtonOnStormmoduleandemittedtotheCommunicatormodule.TheCommunicatoristhenresponsibleofinformingallsites(includingitself)thatproducethiseventtype,thatitrequiresallrelevanteventsproducedwithinagiventimewindow.ThenthepullrequestissentfromtheCommunicatorofthecoordinatingsitetotheCommunicationSpoutofallsitesthatproducethattypeofevent(Figure8).Uponarrival,thepullrequestissenttothebufferingmodule(i.e.theTimeMachine)whichwillswitchtopushmode,fortherequestedeventsfortherequestedtimespan.OnceinpushmodetheTimeMachinecheckswhethertherequestedeventshavealreadyoccurredwithinthespecifiedtimespanandforwardsthembacktothecoordinatingsite’sInputSpoutthroughthesource’sCommunicator(Figure9).AtthesametimetheTimeMachineremainsonpushmodeuntilthepullrequest’stimespanexpiresandwheneveraneventofinterestisdetecteditisimmediatelyforwardedbacktothemicro-coordinator.

18

Figure 2.6: Push requested events from sources to micro-coordinator

The aforementioned functionality introduces some requirements in order for it to work prop-erly. First of all, derived events, which are output from the CEP moudle, must be stored in theTimeMachine. Thus the TimeMachine is equipped with bu�ers (HashMap structures) for eachevent type where events along with their detection timestamps are stored in. Additionally, sincethe push mode is the responsibility of the TimeMachine, a time-based mechanism is employed forrecovering past events within a timespan along with the forwarding process for shipping events ofinterest to coordinating sites.

As previously described, pull requests are generated upon state transition of the NFA queryplan from the ProtonOnStorm module and emmitted to the Communicator. The Communicatorthen is responsible for forwarding the pull request and is equipped with a structure reserving thepull requests in order to avoid sending time overlapping pull requests for the same event types.

Inside the ProtonOnStorm module there is a bolt responsible for routing all events to theprocessing bolts (EPAs) which is called Routing Bolt. Since all events are routed through theRouting Bolt, it is fitting that event statistics are gathered there. Those statistics are periodicallypushed to the TimeMachine module for storage (see the light blue arrow in Figure 2.7). Thesupported statistics form vary from simple counters to more complicated structures like histograms,sampling, ECM-Sketches (see Deliverable D3.2) or even robust Kernel Density Estimators (seeDeliverable D5.3).

The Routing Bolt presents itself as the most fitting candidate for handling the pull requestsas well, since all intermediate derived events (state transitions) are also routed through this boltinside the ProtonOnStorm topology. Simple structures are used for alerting the RoutingBolt whena derived event, that is also a state transition, is generated by the CEP engine in order for a pullrequest to be spawned. The request is then emitted to the Communicator as previously described.This process is illustrated with the red arrow in Figure 2.7 where the ProtonOnStorm’s internalstructure is visualised.

22

Figure 8: CEP micro-coordinator push request.

Theaforementionedfunctionalityintroducessomerequirementsinorderforittoworkprop-erly.Firstofall,derivedevents,whichareoutputfromtheCEPmodule,mustbestoredintheTimeMachine.ThustheTimeMachineisequippedwithbuffers(HashMapstructures)foreacheventtypewhereeventsalongwiththeirdetectiontimestampsarestoredin.Additionally,sincethepushmodeistheresponsibilityoftheTimeMachine,atime-basedmechanismisemployedforrecoveringpasteventswithinatimespanalongwiththeforwardingprocessforshippingeventsofinteresttocoordinatingsites.

Aspreviouslydescribed,pullrequestsaregenerateduponstatetransitionoftheNFAqueryplanfromtheProtonOnStormmoduleandemittedtotheCommunicator.TheCommunicatorthenisresponsibleforforwardingthepullrequestandisequippedwithastructurereservingthepullrequestsinordertoavoidsendingtimeoverlappingpullrequestsforthesameeventtypes.

InsidetheProtonOnStormmodulethereisaboltresponsibleforroutingalleventstotheprocessingbolts(EPAs)whichiscalledRoutingBolt.SincealleventsareroutedthroughtheRoutingBolt,itisfittingthateventstatisticsaregatheredthere.ThosestatisticsareperiodicallypushedtotheTimeMachinemoduleforstorage(seethelightbluearrowinFigure10).Thesupportedstatisticsformvaryfromsimplecounterstomorecomplicatedstructureslikehistograms,sampling,ECM-Sketches(seeDeliverableD3.2)orevenrobustKernelDensityEstimators(seeDeliverableD5.3).

TheRoutingBoltpresentsitselfasthemostfittingcandidateforhandlingthepullrequestsaswell,

19

sinceallintermediatederivedevents(statetransitions)arealsoroutedthroughthisboltinsidetheProtonOnStormtopology.SimplestructuresareusedforalertingtheRoutingBoltwhenaderivedevent,thatisalsoastatetransition,isgeneratedbytheCEPengineinorderforapullrequesttobespawned.TherequestisthenemittedtotheCommunicatoraspreviouslydescribed.ThisprocessisillustratedwiththeredarrowinFigure10wheretheProtonOnStorm’sinternalstructureisvisualised.

Figure 2.7: ProtonOnStorm closeup (pull request generation, event/statistics bu�ering)

2.3.2 Geometric Monitoring Operator WorkflowThe Geometric Monitoring (GM) operator presents some di�erences than ordinary complex eventqueries due to its more complex functionality. The basic architectural scheme, though, remains asdescribed in Section 2.1.2. The GM operator is placed at a specific site (GM micro-coordinator)with a specific plan decided by the optimizer in order to reduce communication and latency.Another major di�erence is where the plans are executed inside the coordinating sites and wherethe computation of the threshold monitoring functions is made in every site that is involved inthe GM operator. All the above will be addressed below.

Geometric Monitoring plans di�er from CEP plans, since the system handles the monitoringof more complex functions. The GM plans are divided in two steps. The first step continuouslymonitors the assigned thresholds throughout the designated sites and pushes the local statisticsvector to the coordinating site whenever a local violation occurs. The second step is the resolutionphase which is activated when a local violation occurs and aims to realize if the local violationwas a global one as well, in order to create a complex event marking that global violation. Thesecond step can be further divided into multiple steps, where each step has a subset of the sites,in order to lighten the communication overhead that a full synchronization requires and that isthe key objective of the optimizer when dealing with such queries.

The basic component inside each site that handles the task of monitoring the threshold for thedesignated function is the GateKeeper. All the computation and decision making takes place inthat component in our storm topology regarding the GM operator whether the site in question issimply a source or a micro-coordinator. Additionally, for coherence purposes with the CEP engineall bu�ering is the responsibility of the TimeMachine. Whenever the Local Statistics Vector (LSV)is updated in the GateKeeper it is also pushed to the TimeMachine for bu�ering. Similarly, forcomputing the Local Statistics Vector the GateKeeper takes its input from the TimeMachinewhich in turn takes its input from the ProtonOnStorm module, as illustrated in every sub-figureof Figure 2.8 with the light blue arrows.

23

Figure 10: CEP Proton on Storm integration.

The FERARI architecture was evaluated against use cases, running on real-world data. For this purpose, HT has collected CDR (Call Detail Record) data for evaluation purposes. This data has been anonymized for analysis in the project. Fraud detection rules have been implemented using Proton on Storm, being described in the Proton on Storm methodology based on EPN (Event Processing networks) and EPAs (Event processing agents). The implementation has been evaluated both in terms of accuracy and performance.

FraudDetectionUsecase

Thegoalinfraudminingistoidentifyusers,whichuseanetworkservicewithouttheintentiontopayforthatuse.Manyfraudminingsystemsintelecommunicationsusesomeformofrulesoftendefinedbyfraudexpertsorautomaticallybysomesoftware,toraisealarms.Thesealarmsarecheckedbyfraudinvestigatorsonacase-by-casebasis.Duringnighttimeswhennofraudinvestigatorsarepresentthesoftwaremayautomaticallyblockcertaincallstopreventdamage.Duringdaytimesthefraudinvestigatorstakeactionsaftertheyhaveinvestigatedacase.Itistheirdutytodecideswhetherasuspiciousbehaviorisfraudulentorlegal.Thisdependsonthecurrent

20

call,thecallhistory,thecustomerhistoryandthesubscriptionplanofthecustomer.ThefocuswithinFERARIliesontheidentificationofsuspiciouscallsandusersandthedesignofdistributedcommunicationefficientsystemsforthistask.

Withinthiscoarsedefinitionoftelecommunicationfraudseveralwellknownpatternsexist,eachwithitsowncharacteristics.WebrieflydescribesomeofthepatternsaddressedintheFERARIproject.

PremiumrateservicefraudApremiumrateserviceisaservicewithadditionalchargesbeyondtheregularcostforacall.Inthisfraudtypethefraudstercooperateswithapremiumrateserviceprovider.Thefraudstergeneratescallstothepremiumrateserviceandgetsashareoftheprofitsfromthepremiumrateserviceprovider.Thefraudsterandthepremiumrateserviceprovidercooperateacrossnetworks.Thepremiumrateserviceproviderchargestheterminatingnetwork,whichchargestheoriginatingnetworkinturn.Theoriginatingnetworkchargesthecustomer.Ifthecustomerhappenstobeafraudstertheoriginatingnetworkoperatorhasafinancialloss.Thespreadoverdifferentnetworksandcountriesmakesfightingthisfraudchallenging.

RoamingfraudWhilethetechniquesusedbyfraudstersinroamingfraudmightbeconsideredanextensionofthefraudtechniquesusedinthehomeland,theytakelongertodetect,longertorespondtoandasaresultofbothcancausehigherlosses.Themainreasonforthedelayindetectionisadelayinthedatadeliveryfromtheroamingcarriertothehomelandcarrieroftheuser.

SimcloningSimcloningisatypeofsuperimposedfraud,wherefraudsters"takeover"alegitimateaccount.Asimcloningmayhappenwitholdersimcards,ifthefraudsterhasphysicalaccesstosuchasimcard.Hemaythencloneitandusetheclonedcard.Whilethenormaluserwillfollowhisregularusagepattern,thefraudstermaysimultaneouslyusetheclonedcardtosuperimposehisuseonthelegitimateuser.Thecombineduseoflegitimateandillegalusemaypreventthediscoveryoffraudsterforawhileastheusagepatternmaynothavethetypicalfraudsterpattern.

Frauddiscoverysystemsrelyonsomeformofruleswhichareeitherhumangeneratedorlearnedbysoftware.Humangeneratedrulesconsistofseveralelements.PleasenotethatthethresholdsweindicateheredonotcorrespondtothethresholdsusedintheactualfrauddetectionsystemofHT.However,thestructureoftherulesisequivalentandwealsoevaluatedtheaccuracyoftheFERARIarchitectureusingtherealthresholdsagainsttheactualdetectionrateofthesystemusebyHT.

•Pattern(e.g.callmadetoMaldivesinnighthourslastinglongerthan40minutes)

•Threshold(e.g.10callsmadetoMaldiveslastinglongerthan60minutesduringlast24hours)

• Accumulation(e.g.usagehigherthan100knmadetopremiumservicesduringlastsixhours)

• Timespan(e.g.morethan500callsduringlast12hours)Asshownintheexamplesatypicalrulehasseveralbutnotnecessarilyalloftheseelements.

RulebasedalarmsystemallowthedefinitionofmorecomplexcriteriausingBooleanlogic(AND/OR/NOT)tocombineseveraloftheseelementstoformmorecomplexrules.Theexamplesshownhereareinventedandmorecomplexrulescannotbedisclosedduetothecompanypolicy

21

ofHrvatskiTelekom.Computergeneratedpatternsmaytakesimilarformsbutcanalsoincludeartificialneuralnetworksorothernonhumanreadableform.Agoodfrauddiscoverysystemshouldsatisfythefollowingproperties:•Discovernewfraudstersfast(i.e.shortdetectiontime)

Discovernewfraudpatterns(i.e.goodgeneralization) Haveaverylowfalsealarmrate(i.e.highaccuracy) Discoverallfraudsters

Thesepropertiesmaynotbesatisfiableallatonce.Imagineasystem,thatwouldreporteverycallasfraud.Itwouldobviouslyhaveaveryhighfalsealarmratebutitwoulddefinitelydiscovernewfraudster,newfraudtypesandallfraudsters.Thisimaginarysystemwouldbeentirelyuselessasitrequirestomuchmanualeffort.Consideranimaginarysystemontheextremeopposite,i.e.asystemthatnevergeneratesafalsealarm.Asystemthatnevergeneratesanalarmwillnevergenerateafalsealarm,howeversuchasystemisuselessforobviousreasons.Anyrealfrauddiscoverysystemmustmeetallofthepropertiestosomeextend.Theexactextendtowhicheachpropertyiscoveredmayvaryfromsystemtosystemandcompanytocompany.Whilesomecompanymaypaymoreattentiontothefastdiscoveryofnewfraudstersanothercompanymaypreferalowerfalsealarmrateattheexpenseofmissingsomefraudsters.Wheneverautomaticallydiscoveredrulesareusedasasupplementtomanuallymaintainedrulesetsadditionalrequirementsmustbemet.Therulesetshouldbesmalltobemanageablebyhumanexpertsandeachruleinsucharulesetshouldcoveradifferentbehaviorofafraudster.

Fraud Detection with FERARI CEP

TheoverarchingaimoftheCEPcomponentinthisusecaseistodetectapotentialmobilefraudincident.Tothisend,afirstEPN(EventProcessingNetwork)hasbeencreatedwiththecollaborationoftheusecaseownerwiththegoalofhavingsomethingmeaningfulandrepresentative,yetdoabletobeachievedinthefirstyearoftheproject.TheoutcomeisanEPNconsistingoffiveEPAs(EventProcessingAgents)showninFigure11anddetailedinthefollowingsections.ForthesakeofsimplicityweonlyshowtheEPAsandtheeventsflowinthenetwork.

InthecurrentEPNwewanttofiresituationsinthefollowingcases:

• Alongcalltopremiumdistanceismadeduringnighthours(EPA1,LongCallAtNight).

• Asbefore,butthistimewearelookingforatleastthreeofthese“longdistancecalls”percallingnumber(EPA2,FrequentLongCallsAtNight).

• Multiplelongdistancecallspercallingnumberthatcostmorethanacertainthresholdvalue(EPA3,FrequentLongCalls).

• Sameasbefore,buteachoccurrencecostexceedsthethreshold(EPA4,FrequentEachLongCall)

• Wearelookingforhighusageofalineforlongdistancecalls(EPA5,Expensivecall).

Inthecurrentprocess,potentialfraudsituationsare(automatically)markedandinspectedafterwardsbyahumanoperatorwhodecideswhetheritisafraudornot.Therefore,thesituationsdescribedaboveanddepictedinFigure5willbemarkedaspotentialindicationsoffraudincidents,andwillbecheckedupbyhumansafterwards.

22

FrequentLongCallsAtNightEPA2

Calls

EPA1

ExpensiveCalls

LongCallAtNight

Situations

EPA3 FrequentLongCalls

EPA5

EPA4 FrequentEachLongCall

Figure11:MobilefraudusecaseinitialEPN

Notethefollowing:

• Duetoprivacyissues,thevalueschosenforspecificvariablesandthresholdsselectedarenotthecorrectones.Inreality,theEPNisimplementedapplyingthecorrectvalues.However,thisdoesnotalterthelogicoftherules,justtheassignmentofthedifferentvariablesandthresholdsvalues.

• “Premiumlocationservices”isaclosedlistofpotentialfarlocations/destinationsforwhichtherulesarerelevant.Wehaveoptedfor“Maldives”asacodenamefortheselocations.Inpractice,thesamepatternwillbeduplicatedforeachofthelocationsinthislist.

• Inthisusecasenighthoursareconsideredbetween19:00and7:00,and24hoursareconsideredfrom24:00to23:59thedayafter.

• Weareonlyareinterestedinoutgoingcalls(incomingcallsarenotrelevanttofrauddetection),indicatedwheneverthecall_directionfieldequals1(refertoTable1).

Fiveeventtypeshavebeendefinedsofarthatcomprisetheeventinputs,outputs/derived,andsituationsasshowninTable1.Forthesakeofsimplicityweonlyshowtheuser-definedattributesortheeventpayloadandnotthemetadata.

AlthoughthenamesofconceptsintheapplicationcanbedeterminedfreelybytheapplicationdesignerinPROTON,weusesomenamingconventionsforthesakeofclarity.Wedenoteeventtypeswithcapitalletters.Built-in/metadataattributesstartwithacapitalletter,aswellaspayloadattributesthatholdoperatorsvalues,whilepayloadattributesstartwithalowerletter.

NotethattheCallraweventincludesmorefieldsorattributes.WedefinedonlytheonesrequiredforpatterndetectioninthecurrentEPNimplementation.WhenrunninginFERARIarchitecture,PROTONwillignoreeventattributesnotspecifiedinitsJSON.

23

Eventname

Call

Payload object_id;billed_msisdn;call_start_date;calling_number;called_number;other_party_tel_number;call_direction;tap_related;conversation_duration;total_call_charge_amount

Eventname

LongCallAtNight

Payload calling_number;conversation_duration;other_party_tel_number

Eventname

FrequentLongCallsAtNight

Payload calling_number;other_party_tel_number;CallsCount

Eventname

FrequentLongCalls

Payload calling_number;other_party_tel_number;CallsCount;CallsLengthSum

Eventname

FrequentEachLongCall

Payload calling_numberother_party_tel_number;CallsCount

Eventname

ExpensiveCalls

Payload calling_number;other_party_tel_number;CallsCostSum

Table 1: Initial EPN for the mobile phone fraud use case

Henceforth,wedescribetheEPAsinthefollowingorder:Eventname;motivation;eventrecognitionprocesscontextsalongwithtemporalcontextpolicy;andpatternpolicies.

Intheeventrecognitionprocessweonlyshowthestepsthattakeplace,i.e.relevant,inthespecificEPA,whiletheothersaregreyed.Forthefilteringstepweshowthefilteringexpression;forthematchingstepwedenotethepatternvariables;andforthederivationstepwedenotethevaluesassignmentandcalculations.Pleasenotethatforthesakeofsimplicityweonlyshowtheassignmentsthatarenotcopyofvalues(allotherderivedeventattributesvaluesarecopiedfromtheinputevents).Forattributes,wejustdenotetheirnameswithouttheprefixof‘attribute_name.’

LongCallAtNight

Motivation:Checkfor“long”calls(definedasmorethan40min)topremiumlocationsduring

24

nighthours(limitedfrom19:00to7:00).

Eventrecognitionprocess

other_party_tel_number =“Maildives”ANDcall_direction =1AND(call_start_date >19:00 ORcall_start_date <7:00)ANDconversation_duration >40minutes

EventProcessingAgent

Call

withincontext

filtering

deriving

LongCallAtNightmatching

Figure12:EventrecognitionprocessforFilteringEPA

NotethatFilteragentsareusedtoeliminateuninterestingevents.AFilteragenttakesanincomingeventobjectandappliesatesttodecidewhethertodiscarditorwhethertopassitonforprocessingbysubsequentagents.TheFilteragenttestisthereforestateless,inotherwords,atestbasedsolelyonthecontentoftheeventinstance.Therefore,bothpatternandcontextpoliciesarenotapplicablewiththistypeofEPA.

EPA2:FrequentLongCallsAtNight

Motivation:Sameasbefore,butweareseekingforatleast3callsmadetopremiumlocationsduringnighthourslastinglongerthan“40minutes”peracallingnumber.

Eventrecognitionprocess

EventProcessingAgent

LongCallAtNight

withincontext

filtering

deriving

FrequentLongCallsAtNightCOUNT

count>2

CallsCount :count

Figure13:EventrecognitionprocessforFrequentLongCallsAtNightEPA

NotethatthepatternCOUNTsumsthenumberoftheinputeventoccurrences,whilecountistheassertionvaluefortheCOUNTpattern.Also,thattheinputeventforthisEPAisLongCallAtNight

25

eventwhichisderivedfromEPA1(seeFigure11).

ThefraudrulesbeingimplementedasCEPweretestedonadatasetcomprising80GBofmaskedCDRsoveratimespanof30daysthatcorrespondtoJuly2016.Wecounted18,426,138eventsper24houronarandomday,meaninganinjectionrateof213eventspersecond.Extrapolatingnthenumberofrecordsperdaywearriveatapproximately553millionrecordspermonth.

Inordertoassesstheaccuracyofourevent-drivenapplication,weperformedtwotypesofanalysis:calculationofrecallandprecisionandoflatencyandthroughput.

Wewouldliketonotethateventprocessingspeedcurrentlyinplaceistunedtoactualbusinessneeds.ThecentralizedsysteminHTcouldbetunedtoahigherperformanceifrequested.Therefore,resultspresentedbelowarecomparedtoactualproductionstatusoftheplatformanddonotrepresentthemaximalperformanceoftheusedFraudmanagementsysteminHT.

Inordertoassesstheaccuracyofourevent-drivenapplicationintermsofrecallandprecisionwecomparedPROTON’sresultstotheresultsgivenbyHTFrauddetectionsystemMEGStothesamedataset(July2016)foranumberoffraudalarmsituations.Thatis,wecomparedthecallingnumbersinthederivedeventsfromPROTONtotheonesdetectedbyMEGS,foreachoftheeventrules.MEGSflaggedcallingnumbershavebeenprovidedtousbypartnersHTandPIinacsvfilethatcontainedtheflaggedcallingnumbersalongwiththedetectiontimebyMEGS.Wedetectedthatwehavearecallof100%(wehaveidentifiedallfraudulentnumbersasmarkedbyMEGS)whiletheprecisionis91%(assumingMEGSisthegroundtruth,wehave10truepositiveswhile1falsepositive).

Oneofthemaincharacteristicsofcomplexeventprocessingsystemsistheirabilitytoderivesituationsinreal-time.Thisisextremelyimportantinusecasessuchasfrauddetection,sincethesooneranoperatorblocksafraudulentmobilephonenumber,thehigherthesavingscanbe.TocheckwhetherwewereabletodetectfraudulentcallsbeforeMEGS,wecomparedthetimestampsofPROTONderivedeventstotheircounterpartsintheoriginaldata.

Latencyisdefinedastheelapsedtimebetweenthedetectiontimeofthelastinputeventrequiredforapatternmatchingandthecorrespondingdetectiontimeoftheoutputevent.Forexample:assumingthatthe(derived)eventDisdefinedasasequenceofevents(E1,E2,E3)thenthelatencyismeasuredasthetimeperiodsinceaninstanceofE3arrivesintothesystemtilltheemissionofthecorrespondinginstanceofD.

Inordertocorrelatethederivedeventwiththelatestinputeventthattriggeredthederivation,weleveragethefeatureofPROTONthatallowsattachingtoaderivedeventthematchingsetofcontributinginputevents.Thus,givenininstanceofthederivedeventonecaneasilyobtainthelistoftheeventsthathascontributedtotheeventpattern,alongwiththeirtimestamps–soit’sstraightforwardtocomputethelatencyasdefinedabove.

Latencytestswereperformedonaclusterwith:

• CPU:[email protected]:16threads(8cores)

• RAM:12GBECC

26

• Disks:2x1TB(RAID1)

• NentworkInterfaceController:4x1Gbps

• OS:Debian8

The table 2 below states the different configurations tested. As one can see in the table, configurations that involve multiple Kafka partitions and brokers were not tested in this version. The reason for that is that initial test results have demonstrated that the messaging layer in its minimal configuration (single broker, single executor for kafka-storm spout, 1-2 executors for kafka-bolt) was operating significantly below its capacity, and further increase of messaging power would not improve the performance of the entire system. As can be seen from the table, adding more workers improves latency.

24

The topology of the Mesos cluster is shown in Figure 12 below:

Figure 12: Mesos cluster topology

Table 9 states the different configurations tested. As one can see in the table, configurations that involve multiple Kafka partitions and brokers were not tested in this version. The reason for that is that initial test results have demonstrated that the messaging layer in its minimal configuration (single broker, single executor for kafka-storm spout, 1-2 executors for kafka-bolt) was operating significantly below its capacity, and further increase of messaging power would not improve the performance of the entire system. As can be seen from the table, adding more workers improves latency.

Config Number

of workers

CEP parallelization factor

End-to-end latency (ms) 50 events/sec

End-to-end latency (ms) 500 events/sec

1 1 1 31 189 2 1 2 86 253 3 2 4 25 111 4 4 8 14 44.7 5 4 16 16 87 6 8 16 11 16

Table 9: Performance Results Summary (90% percentile values)

Since the injection rate required by our use case is around 200 events/second, we can see that with the specified configuration we can reach latency of no more than 16 ms (90% percentile value, probably much less for average latency value). If the rate of the events increases, or faster processing is required, we can add more STORM workers to linearly reduce the processing latency.

Table 2: Performance Results Summary (90% percentile values)

Since the injection rate required by our use case is around 200 events/second, we can see that with the specified configuration we can reach latency of no more than 16 ms (90% percentile value, probably much less for average latency value). If the rate of the events increases, or faster processing is required, we can add more STORM workers to linearly reduce the processing latency.

Throughputisdefinedasthemaximumrateatwhicheventscanbeprocessed.CompleteHTmobilenetworkgeneratesapproximately500millioncall/SMSeventspermonth,whichimpliesthatrequiredaveragethroughputforcompleteHTnetworkislessthan200eventspersecond.Nevertheless,therequiredreal-timethroughputmightbehigher,sincein“peak”minuteHTnetworkmaygenerateupto340eventspersecond,andinsomeoccasionsevenmore.Highestrecordedratewasabout1750eventspersecond.

ExperimentsonthecurrentFERARIservershowthatwewereabletorunstablesystemwithinjectionrateof500eventspersecond,whichismorethanenoughtohandlecurrentload.ItwillbeconsideredtosetupthesystemindistributedmannerintheHT’sinfrastructureonceitgoesforproductionuse.Celltowersdonotallowadditionaldataprocessingandcannotbeusedtoprocesscallevents,butHTalreadyhasasetof80probeserverson6locationsacrossCroatiathatcanbeusedtoruntheFERARIsitesdistributedmanner.Thoseserversareusedfornetworkmonitoringandgenerateaggregatedinformationpercall,includingAandBnumbers,timestampsandadditionalcallattributesneededforfrauddetection.Usingtheseservers,wecanscaleoutourarchitectureandimprovethroughputmuchmorethanwouldbeeverneeded.

TheFERARI architecturehasbeen successfully adapted toworkwithdataused in frauddetection inHT,namelyCDRsandalarmscurrentlyinusebyHTinordertodetectfraudandalso,theFERARIarchitecture

27

hasbeenadaptedtoprocessdatarelatedtonetworkequipmentinusebyHT.Additionally,aGUIhasbeencreatedtoenableuserstomonitordataprocessinganddatadrilldown.Assuch,finalFERARIarchitectureshouldenableeasyintegrationinthefrauddetectionprocessinHT.Fraudpreventionanddetectionisanimportantissuefortelecommunicationsoperator.Experienceshows,that in caseswhere no effective anti-fraud control exist, in a certain time every operator is struck by amajor fraud problem. Due to a growing number of subscribers and an ever increasing number ofdistribution channels, new services andmore complex tariff structures, an effective fraudmanagementsystemisnecessarytopreventorminimizefraudlosses.EquipmentfailureimpactsnetworkperformanceandservicelevelstowardsHT’scustomers.Therefore,itisimportant forHTtodetectpotentialequipmentfailureassoonaspossible inordertoavoidunnecessarydowntimeandconsequentiallycustomerexperience,equipmentmaintenancecostsandrevenues.FERARI architectureworks out-of-the-boxwithCDR formatused inHT and fieldsmonitored andused infrauddetectionandnetworkparametersusedformonitoringthenetwork.AftersuccessfultestingoftheFERARIarchitecture inHT, ithasthepotentialtobecometheprimetoolusedintimelyandcorrectfrauddetectionandmonitoringthenetworkforanomalies.

The FERARI Distributed Online Learning Framework TointegratemachinelearningintheFERARIarchitecture,adistributedonlinelearningframeworkhasbeendeveloped.Thegoalofourapproachistoprovidedistributed,in-streamonlinelearningalgorithmsthatareabletosynchronizetheirmodelswithinthesysteminacommunication-efficientmanner.Forthispurpose,wehavedevelopedspecificgeometricmonitoringoperatorsthattriggerasynchronizationonlywhenitisbeneficialfortheoveralllearningperformance.Thisway,thecommunicationnetworkusageisreducedsignificantly.Thatis,itcanbeemployedtosolvedistributedonlinepredictionproblemsonmultipleconnectedhigh-frequencydatastreamswhereoneisinterestedinminimizingpredictiveerrorandcommunicationatthesametime.

Theapproach,describedin(Kampetal.,2014a,b,2016)1,assumesthateachlearnerreceivesdatafromahigh-frequencydatastream.Furthermore,atthisstageweassumethatthedataateachlearnerisgeneratedbythesametime-variantdistribution.Thatis,whilethetargetconceptmayvaryovertime,atanygivenpointintimeitisthesameforalllearners.Thisallowstouseaveragingofmodelparametersasanaggregationoperator.

Theframeworkisgeneralinthesensethatitsupportsawiderangeofonlinelearningalgorithmsaswellasarangeofsynchronizationoperators.WedenoteanycombinationofvalidonlinelearningalgorithmAandsynchronizationoperatorσadistributedonlinelearningprotocol.Thestate-of-the-artsynchronizationoperatoriscalledstaticaveraging;everyb Nupdatestepsitcentralizesalllocalmodelsatacoordinator,averagesthemandsetsthelocalmodelstothisaverage.Weimprovedthisoperatorbyallowingadata-dependentadaptivesynchronizationthatcommunicatesandaveragesthemodelsonlyiftheirdivergence,i.e.,varianceofmodels,islargerthanapredefinedparameter∆.

1 MichaelKamp,MarioBoley,DanielKeren,AssafSchuster,andIzchakSharfman.Communication-efficientdistributedonlinepredictionby

dynamicmodelsynchronization.InMachineLearningandKnowledgeDiscoveryinDatabases,pages623–639.Springer,2014a.

MichaelKamp,MarioBoley,MichaelMock,DanielKeren,AssafSchuster,andIzchakSharfman.Adaptivecommunicationboundsfordistributed

onlinelearning.InProceedingsoftheOptimiza-tionforMachineLearningWorkshop(OPT2014)atNIPS.NIPS,2014b. MichaelKamp,SebastianBothe,MarioBoley,andMichaelMock.Communication-efficientdis-tributedonlinelearningwithkernels.InJoint

EuropeanConferenceonMachineLearningandKnowledgeDiscoveryinDatabases,pages805–819.Springer,2016.

28

Weimplementedbothsynchronizationprotocols,aswellasvariousonlinelearningalgorithms.Inparticular,algorithmsforclassification,regression,outlierdetectionandkerneldensityestimationareavailableusingstochasticgradientdescentandpassiveaggressiveupdaterules.Theframeworkallowstoemploylinearmodelsaswellasnon-linearmodelsusingkernelfunctions.Forthelatter,varioustechniquestoreduceorlimitthemodelsizehavebeenimplemented.

Inthefollowing,wedescribehowthedistributedonlinelearningframeworkintegrateswiththearchitectureandhowitmapstotheFERARIinterfaces.Wethenprovideimplementationdetailsandguidelinesonhowtoextendthecurrentframework.Thearchitecturedescriptionprovidedhereisself-contained,i.e.,itrecapitulatestheoverallarchitectureasdescribedinpreviousdeliverablesD2.1andD2.2.Thequeryoptimizerrequiresaccurateandtimelyestimationsoftheeventfrequenciesateachsite.Toachievethisinacommunication-efficientmanner,weemploykerneldensityestimation(KDE).WedescribeourapproachonselectivityestimationusingKDEindeliverableD5.3.

Forthat,weincludeanadditionalmoduleforunivariateandmultivariateonlinedistributedkerneldensityestimation,basedonthepaper"MultivariateOnlineKernelDensityEstimationwithGaussianKernels".TheschemereliesonGaussianmixturemodelswhichareknownfortheircapabilityofestimatingdistributionsevenwhentheunderlyingprobabilityfunctionisofadifferenttypeofdistribution.ThegoalofkerneldensityestimationistoprovideonlinedistributedestimationofprobabilitydensityfunctionswhilemaintainingalowcomplexityoftheKDEusinganadaptivecompressionscheme.TheonlineKDEalgorithmprovidedthereinmaintainsonlyacompressedmixturemodelofthesamplesusingsophisticatedtechniquesthatminimizetheapproximationerror.ThealgorithmdoesnotprovidesupportfordistributedKDE,andinordertofititintheframework,weadaptedtheimplementationsuchthatitsupportsmodelaggregationofmultipledistributedmodels.

Inthissection,wedescribethedifferentmodulesthatweimplementedtoprovidetheaforementionedfunctionalityinamodularway.Toachievemodularity,theframeworkhasbeendividedintoseveralsub-modulesandcomponentsallowingseamlessextensionandnewalgorithmsaddition.Additionally,itofferscompleteflexibilityfortheusertodefineherownlearningtaskusingdifferentlearningtypesandfunctionsandtoconfigurethelearningparameters.

TheframeworkisfullyintegratedwiththeFERARIarchitecture,allowingtosetuplearningtaskswheremodelsaretraineddistributedlyateachsitewithfeaturesthataregenerated(distributedly)bytheCEPengine.Thepredictionsofthelearnerscanbesendtotheoutputboltinordertobeusedinauser-definedapplication,orthepredictionscanbefedbackintothesystemasevents.

Moreover,wederivedastand-aloneversionofthelearningframeworkthatrunsinasingleSTORMtopology.Thisstand-aloneversionallowstoquicklysetupalearningsysteminasinglecluster,orcloud.Thestand-aloneversionisalsoavailableasopen-source(https://bitbucket.org/Michael_Kamp/distributedonlinelearningframework).Incaseoneisinterestedonlyinsettingupareal-timeservicebasedononlinelearning,thenusingthestand-aloneversionisadvantageous:itcanbesetupquicklyanditisbuiltforasingledistributedsystemlikeacluster.Thismakesitalsosuitableforresearch,whenthefocusliesondistributedlearning.

Parametertuningisanessentialprocessinmachinelearningforaccomplishingthelearningtasksuccessfully.Tothatend,weimplementedofflineparametertuningusingasmallsubsetofthetargetdata,forinstancetoestimatethelearningrateforstochasticgradientdescentorthebandwidthfortheradial

29

functioninkernellearning.

Wealsodescribevariousvalidconfigurationsofalearningtopologyfordifferentlearningmodes,forinstance,regression/classificationlearningusingalinearmodelusingbothdynamicandstaticsynchronizationschemesorregressionlearningoflinearandnon-linearfunctionsusingkernelswithdifferentkernelfunctionssuchasthelinearorGaussiankernel.

ThemostbasicandcrucialcomponentintheframeworkisthemodelinterfaceprovidedbyILearningModel.Forinstance,weprovidealinearmodelforregression/classification,akernelmodelforregression/classificationandakerneldensityestimationmodelforprobabilitydensityfunctions.Incaseofregression/classificationlearning,themodelmaintainsanarrayofweightsw0,...,wn,nisthenumberofattributes,thatserveasthemodelrepresentation.Incaseofkernellearning,themodelisrepresentedbyasetofsupportsvectorsandtheirweights.Themodelrepresentationisusedtoprovideapredictionserviceforanincominginstancewithanunknownlabel.Amodelobjectisalsoresponsibleforupdatingthemodelafterthetruelabelisrevealed.Todeterminehowtheupdateisperformed,alearningmodelneedstohaveanupdaterule.ThebaseclassfortheupdateruleisgivenbyUpdateRuleBase.Wehaveincludedtwoupdaterulesintheframework,oneisstochasticgradientdescentforlinearmodel:StochasticGradientDescentandtheotherisastochasticgradientdescentforkernels:KernelStochasticGradientDescent.Anupdaterulealsoneedsalossfunctionwhosegradienttriestominimizethevalueofthelossfunction.Weprovidethreelossfunctionsintheframework:squareloss:SquaredLoss,hingeloss:HingeLoss,andepsilonsensitiveloss:EpsilonInsensitiveLoss.Notethatthehingelossfunctionissuitableonlyforclassificationlearning.Inkernellearning,seeingthatthenumberofsupportvectorsgrowswiththenumberofsamples,itisimperativetoreducethecomplexityofmodelssothatthepredictiontimedoesnotgrowtoolarge.Toachievethat,weincludeatruncationoperatorthatremovessomeofthesupportvectorswhilestillboundingtheerrorcausedbythetruncation.ThetruncationinterfaceisrepresentedbyKernelTruncatoranditsmethodtruncateModel()isinvokedperiodicallyfromthemodelclass.WeprovideonesuchimplementationintheclassEpsilonKernelTruncator,whichremoveskernelswhoseweightissmallerthanagiventhresholdε.Wenotethatthelongerthesupportvectorsexistinthemodel,thesmallertheirweightsgrowgiventhatuponthearrivalofanewinstanceallexistingweightsarescaledbyafactorthatissmallerthanone.Inthecaseofkerneldensityestimation,themodelprovidesanupdatefunctionbasedonanobservedunlabeleddatasample.

Sofar,wehaveonlydiscussedthedesignofthelearningframeworkfromasinglelearnerpointofview.Nowwediscusshowtoconnectallthesepartstogethertorealizeastatic/dynamicsynchronizationschemeinadistributedenvironment.Alocallearnerinadistributedenvironmentisrepresentedbyacommunicationefficientlearnerclassnamed:ComEffLearner.Thisclassisusedinthecommunicationlayerbetweenalocallearnerandotherremoteentities.Incomingdataisprocessedbythisclassforlearning.Forthispurpose,itmaintainsamodelobjectandinvokesitsmethodsforthepredictionandmodelupdate.Violations,whichinourcontextnecessitatescross-learnersynchronization,arecheckedbyasynchronizationoperatorrepresentedbytheclassILocalSyncOp.Animplementationofthisclassneedstoincludetherulesofwhatconstitutesaviolationinthelearningalgorithm.Weprovidetwoimplementationsforthisinterface:ThefirstisStaticLocalSyncOpwhoseviolationrulesimplystatesthatthenumberofsamplessincethelastsynchronizationphasereachedthebatchsize,apredefinedparametergivenbytheuser.ThesecondisDynamicLocalSyncOpwhoseviolationruleismorecomplexbutismorecommunicationefficientandisbasedonthepaper(Kampetal.,2014a).Theviolationruleinthiscaseisthedeviationofalocalmodelfromaglobalreferenceweightmodelbyagiventhresholdparametervalue.Thedeviationis

30

calculatedbasedontheEuclideanmetric.

Indynamicsynchronization,whenaviolationisreportedtothecoordinator,thecoordinatorpollsthelearnersfortheirmodelsuntiltheglobalconvergeconditionissatisfied.Weprovidetworesolutionstrategiesthatthelearnerreliesontodecidewhichandhowmanyofthesenodesarepolled.Thefirstisafullsynchronizationstrategyinwhichallthenodesarepolledfortheirmodelsbythecoordinatorconcurrently.ThecorrespondingclassforthisstrategyisimplementedintheclassFullSyncResolution.Thesecondstrategyisanincrementalpollingstrategyinwhichatfirstonlyonelearnernodeisselectedrandomlyandpolledforitsmodel.Ifthislearner’smodeldidn’tsatisfytheglobalcondition,definedbythedivergenceoftheglobalmodelfromtheaveragedmodelofthepolledset,calledthebalancingset,thenthecoordinatorincreasesthenumberofpollednodesexponentiallyandcheckstheglobalconditionaftereachsuchpolling.TheresolutionstrategyisspecifiedinDynamicCoordSyncOpbyIResolutionProtocol.Figure14shows(partof)theintegrationofthedistributedonlinelearningframeworkintheFERARIarchitecture.

Figure6.3:ExtensionoftheGateKeeper:routingmechanismandLocalSyncOps.TheGateKeeperreceivesthelatestlocalstatisticsvectorfortheLearningFrameworkfromtheTimeMachine.Incasethelocalconditionisviolated,adynamicmodelsynchronizationistriggered

Exploitable FERARI Assets and Potiential Impact

Overview and Relation among Assets The FERARI project starts from the need of telecommunication providers to face fraud which

is becoming an increasing problem in the industry. Also, because of the large numbers of physical

31

and virtual machines in today’s cloud services, providers also need to face the task of monitoring the hardware of these services. In case of mobile phone fraud detection, the goal is to extend current manual and time consuming approaches to a whole new level using automation and big data analytics. Many fraud detection systems use empirical knowledge of fraud experts or some software in order to raise alarms which are checked by fraud investigators on a case-by-case basis. During night times when no fraud investigators are present the software may automatically block certain calls to prevent damage. During day times the fraud investigators take actions after they have investigated a case. It is their duty to decide whether a suspicious behaviour is fraudulent or legal. This depends on the current call, the call history, the customer history and the subscription plan of the customer. The focus within FERARI lies on the identification of suspicious calls and users, and the design of distributed communication efficient systems for this task. The project provides support for Complex Event Processing technology for business users in Big Data architectures. This way, stream processing is set much closer to the business world by extending simple stream processing of numerical data to the much more powerful realm of Complex Event Processing (CEP). The project also provides support for integrating machine learning tasks in the architecture. Current Big Data architectures are mostly data flow driven and lack some of the required functionality for supporting efficient learning. Such learning algorithms are supported using the added control flow capabilities. In the past decade, there has been an increase in demand to process continuously flowing data from external sources at unpredictable rate to obtain timely responses to complex queries and to derive conclusions from them. The mentioned results may be interesting in the telecommunication and cloud domain. Our FERARI software platform is released as an open source project. The platform contains, the FERARI interface core, the Proton on Strom CEP integration (distributed together with the PROTON CEP engine) in the master branch available on the Bitbucket where we host the source code of the software modules we develop, as well as instructions on how to build and use it. To give a new potential user a good idea of the capabilities provided by the FERARI platform, we created an example application. This application demonstrates a basic mobile fraud detection scenario, modelled after our mobile fraud use case scenario. The example uses the same underlying technical platform as the prototype.

During the project duration, HT has reviewed the project results and further analysed them in relation to the potential for business exploitation. FERARI architecture has been successfully adapted to work with data used in fraud detection in HT, namely CDRs and alarms currently in use by HT in order to detect fraud and also, the FERARI architecture has been adapted to process data related to network equipment in use by HT. Additionally, a GUI has been created to supply users with analytic results. As such, final FERARI architecture should enable easy integration in the fraud detection process in HT and/or system health monitoring of which the latter is currently non-existent in HT and should enable real business benefits in form of savings due to preventive maintenance and customer satisfaction.

Asset: In-situ Monitoring Technologies

Modern information systems are inherently distributed. In many cases, such as telecommunication systems, content management systems, and financial systems, data is processed in physically distributed locations. Even when data is processed at a central location, modern data centres are comprised of thousands (or even tens of thousands) of computation and storage units, which in practice process data in a distributed manner. Consequently, monitoring the data processed by these systems to detect in real-time global events of interest is becoming increasingly challenging. This challenge is clearly evident in the FERARI

32

use cases. The first use case deals with telecom fraud. Fraudsters exploit the distributed nature of telecom systems to devise ever more sophisticated methods of fraud. The second use case is system health monitoring, which consists of the real-time detection of abnormal performance of system components so that they can be serviced before they fail, and thus enhance the availability and robustness of the system while reducing costs. The central goal of the FERARI project is to preform most of the processing of distributed data in a local manner, and thus detect complex patterns of interest (e.g. fraudulent activity or abnormal performance) without collecting data to a centralized location. This technology is referred to as in-situ processing. The FERARI project has introduced several novel innovations in the application of in-situ methods to real-world applications. The innovations introduced by FERARI to in-situ processing are within the geometric monitoring framework2, which enables defining complex events by applying any desired arbitrary function to data from distributed sources. Geometric monitoring solutions enable breaking up such monitoring tasks into individual constraints on the data generated at the local sites. The idea is that as long as none of the local constraints are violated, it is guaranteed that the global event of interest has not occurred. Distributing the detection among the local sites reduces time delays and bottle necks associated with collecting data to a central location, and thus enables detecting such events in real-time. A major drawback of geometric monitoring, however, is that applying the types of constraints proposed thus far may be very demanding in terms of computational resources at the local sites. The first innovation introduced by FERARI is a novel type of local constraint for geometric monitoring, known as convex bounds. While existing methods for determining local constraints rely on solutions to complex optimization problems, convex bounds employ functional analysis methods for selecting a convex function that bounds the function used to define the event of interest. While specifying such bounds for a given function may require some mathematical expertise, once it has been specified, applying the bound on data in run-time is very efficient. In addition, while this has not been the original motivation, empirical evidence has shown that these constraints are also very efficient in the sense that they produce very small number of false negatives (i.e. instances where that constraints were violated but the global event has not occurred) in comparison to existing methods. A second innovation introduced by FERARI is the adaptation of geometric monitoring methods to dynamic large-scale industrial settings. So far, geometric monitoring has been applied in laboratory settings as part of academic research. Using geometric monitoring in an industrial production environment requires extending the method with several non-trivial adaptation. Laboratory setups assume that the number of sites participating in the monitoring task is fixed and known in advance. Typically the algorithms are run on at most several hundred sites. Applying the method to a distributed counting task over CDR records produced by the cell towers of a telecom provider, as described in Deliverable D3.2 and Deliverable D3.3, involves running the monitoring task on over 18,000 cell towers. Additionally, the set of active subscribers in not known in advance, and every subscriber interacts only with a small subset of these cell towers. The efficient application of geometric monitoring in this case required extending the monitoring framework to simultaneously run concurrent monitoring tasks for an unknown and dynamic set of subscribers. Furthermore the local constraints were modified to support a different and dynamic set of sites for each subscriber.

2 Sharfman, Izchak, Assaf Schuster, and Daniel Keren. "A geometric approach to monitoring threshold functions over distributed data streams." ACM Transactions on Database Systems (TODS) 32.4 (2007): 23.

33

Exploitation Potential

The in-situ processing methods developed as part of FERARI are being successfully applied to the FERARI use cases. As described in Deliverable D3.2 and D3.3, in-situ processing was applied to perform a distributed counting task for identifying highly active subscribers in real-time. Since the cell serving a subscriber constantly changes (especially when the subscriber is mobile), this problem is distributed by nature. Currently, telecom data is scanned for fraud only after usage data is collected from the cells servicing subscribers to a centralized billing centre. As a result, the current minimal time required to detect fraudulent activity may exceed an hour. By this time, the telecom operator may have already incurred heavy losses due to fraud. Simulations on real-world data collected from the HT cellular network demonstrates that in-situ processing method identify highly active subscribers in real-time while reducing the usage of network resources by close to 45%. In situ processing methods, however, are not limited to the FERARI use cases. Since geometric monitoring imposes very little restrictions on the function used to define events, it has been employed for a wide variety of tasks. In addition to the distributed counter example described above, geometric monitoring has been employed for detecting when the variance among the readings of a set of distributed sensors has exceeded a threshold3, monitoring distributed streams of emails to determine when a certain word has become indicative of spam4, and monitoring the join size of a set of distributed vectors5,6. In-situ processing methods also play an important role in the on-line distributed leaning algorithms developed in FERARI (which are a separate exploitable asset).

Asset: ProtonOnStorm

According to Gartner, two forms of stream processing software have emerged in the past 15 years. The first were CEP systems that are general purpose development and runtime tools that are used by developers to build custom, event-processing applications without having to re-implement the core algorithms for handling event streams; as they provide the necessary building blocks to build the event driven applications. Modern commercial CEP platform products even include adapters to integrate with event sources, development and testing tools, dashboard and alerting tools, and administration tools. More recently the second form — distributed stream computing platforms (DSCPs) such as Amazon Web Services Kinesis7 and open source offerings including Apache Samza8, Spark9 and Storm10 — were developed. DSCPs are general-purpose platforms without full native CEP analytic functions and associated accessories, but they are highly scalable and extensible and usually offer an open programming model, so developers can add the logic to address many kinds of stream processing applications, including some CEP solutions. Therefore, they are not considered “real” complex event processing platforms. Specially, Apache open source projects (Storm, Spark, and recently Samza) have gained a fair amount of attention and interest (Fehler! Verweisquelle konnte nicht gefunden werden., Fehler! Verweisquelle konnte nicht

3 Sharfman, Izchak, Assaf Schuster, and Daniel Keren. "Aggregate threshold queries in sensor networks." Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International. IEEE, 2007. 4 Keren, D., Sharfman, I., Schuster, A., & Livne, A. (2012). Shape sensitive geometric monitoring. Knowledge and Data Engineering, IEEE Transactions on, 24(8), 1520-1535. 5 Garofalakis, Minos, Daniel Keren, and Vasilis Samoladas. "Sketch-based geometric monitoring of distributed stream queries." Proceedings of the VLDB Endowment 6.10 (2013): 937-948. 6 Lazerson, A., Sharfman, I., Keren, D., Schuster, A., Garofalakis, M., & Samoladas, V. (2015). Monitoring distributed streams using convex decompositions. Proceedings of the VLDB Endowment, 8(5), 545-556. 7 http://aws.amazon.com/kinesis/ 8 http://samza.incubator.apache.org/ 9 http://spark.apache.org/streaming/ 10 https://storm.apache.org/

34

gefunden werden.) and these may well mature into commercial offerings in future and/or get embedded in existing commercial product sets. DSCPs are designed to cope with Big Data requirements making them an essential component in any organization infrastructure. Today, there are already some implementations that take advantage of the pattern recognition capability of CEP systems along with the scalability capabilities that offer DSCPs, and offer a holistic architecture. ProtonOnStom is one example. IBM partner has implemented its open source complex event processing research tool IBM PROactive Technology ONline (PROTON) on top of Storm in the scope of the FP7 EU FERARI project, thus making it a distributed and scalable CEP engine. ProtonOnStorm has been released to open source under the Apache 2.0 license. The source code along with manuals can be accessed at11. ProtonOnStorm is currently applied in two other EU projects for different purposes: in SPEEDD (FP7) is extended to cope with certain aspects of uncertainty while in Psymbiosys (H2020) is applied to different scenarios of manufacturing intelligence that use IoT devices. Both projects deal with Big Data. ProtonOnStorm programming model serves as basis for the optimization plans that are done in the scope of the project. The derived events out of the system in the form of fraud alerts feed the dashboard of the project. ProtonOnStorm is the underlying CEP engine in the FERARI demo that has been accepted for SIGMOD16 and was shown also as invited demo in DEBS16. SIGMOD constitutes the top rated conference in the database research community, while DEBS constitutes the mainstream publishing venue for researchers in the CEP community.


ProtonOnStorm targets Big Data market that will top $84B in 2026 according to a recent analysis of Forbes12, and therefore its potential is very large. Furthermore, it has been released as open source which according to a recent report by Forrester Fehler! Verweisquelle konnte nicht gefunden werden., “is increasingly ubiquitous as application development professionals look to trim enterprise software costs while simultaneously creating new, modern applications”. Moreover, one of the takes away from the same report is that open source projects drive innovation in key technology areas including mobile, cloud, big data, and the Internet of Things. One of the main recommendation is to use open source as a key part of a software innovation strategy, especially in concert with scale-out hardware architecture. Being a generic asset, ProtonOnStorm can be applied to any industry and any sector that require Big Data real-time streaming analytics.

Asset: The Event Model (TEM)

The Event Model (TEM) provides a new way to model, develop, validate, maintain, and implement event-driven applications. In TEM, the event derivation logic is expressed through a high-level declarative language through a collection of normalized tables (Excel spreadsheet like fashion). These tables can be validated and transformed into code generation. TEM is based on a set of well-defined principles and building blocks, and does not require substantial programming skills, therefore target to non-technical people.

11 https://github.com/ishkin/Proton/ 12 http://www.forbes.com/sites/louiscolumbus/2015/05/25/roundup-of-analytics-big-data-business-intelligence-forecasts-and-market-estimates-2015/

35

This idea has already been successfully proven in the domain of business rules by The Decision Model (TDM). TDM groups the rules into natural logical groups to create a structure that makes the model relatively simple to understand, communicate, and manage. The Event Model follows the Model Driven Engineering approach and can be classified as a CIM (Computation Independent Model), providing independence in the physical data representation, and omitting details which are obvious to the designer. This model can be directly translated into an execution model (PSM – Platform Specific Model in the Model Driven Architecture terminology) through an intermediate generic representation (PIM – Platform Independent Model). In the course of the first two years in FERARI, we have developed the CIM model for functional and non-functional requirements of event driven applications. In the third year we plan to articulate the translation of TEM (the CIM) into an event processing network (PIM). The event processing network can then be converted into a JSON file and eventually to a running application (PSM). During the first two years in the FERARI project we extended the basic model we had and completed all the functional aspects as well as the non-functional requirements. The work until now has been theoretical and reported through FERARI WP4 deliverables13. The model parts for the functional requirements are described in the paper: A Model Driven Approach for Event Processing Applications accepted for DEBS16.


As aforementioned, the idea of defining functional requirements targeting business people has already been successfully proven in the domain of business rules by The Decision Model. TDM first appeared in 2009 and since then it revolutionize the business rules world. TDM has been implemented in the Decision Suite software product by SAPIENS Decision14. The software can be delivered on-premise or via SaaS. It has been deployed in the business systems of some key Global 1000 companies, in areas such as financial services, insurance, manufacturing and retail15. Our vision is to reach the same level of success in the world of complex event processing by applying a similar approach. Complex event processing is evolving rapidly in the last years and it is expected to significantly grow in the coming ones. The CEP market is forecasted to reach 5.6Bn Euros in 201916. Still, one of the main barriers in the adoption of CEP tools is the relative complexity of existing tools, making them impractical and inaccessible for business users. TEM directly addresses this gap. As TEM is agnostic to the run-time engine, the model can be translated to any of the existing CEP products. IBM has filled two patents regarding TEM which emphasizes the innovation of it as well as its business potential.

Asset: CEP optimizer The asset, namely the FERARI CEP optimizer, was developed within WP5. Its design

principles and current implementation details were reported in Deliverables D5.2 and D5.3. In FERARI, designed solutions capable of scaling at a planetary level, thus retaining the potential of being applicable to arbitrarily large businesses and data volumes. Since in vast scale applications (event) data of interest are produced or collected at remote sources; they need to be combined to respond to global application inquiries. In this context, collecting voluminous data at a central computing point (termed site/cloud) first and then processing them is infeasible because such a solution would abuse the available network resources of the underlying infrastructure and would

13 Available at: http://www.ferari-project.eu/key-deliverables/ 14 http://www.sapiensdecision.com/ 15 http://www.sapiensdecision.com/customers/case-studies/#sthash.Bukwz3xc.dpuf 16 Companies and Markets (2014). Complex Event Processing (CEP) Market by Algorithmic Trading: Global forecast to 2019

36

cause a bottleneck at that central point. Data of interest arriving at multiple, potentially geographically dispersed, sites should be efficiently processed in-situ (if possible) first and then wisely combined at an inter-cloud level to provide holistic results to final stakeholders. Efficiency in such a setting calls for reduced communication to avoid congested network links among sites without affecting the timely delivery of notifications and application reports. The above goals are in the core of our CEP optimizer which produces execution plans for application queries that orchestrate site interactions in a way that both ensures: a) optimal network utilization and b) compliance with application Quality-of-Service (QoS) requirements related to the time horizon, from the occurrence of interesting situations, in which corresponding reports should be made available to end users. Our algorithms and design principles are generic enough to support a wide range of application query functionality. They can be employed on top of any CEP Engine being selected as the software responsible for intra-cloud data processing and query execution at the site level. Therefore, our approach can be fostered as a paradigm for any similar implementation irrespectively of the CEP Engine or specific application demands. Having clarified that, in the context of FERARI, the currently developed software is built to support - in terms of application query operators and respective functionality - IBM Proactive Technology Online (Proton) CEP Engine and, in particular, its streaming cloud extension, namely ProtonOnStorm. This is not too great a restriction as Proton and ProtonOnStorm are open source platforms with an already important user base, as will be explained below. Moreover, the TEM specification, which is another important FERARI asset for letting business users express inquiries in a declarative way (details see above), can be mapped to an Event Processing Network (EPN) conceptualization which is supported by ProtonOnStorm at the technical level. Therefore, the TEM specification is commutatively supported by our approach. In addition, the CEP optimizer outcomes incorporate and exploit the FERARI assets related to in situ processing at the site level. The CEP optimizer, as a FERARI asset, has already been pushed to the scientific community by publishing corresponding results - in close collaboration with the rest of the FERARI consortium - in the top database conference and the most well esteemed conference within the event processing community: FERARI: A Prototype for Complex Event Processing over Streaming Multi-cloud Platforms, I. Flouris, V. Manikaki, N. Giatrakos, A. Deligiannakis, M. Garofalakis, M. Mock, S. Bothe, I. Skarbovsky, F. Fournier, T. Krizan, M. Stajcer, J. Yom-Tov, T. Curin in: SIGMOD, 2016 Complex Event Processing over Streaming Multi-cloud Platforms – The FERARI Approach, I. Flouris, V. Manikaki, N. Giatrakos, A. Deligiannakis, M. Garofalakis, M. Mock, S. Bothe, I. Skarbovsky, F. Fournier, T. Krizan, M. Stajcer, J. Yom-Tov, M. Volarevic, in: DEBS, 2016

Exploitation potential

Within the FERARI project the approach adopted by the CEP Optimizer was applied on real use case scenarios from the telecommunication industry. However, these scenarios in the context of a specific industry are only representative examples of the FERARI optimizer potentials. For starters, its design allows the FERARI optimizer to be directly applicable to generic fraud detection and network health monitoring scenarios beyond the telecommunication industry. The former covers the very popular case of transaction fraud detection in the banking industry, while the latter allows its exploitation in the network management industry. Beyond the above, because our current implementation of the CEP optimizer is tailored for ProtonOnStorm support, the FERARI optimizer can be exploited in broader domains where Proton already finds itself suited and where distributed monitoring over a network of data sources is of the essence. The major domains in which Proton

37

has already been successfully integrated include17, but are not limited to, fleet management, supply chain monitoring and logistics, multi-sensor diagnostic systems, systems management, network management, active services in wireless environments, location-based decision support systems, maintenance management and command and control systems. Our inter-cloud optimization approaches achieve appropriate balance between event detection latency and bandwidth consumption in the underlying networking infrastructure. From a business viewpoint, reduced latency enables timely event detection so that event notifications reach the final stakeholders on time to activate decision making processes – such as terminating a suspicious call or blocking a fraudulent transaction. Besides, reduced bandwidth consumption on the one hand also serves the aforementioned goal by not congesting network links to allow for timely transfer of event notification data, on the other hand it saves tremendous equipment upgrade and maintenance costs of networking facilities which would otherwise be required. It is important to note that the benefits of our approach increase with the size of the business as the demands for modern, high power infrastructure facilities also increase. In addition, since TUC is primarily a research organization, another form of exploitation of the CEP optimizer is by students at both graduate and postgraduate levels who carried out or are currently carrying out their research in the context of the work proposed in the project. The current status of the FERARI optimizer itself is the outcome of such research with one PostDoc Researcher, one PhD Student and one MSc Student that have worked on relevant topics.

Asset: Open Software Repository and Distributed Online Learning Framework

The goal of the FERARI project is to build a framework that allows for efficient and timely processing of Big Data streams. This framework includes a distributed complex event processing (CEP) engine, a Query Planner that dynamically optimizes the CEP processing across geographically dispersed sites/clouds and a distributed online learning framework. These components are tied together by a powerful, modular and elastic architecture. This architecture strictly separates the components of the FERARI framework from the underlying distributed computation system. The framework can be adapted to any Big Data streaming platform by exchanging its runtime adaption; after a careful evaluation of existing Big Data streaming platforms we found Apache's STORM (http://storm.apache.org/ ) to be the best fitting platform and implemented the runtime adaption accordingly. The framework has been released as open source (available at https://bitbucket.org/sbothe-iais/ferari). The open source release of our framework allows anyone from the scientific community to people form the industry to explore and use it. Since we provide docker containers for the software, it can be easily installed on any machine, from a personal computer to a cluster or cloud system. To further facilitate getting involved with the framework, we provide a guide that explains its installation and usage in simple steps, as well as a running example. The software allows users to set up high-performance, communication-efficient distributed stream processing systems by plugging together the provided building blocks – including algorithms based on the in-situ. The open source release also includes the distributed online learning framework, which is also implemented as a building block within the architecture. It can be used to set up large distributed machine learning systems for providing real-time services on distributed, dynamic data streams. The employed novel communication-efficient distributed online learning algorithms are based on scientific publication at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery (ECMLPKDD) and the Annual Conference on Neural Information Processing Systems (NIPS). Currently, most machine learning on Big Data is performed on distributed batches of data. In contrast, the FERARI approach enables real-time learning on Big streaming Data. Thus, the learning framework has potential for substantial impact since it allows

17 https://github.com/ishkin/Proton/blob/master/documentation/ProtonUserGuidev4.4.1.pdf

38

scaling learning applications to data volumes and velocities generated by machine-to-machine interaction (M2M) and the Internet of Things.


The open source framework has the potential to be exploited in the industry as well as in the scientific domain. Its building block architecture – together with the fact that it is built upon the established and powerful streaming platform STORM – allows for rapid prototyping in the industry, especially in the context of machine-to-machine communication in industry 4.0 settings. Since it is easy to install and run on various systems, it allows the scientific community to quickly set up distributed stream processing and learning systems enabling scientists to evaluate their ideas facilely on real distributed SWOT Analysis Strengths Weakness Opportunities No market competitor.

We know our software best. We know how to tune/ adjust/ develop our methods.

Big data software is rapidly changing, selected base components may be outdated or better base systems may become available. Our system is strongly modularized, we can offer to implement the required adaption.

Threats The open source code may be used/ extended without our involvement. But, we sell trainings and consulting services and wide spreading of the FERARI software increases our potential customer base.

Software stack installed at customer’s site may be incompatible or customer may not want to use the entire FERARI system. But, different parts of FERARI, for instance the distributed learning framework may be used decoupled from the remaining system.

systems. The distributed learning framework allows companies to set up efficient real-time services and large in-stream learning systems. The need for such frameworks is apparent and has been partly fulfilled by existing approaches like Spark MLlib, Apache Samoa, and Google's Parameter Server. The FERARI learning framework, however, for the first time combines in-stream learning (which is not possible in Spark MLlib) with communication-efficient in-situ methods (which aren't available on any other platform). The open source release of the FERARI framework ensures the contribution from Fraunhofer side, as well as from external contributors, further pushing and promoting the framework. Demand FERARI Potential Competi

tor Target sector

Point of contact

Massive amounts of car sensor data need to be analysed for predictive maintenance. Centralization of CAN bus car data is infeasible with current mobile networks, therefore In-situ analyses is required.

Disruptive, no system capable of learning a model of normal state in regime of constrained communication available on market.

None (to best of our knowledge)

Automotive Management

Fast detection of mobile phone fraud Improvement of detection time due to deployment closer to distributed data sources.

Telecom-munication

CTO, technical units lead

Faster detection of fraudulent credit card transactions

Improvement of detection time?

Finance CTO, tech unit lead

Early detection of latent machine fault

Telco

39

Asset: Data Masking Algorithms

The open source data masking framework has the potential to be exploited in the industry as well as in the scientific domain. Within the data masking framework three main concepts can be identified as exploitable. First is Test data management which can be used to produce test data. Data masking users can make sure employees don’t do something wrong with corporate data, like making private data sets public, or moving production data to insecure test environments. In reality, masking data for testing and sharing is almost a trivial subset of the full customer requirement. The real goal is administration of the entire data security lifecycle – including locating, moving, managing, and masking data. The mature version of today’s simpler use case is a set of enterprise data management capabilities which control the flow of data to and from hundreds of different databases or flat files. This capability answers many of the most basic security questions we hear customers ask, such as “Where is my sensitive data?” “Who is using it?” and “How can we effectively reduce the risks to that information?”. Compliance is the second concept and major reason stated by users as reason they need masking products. Unlike most of today’s emerging security technologies, Early customers came specifically from finance, but adoption is well distributed across different segments, including particularly retail, telecomm, health care, energy, education, and government. The diversity of customer requirements makes it difficult to pinpoint any one regulatory concern that stands out from the rest. During discussions we hear about all the usual suspects – including PCI, NERC, GLBA, FERPA, HIPAA, and in some cases multiple requirements at the same time. These days we hear about masking being deployed as a more generic control – in form of protection of Personally Identifiable Information (PII), health records, and general customer records, among other concerns; but we no longer see every company focused on only one specific regulation or requirement. Now masking is perceived as addressing a general need to avoid unwanted data access, or to reduce exposure as part of an overall compliance posture. Third concept is Production Database Protection: While replacement of sensitive data – specifically through ETL style deployments – is by far the dominant model, it is not the only way to protect data in a database. At some firms protection of the production database is the primary goal for masking, with test data secondary. Masking can do both, which makes it attractive in these scenarios. Production data generally cannot be fully removed, so this model redirects requests to masked data where possible. Within masking solution many masking methods can be used for data obfuscation such as: data shuffling, data substitution, randomizing, nullifying and encryption.

Exploitation potential

Data masking framework can be used in many industries; particularly retail, banking, telecomm, health care, education, and government. It provides security layer which obfuscates data, removing the sensitive information (customer balances, personal information etc.). Potential benefit of exploiting of Test data management is administration of the entire data security lifecycle – including locating, moving, managing, and masking data. Data masking framework saves all information regarding the input and output data (file path, database connection string, specific table info etc.), so this can easily be tracked through Masking application developed as a part of this project. Input data is used by application through which users can define specific parts of data that should be masked. This way bank can mask client data, so their users cannot compromise clients secure info. In telecomm industry masking can be used to mask each call info (Calling number, Called number), so data for test environment never reveals sensitive client info (Client names, usernames,

40

passwords), so only sensitive data is masked, while business logic is kept intact (sales sums, calling averages, churn ratios etc.).

Overall Potential Impact While there are today more and more big data technologies, most of them work with web-based,

human generated data, which is usually unstructured. These technologies are typically based on batch processing and store data in distributed file systems. However, using the data generated through machine-to-machine communication (M2M) in business and industry, remains a fairly uncharted area in the big data domain. Contrary to human-generated data such as the one on Facebook or Google, this type of data is more structured, but its massive data streams volumes require innovative, more adequate solutions for effective real-time processing. Since M2M data has a significant role in crucial business use cases, new solutions in this area are of general interest and will undoubtedly help Europe to answer the challenges of tomorrow and be competitive in this area in the future. The project presents a new big data architecture – a system that will be able to perform complex tasks in an efficient and timely manner. It exploits the benefits of in-situ processing and eliminates unnecessary centralized data storage in detecting complex patterns of interest. Section 6.1.2. describes several innovations in the application of in-situ processing introduced by the FERARI project. Furthermore, the project provides support for sophisticated machine learning tasks and analytics workflows for business users. But most importantly, it implements its scalable distributed CEP engine – an open source tool for complex events processing, in real-time, in a distributed manner. ProtonOnStorm, implemented for this purpose by IBM. This technology targets both big data and the Internet of things market, where there is no real competition for such technologies. Therefore, the FERARI CEP engine is a unique solution, and has great potential in this rapidly growing industry. Although prospective benefits of the project are presented in telecommunications use cases, shown functionalities can be applied to other business areas as well. The first use case engages in telecom fraud, which is a topic related to many industries including banking and finance sector. Fraudulent activities are nowadays a common and dangerous problem, and can cause a company to lose significant amounts of money. Fraud is a rising type of crime worldwide, and requires improvement of methods and tools for preventing, or mitigating the attacks. For this reason, a progress in advanced fraud solutions, that would help companies protect their valuable information and assets, is in everyone’s interest. The project’s results will influence both scientific and business aspects of the big data issues. Hopefully, it will make progress in theoretical knowledge, motivate further research in related fields, but also offer a step forward concerning concrete problems in business and industry. There are specific business solutions that are developed in purpose of the project, but can easily be used as a standalone product, like the data masking tool. The need for this kind of product in companies working with big data is needless to explain, and will surely help them lead their business more successfully. This new era of big data still lacks good solutions that will be able to respond to all the challenges the future of big data brings. There is a big gap between theoretical insights and concrete solutions, especially for specific business cases. The FERARI project is expected to make a significant scientific contribution, offer innovative solutions for business cases involved in big data, and – in a nutshell – lead the way in finding new ways of efficient, real-time processing of big data.

- Name of the scientific representative of the project's co-ordinator, Title and Organisation:

- PD Dr. Michael Mock, Fraunhofer IAIS

41

- Tel: +49 2241 14 2576 - Fax: +49 2241 14 2324 - E-mail: [email protected]

- Project website address: www.ferari-project.eu

42

o Use and dissemination of foreground

43

Section A (public)

LIST OF SCIENTIFIC (PEER REVIEWED) PUBLICATIONS, STARTING WITH THE MOST IMPORTANT ONES

NO Title Main author

Title of the periodical

or the series

Number, date or

frequency Publisher Place of

publication Year of

publication Relevant

pages

Permanent identifiers

(if available)

Is/Will open access

provided to this

publication?

1 Beating Human Analysts in Nowcasting Corporate Earnings by using Publicly Available Stock Price and Correlation Features

Michael Kamp, Mario Boley, Thomas Gärtner

SDM SIAM Philadelphia, USA

2014 641-649 DOI 10.1137/1.9781611973440.74

yes

2 Communication-Efficient Distributed Online Prediction

Michael Kamp, Mario Boley, Michael Mock, Daniel Keren, Izchak Sharfman, Assaf Schuster

ECMLPKDD

Springer Nancy, France

2014 623-639 DOI 10.1007/978-3-662-44848-9_40

yes

3 Lazy Evaluation Methods for Detecting Complex Events

Ilya Kolchinsky, Izchak Sharfman, Assaf Schuster, Daniel Keren

DEBS

ACM New York, USA

2015 34-45 DOI 10.1145/2675743.2771832

yes

4 Monitoring Distributed Streams using Convex Decompositions

Arnon Lazeron, Izchak Sharfman, Daniel Keren, Assaf Schuster, Minos Garofalakis, Vasilis Samoladas

VLDB

VLDB Endowment

2015 545-556 DOI 10.14778/2735479.2735487

yes

5 Adaptive Communication Bounds for Distributed Online Learning

Michael Kamp, Mario Boley, Michael Mock, Daniel Keren,

OPT at NIPS

Montreal, Canada

2014 Yes

44

Assaf Schuster, Izchak Sharfman

6 Distributed Geometric Query Monitoring Using Prediction Models

Nikos Giatrakos, Antonios Deligiannakis, Minos Garofalakis, Izchak Sharfman, Assaf Schuster

ACM Transactions on Database Systems (TODS)

ACM New York, USA

2014 DOI 10.1145/2602137

Yes

7 Issues in Complex Event Processing Systems

Ioannis Flouris, Nikos Giatrakos, Minos Garofalakis, Antonios Deligiannakis

IEEE Trustcom/BigDataSE/ISPA

IEEE 2015 DOI: 10.1109/Trustcom.2015.590

Yes

8 In-Stream Frequent Itemset Mining With Output Proportional Memory Footprint

Daniel Trabold, Mario Boley, Michael Mock, Tamas Horváth

Proceedings of the LWA Workshops

2015

9 Towards Flexible Event Processing in Distributed Data Streams

Sebastian Bothe, Vasiliki Manikaki, Antonios Deligianakis, Michael Mock

Proceedings of the Workshops of the EDBT/ICDT Joint Conference

2015

10 Latent Fault Detection With Unbalanced Workloads.

Moshe Gabel, Kento Sato, Daniel Keren, Satoshi Matsuoka, Assaf Schuster.

EDBT Workshop

2015

11 In-situ Anonymization of Big Data

Tomislav Križan, Marko Brakus, Davorin Vukelić

MIPRO

IEEE 2015 DOI: 10.1109/MIPRO.2015.7160282

Yes

12 Sketching Distributed Sliding-Window Data Streams

Odysseas Papapetrou, Minos Garofalakis, and Antonios Deligiannakis

VLDB

2015 DOI:10.1007/s00778-015-0380-7

Yes

45

13 Approximate Geometric Query Tracking over Distributed Streams

Minos Garofalakis

IEEE Data Engineering Bulletin

2015 no

14 Parallelizing Randomized Convex Optimization

Michael Kamp, Mario Boley, Thomas Gärtner

OPT at NIPS

2015 no

15 “The Event Model” for Situation Awareness

Opher Etzion, Fabiana Fournier, and Barbara von Halle

Data Engineering Bulletin

2015 yes

16 FERARI: A Prototype for Complex Event Processing over Streaming Multi-Cloud Platforms

Ioannis Flouris, Vasiliki Manikaki, Nikos Giatrakos, Antonios Deligiannakis, Minos Garofalakis, Michael Mock, Sebastian Bothe, Inna Skarbovsky, Fabiana Fournier, Marko Štajcer, Tomislav Križan, Jonathan Yom-Tov, and Taji Ćurin

SIGMOD

2016 no

17 Complex Event Processing over Streaming Multi-Cloud Platforms - The FERARI Approach

Ioannis Flouris, Vasiliki Manikaki, Nikos Giatrakos, Antonios Deligiannakis, Minos Garofalakis, Michael Mock, Sebastian Bothe, Inna Skarbovsky, Fabiana Fournier, Marko Štajcer,

DEBS

2016 no

46

Tomislav Križan, Jonathan Yom-Tov, and Marijo Volarevic

18 Scalable Approximate Query Tracking over Highly Distributed Data Streams

Nikos Giatrakos, Antonios Deligiannakis, and Minos Garofalakis

SIGMOD

ACM 2016 1497-1512 DOI 10.1145/2882903.2915225

no

19 Issues in complex event processing: Status and prospects in the Big Data era

Ioannis Flouris, Nikos Giatrakos, Antonios Deligiannakis, Minos Garofalakis, Michael Kamp, and Michael Mock

Journal of Systems and Software

Elsevier 2016 DOI 10.1016/j.jss.2016.06.011

no

20 Lightweight Monitoring of Distributed Streams

Arnon Lazerson, Daniel Keren, Assaf Schuster

KDD

ACM New York, USA

2016 1685-1694 DOI 10.1145/2939672.2939820

yes

21 Communication-Efficient Distributed Online Learning with Kernels

Michael Kamp, Sebastian Bothe, Mario Boley, Michael Mock

ECMLPKDD Riva del Garda, Italy

Springer 2016 805-819 DOI 10.1007/978-3-319-46227-1_50

yes

22 Distributed Query Monitoring through Convex Analysis: Towards Composable Safe Zones.

Minos Garofalakis, Vasilis Samoladas

ICDT

2017 no

23 Monitoring Properties of Large, Distributed, Dynamic Graphs

Gal Yehuda, Daniel Keren and Islam Akaria

IPDPS

2017 no

LIST OF DISSEMINATION ACTIVITIES

47

NO. Type of activities Main leader Title Date/Period Place Type of

audience

Size of audience

Countries addressed

1 Presentation, Posters, fact sheets and

templates PI The European

Data Forum June 29th – 30th, 2016

Eindhoven, Netherlands

2

Presentation PI

39th International

Convention on Information and Communication

Technology, Electronics and Microelectronics – MIPRO 2016

May 30th – June 3th,

2016

Opatija, Croatia

3 Presentation PI Data Science Monetization

April 13th -14th, 2016

Zagreb, Croatia

4 Presentation PI Seize the Data April 12th-13th, 2016

Zagreb, Croatia

5 Presentation PI

10th ACM International

Conference on Distributed and Event-Based

Systems

June 20th – 24th, 2016 Irvine, USA

6 Presentation PI ACM SIGMOD/PODS

June 26th – July 1st, 2016

San Francisco,

USA

7 Workshop PI Data Science Summer School

September 13th – 15th,

2016

Varazdin, Croatia

8 Workshop PI Big Data 4th to 8th of Zagreb,

48

Summer School July 2016 Croatia 9 10 11 12 13 14 15 16

Documents

D7.4 Final Project Report v1.0 - FERARI Project · One key aspect for achieving scalability at very large scale is the provision on communication ... Doing so requires a rethinking