DELIVERABLE - SecureIoT Project · Version: v1.00 - Final, Date 28/09/2018 Page | 2 Project Title: SecureIoT Contract No. 779899 Project Coordinator: INTRASOFT International S.A

Project Acronym: SecureIoT

Grant Agreement number: 779899 (H2020-IoT03-2017 - RIA)

Project Full Title: Predictive Security for IoT Platforms and Networks of Smart

Objects

DELIVERABLE

Deliverable Number D4.1 Deliverable Name Security Monitoring and Knowledge

Inference Dissemination level PU

Type of Document R

Contractual date of delivery M9

Deliverable Leader Inria

Status & version V0.21

WP / Task responsible Inria

Keywords: CRISP-DM, data-mining, machine-learning, analytics,

knowledge discovery

Abstract (few lines): This deliverable mainly introduces the CRISP-DM method

applied to SecureIoT data, identified algorithms (normalization,

clustering, learning) and the MindSphere Sinalytics platform.

Deliverable Leader: Jérôme François (Inria)

Contributors:

Jérôme François (Inria), Abdelkader Lahmadi (Inria), Adrien

Hemmer (Inria), Septimiu Nechifor (SIE), Nikos Kefalakis

(INTRA), Soline Blanc (Inria)

Reviewers: Daniel Calvo Alonso (ATOS), David Evans (IDIADA)

Approved by: George Koutalieris) (INTRA)

D4.1 Security Monitoring and Knowledge Inference,

Version: v1.00 - Final, Date 28/09/2018

Page | 2

Project Title: SecureIoT Contract No. 779899 Project Coordinator: INTRASOFT International S.A.

Executive Summary The scope of this deliverable is to introduce methods, techniques, tools and platforms that can

be used in the SecureIoT project to extract knowledge from the numerous and heterogeneous

sources of data that an IoT environment can provide. Unlike an old-fashioned problem with a

single source and/or single format, we first need a systematic approach to well understand the

business needs and the types of data we deal with before considering preparing and modelling

them. This is the goal of the CRISP-DM method. Then, for the different steps of the methods, we

need appropriate techniques, especially for preparing and modelling data. In this deliverable,

based on the data analysis, we selected different normalization, refinement and clustering

techniques. Some of them have been implemented and tested over provided datasets from the

use-cases. In addition, the MindSphere Sinalytics platform includes many tools and methods that

the project can benefit from for doing real-time analysis of SecureIoT data.



Page | 3


Document History

Version Date Contributor(s) Description

0.1 23/05/2018

Jérôme François

(Inria), Abdelkader

Lahmadi (Inria)

Initial ToC

0.2 28/06/2018

Jérôme François

(Inria), Abdelkader

Lahmadi (Inria)

Inputs on data collection methodology

0.3 18/07/2018 Nikos Kefalakis

(Intrasoft) CRISP-DM methodology initial inputs

0.4 20/07/2018

Jérôme François

Abdelkader

Lahmadi (Inria)

Candidate machine learning algorithms

description (initial description)

0.5 20/08/2018 Jérôme François

(Inria)

Features extraction, application port

metric


(Inria)

Machine learning process described and

architecture mapping

0.7 22/08/2018 Abdelkader

Lahmadi (Inria) TDA and process mining description

0.8 24/08/2018 Adrien Hemmer

(Inria)

Normalization, data clustering and data

analysis

0.9 03/09/2018

Adrien Hemmer

(Inria), Soline Blanc

(Inria)

Review of data, scaling, clustering.

Integration of TDA application

0.10 02/09/2018 Septimiu Nechifor

(SIE) Description of the Sinalyticsplatform


(Intrasoft) CRISP-DM update

0.12 06/09/2018

Adrien Hemmer

(Inria), Soline Blanc

(Inria), Jérôme

François (Inria)

Review of data, scaling, clustering and

TDA. Reorganization and reference

cleaning, requirements


(Inria)

Reorganization of the content and

highlighting relations between CRISP-DM

and other parts



Page | 4



(Intrasoft)

Corrections and updates of existing

sections

0.15 13/09/2018 Adrien Hemmer

(Inria) Results about clustering

0.16 13/09/2018 Septimiu Nechifor

(SIE) Update of the Sinalytics platform


(Inria) Editing, abstract, executive summary


(Inria) Review, editing


(Inria) Introduction, conclusion


(Inria) Integration of reviews


(INTRA) Addressed Reviewers comments

1.00 28/09/2018 Mariza Konidi

(INTRA) Final version to be submitted



Page | 5


Table of Contents Executive Summary ......................................................................................................................... 2

Table of Contents ............................................................................................................................ 5

Table of Figures ............................................................................................................................... 7

List of Tables ................................................................................................................................... 7

Definitions, Acronyms and Abbreviations ...................................................................................... 8

1 Introduction ............................................................................................................................. 9

2 CRISP-DM methodology ........................................................................................................ 11

2.1 Cross-Industry Standard Process for Data Mining (CRISP-DM) .................................... 11

2.2 CRISP-DM reference model .......................................................................................... 11

2.2.1 Business understanding ................................................................................................ 13

2.2.2 Data understanding ...................................................................................................... 14

2.2.3 Data preparation .......................................................................................................... 15

2.2.4 Modelling ...................................................................................................................... 16

2.2.5 Evaluation ..................................................................................................................... 17

2.2.6 Deployment .................................................................................................................. 18

2.3 Application to SecureIoT Datasets ................................................................................ 19

2.3.1 Business Understanding ............................................................................................ 19

2.3.2 Data understanding .................................................................................................. 20

2.3.3 Data preparation .......................................................................................................... 24

2.3.4 Modelling .................................................................................................................. 24

2.3.5 Model evaluation .......................................................................................................... 36

2.3.6 Deployment .................................................................................................................. 36

3 Requirements ........................................................................................................................ 37

3.1 Identified requirements in D2.2 ................................................................................... 37

3.2 Mapping requirements ....................................................................................................... 37

4 Data analysis architecture and processing pipeline .............................................................. 38

4.1 Analysis process model ................................................................................................. 38

4.2 SecureIoT architecture .................................................................................................. 39

5 Data pre-processing ............................................................................................................... 41

5.1 Feature scaling .............................................................................................................. 41



Page | 6


5.1.1 Min-Max scaling ............................................................................................................ 41

5.1.2 Mean-centred scaling ................................................................................................... 41

5.1.3 IQR-scaling .................................................................................................................... 42

5.1.4 Median and median absolute deviation (MAD) scaling ............................................... 42

5.2 Features extraction from network traces ........................................................................... 42

5.2.1 Motivation ................................................................................................................. 43

5.2.2 Rationale ....................................................................................................................... 44

5.2.3 Darknet ......................................................................................................................... 44

5.2.4 Methodology overview................................................................................................. 44

5.2.5 Inter-port similarity ...................................................................................................... 46

6 Candidate machine learning algorithms ............................................................................... 49

6.1 Clustering ...................................................................................................................... 49

6.1.1 Overview ....................................................................................................................... 49

6.1.2 Topological Data Analysis ............................................................................................. 50

6.1.3 Implementation of the TDA (Mapper).......................................................................... 54

6.1.4 Application to SecureIoT Data ...................................................................................... 59

6.2 Process mining .............................................................................................................. 64

6.2.1 Overview ....................................................................................................................... 64

6.2.2 Process mining algorithms............................................................................................ 64

7 Sinalytics platform ..................................................................................................................... 67

6.1 Authentication & Authorization .......................................................................................... 69

6.2 MindSphere IoT Data Services ............................................................................................ 70

6.3 MindSphere Gateway .......................................................................................................... 70

8 Conclusion ............................................................................................................................. 71

References .................................................................................................................................... 72

Annex A: Data collection template ............................................................................................... 74

Annex B: Data Schemata ............................................................................................................. 77

Annex C: IDIADA data summary ................................................................................................... 83



Page | 7


Table of Figures FIGURE 1: PHASES OF THE CRISP-DM REFERENCE MODEL [1] .................................................................................................. 12 FIGURE 2: CRISP-DM – PHASES, GENERIC TASKS AND OUTPUTS [14] ........................................................................................ 13 FIGURE 3 REGISTRY COORDINATION EXAMPLE ........................................................................................................................ 26 FIGURE 4 OPENO&M COMMON INTEROPERABILITY REGISTRY DATA MODEL .............................................................................. 27 FIGURE 5 SECUREIOT DATA MODEL ..................................................................................................................................... 28 FIGURE 6 SECUREIOT DATA KIND ........................................................................................................................................ 29 FIGURE 7 SECUREIOT OBSERVED PLATFORM .......................................................................................................................... 30 FIGURE 8 SECUREIOT PROBE .............................................................................................................................................. 31 FIGURE 9 SECUREIOT LIVE DATA SET .................................................................................................................................... 32 FIGURE 10 SECUREIOT DATA OBSERVATION .......................................................................................................................... 34 FIGURE 11SECUREIOT LOCATION ENTITY ............................................................................................................................... 35 FIGURE 12 ADDITIONAL INFORMATION ENTITY ....................................................................................................................... 36 FIGURE 13: ANATOMY OF THE SECURITY INTELLIGENCE LAYER ................................................................................................... 40 FIGURE 14 METHODS TO EXTRACT PORT SIMILARITY MEASURE .................................................................................................. 46 FIGURE 15 EDGE WEIGHTS DISTRIBUTION .............................................................................................................................. 48 FIGURE 16 PROCESSING STEPS OF THE MAPPER ALGORITHM TO EXTRACT PATTERNS FROM IP NETWORK TRAFFIC. ............................... 51 FIGURE 17 FROM LEFT TO RIGHT, SIMPLICES OF DIMENSION ZERO, ONE, TWO AND THREE. .............................................................. 53 FIGURE 18 EXAMPLE OF PERSISTENCE BARCODE AND DIAGRAM OF A POINTS CLOUD REPRESENTING A CIRCLE. ..................................... 54 FIGURE 19 VARIATION COEFFICIENTS USING DBSCAN AND THE MODIFIED TANH NORMALIZATION (ISPRINT DATASET) ....................... 60 FIGURE 20 CLUSTERS FOUND USING DBSCAN (Ε = 0.0025) AND THE MODIFIED TANH NORMALIZATION (ISPRINT DATASET) .............. 60 FIGURE 21 VARIATION COEFFICIENTS USING DBSCAN AND THE MIN-MAX NORMALIZATION (ISPRINT DATASET)................................ 60 FIGURE 22 CLUSTERS FOUND USING DBSCAN (Ε = 0.25) AND THE MIN-MAX NORMALIZATION (ISPRINT DATASET) ........................... 60 FIGURE 23 VARIATION COEFFICIENTS USING DBSCAN AND THE Z-SCORE NORMALIZATION (ISPRINT DATASET) ................................. 61 FIGURE 24 CLUSTERS FOUND USING DBSCAN (Ε = 0.35) AND THE Z-SCORE NORMALIZATION ........................................................ 61 FIGURE 25 VARIATION COEFFICIENTS USING K-MEANS AND THE MODIFIED TANH NORMALIZATION (ISPRINT DATASET)........................ 62 FIGURE 26 CLUSTERS FOUND USING K-MEANS (K = 4) AND THE MODIFIED TANH NORMALIZATION (ISPRINT DATASET)....................... 62 FIGURE 27 VARIATION COEFFICIENTS USING TDA AND THE MODIFIED TANH NORMALIZATION (ISPRINT DATASET) .............................. 62 FIGURE 28 CLUSTERS FOUND USING TDA (Ε =0.0035 ) AND THE MODIFIED TANH NORMALIZATION (ISPRINT DATASET) ..................... 62 FIGURE 29 VARIATION COEFFICIENTS USING DBSCAN AND THE MODIFIED TANH NORMALIZATION (LUXAI DATASET) ........................... 63 FIGURE 30 CLUSTERS FOUND USING DBSCAN (Ε = 0.0055) AND THE MODIFIED TANH NORMALIZATION (LUXAI DATASET) .................. 63 FIGURE 31 VARIATION COEFFICIENTS USING DBSCAN AND THE MODIFIED TANH NORMALIZATION (IDIADA DATASET) ....................... 63 FIGURE 32 CLUSTERS FOUND USING DBSCAN (Ε = 0.0005) AND THE MODIFIED TANH NORMALIZATION (IDIADA DATASET)............... 63 FIGURE 33 DIFFERENT PROCESS BEHAVIOURS AND THEIR RESPECTIVE EVENT LOGS ORDERING RELATIONS. EXTRACTED FROM [11]. ......... 65 FIGURE 34 MINDSPHERE CONCEPTUAL ARCHITECTURE ............................................................................................................ 69 FIGURE 35 MINDSPHERE IOT DATA SERVICES CONCEPT ........................................................................................................... 70

List of Tables TABLE 1 LUXAI DATA UNDERSTANDING ................................................................................................................................. 22 TABLE 2 IDIADA DATA UNDERSTANDING .............................................................................................................................. 23 TABLE 3 ISPRINT DATA UNDERSTANDING ............................................................................................................................... 24 TABLE 4 ASSOCIATIONS BETWEEN DATA TYPES AND FILTERING FUNCTION ..................................................................................... 55 TABLE 5 ASSOCIATIONS BETWEEN DATA TYPES AND DISTANCE FUNCTIONS. ................................................................................... 56 TABLE 6 SECUREIOT DATAKIND XML SCHEMA ..................................................................................................................... 78 TABLE 7 SECUREIOT DATA MODELS XML SCHEMA ................................................................................................................ 82



Page | 8


Definitions, Acronyms and Abbreviations Acronym Title

AWS Amazon Web Services

BIRCH Balanced Iterative Reducing and Clustering using Hierarchies

CE Contextualization Engine

CIR Common Interoperability Registry

CRISP-DM Cross Industry Standard Process for Data Mining

DBSCAN Density-Based Spatial Clustering of Applications with Noise

DFA Deterministic Finite Automata

DM Data Mining

Dx Deliverable (where x defines the deliverable identification number e.g.

D1.1.1)

IQR Inter-Quartiles Range

ISKB IoT Security Knowledge Base

ISTE IoT Security Templates Extraction

ML Machine Learning

Mx Month (where x defines a project month e.g. M10)

PCA Principal Component Analysis

PLM Product Lifecycle Management

PM Process Mining

ROS Robot Operating System

TDA Topological Data Analysis

TEE Template Execution Engine

WP Work Package



Page | 9


1 Introduction In the SecureIoT project, a major objective is to develop predictive security for IoT. It aims so at

predicting future threats or attacks against the devices, services and applications. This will then

support decision making algorithms to deploy countermeasures. Although the predictive

techniques are the focus of task T4.2, the task T4.1 aims at extracting knowledge from monitored

data. This is an essential stage to consolidate the view or state of the systems from which

predictive techniques can rely on.

In this deliverable, the CRISP-DM methodology is first introduced to guide our analysis because

it is recognized to be very effective. Indeed, one of the advantages is to consider business

requirements from the early stage design and compare results obtained at the end of the full

process. In the first months of the project, it serves as a good template to create preliminary

datasets that are necessary for data scientists to understand the format and semantic of handled

data before proposing applicable algorithms. Based on the CRISP-DM method, we have thus

defined a data model to be used by the SecureIoT platform for monitoring purposes, e.g. to

describe the probes. It thus provides a uniform and documented approach to gather and access

monitored data. From the analysis perspective, this deliverable describes the processing pipeline

to be integrated in the architecture of the project while taking into consideration the

requirements that have been defined in WP2. Based on the first insights of the data we extracted,

different techniques for pre-processing and analysis have been selected (and combination of

both). We also propose a refinement technique to include TCP/UDP ports information from

traffic into ML or DM models with enhanced semantics from previously observed attacks. The

Sinalytics platform of SIEMENS is also introduced as it integrates analytics components that could

serve future needs for the analysis envisioned in this project.

This deliverable is structured as follow:

• Section 2 reviews the CRISP-DM approach by detailing at each stage the task to be

performed. In addition, it is then applied for data understanding and modelling. It also

refers to the different sections that detail DM and ML algorithms.

• Section 3 summarizes the requirements from WP2 related to the security monitoring and

knowledge inference.

• Section 4 introduces the monitoring and data analysis pipeline and maps the different

components to the SecureIoT architecture.

• Section 5 highlights the importance of data normalization as an initial state of processing

data by reviewing predominant techniques in the state-of-the-art. A new method is also

introduced to automatically compute similarities between port numbers originally

considered as categorical data.

• Section 6 focuses on candidate DM and ML algorithms. A particular focus is the TDA that

has been implemented and also tested on the first datasets.

• Section 7 describes the Sinalytics platform.



Page | 10


• Section 8 gives a conclusion and introduces the next steps.



Page | 11


2 CRISP-DM methodology The “Big Data Revolution” is referred more in the capability of doing something with the data,

making more sense out of it. To build a capability that can achieve beneficial data targets,

enterprises need to understand the data lifecycle and challenges at different stages. The most

known methodology for data mining is the CRISP-DM methodology. In this section, we are

describing the CRISP-DM methodology and based on the different SecureIoT scenarios we are

going to generate an appropriate data model for hosting the information collected by its probes.

2.1 Cross-Industry Standard Process for Data Mining (CRISP-DM) The Cross-Industry Standard Process for Data Mining named CRISP-DM provides a structured

approach to plan a data mining project. CRISP-DM methodology defines a hierarchy process

model which consists of a set of tasks. Each task is described at four abstraction levels (from

general to specific) [14]:

• Level 1: At the first level, the data mining process is organized into several phases; each

phase consists of several second-level generic tasks.

• Level 2: This level defines the generic tasks which cover all possible data mining

situations.

• Level 3: This level defines specialized tasks that describe how actions in the generic tasks

should be carried out in specific situations.

• Level 4: “The process instance is a record of actions, decisions, and results of an actual

data mining engagement. Each process instance is organized according to the tasks

defined at the higher levels, but represents what actually happened in a particular

engagement, rather than what happens in general.” [14]

2.2 CRISP-DM reference model



Page | 12


Figure 1: Phases of the CRISP-DM reference model [1]

The reference model provides an overview of the data mining life cycle. This life cycle divides

into phases and each phase is divided in tasks. There are six phases as shown in Figure 1:

• Business understanding

• Data understanding

• Data preparation

• Modelling

• Model evaluation

• Deployment

The sequence of phases is not restricted. There is always the necessity to move backwards. The

output of each phase determines what is the next phase or task. Figure 2 presents the six

phases and the contained tasks and outputs.



Page | 13


Figure 2: CRISP-DM – Phases, generic tasks and outputs [14]

2.2.1 Business understanding

This phase deals with the business view of the project. Business understanding initially focuses

on the understanding of the project objectives and requirements, then converting this knowledge

into a data mining problem definition and, finally, a preliminary plan designed to achieve the

objectives. The business understanding phase consists of four generic tasks[14]:

• Determine business objectives

• Assess situation

• Determine data mining goals

• Produce a project plan

Determine business objectives[14]

This task depicts what the customer really wants to accomplish from a business view.

Outputs

Background Record the known information about the business situation.

Business objectives Describe the customer’s primary business objectives.

Business success criteria

From a business point of view, describe the criteria for a successful or useful outcome to the project. This should be specific enough to be measured objectively.



Page | 14


Assess situation[14]

This task involves more detailed investigation of the resources, constraints, assumptions, and other factors that affect data analysis goal and project plan.

Outputs

Inventory of resources List of available resources such as personnel, data, computing resources and software.

Requirements, assumptions, and constraints

List of project requirements, such as completion schedule, quality of results, security and legal issues. Make sure that you can use the data.

Risks and contingencies

List the risks or events that might delay the project or cause it to fail. Plans and actions will be taken if these risks take place.

Terminology Define a glossary of terminology relevant to the project:

• A glossary relevant to business terminology

• A glossary relevant to data mining terminology.

Costs and benefits Construct a cost-benefit analysis for the project: compare the project costs with the potential benefits to the business.

Determine data mining goals[14]

Translate business goals to data mining goals.

Outputs

Data mining goals Describe the intended outputs of the project that achieve the business objectives.

Data mining success criteria

Define the criteria for a successful outcome of the project in technical terms.

Produce a project plan

Define a plan for achieving the data mining goals. The plan should specify the steps to be performed during the project, including the initial selection of tools and techniques.

Outputs

Project plan List the stages to be executed in the project, including their duration, required resources, inputs, outputs and dependencies. Analyse dependencies between time schedule and risks.

Initial assessment of tools and techniques

This output performs an initial assessment of tools and techniques.

2.2.2 Data understanding The data understanding phase starts with an initial data collection and proceeds with activities

that enable you to become familiar with the data, identify data quality problems, discover first

insights into the data, and/or detect interesting subsets to form hypotheses regarding hidden

information[14].



Page | 15


Collect initial data[14]

Acquire the data listed in the project resources. Also include data loading, if necessary for data understanding.

Outputs

Initial data collection report

List the acquired datasets with locations, acquired methods, problems encountered and resolutions to these problems.

Describe data[14]

Examine the “gross” or “surface” properties of the acquired data and report on the results.

Outputs

Data description report

Acquired data description, including the data format, quantity, the identities of the fields, and any other discovered surface feature.

Explore data[14]

“This task addresses data mining questions using querying, visualization and reporting techniques. These include distribution of key attributes, relationships, results of simple aggregations, properties of significant sub-populations and simple statistical analyses.” [14]

Outputs

Data exploration report

Description of the results, such as the first findings or the initial hypothesis. If appropriate, graphs and plots can be included.

Verify data quality[14]

Examine the quality of the data, addressing appropriate questions (is the data complete or correct? Does it contain errors and how are they?).

Outputs

Data quality report List the results of the data quality verification. If there is a quality problem, define possible solutions.

2.2.3 Data preparation

The data preparation phase covers all activities needed to construct the final dataset from the

initial raw data. The tasks of preparation phase could be performed multiple times and not in any

predefined order. This phase produces a list of datasets and their description[14].

Select data[14]

This task decides which data will be used for analysis based on criteria such as relevance to the data mining goals, quality, technical restrictions (volume or datatype limitations).

Outputs

Rationale for inclusion/exclusion

List the data to be included/excluded and the reasons for these decisions.



Page | 16


Clean data[14]

This task raises the data quality to the desired level for the selected analysis techniques. This can involve selection of clean subsets, default values definition or other techniques such as estimation of missing data.

Outputs

Data cleaning report Describe what actions were taken to address the data quality problems reported during the “Verify Data Quality” task. Data transformations for cleaning purposes. The impact on the analysis results should be considered.

Construct data[14]

This task includes operations for data preparation construction (production of derived attributes or entire new records or transformed values for existing attributes).

Outputs

Derived attributes Derived attributes are new attributes that are constructed from one or more existing attributes in the same record.

Generated records Describe the creation of completely new records.

Integrate data[14]

This task includes methods for multiple data combination to new records.

Outputs

Merged data Merging tables, aggregations.

Format data[14]

Formatting transformations refer to syntactic modifications that do not change the meaning of the data but might be required by the tool.

Outputs

Reformatted data Reformatted data.

2.2.4 Modelling

In this phase, the appropriate modelling techniques are selected and applied, and their

parameters are calibrated to optimal values. Usually, there are several techniques for the same

data mining problems type. Some techniques have specific requirements in the form of data. For

this reason, the backward moving to data preparation phase is often necessary [14].



Page | 17


Select modelling technique[14]

Select the modelling technique that is going to be used. Although the tool may have already been selected during the Business Understanding phase, this task refers to specific modelling techniques. If multiple techniques are applied, this task is repeated for each technique.

Outputs

Modelling technique Document the actual modelling technique that is to be used.

Modelling assumptions

Record any assumption made by a modelling technique.

Generate test design[14]

Generate a procedure or mechanism to test the model’s quality and validity before it is built.

Outputs

Test design Define and describe a plan for training, testing, and evaluating the models.

Build a model[14]

Run the modelling tool on the prepared dataset to create one or more models.

Outputs

Parameter settings List the necessary parameters and their chosen values.

Models The actual models produced by the modelling tool

Model descriptions Report on the interpretation of the models and document any difficulties encountered with their meanings

Assess model[14]

The data mining engineer interprets the models according to his domain knowledge, the data mining success criteria, and the desired test design. The data mining engineer judges the success of the application of modelling and discovery techniques technically. The data mining engineer tries to rank the models. He assesses the models according to the evaluation criteria. In this task, he also compares all results according to the evaluation criteria.

Outputs

Model assessment Summarize results of this task, list qualities of generated models (e.g., in terms of accuracy), and rank their quality in relation to each other.

Revised parameter settings

According to the model assessment, revise parameter settings and tune them for the next run in the Build Model task.

2.2.5 Evaluation This phase focuses on the evaluation and the review of the created model, to be certain that it

achieves the business objectives. A key objective is to determine if there is an important business



Page | 18


issue that has not been sufficiently considered. At the end of this phase, a decision on the use of

the data mining results should be reached [14].

Evaluate results[14]

This task assesses the degree to which the model meets the business objectives and seeks to determine if there is some business reason why this model is deficient. Moreover, “the evaluation also assesses other data mining results generated. Data mining results involve models that are necessarily related to the original business objectives and all other findings that are not necessarily related to the original business objectives, but might also unveil additional challenges, information, or hints for future directions.” [14]

Outputs[14]

Assessment of data mining results with respect to business success criteria

“Summarize assessment results in terms of business success criteria, including a final statement regarding whether the project already meets the initial business objectives.” [14]

Approved models The generated models that meet the selected criteria become the approved models.

Review process[14]

Review of the data mining process to determine if there is an important factor or task that has somehow been overlooked or if the quality assurance issues are covered.

Outputs

Review of process Summarize the process review and highlight activities that have been missed and those that should be repeated.

Determine next steps[14]

This task defines the next steps based on the input of the assessment and the process review. These steps include project finishing and deployment, initiate further iterations or setup new data mining projects. The task also includes the budget and remaining resources analysis.

Outputs

List of possible actions List of the potential further actions and the reasons for each option.

Decision Description of the decision as to how to proceed.

2.2.6 Deployment

Depending on the requirements, the deployment phase can be as simple as generating a report

or as complex as implementing a repeatable data mining process across the enterprise. In many

cases, the customer, and not the data analyst, carries out the deployment steps. In any case, it is

important for the customer to understand up front what actions need to be carried out in order

to actually make use of the created models[14].



Page | 19


Plan deployment[14]

This task takes the evaluation results and determines a strategy for deployment. If a general procedure has been identified to create the relevant model(s), this procedure is documented here for later deployment.

Outputs

Deployment plan Summarize the deployment strategy, including the necessary steps and how to perform them.

Plan monitoring and maintenance[14]

This task defines a plan for monitoring and maintenance. The maintenance strategy helps to avoid unnecessarily and/or incorrect usage of data mining results.

Outputs

Monitoring and maintenance plan

Summarize the monitoring and maintenance strategy, including the necessary steps and how to perform them.

Produce final report[14]

In this task, the project team writes the final report. This report may be only a summary or a final and comprehensive presentation of the data mining result.

Outputs

Final report This is the final written report of the data mining engagement. It includes all the previous deliverables, summarizing and organizing the results.

Final presentation Usually, this is a meeting at the end of the project at which the results are presented to the customer.

Review project[14]

This task assesses what went right and what went wrong, what was done well and what needs to be improved.

Outputs

Experience documentation

Summarize important experience gained during the project.

2.3 Application to SecureIoT Datasets 2.3.1 Business Understanding

As mentioned above, this phase deals with the business view of the project. The main volume of

this work has been conducted in D2.1 & D6.1 for understanding the business scenarios and in

D2.2 for analysing the stakeholder requirements. The requirements referring to knowledge

inference are listed and analysed in section 3 of this document named “Requirements”.



Page | 20


2.3.2 Data understanding

2.3.2.1 Methodology for collecting data characteristics

One major objective of SecureIoT is to analyse data, especially real- time data from systems in

production in order to predict security issues and so prepare countermeasures. From raw data

of systems to prediction, there is a long path that first starts by extracting automatically

knowledge from data, i.e. aggregated raw data into understandable facts or events. This

supposes to understand the content of the data, i.e. having context in order to perfectly fit the

algorithms to be used, in particular, Machine Learning (ML) and Data Mining (DM) algorithms.

Data is thus the starting point of our analysis and everything will be built around them. Evidently,

some feedback can be also provided to use cases if the analysis reveals that further data should

be also collected.

In the first iteration, WP3 and WP4 prepare conjointly a document template to be used for

describing collected initial data sources that can be exploited in this project (Collect and Describe

tasks from CRISP-DM). This document is a living document in the sense that updates may arise

with further details about data to be used. Therefore, the document first provides general

information about the dataset such as a general description in order to provide a first insight

about the content. Then, a second section is dedicated to explaining how data has been collected.

Indeed, the variety of the same types of data can be very different when data are collected on a

small-scale testbed compared to real-life systems. Those are also very important information in

order to build analytics with bias due to non-representative data, while they still can be used to

perform the first tests of the techniques. So, in this section is expected to have details about the

topology, the number and types of devices, etc. Then, there are two important sections for

technical use of data. First, the access to the data must be documented in order, for instance, to

know if there will be a standard API to access them. Secondly, the format of the data and the

model are described such as algorithms analysing the latter can be properly prepared. For

example, textual, categorical or numerical data cannot be handled in the same manner. Finally,

a section is devoted to express what are the restrictions of use for this data. For instance, the

openness of data is highly important, especially for the academic partners and community in

general, but they may also contain sensitive or personal data. DWF particularly helps in reviewing

this section to take into account the legal aspect.

Although the previous paragraphs aim to highlight the rationale of the template composition, the

interested reader can refer to Annex A: Data collection template which provides the full

template.

2.3.2.2 Qualitative Data Analysis

Data has to be qualitatively assessed and characterized (Explore task of CRISP-DM) in order to

select the normalization and the clustering methods to use. Most specifically, two critical

properties of data needs are assessed: the type and variability.



Page | 21


For example, unbounded values bring difficulties to be integrated, which are not shared by

bounded values. Normalization parameters found on a little set of unbounded data may have to

be changed extremely to fit a new entry. Moreover, a distinction has to be done between discrete

and continuous values because they could influence the distance calculation and the clustering

algorithms results. Finally, in some datasets, there are categorical and text-based values that

need a special treatment to be used in a clustering algorithm because a huge part of them, like

K-Means or DBSCAN, only use numerical values or features that can be mapped to a metric space.

The type of data is not the only point to check. The variability of each feature has to be considered

as well. Indeed, the data change frequency may have an effect on the clustering algorithm results

because of outliers or occupy memory unnecessarily. For example, it is easy to imagine that a

fixed value feature that is identical for each data point is useless to be sampled at high frequency.

On the other hand, if multiple values of this feature exist, they have to be reported in the dataset,

otherwise some clusters could be ignored.

The purpose of clustering these data is to find normal behaviours clusters where no alert needs

to be triggered. At the moment, the datasets do not have any attack and have too little data to

try to be exhaustive in clusters search, but it may be possible to find some normal behaviour

groups.

Therefore, data is grouped into five different types:

1. Numerical data

1.1. Discrete values

1.1.1. Bounded

1.1.2. Unbounded

1.2. Continuous values

1.2.1. Bounded

1.2.2. Unbounded

2. Boolean data

3. Text-based data (infinite possibilities)

4. Categorical data (finite number of categories)

5. Aggregate

Aggregate refers to grouped features like users’ profiles where multiple information on a single

individual can be found. For the variability, it is interesting to know if the data never change, if it

changes on events or if it changes continuously.

LuxAI

The data included in LuxAI dataset comes from a demo on QTrobot of about 2 minutes 30

seconds, where the robot uses its camera to recognize emotions of people standing in front of it

and reacts accordingly.



Page | 22


So, the dataset contains recognized emotions of the subject and robot reactions like the sound it

plays or its gestures. Moreover, there is some additional data that can be found in the network

traffic, like packets sizes or the protocols used. A table of categorized data is in Table 1.

Table 1 LuxAI data understanding

IDIADA

The dataset provided by IDIADA dataset is from a simulation tool that generates vehicles

performance data. There are two data sources: the first source is CAN data, and the second is

provided by the V2Xmodule. The data in CAN is low level, like braking pressure or vehicle RPM,

while the ones from the V2X module are from a higher level, with data like GPS coordinates. The

document V2X.log is easier to understand than V2X traffic in the PCAP file. A more detailed

presentation of the data is available in Annex C (IDIADA_Data_Summary).

Some of the data below cannot be directly found in the understandable application format but

are in a dump file like the vehicles doors status. Some other data were in the previous dataset,

like the type of vehicle or the user email, but they might have been additionally introduced by

the simulation tool at that time. A table of categorized data is in Table 2.

Categorical value Grouped featuresContinous

change

Change on

particular

event

method called X X

robot gesture X X

sound played X X

recognized

human emotion X X

video and sound

stream X X



Page | 23


Table 2 IDIADA data understanding

iSprint (CC2U)

The data included in iSprint dataset is from a simulation from CloudCare2U, a solution for chronic

disease patients to have a life as normal as possible. Information in the dataset is from rooms

sensors, like temperature and illuminance, or from particular devices used by the patients, like

heartbeat monitors. The collected data is used to find patient’s other knowledge, for example,

its hunger is, for instance, estimated that way. For each room, we have a file presenting sensors’

data. Additional information can be found in other files, like medical measures or a history of the

patient’s actions of the patient. All these files are in JSON format. A table of categorized data is

available in Table 3.

Discrete

numerical

value

(bounded)

Numerical

value

(bounded)

Categorical

value

Grouped

featuresFixed value

Continous

change

Change on

particular

event

engine rpm X X

distance travelled X X

steering wheel angle X X

GPS coordinates X X

vehicle speed X X

braking X X

throttle X X

vehicle heading X X

vehicle altitude X X

vehicle length X X

vehicle width X X

vehicle type X X

service provided X X

station type X X

status seat belts X X

status doors X X

status light X X

user profile X X



Page | 24


Table 3 iSprint data understanding

2.3.3 Data preparation In this document, we propose a dedicated method to enhance the knowledge that can be

extracted from network traffic (Format data task from CRISP-DM), and we also review some

existing techniques for data normalization in section 5 (Construct data task from CRISP-DM). They

are then applied and tested in section 6, when being conjointly used with clustering techniques.

At that time, our objective is focused on data exploration and initial testing of algorithms, so no

data record is excluded, cleaned or integrated into CRISP-DM.

2.3.4 Modelling

Based on the scenarios described in D2.2 and D6.1 and the data analysed in section 2.3.2, we

realized that a dynamic data modelling (Select modelling technique task of CRISP-DM) should be

Discrete

numerical

value

(unbounded)

Numerical

value

(unbounded)

Numerical

value

(bounded)

Boolean

value

Text-based

value

Categorical

value

Grouped

featuresFixed value

Continous

change

Change on

particular

event

age X X

systolic blood

pressure X X

SPO2 X X

diastolic blood

pressure X X

heart rate X X

steps X X

humidity X X

person boredom X X

person tiredness X X

person toilet need X x

hunger X X

position confidence X X

gender confidence X X

age confidence X X

emotion confidence X X

position in the room X X

person width X X

person height X X

IMA X Xappliance power

consumption X X

appliance daily energy

consumption X X

heart rate variability X

illuminance of the

room X X

temperature X X

doors status X X

movement X X

bed pressure X X

sofa pressure X X

fall X X

object name X X

state X X

gender X X

emotion X X

function X X

room X X

type of appliance X X

body X X

physical activity X X

previous state X X



Page | 25


applied to cover the diverse domains of the different IoT platforms that need to be monitored.

For this reason, we decided to use an objects identification registry. This registry will enable

SecureIoT to store the observed information from the probes deployed to the different IoT

platforms in a SecureIoT database and, in parallel, to use the IoT platform third-party database

to enrich the observations with the scenario semantics.

The object Identification registry will:

• Support the digital tweens concept of Secure IoT reference architecture by enabling

simulation and data processing applications to map their internal object descriptions to

the appropriate data sources to retrieve data.

• Enhance the Probes registry with additional security/business context coming from the

use case owner databases.

o The abovementioned mapping will facilitate the Security Template Extraction as

well.

• Finally, it will enable the linking of data streams and Execution Engine templates with the

security knowledge base objects.

For the objects registry, we have identified the OpenO&M Common Interoperability Registry

(CIR) [2], capable of covering our needs. For the SecureIoT observed data, we specified a data

model which provides the types of captured information, as stream or batch, and, in the same

time, identifies their sources. Both models are described in the sections below.

In Figure 3 below we can see an example of SecureIoT DB systems interactions. We can see that

each of the objects introduced to each of the SecureIoT DB systems (Probe Registry, Security

Templates, Security Knowledgebase etch.) with the same context is registered with a unique

identification number to the CIR as well. This way we can combine all the different DB systems

to provide business context to the data produced and stored to the SecureIoT data storage. This

procedure should be managed by a Management Console which would be responsible to bind

the objects from the different DB systems to the CIR.



Page | 26


Figure 3 Registry coordination example

2.3.4.1 OpenO&M Common Interoperability Registry

The OpenO&M Common Interoperability Registry (CIR) provides a standards-based, vendor-

neutral method to map object entities belonging to different systems/databases which share

common business context. Additionally, it:

• Enables the discoverability and relation of the registered objects and helps third party

applications to combine the information provided from these systems/databases.

• Provides a global unique identifier (in a UUID format) for the registered objects.

The CIR provides an XML schema and a relational DB which describes the specification. In Figure

4, we can see the structure of the root elements of the Common Interoperability Registry. The

OpenO&M (CIR) is open source and the latest version of the CIR schema can be found at GitHub1.

1 https://github.com/mimosa-org/ws-cir/tree/master/xsd

https://github.com/mimosa-org/ws-cir/tree/master/xsd



Page | 27


Figure 4 OpenO&M Common Interoperability Registry Data Model

CIR main entities[3]:

• Registry: The container object for a set of categories.

• ID: The user-defined identifier of the registry.

• Description: The description and expected use of the registry.

• Category: A Category object is the container object for a set of entries. Categories define

sets of related or potentially related entries. For example, a Category may be defined for

equipment hierarchy level names (Enterprise, Site, Area, Work Centre, Work Unit), which



Page | 28


have alternate names on different systems. The combination of ID and SourceID must be

unique within a Registry.

2.3.4.2 SecureIoT Data Model

As mentioned above for storing the observed data, we specify a data model which provides the

type of information captured and its source. To achieve this, we have specified the SecureIoT-

DM, which is depicted in Figure 5.

Figure 5 SecureIoT Data Model

SecureIoT DM consists of four main entities:

• DataKind (DK): The DataKind entity describes the type of data observed by the SecureIoT

platform. It may provide the data format along with their semantic measurements.

• Platform: The Platform entity provides an identification and description of the observed

IoT platform.

• Probe: The Probe entity provides an identification and description of the probes deployed

to an IoT platform. Each Probe instance is bounded with an IoT platform.

• LiveDataSet: The LiveDataSet entity provides the structure of the captured observations.

It supports batch along with stream data. Each LiveDataSet is with a SecureIoT probe,

which is responsible for the generated data.



Page | 29


Data Kind

Data Kind entity provides the information on the kind of data captured along with their types and

formats. It is required in order to not only understand the observation semantics but how to

consume them as well. The root element of the DK model is depicted in Figure 6. The XSD schema

of the “DK” is provided in Table 6 of Annex B.

Figure 6 SecureIoT Data Kind

As depicted in Figure 6 “DataKind” has:

• id: Uniquely identifies the DataKind as a URI

• name: A human-readable name which uniquely identifies the DataKind

• description: Provides an optional description of the DataKind

• modelType: Specifies the model type of the Data (i.e. SenML, OM, ...)

• format: Specifies the format of the Data (i.e. JSON, XML,...)

• quantityKind: A QuantityKind is an abstract classifier that represents the concept of "kind

of quantity". A QuantityKind represents the essence of a quantity without any numerical



Page | 30


value or unit. (e.g. A sensor -sensor1- measures temperature: sensor1 has quantityKind

temperature).

Platform

The “Platform” entity provides a high-level description and unique identification of the IoT

platform monitored in each use case. The parent of the “Probe” entity described below. The root

element of the “Platform” model is depicted in Figure 7. The XSD schema of the “Platform” is

provided in Table 7 of Annex B.

Figure 7 SecureIoT observed Platform

As depicted in Figure 7 “Platform” has:

• id: Uniquely identifies the Platform as a URI

• name: A human-readable name which uniquely identifies the IoT Platform

• namespace: For scope hierarchy

• description: Textual description for the IoT Platform

• Location: The Platform's location



Page | 31


• AdditionalInformation: Optional auxiliary field that may contain any additional

information.

Probe

The “Probe” entity provides a high-level description and unique identification of a probe, which

provides security measurements to the SecureIoT system. Each probe is associated with one

Platform by referencing the IoT platform’s ID. The root element of the “Probe” model is depicted

in Figure 8. The XSD schema of the “Probe” is provided in Table 7 of Annex B.

Figure 8 SecureIoT Probe

As depicted in Figure 8 “Probe” has:



Page | 32


• id: Uniquely identifies the Probe as a URI

• name: A human-readable name which uniquely identifies the Probe.

• PlatformReferenceID: The ID of the Platform this probe is deployed to.

• namespace: For scope hierarchy.

• Description: Textual description for the SecureIoT Probe

• Location: The Probe's location

• AdditionalInformation: Optional auxiliary field that may contain any additional

information

LiveDataSet

The “LiveDataSet” is structured in two levels. The first level provides the information of the probe

(i.e. packet sniffing probe) and when it has generated/grouped the data. The second level

provides the measurements of the probe (which can be a single measurement or a list of

measurements). The root element of the “LiveDataSet” model is depicted in Figure 9. The XSD

schema of the “DeviceLD” is provided in Table 7 of Annex B.

Figure 9 SecureIoT Live Data Set



Page | 33


As depicted in Figure 9 “LiveDataSet” has:

• id: which is the assigned ID of the dataset when entering the SecureIoT system.

• ProbeReferenceID: the ID of the probe the captured observations are coming from.

• mobile: Identifies if the SecureIoT Probe is mobile or not. If it is mobile, the location field

within the observation entity should be provided as well.

• timestamp: is the dateTime of when the batch of recorded observations was generated.

If a data stream is observed (or only single values are observed) this field is not used and

only the timestamp within the observation entity is used.

• observations (unlimited list): which provides the value entity. This entity is analysed with

more details in the sections below.

Observation

The root element of the Observation model is the “Observation” and is depicted in Figure 10. The

XSD schema of the “Observation” is provided in Table 7 of Annex B.



Page | 34


Figure 10 SecureIoT Data Observation

As depicted in Figure 10 “Observation” has:

• id: which is the ID of the Observation instance.

• name: which is a humanly recognisable name of the SecureIoT Observation Model

instance.

• DataKindReferenceID: which provides information about the Observation supported

DataKind.

• timestamp: which provides information about the dateTime timestamp an observation

was recorded.

• Location: which provides the geographical or virtual location an incident took place.

• value: which provides the value a probe observes.



Page | 35


Other Entities

Location

The root element of the Location model is the “Location” and is depicted in Figure 11. The XSD

schema of the “Location” is provided in Table 7 of Annex B.

Figure 11SecureIoT Location entity

As depicted in Figure 11 “Location” has:

• geolocation: which provides the coordinates (longitude and latitude) of a physical

location.

• virtualLocation: which provides information about a virtual location (it could be the ID of

a resource or subsystem).

Additional Information

AdditionalInformation is a generic entity which allows the extension of the existing data model

with additional attributed that may be required. The root element of the “AdditionalInformation”

model is depicted in Figure 9. The XSD schema of the “AdditionalInformation” is provided in Table

7 of Annex B.



Page | 36


Figure 12 Additional Information entity

2.3.4.3 Data Mining Models

In the section 2.4.3.2 we introduced data models to ease data collection and access. In section 6,

we have selected different clustering and machine learning techniques (Select modelling

technique task of CRISP-DM). To perform the tests (generate test design task), we have run an

extensive set of experiments with initial datasets while varying the different parameters. As no

anomalies or attacks are provided at this stage, we decided to cluster the normal activities and

so use some particular attributes of the data to serve as the label in order to check its consistency

within an identified cluster. Thanks to the tests performed that includes training the model (build

model task), we are able to identify the assets and drawbacks of the different algorithms.

2.3.5 Model evaluation

These steps are currently left for future work, as it is first needed to have an integrated platform

to analyse data from a real system with potential threats.

2.3.6 Deployment

Since the deployment requires the use case owners to carry out the deployment steps at this

stage of the project (month 9), the use cases are not mature enough to proceed with the

deployment state. For this reason, we cannot provide in this first version of the deliverable a

deployment report which can be available in the second version of the deliverable in month 24.



Page | 37


3 Requirements 3.1 Identified requirements in D2.2 In deliverable 2.2, the analysis leads to the identification of particular requirements related to

each task. Four are provided for T4.1 that relates to this deliverable. We can divide them in two

groups. The first group is about requirements that impact on the design and definition of the

methods to analyse the data, e.g. algorithms and processes. The second group focuses on how

the latter is integrated within the SecureIoT platform to support a secure and controlled access

to the knowledge inferred.

Requirements for design and definition of the methods

• R4.1.1 Determine security monitoring and knowledge information in a timely, scalable,

consistent and automated manner

• R4.1.4 Support monitoring and knowledge data lifetime management

Requirements for secure integration

• R4.1.2 Monitoring and knowledge data should be protected in transit and at repositories

• R4.1.3 Support data access control and partitioning methods

3.2 Mapping requirements We have used the CRISP-DM methodology for data preparation to fulfil points of the requirement

R4.1.1. In this respect,we have designed a data model capable to ensure that the security

knowledge captured by the differed sources (probes) of SecureIoT system will be consistent by

storing them based on a standard model, which is able to depict their semantics and origin.

Additionally, the captured information will be able to be enriched based on multiple data sources

by utilizing an object registry.

Moreover, in order to provide a consolidated view of data (R4.1.1), we introduce in this

deliverable, different techniques to pre-process data (normalization and enrichment) and cluster

data. The goals are to support then further analysis in particular detecting or predicting attacks.

Process-mining technique also detailed in this document has the great advantage to naturally

deal with the heterogeneity of data by representing everything as events.

Regarding 4.1.2, current usable datasets represent small samples to identify potential usable

techniques but cannot support a time-based analysis. As mentioned, the second group of

requirements is related to the integration with the SecureIoT platform and is so related to WP5

development. At the time of the project, WP5, in relation with WP2, is focused on specifications

that integrate access control and data management considerations.



Page | 38


4 Data analysis architecture and processing

pipeline

4.1 Analysis process model In the project, different techniques are planned for enhancing the security of IoT, including

making predictions about issues to be thus resolved in advance. While the first set of algorithms

is described in this deliverable, updates may occur along the project lifetime based on the dataset

nature and expressiveness.

This section describes the logical view of the processing of data. First, it is important to highlight

there are different types of data that can be used:

• Dynamic data collected from a system in production. It represents data collected when

the system runs. It includes general data like network traffic but also operational data.

Even if this type of data is dynamic and the best would be to collect them in a streaming

manner, first iterations of our research development is focused on offline analysis (from

stored data), until all necessary tools will be developed and integrated to interact with

the live system.

• Assets or static contextual data. It represents general information about the observed

systems which tends to not change frequently over time, for example the version of the

software in use. Infrequent update can happen and has to be considered but this cannot

be considered as streaming/live data. The goal of this data is more to infer a profile about

the system and adjust the algorithms if necessary, depending on a particular context.

• External data. External knowledge such as descriptions of attacks or vulnerabilities is

necessary, especially for assessing the security of a system. It may also help to verify if a

suspect behaviour has been observed somewhere else or if there are evidence of similar

attacks over the Internet.

All of them are input to algorithms in charge of analysing them. Different types of machine

learning algorithms will be used:

• Unsupervised algorithms. This class of algorithms assumes no prior knowledge about

data and has so no labelled samples of data (from the dynamic dataset). In this category

fall the clustering algorithms to group automatically data samples into distinct classes.

From a security point of view, it is particularly useful to profile the different types of

behaviours and eventually detect the apparition of new ones, potentially abnormal ones,

over time.

• Supervised algorithms. In that case, the system is provided with labelled data to learn

signatures corresponding to this label. It corresponds to the learning phase. Very often

the goal is then used to predict the unknown label of new data sample (testing phase).



Page | 39


Therefore, this type of algorithms works in 2-pass. First, a model is trained or learned

(usually the most computational part) and the data is matched against this model. In the

most drastic cases, only a single class or label can be used. This type of algorithms, known

as 1-class classification, has been successfully applied for network security to detect

divergent behaviours.

This is a very short description of types of algorithms in the literature. For instance, reinforcement

learning is another type of machine learning approach. However, this limited description aims

solely to support the description of selected algorithms in this deliverable: TDA mapper for

clustering data (unsupervised) and Process Mining to automatically create profiles of behaviours

(supervised). In next versions of this deliverable, this may evolve; it is worth to mention that these

algorithms are designed to be applied on a single dataset. They cannot thus deal conjointly with

the different sources of data (dynamic, contextual and external). This will require further

investigation in our future work.

4.2 SecureIoT architecture Figure 13 refers to D2.4 that provides the architecture of the project. This figure is focused on

WP4 integration and highlights how analytics process will be integrated. Thereafter, we

summarize the different components of the architecture and then will map them to the data

analytics process. First, data comes and is processed from different modules:

• Dynamic data is provided by the Global Storage module. The latter is in charge of storing all

data sent by the monitoring probes of the live system. It will rely on the modelling described

in section 2.3.4.2.

• Context data is provided to the Contextualization Engine (CE): the role of this engine is

actually to fit the algorithms or pre-established learned models to a specific context, i.e. a

specific IoT deployment.

• External data is collected within IoT Security Knowledge Base (ISKB): This knowledge base

comprises external IoT security knowledge, including for example knowledge about known

threats, attacks, incidents and vulnerabilities.

Security analysis is then performed at different levels:

• IoT Security Templates Extraction (ISTE): this module aims at creating models of the

anomalies, attacks or most generally events we want to catch. While they can be manually

created, we mainly envision automated extraction using machine learning. Indeed, this

consists of the first phase of the supervised algorithms. Unsupervised algorithms do not

require this stage, and so this component.

• IoT Security Templates Database: models previously built by the ISTE are stored in a

persistent manner in this database. From a logical point of view, a template or model is a set

of descriptions of the data model to be matched against. In practice, as highlighted in D2.4,

this can take various forms such as a decision tree, a neural network or even a program.



Page | 40


• Template Execution Engine (TEE): in our context, the role of the security engine is to either

apply the unsupervised algorithm or the testing phase of the supervised algorithm. Hence, it

will automatically extract knowledge of new data, in particular, dynamic data.

Figure 13: Anatomy of the Security Intelligence Layer

In the following section of this deliverable, the scope is to infer security knowledge rather

than making predictions (D4.3), which also consists in deploying countermeasures such as

security policies to be enforced by the Security Policy Enforcement Point (SPEP). The other

components, Management and Configuration Tools and the Visualization are out of the

scope of this deliverable.

IoT Systems (Platforms &

Devices)

FieldNetwork

FieldDevice

Edge

Cloud

App Intelligent(Context-

Aware)Data

Collection

Actuation & Automation

Open APIs

IoT Security Template Extraction (Analytics)

Template Execution

Engine(e.g., Rule

Engine)

Global Storage(Cloud)

(SecureIoT Database + Probes Registry)

IoT Security Templates Database

Templates

ContextualizationEngine

IoT Security Knowledge Base

Security Policy Enforcement Point

WP4

Open APIs

WP3Management &

Configuration ToolsVisualization (Dashboards)



Page | 41


5 Data pre-processing 5.1 Feature scaling

When being analysed, data represents several features, which are usually all taken together for

the analysis. They can be, for example, projected into a metric space to compute the distances

between data points. However, a problem can arise if the data are not properly scaled or

normalized. For instance, a feature ranging between 0-100 and the other between 1 million and

10 million cannot be used directly for computing a distance, as one will have more impact than

the other. Re-scaling techniques aim at easing the integration of features with the primary

objective to make data features more comparable.

5.1.1 Min-Max scaling

Min-max normalization is the simplest normalization technique. It changes the features until

each one of them fills in the [0, 1] interval. So that, for each feature k of each data point S in the

dataset, the new value Sk’ is given by:

sk’ = 𝑠𝑘 − 𝑚𝑖𝑛

𝑚𝑎𝑥 − 𝑚𝑖𝑛 where min is the minimal value of the feature and max the maximal value.

The strong point of this method is that features keep the same distribution as before, and their

influence is harmonized. However, this technique has drawbacks like the lack of robustness.

Indeed, it needs only one outlier in the dataset to distort clustering results.

5.1.2 Mean-centred scaling

5.1.2.1 Z-score normalization

Z-score normalization is used to change the distributions of features values into normal

distributions with an average of 0 and a standard deviation equals to 1. Thus, for each feature k

of each data point S in the dataset, the new value Sk’ is given by:

sk’ = 𝑠𝑘 − 𝜇

𝜎 where μ is the feature mean and σ the standard deviation of the feature k.

Z-score standardization is one of the most used normalizations, especially in machine learning.

However, this method is not robust because its parameters, the mean and the standard

deviations, are sensitive to outliers.

It is possible to use standardization to harmonize the same types of data as min-max

normalization, such as GPS coordinates and vehicle speed (IDIADA). It is not trivial to know which

method will give the most exploitable results for clustering algorithms, that is why the results

obtained with different normalization methods have to be compared.



Page | 42


5.1.2.2 Modified tanh normalization [4]

This method applies the same transformation as tanh normalization [5], but without using

influence function and Hampel estimators. So, for each feature k of each data point S in the

dataset, the new value Sk’ is given by:

sk’ = 1

2[𝑡𝑎𝑛ℎ (0.01

(𝑠𝑘 − 𝜇)

𝜎) + 1] where μ is the feature mean and σ the standard deviation of the

feature k.

This method is derived from a robust one, that is why the modified tanh normalization is expected

to be robust or, at least, more robust than min-max and z-score. Tanh normalization seems to

have results close to z-score, that is why, just like for the previous methods, it is possible to apply

modified tanh normalization on datasets’ features like TV energy consumption and the degree of

boredom.

5.1.3 IQR-scaling

IQR (Inter-quartile range) consists into taking the most representative part of data between the

first and third quartiles of data. It thus discards the more extreme data samples.

Assuming the distribution A of 𝑛 data samples, 𝐴 = {𝑎1, 𝑎2, … , 𝑎𝑛}, the first and last quartiles

are denoted as 𝑄1 and 𝑄3. Each data point 𝑎𝑖 is rescaled as follows:

𝑎𝑖′ =

𝑎𝑖 − 𝑄1

𝑄3 − 𝑄1

Actually, the IQR-based rescaling method looks at data distribution instead of the data value

and allows then a higher range for most represented values.

5.1.4 Median and median absolute deviation (MAD) scaling The principle of this method can be compared to z-score. But it works with the median instead

of the mean and the median absolute deviation instead of the standard deviation. So, for each

feature k of each data point S in the dataset, the new value Sk’ is given by:

sk’ = 𝑠𝑘 − 𝑚𝑒𝑑𝑖𝑎𝑛

𝑀𝐴𝐷 where MAD is equal to median (|sk − median|).

This method is more robust than z-score which uses the standard deviation, but it has several

drawbacks like the input distribution which is changed and, moreover, the result is not included

in a common interval.

Median and MAD normalization could be useful if there are outliers in the datasets but, in the

literature, this method seems to perform less well than the others described before.

5.2 Features extraction from network traces Network captures or traffic is a common type of data that can be provided in many systems,

including the different use cases. This can be viewed as a generic type of data. There are different

types of format including full packet capture or aggregated data such as NetFlow/IPFIX [15]. They



Page | 43


have been widely used for security analysis, in order to detect anomalies or attacks. While they

are generic in their format, they can still include context-specific information, in particular for full

packets capture that includes payload of transmitted packets or frames, and so application-

specific data.

Even if network traffic is well known, the features it provides are not simple ones such as IP

addresses or port numbers. Indeed, those latter are not easy to interpret and so include in

automated techniques. While they are or can be represented as numerical values, their

projection in a metric space is not useful. This is particularly true for TCP or UDP ports since the

IP address space is still organized in subnetworks.

Hence, this section defines a similarity metric to be used when port numbers have to be taken

into account in the analysis. This avoids considering port numbers as categorical data with no

correlation among them. In a nutshell, the objective is to integrate the rich semantic of services

associated with port numbers in our further analysis.

5.2.1 Motivation TCP and UDP are major transport protocols in Internet. Port numbers allow the end-hosts to

differentiate flows and forward them to the right sockets and so services. Encoded in 16 bits,

there are 65,536 possible ports for both TCP and UDP. There are different segments: system or

well-known ports (0-1023), reserved ports for specific applications or vendors (1024-49151) and

dynamic ports (49152-65535). Although the dynamic ports are mainly used as ephemeral ports,

such as source ports when establishing a connection, other ports are associated with a special

use or service. Their numbering is managed by the Internet Assigned Numbers Authority (IANA).

Even if this does not prevent any user to use a registered port for any usage, using assigned port

numbers eases the access to the service. In addition, there are some ports which are usually

diverted from their normal usage, such as 443 originally reserved for HTTPS but often used by

VPN services to avoid filtering.

Therefore, port numbers are representative of the provided services but there is no one-to-one

mapping between a port number and a service. They are a valuable source of information for

managing and operating a network as, for instance, to perform traffic engineering for QoS

purposes or to detect anomalies. In many cases, in particular, for security purposes, packets or

flows need to be compared for supporting machine learning or data-mining algorithms. For

example, NetFlow records can be analysed to detect anomalies but all data to handle cannot be

represented in a metric space and so easily compared. While using longest common prefixes can

solve the problem with IP addresses, the problem remains for port numbers. Considering only

the three possible ranges is an option but results in a too large granularity.

We thus propose a fine-grained approach, which catches and quantifies two types of similarities

between port numbers:

• Service-semantic similarity: this represents port numbers supporting services being considered

of the same type. For instance, TCP ports 80 and 443 are semantically close to each other (Web).



Page | 44


However, TCP ports 443 and 22 are also close semantically because they provide a secure

connection.

• Context-semantic similarity: this abstracts the relations between ports, which are often present

together (on the same machine or more generally in a close vicinity, e.g. same subnetwork). As

an example, a medium-scale enterprise network often provides a web and email server.

It is worth to mention that two ports can be similar on both perspectives, such as port 443 and

80, both for web services and usually co-located on the same server.

5.2.2 Rationale

When performing an attack, the first stage usually consists in fingerprinting the potential target

and all the more with Advanced Persistent Threats. It is a critical step for crafting an attack as

most as possible specific. Discovering accessible machines and services often relies on IP

sweeping and/or scanning TCP and UDP ports. Naive approach testing all ports numbers and all

IP addresses of a targeted subnetwork is time-consuming and has a large footprint, which can be

easily detected. However, smarter attackers would look for specific ports to search for particular

services with potential vulnerabilities. For example, if they look for web servers, TCP/443,

TCP/80, TCP/8080 will be targeted in priority and can reveal a service-semantic similarity.

Similarly, attackers may target a particular type of environment with various services close from

a context-semantic point of view. For example, a web service relies usually on a web server and

on the database. So, both of them are regularly co-located in a close network vicinity, even in the

same host.

As a result, observing the strategy of port scans performed by attackers is helpful to derive the

semantic between port numbers, which can be reused for other analysis, such as traffic

classification or network security monitoring. Our observation on darknet hosted by Inria

confirms the existence of relationships among targeted ports. Secondly, this actually motivates

and guides the definition of two metrics. Each of them is defined in order to catch both types of

similarities (service- and context- semantic) in a single value.

5.2.3 Darknet

A darknet, also known as “network telescope” or “Internet blackhole”, is an entire reachable

subnetwork collecting all incoming traffic with no active hosts and so neither sending any packets

nor having services being responsive to any request. Such traffic can be considered as the noise

from the Internet or the Internet Background Radiation (IBR)[6]. It has been proved to contain

valuable information to understand major security threats like DDoS attacks and scanning

activities[7]. Therefore, our goal is to analyse the vast quantity of network traffic targeting a

darknet, modelling the attacker behaviour when the latter performs a port scan, and then

summarising this information into a unique port similarity or distance.

5.2.4 Methodology overview

We thus use the darknet to observe attacker strategies when scanning successive ports while

discarding naïve strategies, horizontal or vertical massive scans which do not bring any valuable



Page | 45


information for our metrics. All sequences must be aggregated together in a unique

representation. In addition, even ports which are supposed to be similar are not targeted in the

same order each time by the different attackers. No global order or sequence should be

constructed, but a graph represents all of them actually in a compact format. We thus transform

successively probed ports by attackers into a graph prior to extract the semantic between port

numbers.

The Figure below illustrates the whole process to infer a distance metric between port numbers:

• Multiple attacker behaviours, e.g. targeted ports, are collected. In order to avoid a bias,

it is required to collect such behaviours n a massive scale. In our case, we use a darknet

or telescope.

• Scan extraction: since collected data can embed some noise, filtering is necessary and

directly dependent on the collecting process. The goal is to only extract the attacker’s

behaviours. For example, big vertical scans running on all ports do not contain an

extractable semantic and should then be omitted for graph construction.

• Graph: the graph of scans is created with the filtered data gathered from the previous

step. In this graph, nodes represent network ports and the directed edge between two

port means that they have probed sequentially at least once.

• Distance: thanks to the network port scans graph created, several algorithms have been

defined and assessed to extract an attacker’s behaviour based semantic distance between

the pair of ports. Actually, only one of the two proposed metrics is a distance, the other

is a dissimilarity measure.



Page | 46


Figure 14 Methods to extract port similarity measure

5.2.5 Inter-port similarity Darknet observations are raw data with a lot of noises, in particular with scans targeting many

ports or even all of them. They are not significant of a particular strategy and so are not helpful

to reveal any semantic about associate ports.

Therefore, we process first raw data to limit our analysis to targeted scans. Two steps are applied:

• First, we extract TCP Syn packets. Hence, our method is only applied to TCP. The same

could be applied to UDP but we do not have enough amount data until now. The reason

is that TCP scans are more popular than UDP scans.

• Secondly, we only keep scans targeting between 3 and 30 distinct ports, i.e. targeted

scans. The selection of those numbers has been made according to a primary statistical

analysis.

Once relevant scans have been extracted, they are transformed into a graph model. A scan graph

is so derived as 𝐺 = (𝑁, 𝐸, 𝜔):

• N: The set of nodes of the graph. Each of them represents a unique TCP port.

• E: The set of edges of the graph. The existence of an edge 𝑒𝑖𝑗 from port 𝑝𝑖 to port 𝑝𝑗 shows

that port 𝑝𝑖 has been probed (by the same scanner) just after 𝑝𝑗 on the same destination

IP address.



Page | 47


• 𝜔: The Weight function that, for an edge, returns its weight. The weight 𝜔(𝑒𝑖𝑗) of 𝑒𝑖𝑗 is

the number of times that 𝑝𝑗 has followed 𝑝𝑖 in all scan sequences.

The defined graph intuitively contains the desired semantic. If two ports are connected by an

edge with a high weight, they have been probed a lot of time successively. By generalization, the

graph also contains ports that are near each other thanks to transitivity. For example, if scans go

repeatedly from 80 to 443 and from 443 to 3306, the graph contains a transitive link between

ports 80 and 3306 which reveals a semantic similarity between these ports, but lower than

between 80 and 443 (connected by a direct edge).

The intuition of this semantic is to swap (or invert) the weight of edges in the graph to reduce

the shortest path length between ports which are regularly scanned in the same sequence (i.e.

those connected together with heavy weights). Then, shortest paths 𝑠𝑝(𝑛𝑖 , 𝑛𝑗) between a pair

of nodes (port numbers) are computed. It is the smallest sequence of edges from source 𝑛𝑖 to

destination 𝑛𝑗 according to the inverted weights. The length 𝑙 (𝑠𝑝(𝑛𝑖 , 𝑛𝑗)) of this shortest path

is then used as a dissimilarity measure between the two ports.

𝑑𝑠𝑝(𝑖, 𝑗) = 𝑙(𝑠) = ∑ 𝜔′(𝑒𝑖𝑗), 𝑠∀𝑒𝑖𝑗∈𝑠 = 𝑠𝑝(𝑛𝑖, 𝑛𝑗)

Finding shortest paths in a graph is a common problem. Methods, like the Dijkstra algorithm, are

well defined. The main challenge resides in defining a correct rescaling and swapping method for

edge weight, i.e. deriving 𝜔′. For the sake of simplicity and brevity, we will simplify 𝜔(𝑒𝑖𝑗) weights

notation to 𝜔𝑖𝑗.

In order to take into account the distribution of data, we use a rescaling technique based on the

IQR method presented in section 4.1.3. Indeed, Figure 15 shows an unbalanced distribution with

most of the values concentrated between 𝑄1 = 299 and 𝑄3 = 4082. Therefore, data needs to

be rescaled to avoid bias due to extreme values.



Page | 48


Figure 15 Edge weights distribution

It is worth to mention that rescaled values originally below 𝑄1 become negative. Prior to swap

weights, we shift rescaled data to positive values by deducing the minimal value: 𝜔𝑖𝑗𝑖𝑞𝑟 − 𝜆, 𝜆 =

𝑚𝑖𝑛𝑖𝑗

𝜔𝑖𝑗𝑖𝑞𝑟.

Finally, weights can be swapped regarding the maximum value (which is also shifted):

𝜔𝑖𝑗′ = (𝑚𝑎𝑥𝑖𝑗(𝜔𝑖𝑗

𝑖𝑞𝑟) − 𝜆) − (𝜔𝑖𝑗𝑖𝑞𝑟 − 𝜆)

This data-driven scaling and swapping technique avoids using arbitrary factor when inverting

the edges weights.



Page | 49


6 Candidate machine learning algorithms At this stage of the project, the analysis is limited to datasets with no attack. We thus select

algorithms which aim to infer common patterns in data (clustering) or to verify the normality of

learnt patterns (supervised method).

6.1 Clustering 6.1.1 Overview

Clustering is a vast domain of data mining with the goal to group data instances of a dataset, that

are similar. It thus helps in identifying common features over grouped instances and, in a final

goal, create signatures. There are many clustering techniques which have been summarized in a

survey [8]:

6.1.1.1 Partitioning-based clustering

The goal of partitioning-based clustering is to minimize a criterion, most of the time it is the

distance between data points and clusters of representatives. They iteratively attribute data

points into clusters, then change clusters representatives until convergence, i.e. two consecutive

steps lead to the same clusters. These algorithms have two mandatory characteristics. First, from

the beginning to the end, each cluster must contain at least one data point. Secondly, each data

point must belong to exactly one cluster.

K-Mean algorithm

K-Mean is one of the easier clustering algorithms. First, it takes K data points in the dataset as

the center of clusters. There are no rules for this selection: it can be random or made by the user.

Then, each object in the dataset is assigned to the cluster for which the distance is minimal. After

that, the new clusters’ centers are found by calculating the mean value of data points in them.

Finally, the previous steps are repeated until there is no change in clusters.

Like for the major part of clustering algorithms, data have to be normalized before being used.

Indeed, without normalization, if the tiredness degree of the test subject and the illuminance of

the room (iSprint dataset) are used, then the first feature will be ignored because its values are

negligible compared to the second one. Its major drawbacks are that the selection of initial

clusters will highly influence the result and the sensitivity to outliers that can easily be associated

with another cluster.

6.1.1.2 Hierarchical-based clustering

The result of a hierarchical clustering algorithm can be presented most of the time in a

dendrogram where each leaf is a data point. To get this result, there are two possible ways. On

the first one, at the beginning, every data belongs to one cluster and, at each step, the existing

cluster with the most dissimilarities is cut in two more similar clusters. Algorithms using that

method are qualified as divisive. For the second one, the construction of the dendrogram is made

on the other side: at first, every data is a cluster, and at each step, the most similar sets are



Page | 50


grouped together until there is only one cluster containing all the data. That is the agglomerative

method.

BIRCH algorithm (Balanced Iterative Reducing and Clustering using Hierarchies)

BIRCH is a hierarchical clustering algorithm thought to be used in the large dataset. First, it builds

a tree for which the user has to define the parameters like the maximum number of children for

a node that is not a leaf (branching factor), and the maximum value for the diameter or the radius

of the cluster (threshold). Once the tree is built, it is possible to rebuild it to limit its size by

increasing the Threshold if it is needed. After that, another clustering algorithm like K-Means is

used on the leaf nodes to determine clusters. Finally, the last optional step is to apply once again

the K-Means algorithm on all the data with the groups previously found as initial K clusters for

the first step.

One of the major BIRCH algorithm advantages is that it only scans the data once and it can

dynamically insert a new point in the tree. So, to be applied, BIRCH algorithm does not need the

whole dataset, which could be interesting in the SecureIoT case. Its drawbacks are quite similar

to K-Means; it only works with numerical data and may not work properly if the groups that have

to be found are not spherical. Moreover, where the K-Means result depends on the initial clusters

choice, BIRCH algorithm result depends on the order in which the data are added to the tree.

6.1.1.3 Density-based clustering

In the data space, there are areas where data are more concentrated than elsewhere. The idea

behind density-based methods is to choose these areas as clusters.

DBSCAN is certainly the most common density-based clustering algorithm. First, it takes

randomly a point in the dataset that does not have been handled before. Then, it looks at the

neighbourhood how many pieces of data there are at the neighbourhood. If this number is higher

than the minimum parameter, then the point forms a cluster, else the data is considered as an

outlier. Once the algorithm finds a new cluster, it tries to expand it by adding all its neighbours

and attainable points. DBSCAN repeats the previous steps until each data is checked.

The major advantages of this method are its noise resistance and its possibility to find clusters

with non-spherical shapes. These cases could easily be encountered in complex datasets as in

those of SecureIoT. Its drawbacks are the difficulty to determine parameters and its impossibility

to handle varying densities.

6.1.2 Topological Data Analysis

In SecureIoT, we are considering relying on Topological Data Analysis (TDA) techniques for the

clustering and the extraction of patterns from available monitoring data. This choice is motivated

by the nature of data that we will handle in the project. First, data is very heterogeneous, and so,

exhibit a large dimensionality that can be handled by TDA by design. Secondly, collected data can

be noisy as it might contain different levels of interaction. This is recurrent in IoT environments,

which can be deployed on top of existing infrastructures, e.g. domestic networks or Internet.



Page | 51


Using this technique is mainly motivated by two reasons. First, its unsupervised pattern detection

is appropriate for the analysis and classification of these data since we have no a priori knowledge

about them. Second, its visualization capabilities make the output understandable and

interpretable by the human experts.

The goal is to infer main perceived and persistent activities in the data. TDA is actually a set of

techniques. In our case, we rely on two techniques: the first is the mapper algorithm and the

second is the persistent homology.

6.1.2.1 The Mapper algorithm

The Mapper algorithm [9] coupled with another clustering algorithm such as DBSCAN is very

efficient for clustering large sets of data by decomposing the space into subspaces and it can be

also executed in parallel partially if needed. The processing steps of the technique are described

in Figure 16 assuming a dataset of IP traffic as an illustrative example.

Figure 16 Processing steps of the Mapper algorithm to extract patterns from IP network traffic.

As a pre-processing step, we extract the required features from each log record, for example, for

each network packet: the timestamp, the source and destination IP addresses and ports, and the

protocol (TCP, UDP or ICMP). Each log record is then mapped to a vector in Rn by converting the

selected n features to a numerical format. In the illustrative example, IP addresses are

represented by their respective integer values between 0 and 232, the integer values of the ports

are used, and the name of the protocols are encoded to numerical values. First, a filter function

is chosen and applied to associate a value with each data point. This filter function may be the

minimum, the maximum, the average, a result of the PCA algorithm or a component or the entire

features vector itself, etc. The dataset is then divided into smaller subsets by dividing the filter's

range into a set of smaller overlapping intervals. Hence, the original n-dimensional hypercube

containing all data is divided into multiple and overlapping m-dimensional hypercubes, where m

is the dimension of the filtering values. This step relies on two parameters: the first parameter is

the resolution which represents the length of the interval over the filter function range, and the



Page | 52


second parameter is the overlap parameter which represents the percentage of overlap between

successive intervals. The second step consists of applying the clustering algorithm, DBSCAN,

within each individual hypercube. It takes two parameters, and minpts along with the data. It

will mark as being in the same cluster the points having at least minpts at a distance lower than

(neighbours) and will propagate the search to those neighbours. A point is considered as noise if

it has less than minpts neighbours. To compute the similarity between the features vectors when

using the DBSCAN clustering algorithm, we have to define a distance metric that may be a

traditional metric, such as Euclidean, Minkowski, Manhattan, Chebyshev, Hamming, Jaccard or

any function that defines a distance between two vectors. In our illustrative example, we defined

a metric as follows: for the timestamp attribute and IP destination addresses, the difference

between their respective values is suitable metric since the former is a measurable quantity and

the latter are within the same subnetwork. For the source IP addresses, we also used the distance

between their respective values as a metric.

The final output of the algorithm is a topological graph where each node is a cluster of points

within one hypercube. If two clusters (from different hypercubes) have one or more common

points with respect to the specified overlapping interval, the nodes are linked together.

The steps of the mapper algorithm are as follows:

Input: feature vectors

Parameters: number of intervals (resolution), overlapping percentage (zoom), Minpts

Output: clustering graph

1. Apply filter function f: Rn -> Rm

2. Put data into overlapping intervals: f-1(a1,..,am)

3. Cluster each bin using DBSCAN and a distance function

4. Create a graph: each vertex is a cluster and edges are nonempty intersection between

clusters

6.1.2.2 Persistent homology

Persistent homology [10] is another TDA technique that is able to quantify the topological

features in a point cloud defined as a finite metric space. The methodology assigns to this space

and to a non-negative integer k a multi-scale visual representation of a variety of topologies by

varying a filtration parameter, such as a radius or a level set function. The integer k specifies the

dimension of a topological feature where zero-dimensional denotes a cluster and a one-

dimensional denotes a loop, etc. These features are represented as intervals. The left-hand

endpoint denotes the birth of the feature at a value of the filtration parameter and the right-

hand endpoint denotes the value of the death of the feature. Often, barcodes are used to

visualize these intervals where horizontal line segments or bars are the generated homologies

over filtration scales. Using these barcodes, we are able to extract and identify long-lives

topological features which persist over a certain parameter range (the persistence parameter).

Persistent diagrams are another equivalent representation of the topological features.



Page | 53


To represent such persistence barcodes, we first need to build a topological structure using

simplicial complexes that represent a collection of simplices closed under face relation. An

example of a simplex of dimension zero is a point. A simplex of dimension one is a line, a two-

dimensional simplex is a triangle, a three-dimensional simplex is a tetrahedron and so on. Figure

17 depicts an example of geometric simplices.

Figure 17 From left to right, simplices of dimension zero, one, two and three.

The standard way of building such a simplicial complex is to calculate a Rips graph (or a

neighbourhood graph) over the dataset. The computed simplicial complex is defined as the

Vietoris-Rips complex that exists whenever we have the distances between points pairs. The Rips

graph is constructed as follows: given a threshold 𝜀 > 0, the set of edges of the graph is built such

as each pairs of vertices are at distance at most 𝜀 from each other. Persistent homology groups

are then calculated for a range of values of 𝜀. For each topological feature, we assign an interval

that represents the difference between the value of 𝜀 for which it appears and the value

destroying it. The set of intervals for the different features is mapped to horizontal lines to build

persistence barcodes. As an illustrative example, let’s consider a points cloud that represents a

circle of 100 points. Figure 18 depicts the barcode and persistence diagram of the data obtained

using a Rips based filtration.



Page | 54


Figure 18 Example of persistence barcode and diagram of a points cloud representing a circle.

In the barcode sub-figure, the black lines represent connected components and the red line

represent a loop which is the circle shape. The length of this line represents the lifetime of the

topological feature where the left and right endpoints are respectively its creation and

destruction values. The same topological feature which is the loop is represented as a red point

in the persistence diagram.

The persistence homology technique and its associated diagrams and barcodes visualization

capabilities are useful to build topological signatures of high-dimensional dataset. Recently2,

these topological features are used as a persistence-based clustering technique that emphasizes

any structures in datasets.

6.1.3 Implementation of the TDA (Mapper)

6.1.3.1 Data model and format for TDA

The SecureIoT project – and more generally the IoT domain – presents heterogeneous data. To

be able to extract knowledge from them by using the TDA, we were looking for a way to handle

them, the more homogeneous way we can. We received for now various data such as PCAP, SVG

or TXT files. Moreover, we have, for ROS robot, three other files (CSV, XML and JSON) describing

the same data. Their three formats are modular; among them, we temporary chose to work on

the json one, as the first step to homogenization. Indeed, this choice has been made because this

format is also used to present the data of AAL environment.

2 F. Chazal, L. Guibas, S. Oudot, and P. Skraba. Persistence-based clustering in Riemannian manifolds. In Proc. 27th

Annual ACM Symposium on Computational Geometry, pages 97–106, 2011.



Page | 55


A structure will be associated with each data file. This structure can be described in a JSON file or

deduced, as it is the case for a PCAP file for instance (these two cases will be described in the

3.1.3.2.1 part). The main goal of such a structure is to be able to know which filter functions or

distance functions could be applied to the different elements described by the file. Therefore,

each data element will be associated with a type, which will itself be associated with an array of

filter functions and an array of distance functions. We can provide the following arrays, describing

these associations in Table 4 and Table 5.

Type Filter functions

identity

boolean X

float X

int X

IP X

proto X

str X

timestamp X

Table 4 Associations between data types and filtering function

Type Distance functions

Equality Euclidean Max Min AND OR XOR NAND NOR XNOR

boolean X X X X X X X

float X X X X

int X X X X

IP X

proto X



Page | 56


str X

timestamp X X X

Table 5 Associations between data types and distance functions.

6.1.3.2 TDA web interface

In the first iteration, the Topological Data Analysis (TDA) is implemented in python as a

standalone web application; thus, it includes two parts: the data files management, and the TDA

application. Only the Mapper algorithm is fully implemented and in the scope of this section.

As explained in the 3.1.2 section, the TDA process is divided in five actions:

• Extract the required features (Pre-processing)

• Apply the filtering function (Step 1)

• Put data into overlapping intervals (Step2)

• Cluster each bin using DBSCAN and a distance function (Step 3)

• Create a graph (Step 4)

We will describe the TDA web implementation by following the same steps.

(Pre-processing) Extract the required features

The web interface allows users to submit data files identified by a name and potentially described

by a structure file. As explained previously, the definition of a structure is required for a data file

to associate each data element to a set of filter functions and distance functions. For now, two

files formats are handled: JSON and PCAP. While a structure file is required when a user loads a

JSON data file, it is not the case when the uploaded data file is a PCAP: the backend already

contains an inherent structure for PCAP files.

A structured file presents a JSON dictionary containing two main elements:

• the value associated with the key « data_location » is an array containing the list of keys

to follow to get the array of the data rows in the JSON file.

• the value associated with the key « data_structure » is a dictionary presenting the same

structure as a data row, but with data types instead of the values.

For now, the inherent PCAP structure contains the elements « src » of type « ip » (for source IP),

« dst » of type « ip » (for destination IP), « srcport » of type « int » (for source port), « dstport »

of type « int » (for destination port) and « proto » of type « proto » (for the used protocol). If one

of these values is missing in a data row, its value is empty.

The file management and reading are handled by the files module. It provides functions such as

• get_files(): to get the metadata of all the files.

• get_file(file_id): to get the metadata of a file identified by its file_id.



Page | 57


• get_structure_elements(file_id): to get the labels and types of the file’s data.

• get_data_raw_vectors(file_id): to get an array of vectors presenting the data rows of the

file identified by file_id.

• get_file_content(file_id): to get the data rows of a data file identified by its file_id.

• get_lines (lines, file_id): to get a set of data rows.

Thereby, the algorithm extracts the required features by calling the get_data_raw_vectors

function. Its result is then used as the “file_content” argument of the function:

• compute_filter_file (file_content, filters, output_file) of module filters.

It will browse the data rows: for each data row, it gets the data elements to use with the selected

filter functions and store them in an output file. The same procedure is applied to data elements

used by the distance functions, with the function:

• compute_distance_file (file_content, distances, output_file) of module distances.

Hence, it results in the creation of two files. These files could be reused for new TDA

computations if the data elements requested and the filter and distance functions are the same

so that this step could be optional if the corresponding files already exist.

(Step 1) Apply filter function

The first step is the application of the filter. For each data row extracted from the filter file

obtained during the previous step, each element is filtered with its associated filter function.

Once all the elements have been filtered, the global filter function is applied on the vector that

resulted in the global filter function. This global filter function allows the definition of a filter

which can combine several features of a data row. It is important to note that the output of the

global filter function can be a vector of several elements (that will not be the case for the distance

function, the result of which must belong to R⁺).

(Step 2) Put data into overlapping intervals

Then, the data rows are gathered in hypercubes. These hypercubes are defined by the following

way:

for each combination of data_row_length elements which values are in [0,nb_intervals]:

bound_inf = combination * (interval_length – overlap_length)

bound_sup = bound_inf + interval_length

hypercube = all data rows verifying the inequalities:

data_row >= bound_inf

data_row <= bound_sup



Page | 58


where:

• data_row_length = the number of dimensions of each data vector gotten from the

application of the global filter function.

• nb_intervals = the number of intervals to have in the result.

• interval_length = a vector presenting, for each data dimension, the length of an interval.

It can be gotten by the computation displayed in the following paragraph.

• overlap_length = a vector presenting, for each data dimension, the product of the length

of an interval and the overlap percentage.

To compute the interval_length from the number of intervals (nb_intervals), the process is as

follows: for each dimension encountered in the filtered vector, the minimal and maximal values

of the corresponding elements in the rows are gotten, then this interval is split in the requested

number of intervals. The intervals overlap each other with a specified rate (the « overlap »

parameter), this rate will thereby operate in the interval length computation as follows:

𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙_𝑙𝑒𝑛𝑔𝑡ℎ =|𝑚𝑎𝑥 − 𝑚𝑖𝑛|

(𝑛𝑏_𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠 − 1) ∗ (1 −𝑜𝑣𝑒𝑟𝑙𝑎𝑝

100 ) + 1

In other words, a hypercube is the reunion of all the data rows of which each filtered element is

included in the range set for its dimension. By this way, if the filtered vectors are three-

dimensional and the number of ranges requested by the user is 2, this step will result in the

creation of 2³ = 8 hypercubes.

(Step 3) Cluster each bin using DBSCAN and a distance function

Finally, for each hypercube, a DBSCAN clustering is applied. The distance function used in the

DBSCAN clustering is always the same:

• compute_distance (global_distance_name, x, y) of the module distances_config.

The distance functions and data which will be used are previously set by calling the functions:

• set_distances(distances) of the module distances_config.

• set_distance_data(data) of the module distances_config.

Thereby, the function compute_distance will receive as x and y parameters the numbers of the

lines of the data rows to compare. The function will then apply the requested distance functions

through the distances module.

(Step 4) Create a graph

The TDA result (Mapper) is composed of vertices and edges between them. A node regroups one

(if the minpts parameter is « 1 ») or several data rows of the same hypercube which have been

clustered together by the DBSCAN (all the rows can be linked with a path lower than (regarding



Page | 59


with the distance function) between two consecutive rows). An edge exists between two vertices

if they share at least one common row.

The TDA result file is a JSON file presenting three main elements:

• « intervals »: describes the edges for each dimension of each interval. Concretely, it is an

array of intervals, which are an array of two elements: the intervals’ edges that are a

vector of one or more element (of the same size for one result, depending on the number

of elements selected by the user for the filter functions).

• « vertices »: describes the vertices. It is an array of dictionaries with three main elements:

◦ « name »: is a string constituted by the hypercube number concatenated to an

underscore and the cluster number of the data rows in this hypercube (for a clearer

idea, this is the format: « hypercubeNumber_clusterNumber »)

◦ « class »: is a string with the format « class-node-id » where id is the hypercube

number. It is used for the display of the graph.

◦ « lines »: is an array of int, each one is the identifier of a data row so that « lines » lists

all data rows of the node.

• « edges »: describes the edges. It has two main elements: « source » and « target », each

one refers to a node’s names.

6.1.4 Application to SecureIoT Data

Clustering algorithms and normalization methods have been applied on a subpart of the dataset. Three features have been selected. For example, in iSprint dataset, these features are the time since the beginning of the day, the living room illuminance and the degree of tiredness. The latter actually serves as the label to be assessed (so not used for the clustering itself). At the start, clusters have been found by using the DBSCAN algorithm and normalization methods on the first two features. Then, for each cluster found, the variation coefficient of the third feature is calculated:

𝑐𝑣 = 𝜎

𝜇 where 𝜎 is the feature mean and 𝜎 its standard deviation

This coefficient demonstrates a correlation between the first two features and the third one. The smaller the value is, the greater the correlation. However, artificially increasing the number of clusters would lead to smaller coefficients. Hence, the objective is to reach low coefficients with also a low number of clusters. At first, the modified tanh normalization has been applied. The heat map in Figure 19, gives information about the coefficient of variation (cv) for each cluster found. DBSCAN has been used several times with different values for the maximal distance between two data points to be



Page | 60


considered as neighbour (epsilon). After being normalized, these data will have a standard

deviation of 1

100 so each value will be closer to each other than if another normalization is used.

The Figure 20 shows the result of the DBSCAN algorithm with a specific parameter ε on the chosen ISprint data.

Then, min-max normalization has been used on the same data and the results can be found in Figure 21 and Figure 22.

Finally, the results for z-score normalization are presented in Figure 23 and Figure 24:

Figure 19 Variation coefficients using DBSCAN and the modified tanh normalization (iSPRINT dataset)

Figure 20 Clusters found using DBSCAN (ε = 0.0025) and the modified tanh normalization

(iSPRINT dataset)

Figure 21 Variation coefficients using DBSCAN and the min-max normalization (iSPRINT dataset)

Figure 22 Clusters found using DBSCAN (ε = 0.25) and the min-max normalization (iSPRINT

dataset)



Page | 61


For this example, the results of DBSCAN using z-score or modified tanh are quite similar. The clusters found are exactly the same when the formula, for the modified tanh normalization, below is used:

sk’ = [tanh(0.01(𝑠𝑘−𝜇)

𝜎)]

The modified tanh normalization seems to produce interesting clusters with DBSCAN algorithm. It has to be tested with other algorithms. The results shown in Figure 26 and Figure 25 are based on the KMeans algorithm with the modified tanh normalization on the first 2 features taken. The results in Figure 27 and Figure 28 highlight the outcomes obtained with the TDA algorithm. For K-Means, the heatmap gives the coefficient of variation for each cluster found. A few values for the number of clusters, K, are tested.

Figure 23 Variation coefficients using DBSCAN and the z-score normalization (iSPRINT dataset)

Figure 24 Clusters found using DBSCAN (ε = 0.35) and the z-score

normalization (iSPRINT dataset)



Page | 62


For this dataset and the chosen features, K-Means is less efficient than the two other algorithms with higher coefficient variations. It may also have a bias due to the small number of clusters TDA performs better than K-means, but the variation coefficients are higher than for DBSCAN. However, the number of clusters is lower. In a practical case, this will limit to scatter too much data samples with the same or similar labels. Indeed, by increasing the number of clusters, as for example with DBSCAN, a similar level of tiredness could be scattered among different clusters. DBSCAN and TDA Mapper are actually very similar since TDA relies on DBSCAN. The advantage of TDA is its ability to cluster data per hypercube before merging intermediate clusters. It is thus more adapted for larger datasets.

Figure 25 Variation coefficients using K-means and the modified tanh normalization (iSPRINT dataset)

Figure 26 Clusters found using K-Means (K = 4) and the modified tanh normalization

(iSPRINT dataset)

Figure 27 Variation coefficients using TDA and the modified tanh normalization (iSPRINT dataset)

Figure 28 Clusters found using TDA (ε =0.0035 ) and the modified tanh normalization (iSPRINT dataset)



Page | 63


We then apply DBSCAN with modified tanh normalization to the two other datasets, that are relatively small. For the one from LuxAI, the packet size and the time between apparition of an event in ROS file and PCAP are the two features that are used in the clustering algorithm. The feature playing the role of the label is the fact that this event is a human emotion recognition or a robot action. The results can be found in Figure 29 and Figure 30. In that case, the clusters are very specific to the label, but 19 clusters have been constructed while only two labels exist. So, data labelled identically can be in different clusters.

For IDIADA dataset, the two first features that are used in the clustering algorithm are the vehicle speed and the steering wheel. The value considered as the label is the rpm. The heat map for the variation coefficients is shown in Figure 31. Clusters are reported in Figure 32 assuming that only those with more than 10 points are considered as there are many clusters.

Figure 29 Variation coefficients using DBSCAN and the modified tanh normalization (LuxAI dataset)

Figure 30 Clusters found using DBSCAN (ε = 0.0055) and the modified tanh

normalization (LuxAI dataset)

Figure 31 Variation coefficients using DBSCAN and the modified tanh normalization (IDIADA Dataset)

Figure 32 Clusters found using DBSCAN (ε = 0.0005) and the modified tanh

normalization (IDIADA Dataset)



Page | 64


6.2 Process mining 6.2.1 Overview

Process mining techniques are extensively used to learn dependencies and causality between

traces of events observed on a system. They have been applied in multiple application domains,

including business process monitoring, reverse engineering and software process modelling.

The goal of these techniques is to discover, verify the conformance or extend a model of a system.

A set of events generated from recorded logs that depict the process executions of a system is

analysed using mining algorithms to build a discrete model of the process from them. Multiple

mining algorithms exist and have been developed to build these dependencies and infer models

represented by different formalisms: Petri nets, Transition systems or DFA (Deterministic Finite

Automata).

6.2.2 Process mining algorithms

We identified three candidate process mining algorithms to be evaluated on SecureIoT data. The

three algorithms are able to take as input an event log and generate models represented as Petri

nets.

6.2.2.1 The -algorithm

This algorithm [1] tracks some particular precedence patterns in the event log. For example, in

the case where an event a is followed by an event b, but b is never followed by a, then a causal

dependency is detected between a and b. This dependency is reflected using a place connecting

a to b. The algorithm relies on four ordering operations aiming to capture precedence patterns

in the log:

• a > b if and only if there is an event trace = (e1,e2, ..,en) and i {1,...,n-1} such that ei=a

and ei+1=b;

• a b if and only if a > b and b ≯ a;

• a ⋕ b if and only if a ≯ b and b ≯ a;

• a ∥ b if and only if a > b and b > a.

a > b refers to a "directly follows" relation and a b refers to a "causality relation" between the

pair of events a and b. a ∥b is used when each event can follow the other (sometimes a follows b

and sometimes the opposite happens). a ⋕ b states that there is no immediate precedence

relation between the pair of events. These ordering relations are then used to discover the

patterns that form the process model. The typical different process behaviours and their

respective event log patterns are depicted in Figure 33. If a b, then a sequence is created in

the model. If a b and a c and b ⋕ c, then a choice between b and c is created after a and is

represented by XOR-split pattern. If a c and b c and a ⋕ b then a and b are considered as

two concurrent events that precede c (XOR-join pattern). If a > b, a > c and b ∥ c then b and c can

be executed in parallel just after a (AND-split pattern). If a > c, b > c and a ∥ b then c needs to

synchronize a and b (AND-join pattern).



Page | 65


Figure 33 Different process behaviours and their respective event logs ordering relations. Extracted from [11].

6.2.2.2 The inductive miner algorithm

The inductive miner algorithm [12] builds models that are able to replay a log of events used in

the mining phase by reaching the final state from the initial state without errors. The first step of

the algorithm is to build the directly-follow graph of the events log. This generated graph is

composed of arcs a →b where a and b are events from the log and where b directly follows a in

the log. From this graph, we can deduce a process tree of the log. A process tree is an abstract

and compact representation of a graph, for example, a Petri net. It is a tree from which the leaves

are events and nodes are operators used to describe interactions. There are four operators, ×

for an exclusive choice, → for a sequence, ↺ for a loop and ∧ for a parallelism. During these

phases, one type of transitions can appear: they are used only for this algorithm, they are called

𝜏 transitions and can be traversed in the graph without any event on the log. There is also a noise

parameter in the algorithm that defines several methods to filtrate events to remove infrequent

events according to the noise parameter. After these steps, we can create the Petri net

corresponding to the process tree.

6.2.2.3 The transition system miner algorithm

The transition system miner [13] is mainly composed of two steps. First, we need to create a

transition system. It is constructed according to the notion of states in this algorithm. There are

three data structures for the construction of states: a sequence (list of ordered events), a set (list

of unordered events without frequency) and a multiset (list of unordered events with frequency).

Besides, a state is constructed with whether past, future events of the actual state, or both. We

can take all events or only a few to create a state. States will be formed by a data structure

containing the different events. Two simplification approaches exist: the removing of loops and

the merging of states with the same inputs and/or outputs.



Page | 66


The second step consists to create the Petri net. It is built thanks to a notion of the region. A set

of states S is a region if, for every event e of the transition system, one of the following condition

is true:

• All arcs s1 𝑒→

s2 go in S: s1 is not in S and s2 is.


s2 go out of S: s1 is in S and s2 is not.


s2 do not cross S: both s1 and s2 are in S or not.

All minimal regions (there are no sub-regions in it) are discovered, and then we transform these

regions into places of the Petri net, and events into transitions.



Page | 67


7 Sinalytics platform Sinalytics represented the outcome of Industrial Data Analytics of Siemens, developed along last

years, starting with 2013 and being focused on generating digital services offerings on top of data

generated out of Siemens assets installed in various business settings. Those digital services have

been custom designed and implemented to provide full cycle digitalization strategy along with a

comprehensive Big Data infrastructure and analytics support.

Designed infrastructures have been internally tested along large-scale use cases, from energy

production facilities up to remote diagnostics systems for medical equipment. The key concept

is the transformation of Big Data in Smart Data and associating on it whole added value services

to be exposed by analytics: descriptive, predictive and prescriptive. Those analytics aspects have

been added as offering both as domain-oriented services, but also as consulting for tailored

services towards customers.

Looking at application areas, we can mention successful applications of descriptive analytics for

complex infrastructures like train control systems, predictive analytics as a relevant contributor

for remote maintenance of gas turbines for energy production. Manufacturing shop floor

benefits from predictive maintenance due to increased use of production facilities and a better

understanding of raw materials usage, leading to lower stocks and better planning of

acquisitions.

Since Sinalytics solutions have been focused on data centre deployments, most of security-

related features have been related to data access and safe transfer from processes premises to

data centres hosting processing. The largest majority of services rely on large amounts of

historical data and on deep knowledge of OT for served industries and monitored assets.

Sinalytics provided Identity and Access Management infrastructures used to connect to installed

Siemens systems around the world, enabling remote monitoring and servicing of them. Cutting-

edge algorithms and analytical methods support domain engineers to predict and prevent faults,

identify, for example, best time windows to perform maintenance, with a consistent increase in

up-time and save associated costs. For a complete PLM cycle, Siemens uses the data analysed to

further improve own products and services.

Sinalytics infrastructure is now in production deployments owned by Siemens for a range of long-

term customers, but without future plans for additional development.

Based on success and customers’ feedback of Sinalytics it has, MindSphere has been developed.

This step was needed to move data centre-based deployments towards full-scale cloud

processing for data related services, and to have one continuous model towards the new edge

and fog computing. MindSphere provides four basic aspects to connect customer infrastructure:



Page | 68


- MindAccess – is the capability offered at both user and developer level to configure

configuration of necessary libraries.

- MindConnect – it connects machine, multi-assets factories and IT systems to

MindSphere. It contains a wealth of connectors for various protocols involved in data

exchange.

- Mind Apps – are offered as a catalogue of applications, both web-based and mobile,

facilitating real-time analysis of collected data and launch of analytic functions

towards specialized backends.

- Mind Services – Represent the services associated with previously listed ones,

including customer-based assistance.

MindSphere provides to application developers an open Data Services model where data can be

simulated before being connected to real life systems3.

MindSphere is built as PaaS, facilitating the development of specialized apps on top of specialized APIs. After the previous versions of APIs have been offered using infrastructure from SAP HANA and Cloud Foundry, current plans are to offer seamless access on AWS (Amazon Web Services) and MS Azure. From an architectural point of view {https://developer.mindsphere.io/concepts/concept-architecture.html}, MindSphere is structured on three layers of interest, corresponding also to customer offer described earlier:

• MindSphere Application Platform – exposed a managed PaaS able to host apps directly in MindSphere space.

• MindSphere Services Platform – designed to expose public APIs to be used on customer-specific developed solutions.

• MindConnect Elements – is the “shopfloor” connectivity level to be used for plug&play capabilities for both hardware and software add-ons in running platform.

Figure 34 depicts this conceptual split.

3 https://developer.mindsphere.io/concepts/concept-iot/index.html



Page | 69


Figure 34 MindSphere conceptual architecture

In this report,, we will detail a limited number of components that are of specific interest for SecureIoT project, namely the Authentication & Authorization, MindSphere Gateway and MindSphere IoT Data Services.

6.1 Authentication & Authorization Those aspects address the access to MindSphere data and resources with a specific focus on APIs

access from applications, APIs call from thin client applications or complex security of

applications with a special purpose (e.g. calls received from certain terminals, geographical areas,

the volume of data or time interval).

In order to provide those aspects, all exposed endpoints are secured and can be accessed after

authentication, with token-based access in place4.

4 https://developer.mindsphere.io/concepts/concept-authentication.html



Page | 70


6.2 MindSphere IoT Data Services Below, Figure 35 depicts the conceptual structure of Data Services provided by MindSphere.

Figure 35 MindSphere IoT Data Services concept

Services implemented in this area are oriented towards time series collected from field devices,

doing on the go annotation and correlations, preparing data to be called via analytics functions.

Those functions are accessed via REST APIs.

For current development horizon, there are three APIs addressing functional areas: IoT Time

Series Service – used to extract and operate over dynamic time series data, IoT TS Aggregates

Service – for parameterized queries over previously extracted time series data, and IoT File

Service – offering user-friendly access to log files in store.

6.3 MindSphere Gateway This component acts as a gateway to the cloud backend, mostly to be used by web clients and

edge device applications. It can also be made available for domain-specific internal and external

legacy applications and services. It is worth to mention that Gateway has integrated OAuth2.0

for exchange of tokens with the outer world.

Key naming convention is in place to describe access methods to exposed resources.



Page | 71


8 Conclusion

In this deliverable, we introduce different approaches for data modelling and analysis that are

integrated within the different stages of the CRISP-DM methodology. The data modelling from

monitored elements is essential to synchronize in a unique logical repository the data. Some

algorithms for analysing data have been shown and implemented, in particular, to explore the

initial datasets.

This is the first iteration of this deliverable. Another is planned M24. Until this date, we will particularly continue to apply the CRISP-DM method. Especially, some stages cannot be fulfilled at this stage, e.g. evaluation and deployment. We also continue to collaborate with use case partners in order to construct new datasets to support an in-depth analysis, but also some pre-processing on data, such as data verification. Refinement of algorithms will be pursued, in particular by considering the detection of particular states or events, such as anomalies that can serve then the predictive security. Hence, supervised learning techniques will be focused on the next months in order to establish signatures of events to be monitored. The capacities of the Sinalytics platform will be also investigated and matched against the SecureIoT requirements and data analysis to identify components to be used.



Page | 72


References

[1] “CRISP-DM” [Online], Available: http://crisp-dm.eu/ [Accessed: 27-Jun-2018]

[2] Ken Bever, “The OpenO&M Information Service Bus and Common Interoperability

Registry”, Friday, 16 October 2009, available at:

http://www.mimosa.org/presentations/openom-information-service-bus-and-common-

interoperability-registry

[3] Avin Mathew, Ken Bever, Dennis Brandl, “Web Service Common Interoperability Registry

1.0”, OpenO&M Candidate Standard 19 June 2015, available at:

http://www.openoandm.org/ws-cir/1.0/ws-cir.html

[4] Latha, L., & Thangasamy, S. (2011). Efficient approach to normalization of multimodal

biometric scores. International Journal of Computer Applications, 32(10), 57-64.

[5] F.R. Hampel, P.J. Rousseeuw, E.M. Ronchetti, W.A. Stahel,Robust Statistics: The

Approach Based on Influence Functions, Wiley, New York, 1986.

[6] R. Pang, V. Yegneswaran, P. Barford, V. Paxson, and L. Peterson, “Characteristics of

internet background radiation,” in SIGCOMM Conference on Internet Measurement

(IMC). ACM, 10 2004, pp. 27 – 40.

[7] C. Fachkha and M. Debbabi, “Darknet as a source of cyber intelligence: Survey, taxonomy,

and characterization,” Communications Surveys Tutorials, vol. 18, no. 2, pp. 1197–1227,

2016.

[8] A. Fahad et al., "A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical

Analysis," in IEEE Transactions on Emerging Topics in Computing, vol. 2, no. 3, pp. 267-

279, Sept. 2014.

[9] Gurjeet Singh, Facundo Mémoli and Gunnar Carlsson, Topological Methods for the

Analysis of High Dimensional Data Sets and 3D Object Recognition, Eurographics

Symposium on Point Based Graphics, European Association for Computer Graphics, 2007,

pp. 91–100

[10] H. Edelsbrunner and J. L. Harer. Computational topology. American Mathematical

Society, Providence, RI, 2010.

[11] W.M.P. van der Aalst. Process Mining: Discovery, Conformance and Enhancement

of Business Processes . Springer Publishing Company, Incorporated, 1st edition, 2011.

[12] D.F.Sander, J.J.Leemans and W.M. van der Aalst,“Discoveringblock- structured

process models from event logs - a constructive approach,” Application and Theory of

Petri Nets and Concurrency, International Conference, PETRI NETS 2013, vol. 34, pp. 311–

329, Juin 2013.

[13] B.D.Wil M.P.van der Aalst,V.Rubin and a.C.G.E.Kindler,“Process mining: A two-

step approach using transition systems and regions,” p. 35, 2007.

http://www.mimosa.org/presentations/openom-information-service-bus-and-common-interoperability-registry

http://www.mimosa.org/presentations/openom-information-service-bus-and-common-interoperability-registry

http://www.openoandm.org/ws-cir/1.0/ws-cir.html



Page | 73


[14] Pete Chapman, Julian Clinton, Randy Kerber, et al. “CRISP-DM 1.0: Step-by-step

data mining guide”, 1999-2000, available at: https://www.the-modeling-

agency.com/crisp-dm.pdf

[15] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, "Specification of the IP Flow

Information Export (IPFIX) Protocol for the Exchange of Flow Information", STD 77, RFC

7011, DOI 10.17487/RFC7011, September 2013, <https://www.rfc-

editor.org/info/rfc7011>.

https://www.the-modeling-agency.com/crisp-dm.pdf

https://www.the-modeling-agency.com/crisp-dm.pdf



Page | 74


Annex A: Data collection template 1. General information

Ref. No Sequence Number (1)

Title Somfy Gateway Traffic Dataset

Version 1.0

Description Briefly describe what data would represent This dataset consists of traffic capture between IoT gateways and Internet

Type of data Performance, usage, alert

Dataset availability

Data already existing OR date to be released June 2018

Future revisions anticipated

Yes / No No

2. Environment / Context

Directly observable device types

Sensor, robot, vehicle board, monitor device, edge node, gateway

• Somfy gateway

Directly observable software

IoT application, gateway software, cloud service app…

• IoT service provider cloud service (Somfy)

Indirectly observable device

Sensor, robot, vehicle board, monitor device, edge node, gateway (devices which are not directly monitored, be exhaustive to the extent possible)

• 4 smart plugs

• 2 smart bulbs

• 1 Android phone to access cloud service

Indirectly observable software

Architecture/Topology description and communication protocols

Figure showing where are the monitoring probes (some incertitude may occur) Smartphone --- Web? ---- Cloud service --- UDP + HTTPS - (monitoring probe) --- Somfy Gateway ---- RTS --- IoT devices

3. Data access



Page | 75


There are three cases:

• Data is already retrieved and stored as data files

• Monitoring data can be retrieved through an interface

• Data is present in sw/hw but no means exists yet to access them remotely, need for a probe to be developed

The first two may coincide

Dataset provided as data file(s)

Yes/No Yes

Remote accessibility

Yes/No No

Protocol SNMP, CMIP, CoAP, NETCONF

Message format

Protocol specific, JSON, XML (* use extra space if needed *)

Pull/Push Pull, push

Provided interface

URI + interface specification (* use extra space if needed *)

If data is not yet accessible, how can they be retrieved?

Describe the architecture and where the probe can deployed

Detail where a probe could be instantiated in the current architecture of the IoT system (in sensors, edge nodes, gateways…)

Probe development requirements

Programming language, framework

Usable software API on device

Are there usable APIs? (if yes, describe them and add reference to the documentation)

Data description

Data format NetFlow, pcap, syslog, json (when an interface is used, the format of embedded data is needed to be described) Pcap file + text file

Encryption Is the data encrypted? (explain) Yes, communications between the gateway and cloud service are encrypted with HTTPS (some partially clear-text UDP messages are visible)

Data format description

Syntax and semantics of data (very important for non-standard formats, e.g. describe the columns of a csv file, or the structure and semantics of what contains a JSON file) Full pcap file including payload



Page | 76


Text file describing the timeline of actions performed with the Android app (time: action)

For unusual format, tool to read it

Provide link

Dataset generation Was the data monitored in a system with real users?

Yes/No No

If no, how the data has been generated?

Actions triggered /performed/simulated, how many of them, methodology

• Test 1: trigger no action

• Test 2: trigger one action on one sensor (with the Android App)

• Test 3: trigger several actions simultaneously (with the Android App)

Attack Does the dataset contain attacks?

Yes/No No

If yes, are the attack labeled?

Yes/No

If yes, what is the granularity of the labels?

Per packet, per flow, timeline of anomalies

Dataset statistics Duration, size(s) in appropriate format (MB, pkts), number of packets breakdown per IP address, protocols… (be exhaustive as possible)

Sample of data Provide a sample of data here or a link to it

Data restrictions

Is the data open publicly? Yes/No No

If no, is there a plan to make data open? Yes/No Yes

If no, will the data be accessible to the consortium, or to specific partner(s)?

Yes/No Yes, whole consortium

If yes, for how long? End of project

Can the data be used for public dissemination (without revealing the full content of the data, aggregated view)

Yes/No No

Who owns the data? Inria



Page | 77


Legal issues Please use the guidelines provided by DWF.5 Flags:

☐ data may be “personal data”

☐ we plan to combine/merge the data with this other data source: _________________

☐ data may be “telecommunication metadata”

☐ data may be “telecommunication content”

☐ data is encrypted

☐ data may contain business secrets Explanation: (only if data is flagged above, please refer to DWF guidelines on what additional information to include) No legal issues as data is collected in an isolated environment from the rest of the lab network with no real user

Other

Existing documentation / references

Provide link

Other comments Provide here all information you think useful (e.g., limitation of collected data, presence of erroneous data in the dataset) Data collect in an artificial environment with no real user, a bias may exist in the data variation

Annex B: Data Schemata

<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" targetNamespace="eu:secureiot:data:dk" xmlns:dk="eu:secureiot:data:dk"> <xs:element name="DK"> <xs:annotation>

5 https://rid-redmine.intrasoft-intl.com/dmsf/files/14645/view

https://rid-redmine.intrasoft-intl.com/dmsf/files/14645/view



Page | 78


<xs:documentation>SecureIoT Data Kind</xs:documentation> </xs:annotation> <xs:complexType> <xs:sequence> <xs:element type="xs:string" name="description" minOccurs="0"> <xs:annotation> <xs:documentation>Provides an Optional description of the Data Kind </xs:documentation> </xs:annotation> </xs:element> <xs:element type="xs:string" name="modelType" minOccurs="0"> <xs:annotation> <xs:documentation>Specifies the Model type of the Data (i.e. SenML, OM, ...)</xs:documentation> </xs:annotation> </xs:element> <xs:element name="format" type="xs:string" minOccurs="0"> <xs:annotation> <xs:documentation>Specifies the format of the Data (i.e. JSON, XML,...)</xs:documentation> </xs:annotation> </xs:element> <xs:element name="quantityKind" minOccurs="0" type="xs:anyURI"> <xs:annotation> <xs:documentation>A QuantityKind is an abstract classifier that represents the concept of "kind of quantity". A QuantityKind represents the essence of a quantity without any numerical value or unit. (e.g. A sensor -sensor1- measures temperature: sensor1 has quantityKind temperature) </xs:documentation> </xs:annotation> </xs:element> </xs:sequence> <xs:attribute name="id" type="xs:anyURI"> <xs:annotation> <xs:documentation>Uniquely identifies the Data Kind as a URI </xs:documentation> </xs:annotation> </xs:attribute> <xs:attribute name="name" type="xs:string"> <xs:annotation> <xs:documentation>A human readable name which uniquely identifies the Data Kind.</xs:documentation> </xs:annotation> </xs:attribute> </xs:complexType> </xs:element> </xs:schema>

Table 6 SecureIoT DataKind XML schema

<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"



Page | 79


targetNamespace="eu:secureiot:global:dm" elementFormDefault="qualified" attributeFormDefault="unqualified" xmlns:dm="eu:secureiot:global:dm" xmlns:dk="eu:secureiot:data:dk"> <xs:import namespace="eu:secureiot:data:dk" schemaLocation="DataKind.xsd" /> <xs:element name="SecIoT-DM"> <xs:annotation> <xs:documentation>SecureIoT Data Model</xs:documentation> </xs:annotation> <xs:complexType> <xs:sequence maxOccurs="1"> <xs:element name="DataDefinitions"> <xs:annotation> <xs:documentation>Data Definitions</xs:documentation> </xs:annotation> <xs:complexType> <xs:sequence> <xs:element maxOccurs="unbounded" minOccurs="0" ref="dk:DK" /> </xs:sequence> </xs:complexType> </xs:element> <xs:element maxOccurs="unbounded" minOccurs="0" ref="dm:Platform" /> <xs:element ref="dm:Probe" maxOccurs="unbounded" minOccurs="0" /> <xs:element ref="dm:LiveDataSet" maxOccurs="unbounded" minOccurs="0" /> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="Platform"> <xs:annotation> <xs:documentation>An observed IoT platform description</xs:documentation> </xs:annotation> <xs:complexType> <xs:sequence> <xs:element name="namespace" type="xs:anyURI"> <xs:annotation> <xs:documentation>For scope hierarchy.</xs:documentation> </xs:annotation> </xs:element> <xs:element name="description" minOccurs="0" type="xs:string"> <xs:annotation> <xs:documentation>Textual description for the SecureIoT Data Model </xs:documentation> </xs:annotation> </xs:element> <xs:element minOccurs="0" ref="dm:Location"> <xs:annotation> <xs:documentation>The Platform's location</xs:documentation> </xs:annotation> </xs:element> <xs:element maxOccurs="unbounded" minOccurs="0" ref="dm:AdditionalInformation" />



Page | 80


</xs:sequence> <xs:attribute name="id" type="xs:anyURI" /> <xs:attribute name="name" type="xs:string" /> </xs:complexType> </xs:element> <xs:element name="Probe"> <xs:annotation> <xs:documentation>The SecureIoT Probe</xs:documentation> </xs:annotation> <xs:complexType> <xs:sequence> <xs:element name="PlatformReferenceID" type="xs:anyURI"> <xs:annotation> <xs:documentation>The ID of the Platform this probe is deployed to. </xs:documentation> </xs:annotation> </xs:element> <xs:element name="baseURL" type="xs:anyURI" /> <xs:element name="namespace" type="xs:anyURI"> <xs:annotation> <xs:documentation>For scope hierarchy.</xs:documentation> </xs:annotation> </xs:element> <xs:element name="description" minOccurs="0" type="xs:string"> <xs:annotation> <xs:documentation>Textual description for the SecureIoT Probe </xs:documentation> </xs:annotation> </xs:element> <xs:element minOccurs="0" ref="dm:Location"> <xs:annotation> <xs:documentation>The SecureIoT Probe location</xs:documentation> </xs:annotation> </xs:element> <xs:element maxOccurs="unbounded" minOccurs="0" ref="dm:AdditionalInformation" /> </xs:sequence> <xs:attribute name="id" type="xs:anyURI" /> <xs:attribute name="name" type="xs:string" /> </xs:complexType> </xs:element> <xs:element name="LiveDataSet"> <xs:annotation> <xs:documentation>A generic type of Data Source Live measurements (physical or virtual)</xs:documentation> </xs:annotation> <xs:complexType> <xs:sequence> <xs:element name="ProbeReferenceID" type="xs:anyURI"> <xs:annotation> <xs:documentation>The ID of the Probe (physical or virtual) these observations refer to.</xs:documentation> </xs:annotation>



Page | 81


</xs:element> <xs:element default="false" name="mobile" type="xs:boolean"> <xs:annotation> <xs:documentation>Identifies if the SecureIoT Probe is mobile or not. If it is mobile, the location field within the observation entity should be provided as well.</xs:documentation> </xs:annotation> </xs:element> <xs:element minOccurs="0" name="timestamp" type="xs:dateTime" /> <xs:element maxOccurs="unbounded" minOccurs="0" ref="dm:observation" /> </xs:sequence> <xs:attribute name="id" type="xs:anyURI" /> </xs:complexType> </xs:element> <xs:element name="observation"> <xs:annotation> <xs:documentation>The Node Measurement</xs:documentation> </xs:annotation> <xs:complexType> <xs:sequence> <xs:element name="DataKindReferenceID" type="xs:anyURI"> <xs:annotation> <xs:documentation>The ID of the Data Kind this observation refers to.</xs:documentation> </xs:annotation> </xs:element> <xs:element form="qualified" name="timestamp" type="xs:dateTime" minOccurs="0"> <xs:annotation> <xs:documentation>The timestamp indicating the instant in which a measurement was acquired </xs:documentation> </xs:annotation> </xs:element> <xs:element ref="dm:Location" minOccurs="0"> <xs:annotation> <xs:documentation>The location at which the measurement was taken if SecureIoT Probe mobile attribute is true </xs:documentation> </xs:annotation> </xs:element> <xs:element name="value" type="xs:anyType" minOccurs="1"> <xs:annotation> <xs:documentation>Element providing the value of a measurement. </xs:documentation> </xs:annotation> </xs:element> </xs:sequence> <xs:attribute name="id" type="xs:anyURI" /> <xs:attribute name="name" type="xs:string" /> </xs:complexType> </xs:element> <xs:element name="Location">



Page | 82


<xs:annotation> <xs:documentation>Indicating the location at which something took place </xs:documentation> </xs:annotation> <xs:complexType> <xs:choice> <xs:element maxOccurs="1" minOccurs="0" name="geoLocation"> <xs:annotation> <xs:documentation>Specifying a physical location (a pair of coordinates) </xs:documentation> </xs:annotation> <xs:complexType> <xs:sequence> <xs:element name="latitude" type="xs:string" /> <xs:element name="longitude" type="xs:string" /> </xs:sequence> </xs:complexType> </xs:element> <xs:element minOccurs="0" name="virtualLocation" type="xs:anyURI"> <xs:annotation> <xs:documentation>Specifying a virtual location (it could be the ID of a resource or subsystem)</xs:documentation> </xs:annotation> </xs:element> </xs:choice> </xs:complexType> </xs:element> <xs:element name="AdditionalInformation" type="xs:anyType"> <xs:annotation> <xs:documentation>Optional auxiliary field that may contain any additional information </xs:documentation> </xs:annotation> </xs:element> </xs:schema>

Table 7 SecureIoT Data Models XML schema



Page | 83


Annex C: IDIADA data summary

Project Code SecureIoT

Data-Sets Introduction v0.1

IDIADA Automotive Technology UK Ltd. For the SecureIoT project

Cecil TC Building

Cambridge Road

Milton

CB24 6AZ For the attention of:

United Kingdom

T +44 1223 441434 SecureIoT Data Partners

[email protected]

www.idiada.com

Author (remove if not applicable:)

David Evans

Connected Vehicle R&D

Engineer

Electronics

Issue date: 24Jul2017

This report contains 15 pages including this cover and the annexes.



Page | 84


If Applus+ IDIADA can be identified as the author of the text, its permission is required for the inclusion of this information in other documents (reports, articles, publicity, etc.)

INTRODUCTION AND SCOPE

This document aims to provide context to the data logs provided by IDIADA.

Document version 0.1

1. TERMS AND ABBREVIATIONS

CAN Control Area Network

CAM Cooperative Awareness Message

ITS Intelligent Transport System

OBU Onboard Unit

RPM Revolutions Per Minute

V2V Vehicle to Vehicle

V2X Vehicle to Everything

2. REFERENCES

[Ref 1] CAN Message definition support

\IDIADA_CAN_Data_Definitions.docx

[Ref 2] ITS V2X CAM Data definitions

(https://www.etsi.org/deliver/etsi_ts/102800_102899/10289402/01.02.

01_60/ts_10289402v010201p.pdf)

[Ref 3] CANDump file (candump-2018-07-24_113439.log)

[Ref 4] CAN Monitor file (context)

(2018_07_24_123439_simulation_CAN_Monitor.csv)

[Ref 5] CAN Application Level (2018-07-24 11-34-39.681344-CAN.log)

[Ref 6] V2XProbe file (v2x_probe.pcap)

https://www.etsi.org/deliver/etsi_ts/102800_102899/10289402/01.02.01_60/ts_10289402v010201p.pdf






Page | 85


[Ref 7] V2X Application Level (2018-07-24 11-34-39.703073-V2X.log)

[Ref 8] ITS CAM Technical Specification

(https://www.etsi.org/deliver/etsi_ts/102600_102699/10263702/01.02.

01_60/ts_10263702v010201p.pdf)

[Ref 9] ITS CAM Message Specification

(https://www.etsi.org/deliver/etsi_en/302600_302699/30263702/01.03

.01_30/en_30263702v010301v.pdf)

3. SETUP/CONFIGURATION

The setup involves a steering wheel and pedals connected to a simulation tool generating

vehicle performance data. The output from the simulation tool is used as input to the IDAPT

OBU to generate the respective CAN data and feed that back into the system as if it was coming

from a real vehicle.

Despite the CAN messages being generated by the IDAPT OBU, they are generated on a

separate CAN interface and used as input to the primary CAN interface, of which this is the one

that is logged.

In the context of this exercise, there is no difference between the integrity of the data being

generated by a traditional CAN bus and the simulation tool.





https://www.etsi.org/deliver/etsi_en/302600_302699/30263702/01.03.01_30/en_30263702v010301v.pdf







Page | 86


4. CAN

Three files are provided for CAN messaging logs, 1 probe level log, 1 probe context log and 1

application level.

1. Probe Level (see [Ref 3])

2. Probe Context (see [Ref 4])

3. Application Level (see [Ref 5])

It should be noted that ‘Probe Context’ is provided purely for interpreting the CAN message

structure. A supporting document providing further detail is also provided (see [Ref 1]). This log

is not intended to be provided for the data element of this project, it is just supporting data to

help understand the low-level CAN data.

4.1. Candump

As per the CAN messages and signals described in [Ref 1], this is the CAN log produced from the

simulation.

This is the ‘probe’ level data collected by the IDAPT OBU and the source of data of which the

application would use.

4.2. CAN_Monitor

This csv file provides an Excel sheet with each row representing a CAN message. Bytes 0 to 7 are

shown in columns ‘H’ to ‘O’ respectively. Supporting context as detailed in [Ref 1] can be found

in columns ‘P’ onwards.

4.3. Application Level

This log contains a ‘JSON’ like format containing the following structure:

Element Description Unit Data Type

1 – Brake

pressure

This is the brake pressure applied by the braking

pedal

bar Float



Page | 87


2 – Element

Counter

This is a counter for each logged entry - Unsigned

Integer

3 – RPM This is the RPM (revolutions per minute) of the

vehicle.

rpm Unsigned

Integer rpm

4 - Speed This is the speed of the vehicle in kilometers Km/h Float

per hour.

5 – Steering

Angle

This is the angle of the steering wheel in

degrees, controlled by the steering wheel

connected to the simulation tool.

degrees Float

6 – Throttle This is the throttle position percentage,

controlled by the accelerator pedal connected to

the simulation tool.

% Float

7 – Timestamp This is a timestamp of the capture - -

4.3.1. Example

Using an excerpt as an example from [Ref 5] line 2364 and a detailed explanation of the data in

Table 1:

“{"bra": "0.0", "element": "2364", "rpm": "6427", "speed": "204.61", "str_ang": "9.4",

"throttle": "50.0", "timestamp": "2018-07-24 11:38:36.363892"}”

Tag Element Unit Value

“bra” Brake Pressure bar 0.0 bar

“element” Entry number - 2364th element

“rpm” Vehicle RPM rpm 6427 rpm

“speed” Vehicle Speed Km/h 204.61 Km/h

“str_ang” Vehicle Steering Ang degrees +9.4

“throttle” Vehicle Throttle Pos % 50.0 %

“timestamp” Entry timestamp - 2018-07-24 11:38:36…

Table 1: CAN Application Data



Page | 88


5. V2X

The scope of the V2X communication of the vehicle is limited to a subset of the ITS CAM

(Cooperative Awareness Messaging) messaging, specifically the following elements:

1. ITS PDU Header (see [Ref 2] A.114 DF_ItsPduHeader)

a) Protocol version

b) Message ID

c) Station ID (see [Ref 2] A.77 DE_StationID)

2. Latitude (see [Ref 2] A.41 DE_Latitude)

3. Longitude (see [Ref 2] A.44 DE_Longitude)

4. Altitude (see [Ref 2] A.9 DE_AltitudeValue)

5. Heading (see [Ref 2] A.35 DE_HeadingValue)

6. Speed (see [Ref 2] A.74 DE_SpeedValue)

7. Steering An (see [Ref 2] A.80 DE_SteeringWheelAngleValue)

8. Vehicle Len (see [Ref 2] A.92 DE_VehicleLengthValue)

9. Vehicle Wid (see [Ref 2] A.95 DE_VehicleWidth)

5.1. V2X System

In the IDAPT OBU, there is the main microcontroller (NVIDIA) that handles the communication

with the V2X module (a separate microcontroller) connected via an internal network (see Table

2 for IP Address context).

Currently, only the loggings of transmitted V2X messages are in the logs at the moment.

With respect to the V2X Probe ([Ref 6]), the following IP Addresses correspond to each device

Device IP Address

NVIDIA 192.168.1.1

V2X Module 192.168.1.2



Page | 89


Table 2: IP Addresses

As the V2X module implements the ETSI ITS standard, the CAM message consecutive

transmission rates of a CAM message have a maximum rate of 100ms (10 Hz) and a minimum

rate of 1000ms (1 Hz). Further detail can be found in section 6.1.3 of [Ref 9].

The rate of the transmission is dictated by the movement of the vehicle (Latitude, Longitude,

Heading, and Speed).

The (NVIDIA) application passes a structure of data to the V2X module every 50ms (20 Hz)

which resembles a CAM message. Every time a V2X message is transmitted by the module, a

copy of the content is passed back to the (NVIDIA) application between 100ms (10Hz) and

1000ms (1Hz) depending on the movement of the vehicle.

5.1.1. Application to V2X module Message

The following structure is passed from the application to the V2X module, using an excerpt as

an example from [Ref 6] (no 12920 source 192.168.1.1) and a detailed explanation of the data

in Table 3

0000 e7 ec c5 01 02 ca fe ba be 00 00 03 05 1f 4d 01

0010 45 ff 4e 18 5f 00 00 00 00 00 00 00 00 00 8e 00

0020 00 00 04 24 00 07 c4 00 00 00 28 00 11 00 00 00

0030 00 00 00 00 00 00 00 00 00 ff ab 00 00 00 00 00

0040 00 00 00

Note: Data larger than 1 byte can be assumed as Big Endian.

Byte Element Hex Value Data Value

[0-1] 16bit CRC 0xE7EC 59,372

[2] 8 bit Sequence counter 0xC5 197

[3] ITS Protocol Version 0x01 1



Page | 90


[4] ITS Message ID (see [Ref 2] A.114) 0x02 2

[5-8] ITS Station ID (see [Ref 2] A.77) 0xCAFEBABE 3405691582

[9-12] Out of scope - -

[13-16] ITS Latitude (see [Ref 2] A.41) 0x1F4D0145 525140293

[17-20] ITS Longitude (see [Ref 2] A.44) 0xFF4E185F -11659169


[27-30] ITS Altitude (see [Ref 2] A.9) 0x0000008E 142


[34-35] ITS Heading (see [Ref 2] A.35) 0x0424 1060

[36] Out of scope - -

[37-38] ITS Speed (see [Ref 2] A.74) 0x07C4 1,988


[41-42] ITS Vehicle Length (see [Ref 2] A.92) 0x0028 40

[43] Out of scope - -

[44] ITS Vehicle Width (see [Ref 2] A.95) 0x11 17


[57-58] ITS Steering Wheel Angle (see [Ref 2] A.80) 0xffab -85


Table 3: Application to V2X module payload

5.2. V2X Probe (TX)

As per [Ref 6], the received payload from the V2X module’s IP address (see Table 2) is what has

been transmitted from the V2X module.

The payload sent from the NVIDIA’s IP address (see Table 2) is constantly updating the V2X

module with the latest available information, and, depending on the movement of the vehicle,

the rate of transmission from the V2X module is altered.



Page | 91


As per [Ref 6], at the start of the log, the vehicle is not moving. Note that, as the vehicle begins

the move, the V2X module transmission rate increases from 1000ms (1Hz) to 100ms (10 Hz)

Entry Time Delta between last TX Source

11 0.336308 - 192.168.1.2 (V2X Module)

41 1.337331 1 second 192.168.1.2 (V2X Module)

69 2.338357 1 second 192.168.1.2 (V2X Module)

100 3.339407 1 second 192.168.1.2 (V2X Module)

128 4.240326 1 second 192.168.1.2 (V2X Module)

148 4.941326 0.7 seconds 192.168.1.2 (V2X Module)







Table 4: V2X Transmission Delta

5.3. V2X Probe (RX)

This has been implemented in the overall system but not with the use of the simulation tool of

which this data has been generated with.

The structure (Table 3) will be the same for received V2X messages from surrounding physical

(or simulated) vehicles.



Page | 92


5.4. V2X Application (TX)

Upon the NVIDIA receiving a transmitted V2X message from the V2X module (see Section 6.2), a

V2X TX JSON log entry is generated containing the following structure with a further

explanation of the elements in Table 5.

“{"element": "22", "msg_id": "2", "proto_ver": "1", "sta_id": "3405691582", "sta_ty": "5",

"timestamp": "2018-07-24 11:34:46.935055", "veh_alt": "51", "veh_hea": "549",

"veh_lat": "524913081", "veh_len": "40", "veh_lon": "-11320424", "veh_spe": "1127",

"veh_str": "4", "veh_wid": "17"}”

Element Description Data Type

1 – Element

Counter

This is a counter for each logged entry Unsigned integer

2 – Msg ID This is the Message ID of the V2X message (see

[Ref 2] A.114 DF_ItsPduHeader)

Refer to [Ref 2] A.114

3 – Protocol

Version

This is the protocol version used (see [Ref 2]

A.114 DF_ItsPduHeader)


4 – Station ID This is the station id of the V2X device (see [Ref

2] A.77 DE_StationID)


5 – Station Type This is the station type of the V2X device (see

[Ref 2] A.78 DE_StationType)


6 – Timestamp Timestamp of when the entry logged entry has

been generated

Timestamp

7 – Vehicle Altitude This is the altitude value of the vehicle (see [Ref

2] A.9 DE_AltitudeValue)


8 – Vehicle

Heading

This is the heading value of the vehicle (see [Ref

2] A.35 DE_HeadingValue)


9 – Vehicle

Latitude

This is the GPS latitude value of the vehicle (see

[Ref 2] A.41 DE_Latitude)




Page | 93


10 – Vehicle

Length

This is the length of the vehicle (see [Ref 2] A.92

DE_VehicleLengthValue)


11 – Vehicle

Longitude

This is the GPS longitude value of the vehicle

(see [Ref 2] A.44 DE_Longitude)


12 – Vehicle

Speed

This is the speed of the vehicle (see [Ref 2] A.74

DE_SpeedValue)


13 – Vehicle

Steering Wheel

This is the steering wheel angle value of the

vehicle (see [Ref 2] A.80


Angle DE_SteeringWheelAngleValue)

14 – Vehicle Width This is the width of the vehicle (see [Ref 2] A.95

DE_VehicleWidth)


Table 5: V2X TX JSON Log Entry Table

5.5. V2X Application (RX)

This has been implemented in the overall system, but not with the use of the simulation tool of

which this data has been generated with.

The structure (Table 5) will be the similar for received V2X messages from surrounding physical

(or simulated) vehicles.

Documents

DELIVERABLE - SecureIoT Project · Version: v1.00 - Final, Date 28/09/2018 Page | 2 Project Title: SecureIoT Contract No. 779899 Project Coordinator: INTRASOFT International S.A