Deliverable D1.1 Whitepaper - largo-project.eu

Deliverable D1.1 – WhitepaperLarge-Scale Smart Grid Application

Roll-Out

Version 1.0

Deliverable

Filip Pröstl Andrén, Jawad Kazmi, Catalin Gavriluta, Ewa Piatkowska and Paul Smith (AIT)Eric Veith, Stephan Balduin, Mana Azamat and Mathias Uslar (OFFIS)

Florian Kintzler and Stephan Cejka (Siemens)Marco Mittelsorf and Jörn Schumann (Fraunhofer ISE)

Henrik Sandberg, David Umsonst and Michelle Chong (KTH)

2020-05-29

ERA-Net Smart Grids Plus | From local trials towards a European Knowledge Community

This project has received funding in the framework of the joint program-ming initiative ERA-Net Smart Grids Plus, with support from the Euro-pean Union’s Horizon 2020 research and innovation programme.

INTERNAL REFERENCE

• Deliverable No.: D 1.1

• Deliverable Name: Large-Scale Smart Grid Application Roll-Out

• Lead Partner: AIT

• Task No. & Name: T 1.2

• Work Package No.: WP 1

• Document (File): D1_1_LarGo_Whitepaper.pdf

• Issue (Save) Date: 2020-05-29

DOCUMENT SENSITIVITY

� Not Sensitive Contains only factual or background information; contains no new oradditional analysis, recommendations or policy-relevant statements.

�7 Moderately Sensi-tive

Contains some analysis or interpretation of results; contains no rec-ommendations or policy-relevant statements.

� Sensitive Contains analysis or interpretation of results with policy-relevanceand/or recommendations or policy-relevant statements.

� Highly SensitiveConfidential

Contains significant analysis or interpretation of results withmajor policy-relevance or implications, contains extensiverecommendations or policy-relevant statements, and/or containpolicy-prescriptive statements.This sensitivity requires SB decision.

DOCUMENT STATUS

Date Person(s) Organisation

Author(s) 2020-05-27 Filip Pröstl Andrén, Jawad Kazmi,Catalin Gavriluta, Ewa Piatkowska,Paul Smith, Eric Veith, Stephan Bal-duin, Mana Azamat, Mathias Uslar, Flo-rian Kintzler, Stephan Cejka, Marco Mit-telsorf, Jörn Schumann, Henrik Sand-berg, David Umsonst, Michelle Chong

AIT, OFFIS, Siemens,Fraunhofer ISE & KTH

Verification by 2020-05-29 Filip Pröstl Andrén AIT

Approval by 2020-05-29 Filip Pröstl Andrén AIT

LarGo! | Deliverable D1.1 – Whitepaper 2

CONTENTS

1 Introduction 5

2 Large-Scale System Analysis 52.1 Smart Grid Application Roll-Out Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Evaluation through Co-Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Secure and Resilient System Design 83.1 Identification of Security and Safety Critical Points and Threat Risk Assessment . . . . . . . . . . . . 83.2 Development of Resilient Monitoring & Control Applications . . . . . . . . . . . . . . . . . . . . . . . 9

4 Unified Deployment Process 114.1 iSSN Software Management System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2 BEMS Software Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5 Roll-Out Analysis and Validation 155.1 Large-Scale Application Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.1.2 Test Cases and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.2 Lab-Based Roll-Out and Integration Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.2.2 Test Cases and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.3 Test-Bed Roll-Out Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.3.1 Customer Domain Field Tests in Fellbach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.3.2 Grid Domain Field Tests in Aspern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 Discussions and Lessons Learned 28

7 Conclusions 31

Abbreviations 32


Disclaimer

The content and views expressed in this material are those of the authors and do not necessarily reflect the viewsor opinion of the ERA-Net SG+ initiative. Any reference given does not necessarily imply the endorsement byERA-Net SG+.

About ERA-Net Smart Grids Plus

ERA-Net Smart Grids Plus is an initiative of 21 European countries and regions. The vision for Smart Grids inEurope is to create an electric power system that integrates renewable energies and enables flexible consumerand production technologies. This can help to shape an electricity grid with a high security of supply, coupled withlow greenhouse gas emissions, at an affordable price. Our aim is to support the development of the technologies,market designs and customer adoptions that are necessary to reach this goal. The initiative is providing a hubfor the collaboration of European member-states. It supports the coordination of funding partners, enabling jointfunding of RDD projects. Beyond that ERA-Net SG+ builds up a knowledge community, involving key demoprojects and experts from all over Europe, to organise the learning between projects and programs from the locallevel up to the European level.

www.eranet-smartgridsplus.eu


http://www.eranet-smartgridsplus.eu

1 Introduction

The ongoing digitalization in the power system domain has resulted in a changing role for Information and Com-munication Technologies (ICTs) in electric distribution grids. Not only new networking technologies are beingdeployed, but also software applications that operate on field data or even perform real-time control are required.Consequently, these new systems and their software also need to be maintained and kept up-to-date, which putnew requirements on software roll-outs. As one might expect, there are potentially very large numbers of highlydistributed software components that constitute the smart grid.

For several reasons it may be necessary to update software and services in the smart grid on a large-scale. Thisis the case when new services are being deployed, for example. Other motivations for the large-scale roll-out ofsoftware services in the smart grid include updates, e.g., to enable new functions or adjust the behaviour andconfiguration of existing ones, and computer security-related patches that fix vulnerable software. The LarGo!project is tasked with developing novel approaches to the large-scale rollout of software services in the smart grid.

Electricity distribution networks are a critical infrastructure — failures can result in significant financial and societalcost. Consequently, it is important that software roll-outs do not fail or do so in a graceful way, resulting in minimalimpact to the infrastructure. There are a multitude of reasons this could happen. Benign—non-malicious—reasonsinclude human error (e.g., the accidental deployment of incorrect software versions and set-points), system fail-ures, and poor communication network conditions. Conversely, cyber-attacks to critical infrastructures are be-coming more prevalent and sophisticated. Cyber-attacks have been observed in which the attacker’s goal is todisrupt or damage operational (physical) systems in the energy sector, as was seen in the Ukraine in December2015 [26].

LarGo! enables the mass roll-out of smart grid applications by defining a seamless, safe and secure applicationdeployment process for the grid and customer domain. The project poses the hypothesis that ICT maintenancecannot be conducted independently of the runtime operation of a smart grid. For example, on a utility scale, thetime required for deployment and ICT maintenance processes overlaps significantly with operational periods (i.e.,these two aspects cannot be readily separated). Furthermore, the exchange of operational data will use the samecommunication channels as that used for ICT maintenance. Therefore, the critical challenge of stable and resilientsystem operation is addressed in a setting where communication systems are used for both smart grid run-timeoperation, including monitoring and control, and ICT maintenance, such as application deployment and remoteconfiguration.

To assess possible large-scale effects of application deployment, system maintenance and operations, the projectuses combined emulations and simulations of the required ICT systems and power systems. Furthermore, twotestbeds are used to evaluate selected smart grid applications deployment on substation and customer level. TheLarGo! project has three main goals:

• To prepare the mass roll-out of smart grid software applications, facilitating energy-related deployment ser-vices in the customer domain (energy management, aggregation, flexibility) and renewable integration in thegrid domain (monitoring and control, efficiency, hosting capacity increase) for Distribution System Operators(DSO).

• To analyze the technical side-effects of roll-out, updating, patching and operations over common commu-nication infrastructure, using highly accurate, but large-scale, system emulations. Results are verified byHardware-In-the-Loop experiments.

• To support the adoption of smart grid approaches by designing a secure infrastructure and robust applica-tions that enables fail-safe system operation. Based on previously published software applications, LarGo!analyses how they scale-up towards a utility/large-scale deployment and propose improvements to handlefuture large-scale roll-out scenarios.

The rest of the document is organized as follows: Section 2 presents the initial conceptual modelling made inthe project and ideas how co-simulation can be used as a validation approach. In Section 3, a more theoreticalapproach is presented how secure and resilient smart grid systems can be designed, which is followed by im-plementations of a unified deployment approach in Section 4. These sections are followed by Section 5, wheredifferent validation cases are analysed to evaluate the LarGo! approach. In Section 6, the main lessons learnedfrom the project are presented together with guidelines for how large-scale rollouts in smart grids should be real-ized. The document is concluded in Section 7.

2 Large-Scale System Analysis

The purpose of project LarGo! is to prepare the mass roll-out of smart grid software applications facilitating energy-related service marketplaces in the customer domain and renewable integration in the grid domain. To this effect,the major artifact of LarGo! itself is a software roll-out process description. The LarGo! project is exceptional in itsmulti-domain approach for creating a process for large-scale software roll-outs in the smart grid. It does not just


focus on one domain, but analysis several domains and their interdependencies in order to assess the impact ofa software roll-out to the power grid and the Information and Communication Technology (ICT) infrastructure.

2.1 Smart Grid Application Roll-Out Concept

Figure 1 gives an overview of the main idea behind LarGo!. Beginning with an initial analysis, the operatoror an engineer is assisted with creating a roll-out plan. In order to create this plan, multiple domains need to beconsidered. Side-effects of roll-out, updating, patching and operations over common communication infrastructuremay have impacts on the underlying power system if the roll-out is not carried out correctly. Once software is usedas an integral part, throughout the whole power system, software updates cannot be considered independentlyanymore. Therefore, the assisted roll-out planning in LarGo! support the engineer in considering all the affecteddomains and also provides simulation-based pre-evaluation of the roll-out plan. Once a plan has been compiledit can be executed. In LarGo!, the deployment to two domains were analysed in detail: deployment of softwareto substations (intelligent Secondary Substation Node (iSSN)) and deployment of software to Building EnergyManagement System (BEMS).

Device

Domain

Specific

Knowledge

Device Building

UI / Controlling and Monitoring

Initial

Analysis

System Monitoring

Software Management System

Simulation

Roll-Out Execution

Power System

ICT System

Control System

Assisted Roll-Out Planning

Power System

ICT System

Control System

Roll-Out Evaluation

DeviceDevice

Substation

Control Software

Software Management System

Power System

ICT System

Control Software

Figure 1: General concept for the smart grid application roll-outs in LarGo!.

2.2 Evaluation through Co-Simulation

As already mentioned above, the LarGo! project considers multiple domains and their interdependencies. How-ever, there is no single simulation software that is able to cover evaluation over the different domains in theproject, i.e., to couple power grid analysis/power flow study, ICT simulation, and to integrate existing software andalgorithms into the environment. For LarGo!, this means specifically that in addition to the power grid and ICT in-frastructure model, three separate software packages needed to be integrated in the simulation environment tointeract with each other. Because the goal of LarGo! is to study the interaction of different system domains andthe side-effects of a software roll-out, separate simulation runs that are then analyzed accordingly are not feasible.

Instead, the approach of co-simulation was chosen. Here, each domain can be modelled and simulated with thetool that is best suited for the job, e.g., PowerFactory for the power grid. Each simulator connects to a co-simulationframework that coordinates the execution of each simulator and allows them to exchange data. The co-simulationsoftware mosaik [28], provided by OFFIS for LarGo!, acts as the central control and information interchange hubin this specific project. Mosaik allows to control different simulation software over a network protocol based onTCP and JSON. It is a time-discrete co-simulator that coordinates setup, start, and execution of each simulationsoftware. To connect each simulator, a bridge module needs to be written that implements both the simulator‘sAPI as well as mosaik’s JSON protocol API. The LarGo! project has initiated or driven the development of threesuch bridges.

For LarGo!, the following models have been developed or bridges been created: The model of the power grid hasbeen created in PowerFactory. A corresponding software bridge to couple PowerFactory and mosaik has beeninitiated by and completed for the project. The ICT infrastructure simulation is handled by OMNEST, which isthe commercial variant of OMNeT++ [29]. These two simulations serve as the major environment in which the


roll-out process takes place. In LarGo!, two applications are considered for the software roll-out/update process.The first one is the OpenMUC framework, developed at Fraunhofer ISE. OpenMUC provides a means to controlsmart homes, and therefore models the household domain as a typical application that might see deployment onsmart metering devices. The power grid level is covered by the iSSN, that is provided by Siemens AG Österreich.In Figure 2 an overview of the co-simulation setup is seen. Figure 2 also contains details about the generaldata paths between the software. The figures distinguishes between logical data paths, i.e., the flow of data asperceived by the co-simulated applications, and physical data paths that describe the actual flow of data in theco-simulation environment.

Co-Simulation

Logical Data PathPhysical Data Path

Legend

ICT Simulation

mos

aik-

om

netp

p

mosaik-omnetppSimulator

[rx]

[tx]

Deployment Server

mosaik-vif deployment trigger



mosaik-omnetppSimulatormosaik-vifSimulators

[rx][tx]

[rx][tx]

Power Grid Simulation

mosaik-lablink

deployment triggerSimulator

powergridSimulator

[<ICT Area>.<ID>.P]

[<ICT Area>.<ID>.V][<ICT Area>.<ID>.Q]

[<ICT Area>.<ID>.P]

[start_deployment]

OpenMUC

mosaik-vif

[rx][tx]

aiomas

mosaik-vif

Household(s)

[rx][tx]

householdSimulatorshouseholdSimulatorshouseholdSimulators

Docker Containers

mosaik-vif

Substations

[rx][tx]

ALMS

mosaik-vif

oper.Backend

mosaik-vif[rx][tx]

[rx]

[tx]

Figure 2: Data paths of the co-simulation

The applications OpenMUC and iSSN are already existing software that needed to be coupled with mosaik.Since they are already existing software packages that were not developed with a co-simulation in mind, theylack the specific interface that makes coupling possible. To avoid unwanted changes, all software is packaged invirtual environments, i.e., paravirtualized, using the Docker software. Docker is a freely available, open-sourcegeneral-purpose para-virtualization software that has become an industry standard over the last years. The virtualenvironment usually communicate with the rest of the world over virtual network interfaces. We use this fact inLarGo! by providing a special virtual network interface. This virtual network interface, called a tun device, usesLinux kernel facilities in order to treat every IP packet (i.e., all standard network traffic) in user space. The tundevice is created in parallel to the Docker-provided interface. The containers are set up in such a fashion that thetun device receives all traffic, expect for the dedicated communication with mosaik.A secondary software is then installed in each Docker container. This piece of software, mosaik-vif, has beencreated specifically for LarGo!. It reads the network traffic from and to the virtualized software packages andencodes it as data for mosaik. mosaik, in turn, has been set up to connect each container‘s tun device with thecorresponding node in the ICT simulation. mosaik injects the traffic into the OMNEST simulation environment.OMNEST is able to read, decode and handle IP packets; it now routes the real, virtualized network traffic by the


existing applications in the simulation environment. At no time are the actual packet contents changed or needany adaption for this to work. This allows for elegant integration of unmodified software into an ICT simulationtestbed.

Later on in the project, during the evaluation phase, the development of the mosaik-vif software turned out tobe a major challenge. This was due to a number of not foreseeable reasons, such as spanning network stacks,IP packet parsing, container orchestration, simulation timing, and kernel/userland context switch handling. Moreinformation about these issues are reported in Section 5.1.

3 Secure and Resilient System Design

Failures of a service roll-out process for the smart grid – with benign or malicious causes – can result in operationalphysical consequences; the smart grid is a cyber-physical system (of systems). Such consequences can includereduced power quality, voltage safety-limit violations, blackouts, safety-related incidents, and equipment damage.Furthermore, incorrect roll-outs can result in grids being operated with reduced efficiency, potentially negating anyenvironmental or financial benefits that are associated with smart grid services. (Ultimately, this may lead to a lossof trust in smart grids and reduced investment and uptake.) Therefore, the approaches to the large-scale roll-outof services that are developed in the LarGo! project must be secure and resilient.

3.1 Identification of Security and Safety Critical Points and Threat Risk Assessment

To enable a secure and resilient service roll-out process in the smart grid requires a method to identify and analyzethe consequences of failure and their root causes. The results from applying such a method can be used to guidethe organization of a software roll-out process, such that the identified failures are significantly less likely to occur.Moreover, the findings from an analysis can be used to inform the configuration of online monitoring and detectionsystems, whose purpose is to detect the onset of failures. In this deliverable, we present a method to achieve thisobjective, which includes several steps.

A hazard analysis is performed to identify the high-level hazards and accidents that we wish to avoid causing bya service roll-out. These include the operational consequences discussed earlier. A further analysis identifiesand examines relevant control systems and loops, enumerating the ways that hazards and accidents could occurbecause incorrect control actions are taken (so-called hazardous control actions). Subsequently, factors areidentified that could cause these hazardous control actions to be taken. Example causal factors include delayedor incorrect feedback to a controller. For this hazard analysis, we use the System Theoretic Process Analysis(STPA) approach that has been proposed by Leveson [21] and is shown in Figure 3.

Hazards

Losses(Accidents)

Hazardous Actions(Suboptimal System States)

Hazard Scenarios(Misuse Cases)

Causal Factors

For overallsystem

For everycontrol loop

Softwarerollout

process(Section 5)

(Section 4)

(Section 3)

(Section 3)

Figure 3: Methodology for identification of security and safety critical points.

Finally, misuse-case scenarios are specified that describe the root causes of failures in the software roll-outprocess that can result in the causal factors that have been identified. Misuse-cases include a narrative thatdescribes, in this case, the root causes of smart grid service roll-out failures. As discussed earlier, these can bebenign or malicious. By describing attack scenarios in misuse-case templates, possible attacks can be found bytaking known vulnerabilities into account. The probability of occurrences is then estimated and accordingly it is


determined what damage might occur and what the impact would be. By taking implemented countermeasuresinto account it may reduce the possibility of unfavorable features in the future. Based on the threat and probabilityof the attack on certain assets, the risk can be assessed, as seen in Figure 4.

Figure 4: Assessment Steps

3.2 Development of Resilient Monitoring & Control Applications

Whereas the previous described methods, STPA and the misuse-case scenarios are more offline tools for iden-tifying security and safety hazards during roll-out, the focus here is rather on online tools. In the following, theroll-out methodology proposed in the LarGo! project will be outlined, and how the tools (or applications) describedin this report contribute to a resilient roll-out.

A large-scale roll-out of new software for devices in a smart distribution grid is a complex task. As alreadydiscussed, the roll-out could fail in various ways, and it is important to establish a procedure that is resilient againstsuch possible events. In the LarGo! project, we have devised and analyzed the following five-step procedure for aresilient roll-out:

Step 0: Offline design and verification of new software

Step 1: Offline calculation of safe minimum-time roll-out schedule

Step 2: Software and schedule passed to roll-out manager, which executes the roll-out (see Section 4)

Step 3: Online monitoring, anomaly detection, and possible abortion

Step 4: Rollout complete (or rollback initiated).

Step 0 concerns the design of new device software to be rolled out. This step is not time critical and can be carriedout offline, well ahead of the actual roll-out. Although this step is not the main concern of the LarGo! project, itis important already at this stage to design software that is easy, safe, and secure to maintain and update. Asan example of how such a design could look like, an example of a new droop control algorithm for BEMS wasstudied in LarGo!. A novel so-called voltage droop control law was developed that comes with theoretical safetyguarantees. In particular, it ensures that if all BEMS implement the algorithm, the voltage levels in all nodes willremain within safety bounds, subject to load and demand fluctuations within specified limits. As such, this is anexample of a software that comes with a high level of resilience against failure in future roll-outs. This work hasbeen presented in detail by Chong et al. in [9,10].

Step 1 concerns the computation of a time schedule for the order in which the devices should be updated. Forseveral reasons it is not a good idea to update all devices simultaneously. One reason may be that the availablebandwidth is limited so that it is not possible to communicate to all devices in parallel. Another reason is that duringthe update phase, some or all devices may fail and adversely interact to destabilize the grid, threatening safety.In LarGo! an optimization framework was developed that generates an update schedule covering all devices.The schedule terminates in minimum time, while ensuring that possible devices failing will not threaten voltagesafety. In particular, it is shown how the problem can be understood as a multi-resource bin packing problem,which is a NP-hard problem, and that greedy bin packing heuristics can be adapted to the problem. It performswell on networks of realistic size and can also be exactly solved using an Integer Linear Program (ILP). This workwas presented by Medeiro et al. in [12]. An example of a roll-out schedule for a distribution network is shownin Figure 5. For the example distribution grid in Figure 5, six deployment steps are needed. Just as in Step 0,the computation of the schedule here is not necessarily time critical, and can be performed offline. The followingsteps are supposed to execute in real time, however, and are hence considered as online procedures.

Step 2 concerns the actual execution of the roll-out schedule and the online update of the devices. This step isdescribed in detail in Section 4.

Step 3 concerns monitoring of the update process and detection of anomalies, and is running in parallel with theexecution of the roll-out. While the roll-out is carried out, it is here kept track of devices so that they do not startto misbehave before, during, or after they are updated. The Anomaly Detection (AD) is based on a root-causeanalysis tool, developed in LarGo! to assists in determining the cause of an anomaly. This is important because


MV

dis

trib

uti

on

ne

two

rk2

6

5

3

4

1

1

2

2

4

Figure 5: Optimal roll-out schedule for a distribution network, with deployment in six steps.

it helps to react properly in the face of an update anomaly or failure. If the cause is deemed less severe, oneresort could be a re-computation of the roll-out schedule in Step 1, whereas a more serious incident may triggera complete roll-out abortion.

Situation awareness plays an important role in securing critical infrastructures, such as the Smart Grid. An in-herent part of the situation awareness is online monitoring of the system state, which is achieved by collection ofmonitoring logs, information from distributed sensors, both security or process related. In LarGo! a micro-serviceand event-driven architecture is proposed, which integrates different security sensors and enables a more holisticreasoning about the monitored system state (situational awareness). A microservice architecture facilitates theintegration of independent (standalone) applications that can be written in different programming languages. Anevent-driven architecture enables dynamic update of the system whenever something changes without the neces-sity to periodically query the state of all subcomponents. In Figure 6 an outline of the framework’s architecture ispresented.

GraphicalUserInterface

EventDatabase

EventLogger

RESTAPI

EventParser

AnomalyDetection

NetworkIntrusionDetection

EventBus(MQTTBroker)

Distributed(Security)Sensors

EventsNormalization

HostIntrusionDetection

RecurrentNeuralNetworks

EvidentialNetworks

AlertCorrelation

ReasoningandComplexEventProcessing

HazardDetection

Figure 6: REASENS Framework Architecture

Applications, such as (security) sensors, are distributed in the strategic critical points in the system to monitorthe system’s health (behaviour). In cyber-physical systems we consider distributed sensors to monitor the ICTinfrastructure (hosts, applications, network), as well as the underlying physical processes. The former is usuallyunderstood as a task of intrusion detection, whereas the latter is referred to as Fault Detection Isolation and Re-covery (FDIR) [19]. Examples of security sensors include AD deployed locally at the voltage control level (physicalprocess sensors measurements); network intrusion detection to check if the network has been manipulated; hostintrusion detection system to alert on suspicious activity within the hosts, such as building energy managementsystems or similar. Sensors communicate over the Message Queue Telemetry Transport (MQTT) protocol witha more centrally located component (server), which performs high level alert correlation based on the input fromthe security sensors. Sensors publish events to the MQTT broker, which are then normalized by the parser andpublished further to the reasoning and complex processing components, e.g., Evidential Network, Recurrent Neu-ral Networks, or other alert correlation algorithms. The normalised events, as well as raw messages from the


sensor, can be logged into the database and accessed via the graphical user interface (GUI). The GUI offers aweb interface (dashboard) with presentation of the system state, logging of the security events, alarms (generatedby engines), the possibility to register and configure sensors, etc. The framework architecture allows for differentreasoning/correlation engines: they could subscribe to messages sent from selected security sensors and per-form a variety of processing (with different goals). The correlation engine is also integrated in the architecture asa microservice. In current implementation of the REASENS framework we deploy an Evidential Network (EN) forperforming inference about the system and roll-out state, and root cause analysis of potential failures. The methoddeveloped in LarGo! is based on work by Friedberg et al. [16].

In the final step, Step 4, the status of the updated devices is determined. In case there was a severe roll-outfailure, a rollback mechanism is initiated. Tools for this step have not been investigated in the LarGo! project.

4 Unified Deployment Process

There are several definitions of the term software roll-out or software deployment and its process. In the LarGo!project the following definition of software deployment has been used throughout the activities: software deploy-ment is defined as a process, which organizes and schedules a set of activities in order to make software availablefor use and to keep it up-to-date and operational. According to this definition, software distribution does not onlyinclude the activity of distribution, but also the continuous maintenance of the software and its deinstallation.

In order to enable and prepare the mass roll-out of smart grid applications, a dedicated and enhanced deploymentprocess is needed that can support power system engineers during roll-out of software. The aim is to derive asmart grid roll-out process, which not only covers mass roll-out, but also dependency management, automatedconfiguration and other relevant functions under consideration of degraded ICT and power system states. In theend, the result should be a seamless, safe and secure deployment process for smart grid applications that shouldprovide answers to the following main questions:

1. What software applications should be deployed?

2. When should they be deployed?

3. How can they be deployed?

4. How are deployment exceptions handled?

The answers to the first two questions should help the system engineer to create a software roll-out plan orcampaign and are also answers to Step 0 and Step 1 discussed in Section 3.2. The third question is relatedto Step 2 and is described in detail in Section 4.1 and in Section 4.2. In case there are problems during thesoftware roll-out, the answer to question four helps the engineers to take qualified decisions on how to proceed.For example, a roll-back to the previous software version might be the right choice, but in case of a minor exceptionit might also be better to continue with the roll-out plan.

The development of a unified deployment process in LarGo! was initiated with an identification of requirementsand use cases as well as an analysis of existing software management systems. Existing software managementsystems are designed to roll out software to one or more devices and cover different levels and areas of de-pendency management [2–4, 7, 8, 14, 15, 27]. For complex Cyber-Physical Systems (CPSs) there are howeverdependencies that go beyond of what state-of-the-art software roll-out systems support. These interface andfunctional dependencies range from the device level (dependencies to the applications’ runtime environment likedrivers, configured sensors, etc.) via the system level (protocols, services, etc.) up to the domain level (functionaldependencies with respect to the controlled physical system; for example, two controller applications on separatedevices should not try to control the same physical parameters). To ensure a given level of dependability of theCPS, during and after the software roll-out, it is important to cover dependencies on all of these levels, rather thanfocusing on one of these levels alone.

When a diverse set of devices is used in CPSs, it is very likely that there will also be the need of a diverse set ofsoftware deployment systems, since there are differing requirements for device types with divergent hardware. Forexample, a resource restricted Internet of Things (IoT) device which sends sensor measurements via a wirelessprotocol does not provide the resources to run containerized applications and thus cannot be managed usinga Kubernetes backend but rather with a Solution like SWUpdate [27]. However, managing a state-of-the-artedge device using a firmware management tool that flashes the whole device every time one of the softwarecomponents needs an update, leads to an unnecessary outage of functionality which should not be affected bythe software change. In addition, none of the systems analysed is able to include knowledge about the physicalenvironment and the functions of the software into the software roll-out planning and execution.

To resolve the described limitations the LarGo! project proposes an additional layer of software management whichgenerates and executes software roll-out plans using existing software management systems. The KnowledgeBased Software Management (KBSM) components as shown in Figure 7 utilizes knowledge about the setup ofthe CPS and its components for the planning process, which includes knowledge about the underlying softwaredeployment processes themselves.


UI / Controlling & Monitoring

ResiliencyAnalysis

KnowledgeService

Assisted Planning

Execution

Monitoring Service Domain A Action Service Domain B

Plan Evaluation

Simulation / Digital Twin

Software Management System A Device A.1 Device A.2

Device B.1 Device B.2 Device B.3

...

...

Physical System(e.g. transformer, power lines etc.)

Monitors Actions

Software Management System B

Monitors Actions

M AM

AM

Monitors Actions

PlanDatabase

Service Registry and Proxy

Monitoring and Action Service Domain C ...

Figure 7: KBSM Framework – Using distinct system knowledge an optimal deployment campaign (i.e., schedule)is created, taking multiple factors into account, such as device parameters, power system state, ICT system state,consumer behavior, etc.

The KBSM framework is controlled by the engineer through a User Interface (UI), seen at the top of Figure 7.In a first step, a Resilience Analysis is made, where knowledge gathered from the STPA method is used, seeSection 3. This is combined with live system monitoring data and system knowledge provided by the KnowledgeService. In a second step, this knowledge is used for Assisted Planning and creation of exception scenarios (e.g.,roll-back to previous software). With these initial steps, the roll-out architecture can suggest a plan that makesmost sense based on the available knowledge. It is still up to the operator to approve the plan or adjust it if needed.

Once a plan has been confirmed by the operator it can be tested in a Plan Evaluation. Due to the multi-domainapproach in LarGo! for creating a process for large-scale software roll-outs in the smart grid it focuses not only onone domain but analyses several domains and their interdependencies in order to assess the impact of a softwareroll-out. Thus, an evaluation of a software deployment plan must also evaluate these different aspects. To handlethis, the simulation framework described in Section 2 is used.

After the roll-out plan has been evaluated it is handed over to the Execution. In LarGo!, the deployment hasbeen investigated for two domains: deployment for iSSN, and deployment for BEMS. Although based on differenttechnologies, the deployment for both domains follow the same approach. A software management is neededon the devices as well as in the backend. During a software deployment the backend downloads the softwarepackage to the device software management, which installs and starts the software on the device. How this ishandled for the grid domain (i.e., the iSSN) is presented in Section 4.1 and for the customer domain (i.e., theBEMS) in Section 4.2.

The proposed KBSM framework is based on a layer of Monitoring and Action Services, which provide informationabout and the possibility to interact with the different domains. The Monitoring Services can be configured tomonitor states in the sub-systems that are necessary to be true before changes can be made to the systems.The domain specific Action Services provide means (actions) to apply changes to all parts of the CPS. Therebythe Action Services only execute commands. The successful execution is ensured by monitoring the systems.Thus the definition of a successful installation of an algorithm for voltage level stabilization can thus include thevoltage level itself and not only the successful start of the software itself.

Tests using ontologies and graph databases to store and process all dependencies on all levels from physicalto logical level in detail showed that it is not only challenging but might be too complex and require a depth ofknowledge about the system that is too complex to be manageable. In addition, the knowledge-based approachis currently constrained with respect to the system’s dynamic properties. However, in combination with othermodeling approaches and the usage of a digital twin to test the software roll-out plan against predefined scenarios,it is expected that the presented approach covers a level that increases the dependability of the system duringand after the software roll-out in comparison to the current state-of-the-art.

The inter domain roll-out process thus consists of the following steps:

1. Describe static knowledge about the system, its components, its stakeholders and all known (mis-)usecases.

2. Define standardized sets of failure- and attack-scenarios that can occur during software roll-out.


3. Monitor the system to update knowledge about dynamic properties.

4. Add new software and information about the software to the system.

5. Deduce knowledge about security issues, possible failures, and possible mis-use-cases.

6. Plan the roll-out using the (domain specific) system knowledge:

(a) Define goal of roll-out.

(b) Deduce all constraints (from different domains).

(c) Minimize risks and interferences with normal operation (inform user about non-resovable risks).

(d) Derive safe-points (safe and secure intermediate steps).

(e) Include rollback scenarios (to safe-points) in case a constraint is violated.

7. Evaluate dynamic properties of planned campaign by using a digital twin (simulation):

(a) Execute standardized sets of failure- and attack-scenarios.

8. Rate results and report possible incidents to user.

9. Execute roll-out plan using domain specific services (Actions & Monitoring)

10. Perform rollback to safe-point in case of a constraint violation.

4.1 iSSN Software Management System

For the software management of the iSSN LarGo! follows an application lifecycle management normally used forIndustrial Internet of Things (IIoT) use cases [18]. The architecture of this system is shown in Figure 8, with anoptional App Store included. In any case, by design there is no direct communication between the App Store andthe device.

.sh.sh

AppApp

App StoreApp LifecycleManagement

ServiceLocal Appliaction

Repository

App LifecycleManagement

Service

App LifecycleManagement

Agent

App

.sh

Internet

Operator / ControllerBackend System

IIoTField Devices

Figure 8: Generic Application Lifecycle Management for IIoT use cases

In contrast to consumer IoT solutions, in IIoT use cases it needs to be necessarily avoided that external systemshave any access to field devices, especially since in this architecture the App Store is not within the sphere of theoperator. Thus, purchased applications are first downloaded by the Application Lifecycle Management Service(ALMS) to the local application repository, all situated within the operator’s backend. Proper handling of licensesmay also be done at this layer.

The operator manages the several field devices by use of an UI on the backend side; there the enumeratedapplication lifecycle tasks are issued. Tasks are communicated to the Application Lifecycle Management Agent


(ALMA) running on the device, which executes specific shell scripts based on the file type and the task. ALMAadditionally periodically calls scripts to check whether the apps are still up and running (health/life check). Asspecified shell scripts can be implemented for any file type, this implementation shows to be very flexible and ex-tendable. Furthermore, interfaces and communication mechanisms between the systems are kept exchangeable.With the approach used for the iSSN application lifecycle management, software vendors are not restricted to acertain programming languages or platforms. By executing use-case specific shell scripts, the generic ALMA canbe adjusted for various use cases.

4.2 BEMS Software Management

In addition to the secondary substation, the BEMS also plays an important role regarding grid stability and inte-gration of renewable energies. The BEMS acts as communication center within the building and is able to performload management tasks. Such tasks can be motivated by local optimization goals (e.g., maximization of its ownPhoto Voltaic (PV) consumption) or triggered by remote actors like the distribution system operator or the iSSN tostabilize the grid. Thus, the BEMS needs to support various communication protocols to interact with devices frombuilding and remote systems. It also needs to be extensible for customized applications. The iSSN ApplicationLifecycle Management could also be used in the customer domain. However, there is more than one way to fulfillthe listed requirements and support the defined tasks. Furthermore, in non-academic setups we expect a mixtureof software management backends with varying properties to be the standard, and not the exception. Thus, asecond approach using the Open Service Gateway Initiative (OSGi) Service platform was chosen for managingdevices in the customer domain.

The OSGi Service Platform is a module system for Java programming language which allows a dynamic integrationand remote management of software components. The OSGi specification is driven by the OSGi Alliance, aworldwide consortium with key members like IBM, Oracle, Deutsche Telekom, Huawei Technologies and furtherbusinesses from the open source software area. The OSGi Alliance was founded in 1999 and its mission is tocreate open specifications for network delivery of managed services to local networks and devices. OSGi hasbeen adopted for solutions in IoT, Machine-to-Machine (M2M), Smart Home, Energy Management, Smart Meters,Healthcare, Automotive and various other domains.

There are different OSGi based frameworks which could be used as basis for the BEMS like OpenMUC [24],openHAB [23], OpenEMS [22]. Due to the popularity of OSGi in the customer domain and the lack of a satis-fying solution, we propose a OSGi based deployment process as depicted in Figure 9. The approach consistsof a server and a client part with different orchestrations of standardized OSGi services defined in the OSGiCompendium Release 7 [25]. The server part represents the repository server with released OSGi bundles andproject-specific property files. A target device is a connected node with its own running OSGi framework.

Figure 9: OSGi infrastructure for repository server and target device


5 Roll-Out Analysis and Validation

In the LarGo! project, there has been an extensive investigation going on for defining, testing and validating aseamless, safe and secure application deployment process for both the grid and customer domains. Such aprocess will enable the mass roll-out of smart grid applications. These applications can be useful not only forgrid and energy management, monitoring and control but for better serving the customers and improving thequality of service. The critical challenge of stable and resilient system operation is addressed in a setting wherecommunication systems are used for both smart grid run-time operation, including monitoring and control, andICT maintenance, such as application deployment and remote configuration. Such settings make this researchproject a large and complex multi-domain mosaic of engineering and scientific challenges.

The roll-out analysis and validation in the project LarGo! is mandated with testing and validating the roll-outprocess in different domains and at different scales. The four main activities in this regard are:

1. Large-scale application analysis

2. Lab-based roll-out and integration tests

3. Customer domain test-bed roll-out analysis

4. Grid domain test-bed roll-out analysis.

For performing these activities several test-beds were designed for testing and validation activities. In some cases,these test-beds are designed integrating components developed in other work packages. Additionally, the use andmisuse cases along with the design of experiments were specified. Each of these test-beds is developed for aspecific validation and verification activity at various levels of detail and scale. On one hand, it included the large-scale co-simulation-based validation and the lab-based real-time validations at a zoomed-in area. While on theother hand, it included the field validations both at the customer and grid domains. For performing these activities,several test-beds were developed conforming with the required level of details, scale, and method. The majorfinding and results from the mentioned four activities are summarized below.

The validation tasks have the same basic design of experiment, as visualized in Figure 10. The first step for eachvalidation task is to execute a set of use cases. The use cases are cases where the roll-out of new software issuccessful. In a second validation step misuse cases are applied to the validation tasks. These represent caseswhere the software roll-out is not successful, due to different reasons. By comparing the results of the successfuluse cases with the results of the misuse cases it is possible to assess the effectiveness of the proposed recoverysolutions. At the moment, all test activities are ongoing. The test-beds have mostly been set up and configuredand first tests are expected to be executed soon.

Figure 10: The DoE Rationale.

5.1 Large-Scale Application Analysis

This task deals with developing the large-scale co-simulation environment that enable analysis of the domains:power grid, ICT, household applications, and substation applications. With the simulation environment it is possibleto analyze utility-scale scenarios, studying the mutual influence between components, together with the systemconsistency and behavior under different sub-optimal grid operation states.

5.1.1 Overview

The development of the co-simulation test bed as shown in Figure 2 for Software In the Loop (SIL) turned out tobe a major challenge. The reason for this lies in the complexity of the interacting components, spanning networkstacks, IP packet parsing, container orchestration, simulation timing, and kernel/userland context switch handling.

The vif virtual interface must setup and teardown routes correctly. As the container starts with a standard networkinterface—usually eth0—, the vif must take care to still have this interface and the corresponding route to mosaikavailable for communication with the vif-sim. The default route, however, must be set to the tun device. mosaik-vifachieves this by creating a high-priority defined route to the Internet Protocol (IP) address of the vif-sim, explicitly


assigning it to the existing network device, before replacing the current default route with the new one. Setting upthe correct route and restoring the default route in any case (i.e., during normal shutdown and in case of crash) isone aspect of the test bed validation.

An IP packet in a SIL setup traverses two Transmission Control Protocol (TCP)/IP stacks: That of the containerwhen it enters the tun device, then again the host’s stack when the serialized packet is transmitted to mosaik,afterwards the network stack implementation of OMNEST that handles packet parsing during its traversal in thesimulated ICT environment, and finally the host’s network stack again — twice — on the way back to anothercontainer. The assumption that the simulated ICT environment can be treated just like any other real networkdoes not hold true. In theory, these network stacks are compatible; in practice, however, OMNEST needs toattach additional information to the packet internally that subtly changes the packet’s treatment. For example, thepacket’s time to live value is reset to OMNEST’s default during serialization, which means that the TTL needs tobe tracked separately during the packet’s traversal of the ICT environment. Moreover, OMNEST requires internalcalculation and checking of an IP packet’s checksum; it does not trust the IP packet’s values (as it can be anyvalue). OMNEST instead attaches a tag to indicate the validity of the CRC, which must be set explicitly. Overall,ensuring correct parsing and injection constitutes another part of the validation.

Incorrect packet de-serialization and re-injection into the container can also expose subtle bugs in applicationcode. For example, an incorrect TCP segment size stemming from a parsing error can lead to incorrect bufferallocations, crashing the application.

Furthermore, for a realistic test bed, timing is crucial. TCP’s timeout timer work in the range of seconds, sothere are no phantom segment drops that lead to re-transmission, but the perceived round trip time influences thecongestion window scaling and, hence, the bulk transmission speed. Timers in the container can be influencedby adjusting container time through fake time libraries; however, not all applications are similarly affected by this.The Java Virtual Machine uses its own spin-down system for timeouts; in addition, spinlocks use kernel Jiffies as ameasurement. Therefore, the optimization of code paths is important; when soft real-time guarantees need to begiven, the less time the system spends in I/O, the better. What becomes the most contended part of the executionis the number of context switches between kernel and user space. Therefore, code optimization translates into anumber of container hosts that can be accommodated in the simulated ICT environment.

5.1.2 Test Cases and Results

LarGo! has succeeded in creating a large-scale co-simulation test bed, overcoming many of the hurdles that haskept similar projects from reaching the same state. Developing a co-simulation the includes SIL components isunique. However, as a result, the test cases and simulation verification needed to concentrate on the test beditself, verifying the correct operation of all components.

The ICT SIL test cases therefore were standard tests; we defined the round-trip time of packets as well as theaverage throughput of a bulk TCP transmission as quality metrics. As one can assume that a higher number ofhosts communicating via the simulated ICT environment causes slower overall packet delivery, we increased thenumber of hosts in pairs of two and measured response times and throughput over 100 simulation runs with eachconfiguration.

2 3 5 10 50 1000

50

100

150

200

250

300

350

400

450

500

Avg. Ping Times [ms]

Avg. Ping Times [ms]

2 3 5 10 50 1000

1000

2000

3000

4000

5000

6000

7000

TCP Bulk Transmission [kBytes/s]

TCP Bulk Transmission [kBytes/s]

Figure 11: Experimental Measurements of Average Ping Times and Throughput

In order to establish the feasibility of the SIL co-simulation and quantify the development, we focused prominentlyon the ICT SIL simulations. For this, we have set up a co-simulation with a number of containers in which theiPerf3 [13] application was running. We have deployed pairs of clients and servers so that an iPerf client can sendand receive data from a dedicated iPerf server container. We used this set up to test both, the average round-trip


times (i.e., ICMP echo request/echo reply timings) and TCP bulk transfer speeds. All data was routed through thesimulated ICT environment, so that the flow of data was as follows: vif—vif-sim—mosaik—OMNeT++—mosaik—vim-sim’—vif’. The simulated ICT environment does not impose additional artificial delays in its network model.

Figure 11 shows the behavior for both metrics given a rising number of nods. Each data point represents adifferent number of nodes and the average over 100 repeated simulation runs. Delays rise sharply as the numberof nodes rises, but not exponentially. With ping times in the area of 23ms to 447ms, we assume that applicationsthat do not realy on real-time or, in general, low-latency communication can be accommodated by this SIL setup.However, the bulk throughput rate between 6102 kB/s to 3654 kB/s is far below a characteristic data rate normallyachieved by standard Ethernet connections.

We have investigated the reason for the low data rate and have identified three major points. First, mosaikcurrently uses non-compressed JavaScript Object Notation (JSON) messages in a request-reply pattern for dataexchange with out-of-process co-simulators. As both, the vif-sim, and the mosaik-OMNeT++ adapter, are writtenin C++, an additional network round-trip is introduced, even if the simulation runs locally. In addition, mosaik’ssingle-threaded request-response communication pattern with its associated simulators means that dependentsimulators expect an delay when other simulators are being stepped or queried for data.

Furthermore, mosaik has currently no facilities to allow simulators to signal the necessity to be stepped; simulatorcontrol is completely in the hands of mosaik. This means that mosaik must poll all vif-sims as often as possiblesince the co-simulator has no other way of knowing when data is available from a SIL container. In contrast to theICT simulation, data from applications arrives non-deterministic. In general, we have observed delays in messageprocessing stemming from the context switches between kernel space and user space that frequently occur asdata from the containerized applications travel through several network stacks.

Moreover, we currently launch one vim-sim process per container, as this is the easiest way from a softwareengineering organization perspective. However, this means a separate TCP connection per container, a newprocess, and a new data stream. We therefore plan to implement a multiplexing architecture in the vim-sim partin order to reduce the number of processes, and, hence reduce task and context switches.

We believe that this approach offers great flexibility and ease in modelling ICT networks with SIL. As the develop-ment of mosaik is open source and already aimed at providing higher throughputs and lower delay in the commu-nication with external simulators—e.g., a ZeroMQ1 implementation to replace the socket API already exists—, andthe co-simulator is extended to allow for event-discrete, non-deterministic simulators as they exist in this scenario,we see an increase in the throughput in the near future.

WP6 has, for the co-simulation, mostly served to detailed requirements and issues encountered in a SIL co-simulation of software roll-outs in the power grid, and to overcome hurdles towards a successful test bed setup.We have shown how an interaction of ICT and the power grid can be simulated and how complete containerizedsoftware stacks can be embedded into this co-simulation.

In the future, we expect optimizations on implementation level, e.g., more efficient transports and serializationtechniques, as well as implementing zero-copy primitives to reduce the number of copy operations and contextswitches. On a broader research perspective, we expect that abstracting parts of the system through surrogatemodels [6] will provide for a way to simulate large-scale roll-out procedures.

5.2 Lab-Based Roll-Out and Integration Tests

The laboratory-based validation and testing is focused on developing a realistic emulation of field situation in thelaboratory using CHIL to test the application deployment processes. The use case selected for laboratory im-plementation revolves around the residential feeder of the Conseil International des Grands Réseaux Électriques(CIGRÉ)2 low voltage European benchmark grid.

5.2.1 Overview

The electrical network used for the use case is based on the residential feeder of the CIGRÉ low voltage Europeanbenchmark grid, which can be observed in Figure 12. The main elements of interest for our application arehighlighted in gray in Figure 12 and detailed below together with the rest of the network components. Blackarrows are used in the figure to describe loads connected at a specific bus. Meanwhile, green arrows are used todenote PV generation units connected at different buses.

The interface of each PV generator to the electrical network is a power electronics converter. The main function ofthe inverter is to inject the active power produced by the PV into the grid. However, as required by most Europeangrid codes, as a secondary function, the converter participates in the voltage control of the network by consumingor injecting reactive power according to a predefined characteristic. Several approaches exist for achieving this,however, the most popular at the moment is the Q(U) droop characteristic [1].

1https://zeromq.org/2https://www.cigre.org/


S1

R1

R2 R3

R4

R5 R6 R7 R8 R9

R10

R11

R12 R13 R14R15

R16 R17

R18

Figure 12: The Physical Layer. Schematic representation of the residential feeder of the CIGRÉ low voltageEuropean benchmark grid.

R1 R11 R15 R16 R17 R18

Smax[kVA] 200 15 52 55 35 47

pf 0.95 0.95 0.95 0.95 0.95 0.95

Table 1: Rated load power and power factor for each node of the network.

The electrical network was simulated in a real-time simulator, an Opal-RT simulator3. Models for Opal-RT need tobe developed using Matlab/Simulink, therefore, the first step was to model the electrical network introduced earlierin Figure 12 using the required tools.

On top of the power system layer there are two layers with software components. An overview of the differentlayers and how the components are connected to each other is seen in Figure 13. The Application Layer containsthe control components deployed on the iSSN and the BEMS. The top layer is the Security Layer containing theADs and the REASENS framework.

The software components used for the lab validations are the following:

• iSSN, as the name suggests, the intelligent and secure substation node is a software ecosystem that isintended to run in the substation. For our CIGRÉ network described above, it runs at the node S1. Aspreviously described in Section 4.1, several applications can run at the same time inside the iSSN, handlingthe operation of different devices.

• DP-i, the iSSN software deployment platform is another software module included in our use case. Thismodule serves two purposes. On one hand it monitors the status of all the apps currently running on theiSSN and on the other it can remotely install new apps as well as provide updates or patches to the alreadyrunning software. This module would typically not run in the substation itself, but rather on a centralizedplatform from where it could target and monitor multiple iSSNs. A bidirectional communication layer isrequired between the DP-i and the iSSN as indicated by the blue dotted line in Figure 13

• BEMS, the building energy management system application runs in every node of the electrical networkwhere a residential load is connected, namely nodes R11, R15, R16, R17, and R18 in the electrical network.The BEMS also runs as a software ecosystem which incorporates multiple modules and which can beextended at any times with new modules.

• DP-B, the software deployment platform for the BEMS serves a similar purpose as DP-i in the case of iSSN.It runs as a central component and it can trigger a new software deployment or a software update for theentire fleet of BEMS that it monitors and manages. Dedicated bidirectional communication, as indicated bythe green dashed line in, Figure 13 is required between the DP-B and the BEMS instances that it manages.

• AD, the anomaly detection component is intended to detect voltage anomalies at customer premises (BEMS).The component comprises two methods: Kalman Residual-based anomaly detection and Cumulative Sum(CUSUM). The first one monitors the difference between the expected (estimated with Discrete Kalman Fil-ter) and observed voltage value. Once the difference exceeds the threshold, an alert is generated. KalmanResidual-based Anomaly detection is configured to detect abrupt changes in voltage levels. In addition, thechanges in mean of the voltage distribution is performed using CUSUM. By setting the detection thresholdhigher, CUSUM is configured to detect voltage drifts rather than drastic changes.

• E-Net, the evidential network is used to perform root cause analysis based on the input from distributedanomaly detection components and is running in the REASENS framework described in Section 3.2. Evi-dential Network is a graph structure for knowledge representation and state inference. Nodes of the networkrepresent the variables, such as distributed sensors, reported anomalies, assumed system states. The con-nections between network nodes encode the causal relationships between them. For instance, detected

3https://www.opal-rt.com/


S1

R1R2

R3

R4

R5

R6

R7

R8

R9

R10

R11

R12

R13

R14

R15

R16

R17

R18

iSSN

KBSM

BEMS

BEMS

BEMS

BEMS

BEMS

DP-i

DP-B

Net

AD

AD

AD

AD

AD

iSSN

KBSM

BEMS

BEMS

BEMS

BEMS

BEMS

DP-i

DP-B

Net

AD

AD

AD

AD

AD

Physical Layer

Application Layer

Security Layer

Figure 13: Overview of all the layers involved in the laboratory use case.

voltage anomalies imply that the system is in erroneous state with some level of confidence. The output ofthe Evidential Network is configured to estimate the probability of the system and roll-out process being innormal, erroneous or malicious states.

• KBSM, the inter-domain software management system to create and execute rollout plans. This componentinterfaces the AD for status monitoring during a rollout, DP-B to initialize changes of the software setup inthe BEMS, and DP-i to initialize changes of the software setup in the iSSN.

One of the challenges when using the real-time simulator for validating the ideas in LarGo! was interfacingthe software components to the simulator, i.e., emulating the connections between the physical layer and theapplication layer from Figure 13. For the LarGo! case, there are no real constraints regarding the refresh rateof the data exchange between the simulation and the applications. But instead, there is a larger number of datasignals that need to be exchanged, which is typically not handled very well by real-time simulators. For example,each BEMS requires measurements regarding voltage magnitude, active and reactive power consumption of theloads, as well as power production of the PV. At the same time, each BEMS needs to send back power curtailmentcommands, and controller settings for the PV inverters. As the number of BEMS increases, the number of thesesignals stack up and the conventional interfacing methods cannot be used. In order to circumvent these aspectswe interfaced the Opal-RT simulator to Lablink. Lablink is a middleware software package developed at AIT whichis based on a simulation message bus approach [17]. Lablink allows quick and easy coupling of software andhardware components in a lab context.

In a real-life setup the various software components that run in the application and security layer would be dis-tributed across embedded computers placed at various locations in the electrical grid, e.g. the iSSN and all ofits components would run on an embedded device located in the substation, meanwhile the BEMS would runon devices located at the end-consumer level. A dedicated or public communication network would enable all ofthese devices to exchange data between them.


In order to emulate this setup in the lab we used a combination of single board computers, servers, and commu-nication hardware, in conjunction with the real-time simulator. For emulating the embedded computers distributedacross the electrical grid, we used several Raspberry PI (RPI) single board computers. Using a dedicated com-munication network, the RPIs were connected to the real-time simulator. The iSSN, the BEMSs, and the ADswere deployed to one of the RPIs, as shown in Figure 14.

iSSN Deployment Platform

DP-i

Lablink/node-RED Message Bus

OPAL-RT

Lablink Int.

Real-Time Process

...

Raspberry PI - 3

Lablink Int.

BEMS R15

Raspberry PI - 4

Lablink Int.

BEMS R16

Raspberry PI - 5

Lablink Int.

BEMS R17

Raspberry PI - 6

Lablink Int.

BEMS R18

Raspberry PI - 2

Lablink Int.

BEMS R11

Raspberry PI - 1

Lablink Int.

iSSN S1

Security Data Bus

Raspberry Pi - 8

ADR11

BEMS Management BusiSSN Management Bus

Application Data Bus

Raspberry Pi - 9

ADR15

Raspberry Pi - 10

ADR16

Raspberry Pi - 11

ADR17

Raspberry Pi - 12

ADR18

BEMS Deployment Platform

DP-B

Raspberry Pi - 7

E-net

Phys

ical

Lay

erAp

plic

atio

n La

yer

Secu

rity

Laye

r

Figure 14: Mapping of the CIGRÉ use case components to the laboratory setup.

5.2.2 Test Cases and Results

The three-layer setup presented in Figure 13 and detailed in the previous sections was used as a use case for theproject. Progressively, as the different layers were developed and included in the laboratory setup, various testsof different degrees of complexity were performed throughout the WP execution. In the following we will presenta subset of these tests, mainly those which include all three layers of the use case, i.e., physical, application, andsecurity layer.

The results from the normal operation of the grid are shown in Figure 15. The test starts with all the voltagesin the network at the nominal value of 400 V. Also, in the beginning, the BEMS has the power consumption ofthe building, i.e., Pload, as well as the active power injection of the PV installations, i.e., Ppv, set to zero. Sincethere is no voltage deviation, the droop voltage controllers do not change the reactive power injection of the PVinstallations, i.e., Qpv. During this time, as the voltage gradient is very smooth, the Kalman anomaly detectioncomponents (ADkalman

alert report no alerts to the evidential network, which in turn labels the current system state,i.e., ENsystem

state , as normal.

Starting from around the 13:37:00 mark, the power consumption of the building Pload is increased in all the nodeswith a considerably large step size. At the physical layer this causes the voltage to drop significantly, resulting involtages lower than 340 V for the nodes that are towards the end of the feeder. However, since the PV powerinjection is still zero, and since the standard droop control implemented in the BEMS has a power factor limitation,the reactive power injection is still zero. At the security layer the ADs get activated by the large voltage deviationand start sending alerts to the evidential network which starts to label the system state as erroneous. As one cansee from Figure 15, there is a slight delay from the moment when the anomaly detectors start emitting alerts andthe evidential network changing its reasoning about the system state. This is due to the different sampling ratesat which these components report data to the database. While the AD component report data every 2 seconds,the EN component reports data at every 10 seconds.

Finally, starting around the 13:39:00 mark, the power injection of the PV at all nodes is also increased and, asit can be seen from Figure 15, this causes the voltage in the network to rise. Since there is still a considerablevoltage deviation, the droop controllers get activated and start injecting reactive power ( Qpv), thus contributing tothe voltage restoration. Once the voltage stabilizes, the anomaly detectors stop emitting alerts and the evidentialnetwork goes back to normal system state.

The power changes used in this test case were intentionally selected of extra high magnitude in order to exemplifythe behaviour of the system.


340

360

380

400

Voltage[V

]

Node R11 Node R15 Node R16 Node R17 Node R18

0

20

40Ppv[kW

]

0

2

4

6

8

Qpv[kVar]

0

50

Pload[kW

]

0

0.5

1

ENsystem

state

normalerroneous

malicious

13:34:00

13:36:00

13:38:00

13:40:00

13:42:00

!!!

Time of day

ADkalm

analert

Figure 15: Normal operation test case. From top to bottom the figure shows: the voltage at the different nodesin V, the active power of the PVs installed at the different nodes, in kW, the reactive power injected by the droopcontroller in kVar, the active power consumed by the loads in kW, the output of the evidential network, and theanomaly detection alerts.

In the second test case the BEMS deployment platform (DP-B) triggers an update of the droop controllers insidethe BEMS. The results of this experiment are shown in Figure 16. The standard droop controller used in theprevious test case, and labeled "A" in the first plot of Figure 16, is updated with a new controller "B" at around the14:55:00 time mark.

Controller B is the optimized droop control algorithm presented in Section 3.2. However, an error was introducedin the design and integration phase, which is detected by the ADs. Instead of measuring the voltage from lineto ground it is measured from line to line. This can be seen from the plots displayed in Figure 16. As soonas the new controller is being deployed, the PV inverters completely stop injecting active power and prioritizereactive power injection in order to compensate the erroneously perceived over-voltage. The anomaly detectioncomponents immediately start emitting alerts of suspicious voltage behavior and the evidential network increasesthe probability of an erroneous roll-out state ENroll-out

state . Unlike in the normal operation case, the voltage oscillationsintroduced by the erroneous controller will not go away until the controller is stopped. The network operator orthe software roll-out responsible will notice that the erroneous state is not transient and it could decide to takecorrective actions.


A

B

Droop

controller


350

400

450

500Voltage[V

]

0

10

20

30

40

Ppv[kW

]

−50

0

50

Qpv[kVar]

0

20

40

60

80

Pload[kW

]

0.6

0.8

1

ENrollou

tstate

normalerroneous

malicious

14:50:00

14:52:00

14:54:00

14:56:00

14:58:00

!!!

Time of day

ADkalm

an

alert

Figure 16: Erroneous droop controller deployment test case. From top to bottom the figure shows: the type ofdroop controller (A: standard droop controller, B: badly designed droop controller), the voltage at the differentnodes in V, the active power of the PVs installed at the different nodes, in kW, the reactive power injected by thedroop controller in kVar, the active power consumed by the loads in kW, the output of the evidential network, andthe anomaly detection alerts.

Once the operator is alerted about the erroneous system state a rollback is issued. After the 14:58:00 time mark, amanual rollback command is requested to the BEMS deployment platform, seen in Figure 17. The standard droopcontrollers are deployed to the PV inverters, the system returns to a normal operation point and the anomalydetection components stop emitting alerts.

The activities in the lab validation task are able to achieve the promised goals. They emulated all the layersinvolved in the smart grid software deployment approach proposed by LarGo!. While zoomed-in use case wastackled in this task, the proposed approach has great potential for scaling to larger scenarios and it is currentlybeing further developed by AIT. Although most of the required components representing various layers, wereintegrated into the test-bed, it was still not manageable to coordinate a test that involves all of the developed


A

B

Droop

controller


350

400

450

500Voltage[V

]

0

10

20

30

40

Ppv[kW

]

−50

0

50

Qpv[kVar]

0

20

40

60

80

Pload[kW

]

0.6

0.8

1

ENrollou

tstate

normalerroneous

malicious

14:56:00

14:58:00

15:00:00

15:02:00

!!!

Time of day

ADkalm

an

alert

Figure 17: Rollback to standard droop controller test case. From top to bottom the figure shows: the type of droopcontroller (A: standard droop controller, B: badly designed droop controller), the voltage at the different nodes inV, the active power of the PVs installed at the different nodes, in kW, the reactive power injected by the droopcontroller in kVar, the active power consumed by the loads in kW, the output of the evidential network, and theanomaly detection alerts.

components due to an increased level of complexity. This complexity arises due to potential interaction andcoordination of the multiple layers including management, deployment, coordination, monitoring, rollback, security,etc. to work in a coherent way.

5.3 Test-Bed Roll-Out Analysis

In the project two field tests are carried out. In the first field test, focus is put on the testing and validation of thecustomer domain specific application deployment process in the Fellbach testbed in Germany. In the second field


test a grid domain test-bed roll-out analysis is made, where measurement system application for a substation isdeployed and validated in the smart city Aspern testbed in Vienna.

5.3.1 Customer Domain Field Tests in Fellbach

The customer domain test-bed is part of the project C/sells [11] and is located in the city of Fellbach in Germany.It consists of three buildings, where each of them is equipped with a 10 kWp PV system, a building energymanagement system (BEMS) and a 22 kW AC charging infrastructure. Goal of the C/sells project is to demonstratecongestion management, grid friendly charging of electric vehicles and PV self-consumption maximization forprivate households. In LarGo! the test-bed was connected to our deployment server to remotely install andupdate applications on the BEMS target devices. The overall setup for the test cases is shown in figure 18.

Figure 18: Overview of the test-bed setting.

As preparation for validation, two of the C/sells applications were adapted to the LarGo! deployment process.They were extended by a specific interface and encapsulated in deployment packages, which are the deployableunits of the OSGi deployment process. More details on the deployment architecture are presented in Section 4.2.For each of the application an initial deployment package for installation and a deployment package for the updatewere created. The updates were intentionally misconfigured in order to test the rollback mechanism and anomalydetection for resilient system behaviour.

In normal operation the BEMS receives a PV forecast from an external service provider. Together with local loadforecast, energy demand and departure time, the PV forecast serves as input parameter for optimization of thecharging schedule of electric vehicles. The following test case shows how runtime errors might occur some timeafter an update was made. To show how this can be handled, an intentionally misconfigured updated was rolledout to validate the AD. Based on historical measurements the load forecast application normally calculates theload forecast for the next 48 hours. For the update, a wrong building model was chosen, leading to a much higherload prediction. Since the load forecast is one of the parameters for the PV self-consumption maximisation, thesystem will run in a sub-optimal state. This could cause financial losses to the user and a higher grid consumption.

The test starts with the transfer of the deployment package csells-loadforecast-1.0.1.dp from the server to the tar-get. Once the DeviceAgent received the deployment package it uninstalls version 1.0.0 of the csells-loadforecastapplication. Afterwards the new version of the the csells-loadforecast is installed. This time no error occurredduring the update process. After sometime a new load forecast was generated and the local anomaly detectionmodule detected differences in the load profile. Thus, it informed the deployment server about the anomaly.

Listing 1 shows the log output of the load forecast test case. The lines start with either S for the deployment serveror B for the BEMS. The test case was successfully executed and the anomaly was detected as expected.

The new deployment system was successfully put into operation at the demonstration site. For this purpose, adeployment server was set up at the Fraunhofer ISE in Freiburg and pre-existing applications from the C/sells


project [11] were adapted to the new deployment process. The tests conducted confirmed that the developeddeployment infrastructure works under real conditions. The deployment server was able to request the currentstate of the applications as well as meta information from the target device such as free memory or the currentCPU load. Software updates were manually triggered on the server, transferred to the target and installed duringruntime. The installation process was performed and monitored by a local deployment service on the target. Thisservice also takes care of errors which might occur during the installation process. In case of an error it informsthe deployment server and performs the necessary steps for a rollback to the previous version. Run-time errorsrelated to the updated application which might occur after some while can be detected by sensors for anomalydetection. A prototype was used for the second test case, which forwards detected anomalies to the deploymentserver, where a human operator can take further steps. These mechanisms and the fallback strategies imple-mented in the apps ensure a resilient system operation for this test site.

S:18:43 :07 .096 d . f . i . d . b . c . DeployShellCommand − s t a r t deployment : dp / cse l l s−l oad fo recas t −1.0.1.dp on device BEMS−1B:18 :43 :07 .525 o . o . l . b . depl . BemsLocalDeployer − s t a r t deployment : dp / cse l l s−l oad fo recas t −1.0.1.dpB:18 :43 :07 .584 o . o . deploy . Reg is t ra t i onHand le r − serv i ce changed o . o . f . c . hems . ap i . f o recas t . LoadForecastServiceB:18 :43 :07 .637 o . o . f . c . h . LoadForecastComponent − Load fo recas t component i s going down !B:18 :43 :08 .264 o . o . deploy . Reg is t ra t i onHand le r − serv i ce changed o . o . f . c . hems . ap i . f o recas t . LoadForecastServiceB:18 :43 :08 .338 o . o . f . c . h . LoadForecastComponent − LoadForecastComponent s t a r t e d !S:18 :43 :08 .515 d . f . i . d . b . c . DeployShellCommand − deployment success fu l : dp / cse l l s−l oad fo recas t −1.0.1.dp

on device BEMS−1

B:19 :00 :00 .001 o . o . f . c . h . LoadForecastComponent − generated new fo recas tB:19 :00 :00 .863 o . o . r . a . AnomalyAgent − event : load fo recas t changedB:19 :00 :01 .592 o . o . r . a . AnomalyAgent − f o recas t v a l i d a t i o n − r e s u l t : increased load p r o f i l eB:19 :00 :01 .834 o . o . l . b . depl . BemsLocalDeployer − AnomalyNo t i f i ca t i on : Load fo recas t component − State : INCREASEDS:19:00 :02 .029 d . f . i . d . b . c . DeployShellCommand − AnomalyService a l e r t : device BEMS−1 load fo recas t State : INCREASED

Listing 1: Log messages for an erroneous update of the load forecast algorithm.

5.3.2 Grid Domain Field Tests in Aspern

The second set of field tests in LarGo! was related to the grid domain. This section describes these field testsand how they were used to show the usability of the LarGo! deployment architecture within the ICT system of aDistribution System Operator (DSO).

The main objective of the field validation for the grid domain is to validate how the application deployment processcan be installed and utilized within the Aspern test-bed in Vienna. The Aspern Test-bed4 is part of Wiener Netze’sVienna distribution grid and implements an integrated system in the building, energy grid, ICT, and user domainsat three sites with mixed use (student residences, residential buildings, and a school), and includes a distributiongrid with twelve substations. Thus, although classified in LarGo! as a field test bed, Aspern is affected by thesame strict rules that apply to all distribution grids within Austria. Therefore, this test will give a good overview ofhow the LarGo! deployment architecture will fit into existing distribution systems.

Due to the increasing effort with maintaining and deploying hardware and software needed for the currently on-going large-scale smart meter roll-out in Austria, a validation scenario was selected to show how the LarGo! de-ployment solution can automate and support this roll-out. When new smart meters are rolled out, a correspondinginfrastructure is also needed in order to collect data and configure and update the meters. For the LarGo! partnerWiener Netze, one of the central components for the smart meter roll-out is the Substation GateWay (SGW). It islocated on substation level and collects and aggregates data from underlying smart meters and sends this dataonward to the control center. Since the SGW is collected to a large number of smart meters its installation andcalibration can be an error-prone process, especially when done manually, as is the current state-of-practice.

For the grid domain field validation the configuration of a new SGW was used as the main motivation. Themotivating idea for how the LarGo! deployment platform can be used for the installation and configuration of anew SGW can be summarized with the following steps:

1. A new SGW is installed in a substation and connected to a back-end layer where the deployment man-agement is running as well as to the smart meters installed at household level and associated with thesubstation.

2. The correct SGW configuration is automatically installed (App-1, App-2, ...) depending on the nameplate ofthe connected smart meters.

3. Erroneous configuration is detected by the Anomaly Detection (AD) algorithm, which reports this to thecontrol center.

4. The correct configuration is confirmed by the engineer or can be manually edited by installing new apps incase of errors.

4https://www.ascr.at/en/


SGW1 SGW2

Firmware

App Lifecycle Management Agent

Sensor(s) Sensor

BackendApp Lifecycle

Management Service

Modbus

App Store

Apps

Control Center

Substation

Figure 19: Validation setup for the field tests in Aspern.

As indicated above, the configuration of the SGW is done by deploying the correct apps depending on the con-nected sensor or smart meter. In other words, the configuration is contained within an app and by deploying it tothe SGW the configuration process is executed.With this motivation in mind the following two goals were defined for the field validations in the Aspern Testbed:

• Goal 1: Show that the deployment process can be successfully integrated into Wiener Netze’s distributionICT system.

• Goal 2: Show how the LarGo! deployment process can be used to automate and support the engineers withthe configuration process of the smart meter infrastructure.

In order to be able to deploy apps to the SGW, the deployment architecture described in Section 4 was installed atthe Aspern test site. Due to the access restrictions to addresses outside Wiener Netzte’s own network, it was notpossible to install the App Store in the cloud. Nevertheless, the App Store was included instead in conjunction withthe ALMS on the back-end in the control center as shown in Figure 19. Furthermore, only one device was used,shown as SGW2 in Figure 19, which was installed additionally to the SGW already installed in the substation.The reason for this was that the kernel used in the SGW already in place was not compatible with the ALMAsoftware and the apps. Therefore, a Ruggedcom RX1400 was used instead. On this, the ALMA was installed,which allowed for remote deployment of apps from the control center. Furthermore, possible inferences with thegrid operation could be minimized by using this approach.Using the solution seen in Figure 19 it is possible for the operator to select the required app from the App Storeand download it to the Ruggedcom (SGW2), where the ALMA installs it and monitors it during its lifecycle.The goal of the main test case was to show that the LarGo! deployment process can be used to support engineerswith the deployment and configuration process of the ICT infrastructure needed for the smart meter roll-out. Thetest case shows how the LarGo! infrastructure seen in Figure 19 is used to install a new configuration (app) ona Ruggedcom. In the end, the correct configuration is also confirmed by the ALMS/ALMA services. Since theRuggedcom was not installed directly into the substations, it was not possible to access measurement sensorsin the field. Therefore, an app that emulates measurements was installed instead. Furthermore, since no realmeasurement equipment was used, the automatic detection of the meter nameplate was skipped, see Step 2above. The test case includes the following steps:

1. Acquirement of the measurement emulation app in the app store.2. Configuration of the acquired app to fit the needs of the DSO environment.3. Deployment of the app to the Ruggedcom.4. Confirmation that the app was successfully installed.

In the first step, the engineer from the DSO chooses to purchase an app from the App Store, as seen in Figure 20.For the configuration of the SGW/Ruggedcom an app that is bought from the App Store can be used to configuremultiple devices, so the purchase is only done once.


Figure 20: App Store

Once the app is bought it is downloaded to the DSO’s back-end, where it can be further configured and finallydeployed. An example of how this back-end application can look like is shown in Figure 21. In order to distinguishbetween the public domain and the DSO’s domain, the App Store uses a light background, whereas the back-endof the DSO has a dark background color. In the Application Configuration view, seen in Figure 21, all the apps thathave been acquired are shown to the engineer. Different configurations could be made depending on the type ofthe app, e.g., IP addresses. The Application Configuration also shows the required resources and capabilities ofthe app to the engineer.

Figure 21: Application Configuration

After the app has been properly configured, it can be deployed to a device or to a group of devices. In most cases,the same app is used in more than one device, but with a slightly different configuration. For such cases thedevice groups can help. By assigning an app to a device group, it will be deployed to all the devices in the groupautomatically. Configurations, as shown in the step before, can be made group-wise or device-wise. In Figure 22two device groups are seen, where the DeviceGroup 01 contains only one device, the Ruggedcom/SGW (herementioned as LarGo device). On the right side of the figure the available apps are also seen. By selecting adevice group and then pressing Install of an app, it will be installed on all the devices belonging to the selectedgroup. When Install is pressed the ALMS running on the back-end will connect the the ALMA services, running onthe devices within the group, and issue a download and installation of the selected app. Once the ALMA servicereceives the app it will install and start it on the device.

The final step of the deployment for the engineer is to confirm that the installation of the app was successful. Oncethe ALMA service has successfully installed the new app on the device it will send a confirmation back to the ALMSservice running on the back-end. That way, a direct feedback is available to the engineer if the deployment wassuccessful or not. Furthermore, after installation, the ALMA services on the devices will continuously monitorthe app. The monitoring can detect if the app crashes or has stopped working. If this happens the engineer isimmediately informed and it is also possible to configure the ALMA to automatically restart the app if needed.Figure 23 shows how the user interfaces informs the engineer of a successful installation of the app on the SGW.As can be seen, it is also possible for the engineer to manually start/stop the app from the user interface.


Figure 22: Application Deployment

Figure 23: Result of a successful app deployment to the SGW.

Although it was not possible to implement all the steps mentioned in the motivation, the grid domain field evaluationwas still a success. One of the most challenging aspects with the grid domain field evaluation was to bring theLarGo! deployment architecture into the ICT system of Wiener Netze. Although already modern, the ICT systemof Wiener Netze is mainly intended to be configured using manual processes, such as the installation of newsoftware, configuration of devices, updates of existing software, and is done on a needs basis. It was thereforevery interesting to see how the partly automated processes used in the LarGo! system could be integrated.

The installation and integration of the LarGo! deployment framework into Wiener Netze’s ICT infrastructure re-quired much more time and resources than the actual test case. This is quite common when new processes areintegrated into existing ICT systems. For the LarGo! project, it was also a very good indicator of how important theresults from the project can be to improve the efficiency of future smart grid roll-outs and the adoption potential ofnew smart grid solutions. The steps that were taken in order to integrate the LarGo! solution are the same manualsteps that are often needed to install new software into ICT systems. Thus, this also gives an overview of howcomplicated and resource demanding this can be. However, if this process can be automated using the LarGo!deployment system, these resources can be saved and new smart grid solutions can be adopted faster. Thiswas also proven by test case described above, where the time to install the app was not more than 30 minutes,compared to installation and integration, where multiple days were needed.

6 Discussions and Lessons Learned

The LarGo! project has shown that deployment of software in smart grids is a topic that needs to be consideredwith care in future system implementations. The more software is included, on different levels, in the power systemand the more dependent an optimal operation of the power system is on software applications, the more criticalwill it also be to keep all systems up-to-date. One of the innovative concepts of LarGo!, is that software updatescannot be considered independent from the underlying power system. The smart grid is indeed a CPS and assuch, the system should always be considered for any actions taken on the cyber part.

In LarGo!, the problem was approached from two directions: from a theoretical direction considering solutions todesign secure and resilient applications, and from a more practical direction with implementations of deploymentframeworks for the grid domain and the customer domain as the goal. In this section, the lessons learned fromthese two approaches are summarized together with some finishing remarks and discussions on future opportu-nities.


Resilient Rollout SchedulingImportant assumptions in the software update roll-out problem considered in Section 3 were the presence ofa failure detection system and a known nominal operating state. In practice, these may be hard assumptionsto exactly satisfy. Complicated faults may not present themselves until after some time after the update, andmay be triggered from interaction with devices not yet updated. Also, because of the fluctuating state inLV grids with high renewable penetration, any assumptions made made on the state may change veryfast. These complications could possibly be handled be solving our problem in a receding horizon manner(cf. [5]). Then only the first few slots of the schedule may actually be implemented, before the schedule isrecomputed taking the latest network state and new constraints into account. Such online execution requiresa fast implementation, which is an interesting topic for future research.

Root Cause Analysis and Adaptive Software RolloutIn the REASENS framework we have performed root cause analysis with Evidential Network (EN). AlthoughEN has proved very suitable for the proposed task, there are still some shortcomings that could be addressedin future research. First of all, the structure and configuration of EN is based on experts knowledge, whichis specified in form of IF-THEN clauses. The relationship between variables (structure of the network) withtheir assigned values are quantified using predefined linguistic scale (e.g. probable, very likely, possible,etc.), as explained in detail in Friedberg et al. [16]. An alternative approach would be to quantify the belieffunctions using data-driven approach, i.e. perform supervised learning of belief functions based on a labeledset of samples. Secondly, EN performs reasoning about the current state given the evidence from thesensors that occurred recently (within predefined time window). We could explore the possibility of adaptivetime window and weighted evidence (weights based on their proximity in time). For instance, old eventswould have very low weights, therefore having low impact on the overall results. Finally, it would be worthinvestigating methods that account for temporal aspects of events, such as the sequences of events thatmight be meaningful (e.g.the fact that an error occurred after roll-out might imply the causality of those twoevents). Methods such as Hidden Markov Models (HMM) or Recurrent Neural Networks (RNN) could beapplied.

The set of applications that is needed to fulfill today’s and future needs of customers and of grid operators, is largeand diverse. Thus the implementation and the design of the interfaces between these applications is diverse too.An application that controls a tap changer has different timing constraints than the voltage control in the BEMS orthe PV inverter. In addition there are multiple protocols and interface types used in the customer and grid domain,which can not be easily replaced by "the one" solution, nor is it likely that the stakeholders will agree on such asolution. Thus the design of a template for future applications and their interfaces in the targeted domains is nearimpossible. However there are design rules that can make the integration and development of applications formass roll-out easier. Based on the experiences made in the LarGo! project, the following design rules should beapplied for applications in a CPS like the smart grid:

Increased ModularityModularity sometimes increases complexity to the design process. Since the functionality to be implementedmust be broken down into modules for which clear interfaces need to be designed and implemented. How-ever the result is a system in which components can be easily replaced extended or new versions, withoutthe need to change the complete system. This reduces the effort for a mass roll-out and reduces the amountof data that is to be transferred for the update process. In addition this enables the operator to adapt theconfiguration of the used modules to the needs of the environment. The Fraunhofer BEMS is based onOSGi, while the iSSN implements a micro service based architecture. Both approaches thus support thedesign and implementation of modular systems.

Usage of Digital-TwinsIn order to support automatic or semi automatic planning, and validation of the system, the planning andvalidation processes need information about the applications that are to be rolled out (i.e., application meta-data). This information includes the definition of interfaces, resource requirements, information about theapp lifecycle and could even include models of the implemented functionality. Models of the application’sfunctionality can be used during the planning process to verify the correctness of the desired functionalityand even to dynamically monitor the behaviour of the application after the roll-out. Therefore a post-conditionof the installation could be that the output of the app complies with the model for a given amount of time.

Prepare for Changes in Configuration at RuntimeApplications should be designed in a manner that supports recondiguration during runtime. This is neededto avoid or reduce downtime of the applications that need to be changed. For example, a service that sendsmeasurement data to an MQTT broker needs to be re-configured to send its data to a different broker. Incase the address of the broker changes, the application could buffer the sensor values during the switchover. Thus loss of data can be avoided.

Security by DesignMost applications in the customer domain and the grid domain either handle sensitive data or control partsof the CPS and thus have to be protected against attacks and faults of any kind. Resiliency against attacks


and faults in a CPS can only be ensured when all components in the system contribute to it’s security. Abreach in the application level, or the roll-out system can be an entry to the rest of the otherwise protectedsystem. Thus security has to be part of the application’s design and include

1. Mechanisms for the runtime environment to verify the integrity and origin of the application.2. Authorization mechanisms to grant access to the system only as far as needed.3. Authentication mechanisms to ensure that access is only granted to the correct software components.

Follow Best PracticesIn addition to the safety mechanisms to be supported by the applications (see "Security by Design" above) anapplication in the grid and customer domain should be designed and implemented following best practicesfor the environment the application is designed for.Since the Fraunhofer BEMS is based on OSGi the Best practices for developing and working with OSGiapplications published by IBM are a good starting point [20].

Enable Monitoring during RuntimeThe performance of an application should be monitorable in order to be able to ensure that an application isoperating to its functional and non-functional requirements. This includes logging and interfaces to requestthe lifecycle state of the application and connection status of the application’s interfaces.

Built-in Support for Hot StandbyEven though in the LarGo! project a system was proposed that provides proxy modules and can thus beused to implement a hot standby or warm standby mode of the application, it would be beneficial if theapplications in the smart grid and customer domain would support these modes natively. Thus switchingover from one version to another could be supported without the need of additional software and would thusreduce the complexity of the roll-out process. On the other hand the implementation of these modes in theapplications could increase the complexity and size of the application unnecessarily, in which case it wouldbe preferable to add the standby modes using external components that are added and removed during theroll-out process.

Especially for the end of the LarGo! project, a certain way of approaching large-scale roll-outs crystallized. Theapproach has already been mentioned in Section 3.2, where it was used to deploy new droop control algorithmsto BEMS. A more generalized version can be formulated as follows:

Step 0: Identification of the software that should be deployed : The more interconnected software and powersystem components become, the more important it will be to consider if a planned update is actuallyneeded. Once it is clear that the software needs to be rolled out, it is also important that it undergoesextensive testing before it is deployed. This way, many human errors and unnecessary bugs can beavoided. In LarGo!, one example was the identification of new droop control parameters as describedin Section 3.2.

Step 1: Plan when the software should be deployed : As already mentioned, the a tight interconnection betweenthe ICT system and the power system creates situations where software errors can lead to severepower system failures. Therefore software updates should be planned in a manner that possible failureshave as little affect on the power system as possible. For example, noon on a sunny day may not bethe best moment to roll out a new droop control function to all PV inverters in the system. In LarGo! anovel method to calculate minimum-time roll-out schedules was developed, see Section 3.2.

Step 2: Software should be deployed in a unified manner : Although different applications are used in differentdomains and preferred way of developing software also differs between domains, they are not indepen-dent from each other. Therefore, if a new software needs to be deployed to components in differentdomains (e.g., iSSN and BEMS) this should be done in a unified manner. In LarGo! the unified de-ployement is handled by the KBSM, as described in Section 4.

Step 3: Online monitoring: Once the new software is deployed there must be means to monitor if the roll-outwas successful. Since errors can occur in a delayed manner, and not only directly when the newsoftware is started, the monitoring should be done continuously online. It is also necessary to be ableto differentiate between errors that occurs due to an erroneous update and other errors in the system.In LarGo! the REASENS framework and the AD algorithm warns the operator if errors occur due to thesoftware update, see Section 3.2.

Step 4: Handling of roll-out failures: If anomalies are detected that can be associated with the new update, itmust also be possible to react. This can either be a reconfiguration, or a rollback of the update. It isalso important that it is possible to react in real-time, meaning it should not be necessary to wait untilall devices have been updated before a rollback can be initiated. Although conceptually considered,this step was not researched very extensively in LarGo! and provides opportunities for future work.

In the best case, these steps should be followed when new software should be deployed to the smart grid. Ofcourse, they are also general enough that they can be applied to other domains as well. However, in some oremany cases it may not be possible to carry out all steps. Nevertheless, the steps can still be considered as usefulguidelines that can be used for discussions and planning of the software roll-out.


7 Conclusions

In the LarGo! we have showed how an increased digitalization of the power system will also introduce new prob-lems in areas that were previously not considered. In a distribution system where the majority of the substationsare digitalized, poorly managed large-scale software roll-outs can lead to critical failures in the power system.This has been shown in a number of validations, from pure simulations to tests in the field. Not only did LarGo!show that these new problems may arise, it also introduces and developed multiple tools to plan for and handlesoftware roll-outs for future smart grids systems. The knowledge acquired throughout the project has also beensummarized in this document as lessons learned and guidelines. These can be used not only for DSOs and otherutility operators, but also for other research projects where new algorithms and software components are testedin smart grid systems.

The activities in the LarGo! project have also shown how important the results from the project can be to improvethe efficiency of future smart grid roll-outs and the adoption potential of new smart grid solutions. At the moment,the integration of new software into the smart grid contains many steps that are error prone. However, if thesemanual steps can be automated using the LarGo! deployment system, these resources can be saved and newsmart grid solutions can be adopted faster. In this way LarGo! can enable large-scale roll-outs with a seamlessand secure deployment process and can have a strong impact on the efficiency on future smart grid roll-outs.

During the LarGo! project, we have made important contributions toward realizing large-scale, resilient softwareroll-outs in Smart Grids. Nevertheless, challenges remain. One part we did not really touch upon, are tools fora resilient rollback, in case of failed roll-out. This is a further topic that should be on the future research agenda.Further future topics include better tools to support the deployment for multiple domains, and better usage ofdigital-twins for the deployed applications.


Abbreviations

AD . . . . . . . . . . . . . . . . . . Anomaly Detection

ALMA . . . . . . . . . . . . . . . . Application Lifecycle Management Agent

ALMS . . . . . . . . . . . . . . . . Application Lifecycle Management Service

BEMS . . . . . . . . . . . . . . . . Building Energy Management System

CIGRÉ . . . . . . . . . . . . . . . . Conseil International des Grands Réseaux Électriques

CPS . . . . . . . . . . . . . . . . . Cyber-Physical System

CUSUM . . . . . . . . . . . . . . . Cumulative Sum

DSO . . . . . . . . . . . . . . . . . Distribution System Operator

EN . . . . . . . . . . . . . . . . . . Evidential Network

FDIR . . . . . . . . . . . . . . . . . Fault Detection Isolation and Recovery

ICT . . . . . . . . . . . . . . . . . . Information and Communication Technology

IIoT . . . . . . . . . . . . . . . . . Industrial Internet of Things

ILP . . . . . . . . . . . . . . . . . . Integer Linear Program

IoT . . . . . . . . . . . . . . . . . . Internet of Things

IP . . . . . . . . . . . . . . . . . . Internet Protocol

iSSN . . . . . . . . . . . . . . . . . intelligent Secondary Substation Node

JSON . . . . . . . . . . . . . . . . JavaScript Object Notation

KBSM . . . . . . . . . . . . . . . . Knowledge Based Software Management

MQTT . . . . . . . . . . . . . . . . Message Queue Telemetry Transport

OSGi . . . . . . . . . . . . . . . . . Open Service Gateway Initiative

PV . . . . . . . . . . . . . . . . . . Photo Voltaic

RPI . . . . . . . . . . . . . . . . . Raspberry PI

SGW . . . . . . . . . . . . . . . . . Substation GateWay

SIL . . . . . . . . . . . . . . . . . . Software In the Loop

STPA . . . . . . . . . . . . . . . . System Theoretic Process Analysis

TCP . . . . . . . . . . . . . . . . . Transmission Control Protocol

UI . . . . . . . . . . . . . . . . . . User Interface


References

[1] F. Andren, B. Bletterie, S. Kadam, P. Kotsampopoulos, and C. Bucher. On the stability of local voltage controlin distribution networks with a high penetration of inverter-based generation. IEEE Transactions on IndustrialElectronics, 62(4):2519–2529, 2015.

[2] Apache ACE. https://ace.apache.org/.[3] Apache Felix OBR. http://felix.apache.org/documentation/subprojects/

apache-felix-osgi-bundle-repository.html/.[4] Apache Karaf. https://karaf.apache.org//.[5] Michéle Arnold and Göran Andersson. Model predictive control of energy storage including uncertain fore-

casts. In 17th Power Systems Computation Conference (PSCC 2011), 2011.[6] Stephan Balduin. Surrogate models for composed simulation models in energy systems. Energy Informatics,

1(1):30, 2018.[7] S. Cejka, F. Kintzler, L. Muellner, F. Knorr, M. Mittelsdorf, and J. Schumann. Application lifecycle management

for industrial iot devices in smart grid use cases. In 5th International Conference on Internet of Things, BigData and Security (IoTBDS), May 2020.

[8] Stephan Cejka, Alexander Hanzlik, and Andreas Plank. A framework for communication and provisioning inan intelligent secondary substation. In IEEE Int. Conf. on Emerging Technologies and Factory Automation,Sept 2016.

[9] M. S. Chong, H. Sandberg, and A. M. H. Teixeira. A tutorial introduction to security and privacy for cyber-physical systems. In 2019 18th European Control Conference (ECC), pages 968–978, June 2019.

[10] M. S. Chong, D. Umsonst, and H. Sandberg. Voltage regulation of a power distribution network in a radialconfiguration with a class of sector-bounded droop controllers. In 2019 IEEE 58th Conference on Decisionand Control (CDC), pages 3515–3520, Dec 2019.

[11] C/sells. https://www.csells.net/de/.[12] Marcial Guerra de Medeiro, Kin Cheong Sou, and Henrik Sandberg. Minimum-time secure rollout of software

updatesfor controllable power loads. In 21st Power Systems Computation Conference (PSCC 2020), 2020.[13] Jon Dugan, Seth Elliott, Bruce A. Mah, Jeff Poskanzer, Kaustubh Prabhu, Mark Ashley, Aaron Brown, Aeneas

Jaißle, Susant Sahani, Bruce Simpson, and Brian Tierney. iperf3. [Retrieved: 2020-05-13].[14] Eclipse hawkBit. https://projects.eclipse.org/projects/iot.hawkbit.[15] Eclipse Virgo. ihttps://www.eclipse.org/virgo/.[16] Ivo Friedberg, Xin Hong, Kieran McLaughlin, Paul Smith, and Paul C Miller. Evidential network modeling for

cyber-physical system state inference. IEEE Access, 5:17149–17164, 2017.[17] C. Gavriluta, G. Lauss, T. I. Strasser, J. Montoya, R. Brandl, and P. Kotsampopoulos. Asynchronous integra-

tion of real-time simulators for hil-based validation of smart grids. In IECON 2019 - 45th Annual Conferenceof the IEEE Industrial Electronics Society, volume 1, pages 6425–6431, 2019.

[18] Tobias Gawron-Deutsch, Konrad Diwold, Stephan Cejka, Martin Matschnig, and Alfred Einfalt. Industrial IoTfür Smart Grid-Anwendungen im Feld. e & i Elektrotechnik und Informationstechnik, 135(3):256–263, 6 2018.

[19] Inseok Hwang, Sungwan Kim, Youdan Kim, and Chze Eng Seah. A survey of fault detection, isolation, andreconfiguration methods. IEEE transactions on control systems technology, 18(3):636–653, 2009.

[20] IBM. Best practices for developing and working with osgi applications. https://www.ibm.com/developerworks/websphere/techjournal/1007_charters/1007_charters.html#sec11.

[21] Nancy Leveson. A new accident model for engineering safer systems. Safety science, 42(4):237–270, 2004.[22] OpenEMS. https://www.openems.io/.[23] openHAB. https://www.openhab.org/.[24] OpenMUC. https://www.openmuc.org/.[25] OSGi™ Alliance. OSGi Release 7, 2018.[26] Robert M Lee, Michael J Assante, and Tim Conway. Analysis of the Cyber Attack on the Ukrainian Power

Grid. White Paper, SANS ICS and E-ISAC, March 2016.[27] Software Update for Embedded Systems (swupdate). https://github.com/sbabic/swupdate.[28] Cornelius Steinbrink, Marita Blank-Babazadeh, André El-Ama, Stefanie Holly, Bengt Lüers, Marvin Nebel-

Wenner, Rebeca P. Ramírez Acosta, Thomas Raub, Jan Sören Schwarz, Sanja Stark, Astrid Nieße, andSebastian Lehnhoff. Cpes testing with mosaik: Co-simulation planning, execution and analysis. AppliedSciences, 9(5), 2019.

[29] András Varga and OMNeT++ contributors. OMNeT++ discrete event simulator. http://www.omnetpp.org/,2018.


https://ace.apache.org/

http://felix.apache.org/documentation/subprojects/apache-felix-osgi-bundle-repository.html/

http://felix.apache.org/documentation/subprojects/apache-felix-osgi-bundle-repository.html/

https://karaf.apache.org//

https://www.csells.net/de/

https://projects.eclipse.org/projects/iot.hawkbit

ihttps://www.eclipse.org/virgo/

https://www.ibm.com/developerworks/websphere/techjournal/1007_charters/1007_charters.html#sec11

https://www.ibm.com/developerworks/websphere/techjournal/1007_charters/1007_charters.html#sec11

https://www.openems.io/

https://www.openhab.org/

https://www.openmuc.org/

https://github.com/sbabic/swupdate

http://www.omnetpp.org/

Documents

Deliverable D1.1 Whitepaper - largo-project.eu