71
PUBLIC DELIVERABLE H2020 Project: Smart Resilience Indicators for Smart Critical Infrastructure D3.9 ‐ Report on RapidMiner Coordinator: Aleksandar Jovanovic EU‐VRi Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk Management Haus der Wirtschaft, Willi‐Bleicher‐Straße 19, 70174 Stuttgart Contact: smartResilience‐CORE@eu‐vri.eu

Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

 

 

 PUBLIC DELIVERABLE    

   

 

 

 

 

 

 

 

H2020 Project: Smart Resilience Indicators for Smart Critical Infrastructure 

D3.9 ‐ Report on RapidMiner 

 

 

 

 

Coordinator: Aleksandar Jovanovic EU‐VRi 

Project Manager: Bastien Caillard EU‐VRi 

European Virtual Institute for Integrated Risk Management 

Haus der Wirtschaft, Willi‐Bleicher‐Straße 19, 70174 Stuttgart 

Contact: smartResilience‐CORE@eu‐vri.eu

Page 2: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SMART RESILIENCE INDICATORS FOR SMART CRITICAL INFRASTRUCTURES

© 2016-2019 This document and its content are the property of the SmartResilience Consortium. All rights relevant to this document are determined by the applicable laws. Access to this document does not grant any right or license on the document or its contents. This document or its contents are not to be used or treated in any manner inconsistent with the rights or interests of the SmartResilience Consortium or the Partners detriment and are not to be disclosed externally without prior written consent from the SmartResilience Partners. Each SmartResilience Partner may use this document in conformity with the SmartResilience Consortium Grant Agreement provisions. The research leading to these results has received funding from the European Union’s Horizon 2020 Research and Innovation Programme, under the Grant Agreement No 700621.

The views and opinions in this document are solely those of the authors and contributors, not those of the European Commission.

Report on RapidMiner Data Science-Driven Resilience Analytics with RapidMiner

Report Title: Report on RapidMiner

Author(s): T. Knape, B. Allen

Responsible Project Partner:

AIA Contributing Project Partners:

n/a

Document data:

File name / Release: D3.9_Report on RapidMiner_v13sm25092019.docx Release No.: 4

Pages: 70 No. of annexes: 3

Status: Amended acc. to the EC review

Dissemination level: PU

Project title: SmartResilience: Smart Resilience Indicators for Smart Critical Infrastructures

Grant Agreement No.: 700621

Project No.: 12135

WP title:

The SmartResilience indicator-based methodology for assessing, predicting & monitoring the resilience of SCIs for optimized multi-criteria decision-making

Deliverable No: D3.9

Date: Due date: September 30, 2019

Submission date: September 30, 2019

Keywords: RapidMiner, Resilience Analytics

Reviewed by:

Knut Øien Review date: March 21, 2019

Peter Klimek Review date: March 21, 2019

Frank Fiedrich Review date: March 24, 2019

Approved by Coordinator:

A. Jovanović Approval date: September 30, 2019

Dublin, September 2019

Page 3: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page i

Release History

Release No.

Date Description / Change

1 March 20, 2019 Draft version.

2 April 11, 2019 Updated version based on comments from reviewers.

3 April 23, 2019 Final version, based on comments during SC meeting on April 16, 2019.

4 September 24, 2019 Revised version prepared to address European Commission review comments.

Page 4: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page ii

Project Contact

EU-VRi – European Virtual Institute for Integrated Risk Management Haus der Wirtschaft, Willi-Bleicher-Straße 19, 70174 Stuttgart, Germany Visiting/Mailing address: Lange Str. 54, 70174 Stuttgart, Germany Tel: +49 711 410041 27, Fax: +49 711 410041 24 – www.eu-vri.eu – [email protected] Registered in Stuttgart, Germany under HRA 720578

SmartResilience Project

Modern critical infrastructures are becoming increasingly smarter (e.g. the smart cities). Making the infrastructures smarter usually means making them smarter in the normal operation and use: more adaptive, more intelligent etc. But will these smart critical infrastructures (SCIs) behave smartly and be smartly resilient also when exposed to extreme threats, such as extreme weather disasters or terrorist attacks? If making existing infrastructure smarter is achieved by making it more complex, would it also make it more vulnerable? Would this affect resilience of an SCI as its ability to anticipate, prepare for, adapt and withstand, respond to, and recover? What are the resilience indicators (RIs) which one has to look at?

These are the main questions tackled by the SmartResilience project.

The project envisages answering the above questions in several steps: (#1) By identifying existing indicators suitable for assessing resilience of SCIs, (#2) By identifying new smart resilience indicators including those from Big Data, (#3) By developing, a new advanced resilience assessment methodology based on smart RIs and the resilience indicators cube, including the resilience matrix, (#4) By developing the interactive SCI Dashboard tool, and (#5) By applying the methodology/tools in 8 case studies, integrated under one virtual, smart-city-like, European case study. The SCIs considered (in 8 European countries!) deal with energy, transportation, health, and water.

This approach will allow benchmarking the best-practice solutions and identifying the early warnings, improving resilience of SCIs against new threats and cascading and ripple effects. The benefits/savings to be achieved by the project will be assessed by the reinsurance company participant. The consortium involves seven leading end-users/industries in the area, seven leading research organizations, supported by academia and lead by a dedicated European organization. External world leading resilience experts are included in the Advisory Board.

Page 5: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page iii

Executive Summary

This D3.9 report describes our use of RapidMiner technology for resilience analytics applications, and it relates to work in D3.3, D3.4, D3.7 and also D4.6. The main objectives addressed are

Predictive resilience analytics Multi-Criteria Decision Making Enterprise Integration for resilience assessment applications New data-driven indicators update for the project database

We have collaborated with Cork City Council on the application of predictive resilience analytics and multi-criteria decision making for the use case scenario on urban flood resilience.

Our research has been supported by the following agencies in the Irish Government:

Office of Public Works Ordnance Survey Ireland, Ireland’s National Geographic Service National Transport Authority

We thank them for their support for the development of resilience analytics and data science applications which they provided through discussions, meetings and data provisions.

We have built predictive models for forecasting flood water levels using available datasets in the GOLF case study and evaluated their effectiveness.

We have further implemented multi-criteria decision making by the example of a flood-protection investment use case supported by the Office of Public Works.

Work carried out in D3.9 relates to the GOLF case study. Contributions under T3.3 concern a predictive model that with location height data and location statistics data can in real-time predict future functionality levels along the FL-t curve. However, more extended forecasting is at the expense of forecasting accuracy. Recovery is also linked to the severity of flood water level impact and likely structural flood damage. Contributions under T3.4 are in the use of RapidMiner and the application of the MCA approach which is broadly applicable across a range of government decisions and fulfils important criteria for application in government such as the ability to provide an audit trail, transparency and ease of use. It does not follow the MCDM approach described in D3.4.

Concerning T3.7, we discuss RapidMiner enterprise integration options and re-use of RapidMiner analytics processes which can be integrated with most database systems but likely require alterations to fit the particular business use case.

With regards to T4.6, we report on several new data-driven indicators. These indicators are based on the predictive water level model for the GOLF use case.

Page 6: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page iv

Table of Contents

Purpose of Document ......................................................................................................................... 10 D3.9 in the SmartResilience project ................................................................................................... 10 Intended Audience 12 Impact on Stakeholders ...................................................................................................................... 12 Flood risk management methods ...................................................................................................... 12

Background Predictive Analytics ........................................................................................................ 14 CRISP-DM Industry-standard methodology for data analytics projects .......................................... 15 Conclusions 17

Introduction 18 Algorithmic approach ......................................................................................................................... 19 Data ingested by the model ............................................................................................................... 20 Data cleaning & pre-processing process............................................................................................ 21 Relevance of input data attributes - weighting by information gain ............................................... 23 Implementation of the forecasting model - RapidMiner Predictive Analytics ................................ 25

3.6.1 Option 1 Simple Learning with Naïve Bayes .......................................................................... 25 3.6.2 Option 2 ARIMA 27 3.6.3 Option 3 Deep Learning .......................................................................................................... 32 3.6.4 Model performance (unseen data) discussion & conclusions .............................................. 36

Predictive Functionality Levels & Charting ........................................................................................ 38 Conclusions 41

Introduction 42 Flood Protection Investments Options GOLF .................................................................................... 42 MCA benefit score approach .............................................................................................................. 44 Implementation in RapidMiner .......................................................................................................... 48 Conclusions 52

Conclusions 54

Conclusions 55

Page 7: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page v

Annex 1 Summary of the input data ............................................................................................................... 60

Annex 2 Charts 62

Annex 3 Review process 64

Page 8: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page vi

List of Figures

Figure 1: Complete structure for functionality assessment in a smart city ........................................................................... 11 Figure 2: Common steps in a predictive analytics project ...................................................................................................... 14 Figure 3: Phases of the CRISP-DM reference model [7] .......................................................................................................... 16 Figure 4: Predictive resilience analytics 18 Figure 5: Geo Locations Sensors - on OSI ITM Digital Globe Aerial Imagery ......................................................................... 21 Figure 6: Geo Locations Sensors - on OSI ITM basemap ......................................................................................................... 21 Figure 7: Data cleaning & pre-processing process .................................................................................................................. 22 Figure 8: Executive Process containing the data pre-processing ........................................................................................... 23 Figure 9: Naïve Bayes Model with integrated Weighting using Information Gain ................................................................ 26 Figure 10: Validation sub-process 27 Figure 11: Naïve Bayes confusion matrix on training & test data .......................................................................................... 27 Figure 12: Arima Top level Predictive Model View ................................................................................................................. 28 Figure 13: Arima Executive Process Sub-Process View ........................................................................................................... 29 Figure 14: Arima Model Copy-Time-column Sub-Process View ............................................................................................. 29 Figure 15: Arima Model Handle-missing Sub-Process View ................................................................................................... 30 Figure 16: Arima Model Find ExtremesInLabel Sub-Process View ......................................................................................... 30 Figure 17: Arima Model find-name-of-label Sub-Process View .............................................................................................. 30 Figure 18: Arima Model rename-label-to-a-standard-name Sub-Process View .................................................................... 31 Figure 19: Arima Model Optimize Parameters (Grid) ............................................................................................................. 31 Figure 20: ARIMA: the blue line is the prediction.................................................................................................................... 32 Figure 21: Holt-Winters: the red line is actual, the blue line is the prediction ...................................................................... 32 Figure 22: Deep Learning Toplevel Predictive Model View .................................................................................................... 33 Figure 23: Deep Learning Results 33 Figure 23: Performance Vector Deep Learning - Training Dataset ......................................................................................... 33 Figure 25: Deep Learning forecasting accuracy chart - actual value are in red, predicted values in blue ........................... 34 Figure 26: Shorter-term Deep Learning forecasting accuracy chart (1) - actual values are in red, predicted values in

blue 34 Figure 27: Shorter-term Deep Learning forecasting accuracy chart (2) - actual values are in red, predicted values in

blue – on unseen data 35 Figure 28: Deep Learning forecasting accuracy chart (window size of 48) - actual values are in red, predicted values in

blue 35 Figure 29: Shorter-term Deep Learning forecasting accuracy chart (1) (window size of 48) - actual values are in red,

predicted values in blue 36 Figure 30: Shorter-term Deep Learning forecasting accuracy chart (2) (window size of 48) - actual values are in red,

predicted values in blue 36 Figure 31: Naïve Bayes confusion matrix on unseen data ...................................................................................................... 37 Figure 32: Performance Vector Deep Learning – Unseen Dataset ......................................................................................... 37 Figure 32: Predicted number of jobs at business locations either affected and severely affected by flooding .................. 38 Figure 33: Ratio of jobs at business locations either affected and severely affected by flooding ....................................... 39 Figure 34: percentage of non-affected jobs & non-affected jobs severely ........................................................................... 39

Page 9: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page vii

Figure 35: Predictive modelling of FL-t curve .......................................................................................................................... 40 Figure 36: RapidMiner process 48 Figure 37: Review of data in the RapidMiner data editor ....................................................................................................... 48 Figure 39: RapidMiner functions - MCDM formula for GOLF ................................................................................................. 50 Figure 39: Defining the sum over all weighted-scores calculated for the criteria for a specific option............................... 50 Figure 40: Defining the grouping by flood protection measure option ................................................................................. 51 Figure 41: Result set with values for each option ................................................................................................................... 51 Figure 42: Visual representation of MCDM result set ............................................................................................................. 51 Figure 43: Example operators supporting Integration of RapidMiner Resilience Analytics applications ............................ 53 Figure 44: Webservices integration for analytics processes running on RapidMiner Server in, e.g. a cloud deployment

scenario 54 Figure 45: Insured businesses either affected or severely affected by the predicted water level....................................... 62 Figure 46: Ratio of the number of jobs at business locations either affected or severely affected by the predicted

flood water level 62 Figure 47: Value of stock levels held at business locations either affected or severely affected by the predicted flood

water level 63

Page 10: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page viii

List of Tables

Table 1: Functionality assessment levels GOLF test scenario ................................................................................................. 11 Table 2: Weight by information gain 24 Table 3: Global weighting for flood protection investment options ...................................................................................... 44 Table 4: Local weighting, importance scoring ......................................................................................................................... 45 Table 5: Scoring - General Approach 45 Table 6: Scoring - Technical & Economic Criteria .................................................................................................................... 46 Table 7: Scoring - Social & Environmental 46 Table 8: Scoring - Other Criteria 46 Table 9: MCA benefit calculation for a data-driven social indicator (partial, reviewing one data-driven indicator) .......... 47 Table 10: Example calculation of MCA value per option ........................................................................................................ 47 Table 11: Example MCDM calculation 49 Table 12: Data summary – Roches Point Weather Station – every hour ............................................................................... 60 Table 13: Data summary – Tidal Station NMCI Ringaskiddy Data - every 15mins ................................................................. 61 Table 14: Data summary – Water Level Station Lee Road - every 5mins .............................................................................. 61

Page 11: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page ix

List of Acronyms

Acronym Definition AIA Applied Intelligence Analytics

APSR Areas for Further Assessment

ARIMA AutoRegressive Integrated Moving Average

AU Assessment Unit

CCC Cork City Council

CSV Comma-separated values (file format)

FRM Flood Risk Map

GW Global-weighting

ITM Irish Transverse Mercator

LW Local-weighting

MCA Multi-Criteria Analysis

MCDM Multi-Criteria Decision Making

NTA National Transport Authority

OPW Office of Public Work

OSI Ordnance Survey Ireland, Ireland’s National Geographic Service

Page 12: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 10

Introduction

Purpose of Document This report relates to several project tasks (T3.3, T3.4, T3.7 and T4.6):

1. Contributing to modelling the impact and recovery phase using predictive analytics capabilities of RapidMiner using available data in the context of the GOLF case study

2. Contributing to developing multi-criteria decision analysis tools based on, e.g. RapidMiner 3. Supporting the development of an integrated resilience assessment tool based on expertise with

RapidMiner technology 4. Contributing to updating the resilience indicators database through expertise on databases and

risk/resilience tools

The report addresses the above in the following sections:

All Chapter 2 RapidMiner Analytics

1 Chapter 3 Resilience Analytics with RapidMiner Predictive Analytics

2 Chapter 4 Multi-Criteria Decision Making

3 Chapter 5 Enterprise Integration with RapidMiner

4 Chapter 6 New data-driven indicators

D3.9 in the SmartResilience project D3.9 focuses on the technical aspects of predictive modelling and MCDM implementation using RapidMiner technology. We describe how we use predictive analytics and MCDM for building data-driven indicators that help assess the resilience by example using available datasets in the context of the GOLF case study.

GOLF – Urban Flooding case study

Cork City, located at the head of a tidal estuary and the downstream end of a large river catchment, is prone to both tidal and fluvial flooding. Cork City is the second-largest city in the Republic of Ireland with a population of 125,622 as per the 2016 census. Flooding is the main threat in the GOLF case study. For predictive impact modelling, we selected tidal flooding as the most frequent flooding and calculated the predicted impact/recovery for the city using available economic statistics data, location height data and environmental data.

D3.9 describes predictive modelling of the impact and recovery phase using RapidMiner technology and thereby relates to Deliverable D3.3: Report on the ‘SmartResilience Methodology for Assessing Resilience of SCIs Based on RIs (Resilience Indicators) [1].

D3.9 further describes the implementation of MCDM using RapidMiner by the example of a practical use case supported by the OPW and relates to Deliverable D3.4: Report on the SmartResilience MCDM Methodology Serving as the Basis for the ‘SCIs Dashboard [2].

Page 13: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 11

Also, D3.9 describes the reuse of RapidMiner processes and integration options with other systems such as the SmartResilience database and thereby relates to Deliverable D3.7: The “SCIs Dashboard containing the module on Dynamic Intelligent Checklists [3]

It further lists new data-driven indicators which relate to Deliverable D4.6: New Release of the RI-Database [4].

Below figure illustrates the complete structure for functionality assessment in a smart city according to the SmartResilience methodology [1]. It relates to chapters 3 and 4 (predictive resilience analytics for the impact and recovery phases).

Figure 1: Complete structure for functionality assessment in a smart city

Below table relates resilience analytics as described in chapter 3 & 4 with the SmartResilience methodology.

Table 1: Functionality assessment levels GOLF test scenario

SmartResilience methodology structure Test scenario Urban Flooding Resilience Predictive Analytics

Level 1. Functionality level of the city Cork City Building a threat impact prediction model such as the water level prediction model for Cork City discussed in chapter 4

Height dataset Cork City Environmental Sensor data

Level 2. Functionality level (FL) of the infrastructure, corresponding to the SCIs in the project

Economy Employment statistics

Level 3. Functionality elements (FEs) e.g. jobs, buildings, insurance location and height attributes in stakeholder databases

Level 4. Functionality indicators (FIs) e.g. % or number of jobs affected by flooding, affected by flooding severely

Impact and recovery indicator calculation based on prediction models and data assets at levels 1, 2, 3.

Page 14: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 12

Intended Audience The intended audience of this report, which mostly focuses on data science and the use of RapidMiner for data-driven indicator development, are the required actors involved in a data science-driven resilience assessment project:

End users (needs / requirements / validation / data) Business analysts (concept analysis) Data scientists (data prep/ modelling / operationalisation)

End users:

Readers who are interested in forecasting applications helping to assess the impact of a threat such as flooding water levels in their work environment, e.g. emergency coordination.

Business analysts:

Readers who are interested in the domain of predictive modelling, MCDM, delivering projects that create decision support tools using predictive analytics or MCDM applications.

Data scientists:

Readers who are interested in the technical aspects of data analytics.

Impact on Stakeholders As per the DRS-14-2015 call topic prospective project impact requirements, the funding agency requested the action “to proactively target the needs and requirements of public bodies.” The assessment of end-user needs, and various resilience analytics options have been carried out in close collaboration with Cork City Council and Cork City Fire Brigade and supported by the following stakeholders in the Irish Government:

Office of Public Work Ordnance Survey Ireland National Transport Authority

We have assessed various datasets from the above agencies. We argue that our focus on prototyping resilience analytics concepts where data assets are readily available in public bodies warrants the highest impact requested by the funding agency.

Flood risk management methods Different methods to assess risk and vulnerability of areas to flooding have been developed over the last few decades. Two of the more widely used methods are deterministic physically-based hydraulic modelling approaches to risk assessment and parametric approaches for assessing flood vulnerability [5]. Deterministic modelling approaches use physically-based hydraulic modelling approaches to estimate flood hazard/probability of particular events and rely on a significant amount of detailed topographic, hydrographic and economic information in the area studied. If the information is available, reasonably accurate estimates of the potential flood risk to an area can be achieved. Parametric approaches were introduced in the 1980s by Little and Rubin [6], and aim to use only a few readily available data of information to build a picture of the vulnerability of an area. Parametric approaches point on vulnerability assessments to minimise the impacts of flooding and also to increase the resilience of the affected system. This report presents a hybrid approach of deterministic flood prediction modelling using available environmental data sources and parameters such as jobs registered at businesses affected by flooding as determined by the water level prediction model.

Page 15: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 13

Page 16: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 14

RapidMiner Analytics

Background Predictive Analytics As one of the main parts of this report focuses on predictive modelling, this section gives a primer on predictive analytics.

Predictive analytics is about building predictive models that can provide accurate assessments of what will happen in the future. By using data, statistical algorithms & machine learning techniques, in predictive analytics, we identify the likelihood of future outcomes based on historical data.

Predictive analytics is a process of identifying patterns in historical data to estimate values for future data we do not have. An example which is further discussed in this report is the use of past location-based water level data, tidal data and weather data to build a model which will predict future water levels which allows to better prepare for a flooding disaster. With an accurate estimate of a future water level we can evaluate how many locations at certain heights will be affected, and with the use of location-based statistics can calculate the likely impact of a flood in terms of for example how many jobs at locations are endangered, how many locations do not have any insurance cover and so forth. Comparing the predicted water level to the location height data will also give us a clearer picture how many locations will be affected severely, which are for instance at least 0.5m below the predicted water level and thereby likely not having a speedy recovery due to required repair work. These locations and their location statistics likely will not show a recovery that is 100% soon after the disaster impact.

Predictive modelling

It is essential to understand that a predictive analytics problem cannot be solved by loading data into a predictive modelling tool such as RapidMiner and hoping it will return the required results. Model creation is greatly supported with a visual modelling tool, but for the model to be effective, we need to carry out an in-depth review of the problem to be solved and the available data. Often it turns out that available data is not enough to support a predictive model. Quite often in the pre-modelling phases, the strategy for predictive modelling changes its shape.

A predictive modelling project is often very detailed and complex; however, all have in common some high-level tasks. The following illustrates the main steps for building an effective predictive model. These steps are taken iteratively when aiming to build a highly accurate model with benefit to the participating project end user. Data preparation concerns data access, exploration, blending and cleansing. Data modelling concerns model building and validation. Operationalisation concerns deployment & maintenance as well as embedding. These steps are further explained in detail below as they are mentioned in the predictive model building discussion in chapter 3.

Figure 2: Common steps in a predictive analytics project

Page 17: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 15

Data preparation

In the first step, we have been identifying relevant data sources that allow us to build a predictive resilience analytics model in an urban flood resilience context. We have been discussing and evaluating data sources with

Cork City Council National Transport Authority Ordnance Survey Ireland – Ireland’s National Geographic Service Office of Public Works

We have been exploring the available datasets for use in predictive resilience analytics applications. For datasets supporting a predictive resilience model, we have been blending data using transformations, data parsing, type conversions, filtering, sorting, set operations such as joins or unions, aggregations, rotations, feature selection, feature creation, feature extraction, sampling and partitioning.

In the next step, we have been cleansing data using anomaly & outlier detection, duplicate detection, binning, dimensionality reduction, missing value handling and normalisation.

We have built a number of predictive analytics models and scored them to validate their performance.

Modelling

We have been using different modelling techniques to build predictive models. These include machine learning algorithms such as regression and classification. Further association mining, frequent item set, similarity computation. Also feature weighting, segmentation & clustering, ensemble and hierarchical models. We further used algorithms, loops & branches to find optimal actions.

We used cross-validation to validate the performance of the models and several interactive charts to get a visual insight into the model performance. We have used numerical/nominal and categorical model performance criteria and also, significance tests, optimal threshold cut-off for binomial classes, cost-sensitive learning and performance measures.

Operationalisation

We have used the scoring engine, run on server- and cloud-based infrastructures, trigger or schedule execution.

CRISP-DM Industry-standard methodology for data analytics projects Utilising a standard methodology can help ensure quality outcomes for predictive analytics. The Cross-Industry Standard Process for Data Mining (CRISP-DM) [7] is a widely followed standard process for analytics projects. It is composed of six steps we discuss in the following with comments on where each step sits in relation to SmartResilience data-driven indicators and predictive modelling.

Page 18: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 16

Figure 3: Phases of the CRISP-DM reference model [7]

1. Business understanding

In this step, we spend time understanding the reasons for predictive modelling from a business perspective. In SmartResilience this phase would concern level 1 to 4 shown in Figure 1.

2. Data understanding

In this step, we review data and its potential promises and shortcomings, and we begin to generate hypotheses. We then reassess the business understanding (step 1) if needed. In SmartResilience we have been identifying data source, reviewing their potential for predictive modelling and analysing legal data access modalities. This second step in CRISP-DM would relate to level 4 in the functionality level hierarchy in SmartResilience. It concerns the data of a data-driven indicator that informs about a functionality level.

3. Data preparation

In this step, we carry out data selection, integration, transformation, and pre-processing steps. CRISP-DM does not prescribe in what order these tasks will be done in.

4. Modelling

In this step, we apply the algorithms to the data to discover the patterns. We may have to reassess the data preparation step (step 3) if the modelling step requires it.

5. Evaluation

Here we evaluate the model and discovered patterns for their value in answering the business problem such as location-based water level prediction. We may have to revisit the business understanding (step 1) if necessary.

6. Deployment

We present the discovered knowledge and models and put them into production to solve the business problem. In SmartResilience this would be a new data-driven indicator.

The strength of CRISP-DM is in its built-in iteration. We are expected to check that the current step is still in agreement with certain previous steps. Another strength is that we are explicitly reminded to keep the business problem in the centre of all steps including the evaluation steps. The SmartResilience case studies give a broad spectrum for predictive modelling use cases. Data understanding and availability of relevant historical datasets is key to predictive modelling and data-driven indicators.

RapidMiner Studio [8]

Page 19: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 17

RapidMiner Studio is a visual design environment for rapidly building complete predictive analytic workflows. It provides an extensive library of machine learning algorithms, data preparation and exploration functions, and model validation tools to support data science projects and use cases.

Data science teams can easily re-use existing R and Python code, and new functionality via a vast marketplace of pre-built extensions.

Key features are:

Visual Programming Environment Guided Analytics Reusable Building Blocks & Processes 1500+ Machine Learning & Data Prep Functions Integration of R & Python Scripts Correct Model Validation Methods Access All Types of Data

RapidMiner Studio is a Java-based application that facilitates a GUI driven development of predictive and descriptive models. It is possible to directly run training of models as well as their application which facilitates the modelling and scoring of unseen data. In this project, we have been dealing with reasonably small datasets even when using larger window sizes. We have been using a reasonably powerful workstation with an i7-7700HQ quad-core CPU and 16GB RAM. Alternative implementation options include C, C++ or possibly python which can yield fast performance results but may not provide the advantage of rapid prototyping through a GUI-driven development environment.

RapidMiner Server [9] RapidMiner Server allows for fast and straightforward collaboration for large-scale enterprise data science projects. Users across an organisation can easily access, reuse and share models and processes in a version-controlled, secure and centrally managed environment. RapidMiner Server easily integrates analytic results into business processes and applications with its rich set of connectors, BI integration and web-service APIs. Key features are:

Optimised enterprise data science teamwork Seamlessly operationalise, leverage enterprise infrastructure Highly scalable, distributed architecture Cloud deployment

The next section discusses the use of RapidMiner for predictive modelling addressing flood water level prediction and location-based impact and recovery calculations.

Conclusions This chapter gave an overview of RapidMiner and the typical steps followed in a data analytics project. CRISP-DM is a widely used reference model for data analytics projects. Predictive modelling has been described as a process of identifying patterns in historical data to estimate values for future data we do not have. A predictive analytics business problem cannot be solved by loading data into a predictive modelling tool such as RapidMiner and hoping it will return the required results. It requires project-specific consultancy with the participation of end-users, business analysts and data scientists and developers as key stakeholders who in a collaborative way engage to refine the business understanding following the six phases of the CRISP-DM model in order to ensure to build decision support predictive indicators that reliably address the interests of the end-user stakeholder.

In the next chapter, we discuss the development of predictive data-driven indicators we carried out going through the various CRISP-DM steps iteratively.

Page 20: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 18

Resilience Analytics with RapidMiner Predictive Analytics

Introduction Predictive analytics empowers resilience assessment applications with data-science-driven accuracy for flood impact and recovery, and its results are straight-forward enough such that anyone can draft a meaningful understanding of the predicted flood impact and recovery by using it. Predictive analytics can be used to create accurate forecasting applications, and end-users can use them to substantiate decision making for mitigating negative flood impact and recovery in an urban environment for more effective disaster preparation, response actions and also for analysing future resilience improvement strategies in lessons learned post-disaster assessment. Predictive analytics creates data insight from relevant data sources for resilience assessment applications and supports officers in public authorities in understanding the impact of the predicted next disaster. With the understanding of the predicted next disaster extent, officers are empowered to plan for the disaster event with actions to reduce disaster impact on society. This can be, for example, how many temporary flood protection options such as flood bags are available to protect locations that are estimated to be flooded. Hence predictive resilience analytics is a powerful tool that can focus the thoughts and actions of disaster management staff towards reducing the disaster impact.

Below figure illustrates the approach for predicting the impact of a threat via a threat target parameter and utilising location intelligence via various databases and location height data.

With the predictive modelling, we obtain an indicator for predicting future water levels which are then in turn used as input into the resilience assessment.

Figure 4: Predictive resilience analytics

Figure 4 contains various variables. The following explains them in more detail:

Location IntelligencePredictive Resilience Analytics

Resilience Indicators

Predictive Threat Target Parameter

influencing attributes in time series

Resilience Assessment Apps

Threat target parameter context data such as

location height

Resilience indicator domain-specific data sources such

location employment statistics

Time series data from sensors etc.

Threat Target Parameter Prediction

Page 21: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 19

Time series data from sensors etc.: symbolic data from sensor sources such as described in Annex 1. Predictive Threat Target Parameter: the threat variable such as the water level for an urban area

prone to flooding hazards which we are aiming to forecast. Influencing attributes in time series data: for the threat of a high water-level, there can be several

influencing factors such as described in Table 12. Information gain analysis (see 3.5) then determines which of the attributes does have an influence.

Location Intelligence: data described here is used to calculate the impact of a predicted threat target parameter. For instance, a predicted water level x could mean that y locations at or below that level are impacted. Any statistical data associated with the impacted locations helps calculated the predicted threat impact such as the number of jobs at business locations (see section 3.7).

Resilience Indicators: these are calculated by associating the predicted threat with the impacted locations and location-specific statistics such as an employment database for indicators such as Predicted number of jobs at business locations either affected and severely affected by flooding (see section 3.7)

Algorithmic approach In this section, we present the predictive analytics process in pseudocode. A predictive model can be either a regression model with continuous output or a classification model. The model performance evaluation is different for each approach and discussed in section 3.6.4. A predictive classification model for flooding events outputs either 0 or 1 (no-flooding or flooding). A predictive regression model for flooding outputs a probability on the occurrence of an event such as flooding and can be implemented through various algorithmic approaches, for instance, ARIMA, Holt-Winters or Deep Learning.

Generic

For a given natural disaster environment do

identify the multivariate parameters of system inoperability collect relevant data evaluate candidate analytics approaches asses model accuracy and create performance plots, visualisations if accuracy >= acceptable threshold set by the stakeholder (e.g., errors < 10%) then

o identify the key predictors of the multivariate inoperability o asses the (non-linear) influence of the key predictors on the response

else o improve data collection, further model tuning

end if end for Test / validation (Naïve Bayes / Deep Learning)

for a given dataset do

Training phase: split training data (80,20) preprocess data

o cross-validation o feature weighting

Page 22: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 20

Testing phase: test algorithm (20% of unseen data) if (accuracy >= threshold) then

o save the model for application to unseen data. end if

end for

Model application (classification system, Naïve Bayes)

for every new data-point do

classify flow according to the saved model if (data-point is a flooding incident) then

o categorise as flooding else

o categorise as no flooding end if

end for Model application (deep learning model)

for every new data-point do

predict the next water levels as defined in the window size according to the saved model end for

Data ingested by the model Annex 1 presents a description of the data we used for predictive modelling. The data is from Lee Road Water Level Station, NMCI Ringaskiddy tidal station and Roches Point weather station. Training and test data used was 17 Sep 2015 to 31 Dec 2018 and unseen data 1 Jan 2019 to 28 Feb 2019.

The data allowed us to build a model that had an acceptable outcome. The location of Lee Road Station is not ideal, and the modelling would benefit from water level measurements closer to the city centre.

Page 23: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 21

Figure 5: Geo Locations Sensors - on OSI ITM Digital Globe Aerial Imagery

Figure 6: Geo Locations Sensors - on OSI ITM basemap

Data cleaning & pre-processing process Pre-processing in any data science project is responsible for the bulk of the effort. It is critical for any supervised learning algorithm to obtain the correct data structures. Not just in terms of the correct data types but also in relation to maximising the ‘value’ of the data.

For this project, as we are looking to predict water levels, it is vital to incorporate previous water level data points for each record. We implemented this by using a windowing approach that consists of setting the

Page 24: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 22

window size, the horizon and the offset. The pre-processing process then produces datasets, based on the input data, that is optimised for the following modelling processes:

Naïve Bayes Classification: This process also uses the windowing approach but we, in the absence of clear flooding threshold, created a class label that is ‘High’ for all flood levels that are higher than the average plus two times the standard deviation. The remaining records are classed as ‘OK’.

ARIMA and Holt-Winters: This process aims to predict the actual height. We are also using the windowing approach, tidy up dates and remove attributes that have no relevance.

Deep learning: This process aims to predict the actual height. We are also using the windowing approach, tidy up dates and remove attributes that have no relevance.

The pre-processing is mainly in relation to creating the window size and rearranging the data in such a way that it is useful for the algorithms.

Figure 7: Data cleaning & pre-processing process

The Execute Process operator has the ability to run other processes as well as have input parameters. The input parameters are the window size, horizon, and offset values. The pre-processing process creates data for all three model inputs.

Window Size: The number of values in one window, which is the number of values we are looking into the future.

Offset: we are starting with the following value Horizon: we are taking n number of values.

Page 25: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 23

Figure 8: Executive Process containing the data pre-processing

Discussion of data cleaning & pre-processing process

This process concerns the data cleaning and pre-processing ingesting the data from Table 12, Table 13 and Table 14. The following lists the most important parts of the pre-process part.

1. Retrieve Operator: “Retrieve river”: accessing stored information in the repository and loading them into the process. The river2 CSV file with the data outlined in Table 12, Table 13 and Table 14 is accessed.

2. Sub-process “Copy Time column”: a. Operator “generate attributes”: generate attribute MeasurementTime = [date-measurement] b. Operator “reorder attributes”: “Ensure order of columns matches the table view.”

3. Sub-process “Handle missing” a. Subprocess (2): “unify column types” b. Operator “select attributes”: Remove low-quality columns ind, ind 1, ind 2, ind 3, ind 4 c. Sub-process “replace missing values” d. Operator “reorder attributes”: Ensure order of columns matches the table view. e. Operator “filter examples”:

4. Sub-process “FindExtremesInLabel”: 5. Operator “Select attributes”: 6. Operator “Set role” 7. Operator “Fill data gaps”: Fill data gaps finds all the possible dates and times and allows gaps to be seen

and ensures that the window is based on a fixed time-based number of steps. This leads to some blank rows, but these get deleted later.

8. Operator “Join”: 9. Operator “Select attributes”: 10. Sub-process “Find name of label”: 11. Operator “Multiply” 12. Operator “Windowing”

Relevance of input data attributes - weighting by information gain We have calculated the relevance of the data input attributes for predicting the water level based on information gain and assigned weights to them accordingly. The attributes with the largest value are the values of the previous window. The temperature was also included in these predictors, but this is most likely linked to warmer weather being advantageous when dealing with flooding. It could also be linked to temperatures usually dropping with high winds. We assume this occurrence is more prevalent in spring, winter and autumn times.

Below table lists the weights. The higher the weights, the more important the attribute. Not surprisingly, the previous water levels are the most important decision makers/boundary. The tide does not seem to have much of an impact, and this likely is due to the fact that the only water level data available is from a station located outside the city away from the sea.

Page 26: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 24

Table 2: Weight by information gain

Attribute Weight Attribute Weight

Level Station 1 (Target) - 0 0.274635562 rain - 11 0.010423

Level Station 1 (Target) - 1 0.274151314 rain - 10 0.010393

Level Station 1 (Target) - 2 0.273220028 rain - 9 0.010325

Level Station 1 (Target) - 3 0.272367715 rain - 8 0.010271

Level Station 1 (Target) - 4 0.272121079 rain - 7 0.010223

Level Station 1 (Target) - 5 0.271289444 rain - 6 0.010191

Level Station 1 (Target) - 6 0.270810944 rain - 5 0.010158

Level Station 1 (Target) - 7 0.270113386 rain - 4 0.010085

Level Station 1 (Target) - 8 0.269516878 rain - 3 0.01005

Level Station 1 (Target) - 9 0.268648041 rain - 2 0.010027

Level Station 1 (Target) - 10 0.268085498 rain - 1 0.009987

Level Station 1 (Target) - 11 0.267156002 rain - 0 0.009961

date-measurement - 0 0.069757884 dewpt - 1 0.009692

Date tide - 0 0.069757884 dewpt - 6 0.009692

date weather - 0 0.069737766 dewpt - 4 0.009692

temp - 10 0.024932777 dewpt - 3 0.009692

temp - 11 0.024932348 dewpt - 2 0.009692

temp - 9 0.024931063 dewpt - 0 0.009692

temp - 8 0.024929777 dewpt - 5 0.009692

temp - 7 0.024927635 dewpt - 7 0.00969

temp - 6 0.024925064 dewpt - 8 0.009689

temp - 5 0.024922922 dewpt - 10 0.009687

temp - 4 0.024918209 dewpt - 9 0.009687

temp - 3 0.024913496 dewpt - 11 0.009686

temp - 2 0.02490964 vappr - 6 0.009619

temp - 1 0.0249045 vappr - 5 0.009619

temp - 0 0.024897647 vappr - 4 0.009619

msl - 11 0.021713452 vappr - 3 0.009618

msl - 10 0.021658996 vappr - 2 0.009618

msl - 9 0.021595849 vappr - 1 0.009618

msl - 8 0.02153266 vappr - 7 0.009618

msl - 7 0.021483744 vappr - 0 0.009617

msl - 6 0.021453685 vappr - 8 0.009617

msl - 5 0.021413337 vappr - 10 0.009615

msl - 4 0.021362273 vappr - 9 0.009615

msl - 3 0.021310359 vappr - 11 0.009614

msl - 2 0.021258522 wddir - 11 0.007876

msl - 1 0.021221679 wddir - 10 0.007859

msl - 0 0.021200811 wddir - 9 0.007842

wdsp - 11 0.018346989 wddir - 8 0.007823

wdsp - 10 0.018333758 wddir - 7 0.007805

wdsp - 3 0.018321383 wddir - 6 0.007786

wdsp - 4 0.018320681 wddir - 5 0.007769

Page 27: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 25

wdsp - 5 0.018304282 wddir - 4 0.00775

wdsp - 9 0.018299837 wddir - 3 0.007732

wdsp - 7 0.01829721 wddir - 2 0.007713

wdsp - 6 0.018296667 wddir - 1 0.007694

wdsp - 8 0.018292706 wddir - 0 0.007673

wdsp - 2 0.018291939 rhum - 10 0.006186

wdsp - 1 0.018260656 rhum - 11 0.006185

wdsp - 0 0.018219316 rhum - 9 0.006185

wetb - 0 0.016216825 rhum - 8 0.006171

wetb - 1 0.016200562 rhum - 7 0.006155

wetb - 2 0.016186195 rhum - 6 0.006114

wetb - 3 0.016170341 rhum - 5 0.006074

wetb - 4 0.016154881 rhum - 4 0.006022

wetb - 5 0.016140189 rhum - 3 0.005983

wetb - 6 0.016125888 rhum - 2 0.005956

wetb - 7 0.016111604 rhum - 1 0.005927

wetb - 8 0.016099211 rhum - 0 0.005884

wetb - 9 0.016086459 Tide - 11 0.001764

wetb - 11 0.016074096 Tide - 10 0.001762

wetb - 10 0.016072597 Tide - 9 0.00176

Tide - 8 0.001758

Tide - 7 0.001755

Tide - 6 0.001754

Tide - 5 0.001753

Tide - 4 0.001752

Tide - 3 0.001751

Tide - 2 0.00175

Tide - 1 0.001749

Tide - 0 0.001748

Implementation of the forecasting model - RapidMiner Predictive Analytics In this section, we discuss several predictive modelling approaches. We used 2015-2018 data for modelling/training and cross-validation and Jan/Feb 2019 data to test the cross-validated models. We developed separate analytics processes for building/testing and for applying the models to unseen data.

3.6.1 Option 1 Simple Learning with Naïve Bayes

The Naïve Bayes operator in RapidMiner Studio generates a Naive Bayes classification model.

RapidMiner Studio provides the following summary for Naïve Bayes [10]:

Naïve Bayes is a high-bias, low-variance classifier, and it can build a good model even with a small data set. It is simple to use and computationally inexpensive. Typical use cases involve text categorisation, including spam detection, sentiment analysis, and recommender systems. The fundamental assumption of Naïve Bayes is that, given the value of the label (the class), the value of any attribute is independent of the value of any other Attribute. Strictly speaking, this assumption is rarely true (it's "naive"!), but experience shows that the Naive Bayes classifier often works well. The independence assumption vastly simplifies the calculations needed to build the Naive Bayes probability model.

Page 28: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 26

To complete the probability model, it is necessary to make some assumption about the conditional probability distributions for the individual Attributes, given the class. This Operator uses Gaussian probability densities to model the Attribute data.

In our implementation, we are fitting a model that can create a decision boundary between the binary label (‘OK’ and ‘High’). ‘High’ are values that are equal or greater than the average plus two times the standard deviation. The process works as follows:

Execute the pre-processing process and output an example set of the windows data Apply a weight by information gain algorithm to calculates the relevance of the attributes based on

information gain and assigns weights to them accordingly. Keep the attributes with the 15 highest weights which is an optimised parameter value Feed the data stream into 10-fold cross-validation, where the data are split into ten parts using

stratified sampling. Iteratively, each of these ten parts is then once used for testing and the remainder for training the algorithm.

o Inside the cross-validation, we are applying a Naïve Bayes algorithm and measure the performance of each iteration

Figure 9: Naïve Bayes Model with integrated Weighting using Information Gain

Page 29: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 27

Figure 10: Validation sub-process

The confusion matrix gives close to 100% precision for the OK class (no flood) and a 75.42% precision for the ‘High’ class (flood level reached). There is a total of 4168 misclassifications with only 11 false positives leading to a class precision to almost 100%. The false negatives are considerably small too where we predict that a future river level is ‘High’ when in fact it was ‘OK’. This is acceptable as we rather have a false positive than a false negative. A false negative (a predicted flood that does not occur) is better than predicting no flood but actual flood occurring.

The misclassifications are almost always in the desired quadrant where we predict a high-water level, but in fact, the water level is still normal. Frequently these misclassifications are linked to border values where we are just at the edge of the water level between ‘OK’ and ‘High’. The recall dictates what percentage of total relevant results were correctly classified for each class and is with 98.24% and 99.91% for the two classes more than acceptable. As a 10-fold validation model was used a standard deviation of 0.07% could be calculated with an average accuracy of 98.33% overall. The standard deviation indicates that the model is likely to be robust and will generalise well to score unseen future data.

Figure 11: Naïve Bayes confusion matrix on training & test data

3.6.2 Option 2 ARIMA

The ARIMA operator in RapidMiner Studio trains an ARIMA model for a selected time series attribute.

RapidMiner Studio provides the following summary for ARIMA [10]:

ARIMA stands for Autoregressive Integrated Moving Average. Typically, an ARIMA model is used for forecasting time series. An ARIMA model is defined by its three order parameters, p, d, q. p specifies the number of Autoregressive terms in the

Page 30: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 28

model. d specifies the number of differentiations applied on the time series values. q specifies the number of Moving Average terms in the model. An ARIMA model is an integrated ARMA model. The ARMA model describes a time series by a weighted sum of lagged time series values (the Autoregressive terms) and a weighted sum of lagged residuals. These residuals originate from a normal distributed noise process. The "integrated" indicates that the values of the ARMA model are integrated, which is equal to that the original time series values which the ARMA model describes are differentiated. The ARIMA operator fits an ARIMA model with given p,d,q to a time series by finding the p+q coefficients (and if estimate constant is true, the constant) which maximize the conditional loglikelihood of the model describing the time series. For the optimization the LBFGS (Limited-memory Broyden-Fletcher-Foldfarb-Shanno) algorithm is used. When choosing values for p,d,q, it is essential that the conditional loglikelihood is only a good estimation for the exact loglikelihood if the number of parameters (sum of p,d,q) is not in the order of the length of the time series. Hence the number of parameters should be way smaller than the length of the time series. How well a trained ARIMA model describes a given time series is often calculated with the Akaikes Information Criterion (AIC), the Bayesian Information Criterion (BIC) or a corrected Akaikes Information Criterion (AICC). The ArimaTrainer operator calculates these performance measures and outputs a Performance Vector containing the calculated values. An ARIMA model which describes a time series well has small information criteria. This operator is similar to other modelling operators but is specifically designed to work on time series data. One of the implications of this is, that the forecast model should be applied to the same data it was trained on. The Apply Forecast operator receives a trained ARIMA model and creates the forecast for the time series it was trained on.

Figure 12: Arima Top level Predictive Model View

Page 31: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 29

Figure 13: Arima Executive Process Sub-Process View

Figure 14: Arima Model Copy-Time-column Sub-Process View

Page 32: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 30

Figure 15: Arima Model Handle-missing Sub-Process View

Figure 16: Arima Model Find ExtremesInLabel Sub-Process View

Figure 17: Arima Model find-name-of-label Sub-Process View

Page 33: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 31

Figure 18: Arima Model rename-label-to-a-standard-name Sub-Process View

Figure 19: Arima Model Optimize Parameters (Grid)

3.6.2.1 ARIMA results discussion

We reviewed ARIMA as one of the main methods for predicting time series. Flooding events are linked to wind, temperature, and tidal forecasts and are no isolated events. Both ARIMA and Holt-Winters algorithms did not perform very well and were considerably inferior to the deep learning algorithm.

ARIMA worked reasonably well for normal series but could not predict deviation from the norm.

Page 34: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 32

Figure 20: ARIMA: the blue line is the prediction

Holt-Winters

Figure 21: Holt-Winters: the red line is actual, the blue line is the prediction

3.6.3 Option 3 Deep Learning

The Deep Learning operator in RapidMiner Studio executes the Deep Learning algorithm using H2O 3.8.2.6.

RapidMiner Studio provides the following summary for Deep Learning [10]:

Deep Learning is based on a multi-layer feed-forward artificial neural network that is trained with stochastic gradient descent using back-propagation. The network can contain a large number of hidden layers consisting of neurons with tanh, rectifier and maxout activation functions. Advanced features such as adaptive learning rate, rate annealing, momentum training, dropout and L1 or L2 regularization enable high predictive accuracy. Each compute node trains a copy of the global model parameters on its local data with multi-threading (asynchronously) and contributes periodically to the global model via model averaging across the network.

Page 35: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 33

As the deep learning algorithm is computationally expensive, we opted for a 60:40 split between training and test data. The algorithm was set up with a rectifier activation function and ten epochs. The results, as outlined later, are promising.

Figure 22: Deep Learning Toplevel Predictive Model View

The detailed Deep Learning results can be seen below. A window size of 12 results in a Mean Square Error (MSE) of 0.0009293002 and a correlation value of R^2: 0.9896514, which both point to the fact that the model worked rather well.

Figure 23: Deep Learning Results

Below Figure 24 shows the Root Mean Squared Error (RMSE) of 0.031 and the Squared Error of 0.001 which we will use when discussing the performance of Deep Learning comparing the performance of the model on training and unseen data in section 3.6.4.

Figure 24: Performance Vector Deep Learning - Training Dataset

Page 36: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 34

Figure 25: Deep Learning forecasting accuracy chart - actual value are in red, predicted values in blue

Figure 26: Shorter-term Deep Learning forecasting accuracy chart (1) - actual values are in red, predicted values in blue

Page 37: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 35

Figure 27: Shorter-term Deep Learning forecasting accuracy chart (2) - actual values are in red, predicted values in blue – on unseen data

We ran both in and out of sample, and the model worked well on unseen data. Figure 27 shows the prediction performance on unseen data.

Window size of 48 is still very good with a Mean Square Error of 0.0066463803 and a correlation of R^2: 0.92412037.

Figure 28: Deep Learning forecasting accuracy chart (window size of 48) - actual values are in red, predicted values in blue

Page 38: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 36

Figure 29: Shorter-term Deep Learning forecasting accuracy chart (1) (window size of 48) - actual values are in red, predicted values in blue

Figure 30: Shorter-term Deep Learning forecasting accuracy chart (2) (window size of 48) - actual values are

in red, predicted values in blue

3.6.4 Model performance (unseen data) discussion & conclusions

As discussed in section 3.2, a predictive model can be either a regression model with continuous output or a classification model. We have developed separate analytics processes for training and applying a trained model to unseen data. In this section, we discuss the model performance for Naïve Bayes (a predictive classification model) and Deep Learning (a regression model with continuous output).

Page 39: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 37

As new weather data arrive continually, in production, we would need to make frequent model updates by retraining to ensure accurately predicted future values either predicted flooding events (Naïve Bayes) or predicted water levels.

Naïve Bayes

Below Figure 31 shows the confusion matrix which describes the performance of the Naïve Bayes model, how well it predicts whether the water level is high or ok on unseen data.

High was defined as any value which is greater than the mean + 2 * the standard deviation. No high-water classification event occurred in the timeframe of the unseen data and the model did not forecast any high-water level which is an excellent result as the model performance was 100% accurate. Figure 11 shows that the precision of the model using 10-fold cross-validation of predicting a flood is over 70% and when looking closer, the misclassified examples mainly consisted of values that were close to the threshold.

The sensitivity or recall for true-negative is 100% and for false-negative 0%. Class-recall refers to how many flood events are selected correctly as no flood (ok) or flood (high), and class-precision refers to how many.

Figure 31: Naïve Bayes confusion matrix on unseen data

Deep Learning also showed excellent results. Deep Learning model performance is measured with error loss which is the difference between the values predicted and the values actually observed. with a very low mean square error of 0.0009293002 when using a window size of 12 future measurements. The root-mean-square error (RMSE) is a frequently used measure of the differences between values predicted by a model and the values actually observed. The RMSE for training and test (unseen data) datasets should be very similar for a good model. If the RMSE for unseen data is much higher than that of the training data, this could be due to overfitting the data. The RMSE for unseen data is 0.040 and for the training data, it is 0.031 which is a good result.

Figure 32: Performance Vector Deep Learning – Unseen Dataset

The learning time for the Deep Learning model is considerably longer than for the Naïve Bayes model, and so is the scoring of unseen data. However, it is still at an acceptable level. We used different window sizes. As expected, the larger the window size, the less correct are the predictions.

Page 40: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 38

Predictive Functionality Levels & Charting In this section, we discuss the predicted functionality levels. We differentiate the severity of the flooding impact. Business locations that are affected (impacted) by a predicted flood water level at or up to 50 cm below are labelled as ‘affected’. Business locations that are predicted to be affected with more than 50cm predicted flood water level are labelled as ‘affected severely’ which suggest significant flood damage.

Business locations labelled as ‘affected severely’ likely will not recover back to ‘business as usual’ than business locations labelled as ‘affected’. We set the threshold of 50 cm as an example; it can be freely adapted based on the experience of subject matter experts.

Below chart shows the predicted functionality level for the number of jobs either affected or severely affected by a predicted water level of, e.g. 3.77m, which we chose as an example. It can be interpreted as how many jobs are very likely either affected or severely affected by the predicted water level.

Figure 33 shows the absolute number of jobs predicted to be affected or affected severely at their employment locations.

Figure 33: Predicted number of jobs at business locations either affected and severely affected by flooding

Figure 34 shows the ratios of jobs affected or affected severely by a predicted water level from the predictive model.

Page 41: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 39

Figure 34: Ratio of jobs at business locations either affected and severely affected by flooding

Figure 35 shows the predicted functionality level ‘employment’ with the percentage of jobs either not affected or not affected severely by the forecasted water level at their employment locations

Figure 35: percentage of non-affected jobs & non-affected jobs severely

Figure 36 illustrates the functionality level results in relation to the FL-t curve.

Page 42: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 40

Figure 36: Predictive modelling of FL-t curve

Impact & recovery phase discussion

Figure 34 shows the ratio of jobs at business locations, either affected and severely affected by a predicted flooding incident based on the total number of jobs in the location database. It shows that 5.4% of jobs will be affected, and 32.9% of jobs will be affected severely by the predicted water level of 3.77m. Figure 36 shows the resulting employment functionality level based on a value from the water level forecasting window. We define ‘severe’ flooding as an incident that causes severe structural damage to flooded locations. The percentage from the predictive indicator corresponds to the predicted water level value from the rolling forecasting window of the predictive model.

With the predictive model, we are predicting all values that are part of the window size (see section 3.4) which is our horizon. The larger the window size, the more likely we are incorrect in our prediction. That means we can increase the window size at the cost of accuracy for values far in the future to cover a full impact and recovery phase concerning the predicted water level.

Naïve Bayes forecasting only predicts the event of a flooding incident; however, deep learning predicts the future values of the water level. The deep learning model has been tested with a window size between 12 to 48. Given the model is built with 5mins intervals, with a window size of 48 the model forecasts 4 hours of future water level values which combined with the height data and location statistics give us the real-time FL-t curve.

The indicator illustrated in above Figure 33 estimates (with the input of a predicted water level value from the current model forecasting window) that in the event of the water level returning back to normal, 32.9% of jobs are still affected due to severe flood damage at the respective employment locations. As the current predicted water level gives insight into the severity of the flooding disaster and allows to estimate how many locations will suffer severe flood damage it, in turn, gives insight into the likelihood of bouncing back to 100%. Locations exposed to severe flooding may not bounce back to 100% which is returning to normalcy or resuming pre-flood challenge behaviours such as business-as-usual. In that context, there may be a need to think about regional adaptation, rethink resilience and see an individual flood-affected region as a complex adaptive system. An adaptive system is able to change or adapt to stresses rather than merely striving for a return to normalcy or a resumption of pre-challenge behaviours or outcomes [11]. In a complex adaptive system, resilience is not related to equilibrium, a return to ‘normal’, or even to resilient outcomes but instead it is a dynamic attribute associated with a process of continual development [12]. Severe flooding or a series of severe flooding incidents can contribute to a discussion of resettlement.

Page 43: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 41

There have been a number of severe flooding incidents in Europe in recent years. For example, regions in Saxony/Germany were flooded up to three times in 2002, 2006 or 2010, and 2013. According to a study on the impact of flooding to households in Saxony [13], households affected by flooding up to three times in recent years perceived the impacts of flooding incidents as more severe than households that had been affected by flooding in 2013 for the first time. Further, the study found that households that suffered flood damage several times thought considerably more often about resettlement. Hence flood-prone communities that do not get flood protection may face severe consequences and may not bounce back to normalcy 100% measured before the incident.

Annex II shows the charting for other predicted functionality levels.

Conclusions We have discussed several approaches for building a predictive model for flood water level forecasting which helps flood emergency coordinators see the impact of and recovery from a soon flooding event with the associated functionality level impact & recovery metrics. For the target parameter, the predicted water level, we have used a water level gauge not close to the city which only has historical recordings starting in 2015.

The deep learning approach yielded the best forecasting accuracy and can be adapted to other forecasting applications for a threat target parameter ‘x’ to be predicted with high accuracy pending available relevant longer-term time series data. The use of more extensive historical data for the threat target parameter recorded at a location close to the expected threat impact is essential for building a robust predictive model. By training a model with more extended time series data, it can tune in to a wide range of patterns that occurred over time and become more robust for forecasting applications. Shorter period historical time series data likely have fewer patterns that a deep learning approach can adapt to, making the predictive model less able to predict future events.

We have been using data available to us (weather, tidal and river levels) and we have shown that deep learning algorithms are quite accurate in forecasting future water levels. The model itself cannot be applied to other cities due to the variance in infrastructure (flood prevention, damns, walls, river system, etc.) and weather/tidal data. However, it would be reasonably straight forward to fit a model to data obtained from other cities. We have shown that for Cork, the deep learning approach is the most suited algorithm. Whether this holds for other cities would need to be further investigated for each case.

We have further discussed that with a severe impact from a natural flooding disaster there may not be a bounce-back recovery to pre-incident normalcy.

In addition, we discussed the limitations of our predictive model with the available input data we used to build the model. We tested the model (also on unseen data) with a window size of up to 4 hours of future water level values. This window is likely too short to carry out flood mitigation actions for the locations identified with severe flood impact such as local businesses that have a high number of employees registered at their location and are important drivers of the local economy.

Page 44: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 42

Multi-Criteria Decision Making

Introduction The previous chapters provided an overview of the capabilities of RapidMiner. In this chapter, we discuss the use of RapidMiner for multi-criteria decision making (MCDM). The test scenario is closely related to the appraisal process for investing in flood protection measures that the Office of Public Works (OPW) applies in Ireland. The OPW is the lead agency for the delivery of flood protection infrastructure in Ireland and as an external stakeholder kindly supported the SmartResilience project through the GOLF case study on urban flooding. The OPW provided relevant multi-criteria analysis (MCA) data and advice based on OPW MCA work on flood protection measures for Cork City.

RapidMiner is a data science software platform that provides an integrated environment for data preparation, machine learning, deep learning, text mining, and predictive analytics. For MCDM we are mostly using data access and blending as well as statistics visualisation features.

There are many MCA methods which each have useful features justifying their application. The OPW MCA approach chosen is broadly applicable across a range of government decisions and fulfils important criteria for application in government such as the ability to provide an audit trail, transparency and ease of use.

The MCA approach discussed relies on objectives that clarify what is intended to be achieved regarding flood risk reduction and related benefits. In that, the objectives defined focus on the adverse consequences of flooding on human health, the environment, cultural heritage and economic activity.

Each flood protection investment option is calculated on its performance against each objective, in turn, looking at available indicators and the change the investment option brings to the indicator. This score is then multiplied by the global and local weightings. These weighted scores for each objective are then added up to give the overall MCA benefit score for the option which represents the overall benefits and impacts of the option across the full range of objectives.

Flood Protection Investments Options GOLF In this section, we describe the OPW MCA problem statement and options for flood protection investments in Cork City.

The core problem statement has been defined as the identification of the best investment option for improving flood protection in Cork City.

The investment options identified are:

Option 1: Develop a flood forecasting system combined with individual property protection and a targeted public awareness and education campaign

Option 2: Improved channel conveyance combined with permanent flood walls Option 3: Proactive maintenance of existing informal defences Option 4: Provision of demountable defences combined with some permanent defences with a

flood forecasting system Option 5: Provision of permanent flood walls/embankments

Discussion of investment options

Option 1: Develop a flood forecasting system combined with individual property protection and a targeted public awareness and education campaign.

Page 45: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 43

Baseline option:

No modelling and no change in flow regime. Consider individual property protection for all properties where flood depth <600mm

Baseline option assumes the continuation of any existing maintenance regime in the assessment unit.

Option 2: Improvement in channel conveyance combined with the provision of flood walls/ embankments.

Baseline option:

This option is the same as the permanent defences option (option 5) except modification of footbridges has been considered in order to improve channel conveyance. Modelling showed that some of the footbridges on the south channel do have a small effect on water levels, reducing them by approximately 100mm in some areas. Modifying the footbridges reduces the height of the defences required.

Baseline option assumes the continuation of any existing maintenance regime in the assessment unit.

Option 3: Proactive maintenance of existing informal defences.

Baseline option:

'with defences' and 'without defences' models of Cork demonstrate the impact of the existing informal defences. In general, the defences provide little protection against anything greater than minor floods. The reason for this is that many are not designed as flood defences and have openings for pedestrian access etc.

There are more than 40km of existing assets within Cork. Of these approximately 3% have good or very good condition, 54% have a fair condition, and 43% have a poor or very poor condition.

Baseline option assumes the continuation of any existing maintenance regime in the assessment unit.

Option 4: Provision of demountable defences combined with some permanent defences with a flood forecasting system.

Baseline option:

This option is very similar to the previous option in terms of areas defended but wherever possible demountable defences have been used.

Demountable defences are appropriate wherever there is good access to the defence location for installation and removal. In many areas of the city centre, it would be possible to install demountable defences slightly set back from the river bank. This reduces the need to provide new walls along the bank and would make the demountables easier and safer to install. Where existing bridges are below the flood level, it has been assumed that the demountable defences will continue past the bridge. These sections of defence could be left open to allow traffic movement until floods reach a critical level.

Baseline option assumes the continuation of any existing maintenance regime in the assessment unit.

Option 5: Provision of permanent flood walls/embankments. Baseline option: For this option assessment flood walls and embankments are considered for all the areas where

there are significant numbers of properties at risk within the APSR which can be protected

Page 46: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 44

without excessive cost. If this option is taken forward the location and type of defences would need optimising.

The defences option has been modelled, and it was found that the defences raised water levels in the North and South Channel by approximately 300-400mm at the upstream (west) part of the river but had little effect on levels in the downstream (east) end of the river.

Baseline option assumes the continuation of any existing maintenance regime in the assessment unit.

MCA benefit score approach The core of the approach features the following:

Global weightings (economic, environmental, social, technical) represent the four core objectives as the basis assessment criteria

Local weightings (from ‘international importance’ to ‘not relevant’) Scoring system developed in consultation with the Lee Catchment Flood Risk Assessment and

Management project steering group o OPW o Cork City Council o Cork County Council o Environmental Protection Agency

Global weightings

The global weightings have been developed by the OPW and are fixed nationally. They are unchanged for each assessment unit. This level of weighting recognises the key drivers behind flood risk management (FRM) options and gives higher weightings to risk to human health and life and economic return on options.

An assessment unit defines the spatial scale at which flood risk management options are assessed and are defined on four spatial scales ranging in size from largest to smallest as follows: catchment scale, Assessment Unit (AU) scale - a large sub-catchment - e.g. Lower Lee AU, Areas for Further Assessment (APSR) - e.g. Cork City.

Table 3: Global weighting for flood protection investment options

Criterion Objective Global weighting

Technical Operationally Robust 5

Technical Health & Safety Risk 5

Technical Adaptability 5

Economic Economic Return 25

Economic Transport Infrastructure 15

Economic Utility Infrastructure 15

Economic Agriculture 5

Social Risk to Human Health 30

Social Community Risk 10

Social Risk to Social Amenity 5

Page 47: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 45

Environmental Ecological Status 5

Environmental Pollution Sources 15

Environmental Habitats 10

Environmental Fisheries 5

Environmental Landscape Character 5

Environmental Cultural Heritage 5

Local weighting

The local weighting of each objective varies for each assessment unit depending on the level of applicability of that objective to that unit. For some objectives, the local weighting could be 0, since the objective does not apply to that part of the catchment.

Table 4: Local weighting, importance scoring

Importance Local Weight

Major / International importance 5

Significant / National importance 4

Medium / Regional importance 3

Minor / Local importance 2

Negligible importance 1

Not relevant 0

Scoring

The flood protection measures applicable to an analysis unit (large sub-catchment, e.g. Lower Lee AU) or areas of potential significant risk (APSR), e.g. Cork City, Ballincollig, Crookstown are scored based on the core criteria. The baseline indicator data relevant to each core criterion (e.g. the presence of a sensitive environmental designation) were used to inform this preliminary evaluation. The scoring system was developed in consultation with the project steering group.

Table 5 to Table 8 show the scorings.

Table 5: Scoring - General Approach

Impact Score

Achieving aspirational target 5.0

Partly achieving the aspirational target 3.0

Exceeding minimum target 1.0

Meeting minimum target 0.0

Just failing minimum target -1.0

Partly failing minimum target -3.0

Page 48: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 46

Fully failing minimum target -999.0

Table 6: Scoring - Technical & Economic Criteria

Core Criteria Basis for scoring Score

Technical Technically impossible or difficult -2, -1

Technically possible 0

Technically straightforward 1

Unacceptable -999

Economic Prohibitive / excessive cost; estimated BC ratio <<1 -2

Reasonable cost; estimated BC ratio 0.5 – 1

1

1-2

-1

0

1

Low cost or potential for income; estimated BC ratio > 2 2

Unacceptable -999

Table 7: Scoring - Social & Environmental

Core Criteria Basis for scoring Score

Social Significant negative impact on people -1

Neutral impact on people 0

Positive impact on people 1

Unacceptable -999

Environmental Overall negative environmental impact -1

Overall neutral environmental impact

0

Overall positive environmental impact 1

Unacceptable -999

Table 8: Scoring - Other Criteria

Core Criteria Basis for scoring Score

Other Significant negative issue -1

No other significant issues 0

Significant positive issue 1

Unacceptable -999

Page 49: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 47

The elements for the calculation of the MCA benefit score for a flood protection investment option, e.g. build a tidal barrier include:

Applicable (yes/no) (no if, for instance, the option is about the improvement of existing tidal flood defences, but none exist)

Technical (weighted score) Economic (weighted score) Social (weighted score) Environmental (weighted score) Other (weighted score) Score looking at data-driven indicators Overall score (weighted score - the MCA benefit score) Decision to carry forward to option development (yes/no), e.g. yes build a tidal barrier

MCA benefit calculation for a data-driven social indicator (partial reviewing one data-driven indicator from one objective)

We use the location-based employment statistics database discussed in chapter 3 and the indicator “Predicted number of jobs at business locations either affected and severely affected by flooding” as shown in Figure 33. This indicator relates to the social/political dimension listed in the Resilience matrix in the SmartResilience project [1] (table 1, page 14). In the MCA approach discussed it falls under the social objective, sub-objective minimise risk to community (employment).

Table 9: MCA benefit calculation for a data-driven social indicator (partial, reviewing one data-driven indicator)

minimize risk to community (employment)

global (10), local(5)

Baseline indicator total of 31123 jobs at risk in city locations severely affected by flood water levels of 3.77m.

Provision of permanent flood walls

score 3

Explanation of option assessment (simplified for this report)

Flood walls and embankments are considered for all the areas where there are significant numbers of properties at risk within the APSR which can be protected without excessive cost.

MCA benefit score (partial) 150

Table 10 shows an example of the approach for MCA where the score links to the review of the relevant data-driven indicators. It illustrates an example for the calculation of an MCA option ‘x’ for which we have selected three criteria which each have assigned the respective global and local weights and scores as discussed earlier in this section. The significance or individual benefit of each criterion for option ‘x’ is calculated in the last column with a weighted score, and the sum of that column gives the MCA benefit score for option ‘x’ which then can be compared with other options for decision making. With an MCA benefit score of -50, this option ‘x’ will not be carried forward to option development.

Table 10: Example calculation of MCA value per option

Page 50: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 48

Criteria Global-weighting (GW)

Local-weighting (LW)

Score (S) Weighted Score (WS)

GW * LW * S

Technical - Ensure Flood Risk Management options are operationally robust.

5 5 0 0

Technical - Minimise Health and Safety risk of flood risk management options.

5 5 1 25

Technical - Ensure flood risk managed effectively and sustainable into the future.

5 5 -3 -75

MCA benefit score -50

Implementation in RapidMiner We have first prepared the data in a CSV file storing the criteria, global weighting, local weighting and the score for each option. We loaded the CSV into a RapidMiner process using the retrieve operator.

Figure 37: RapidMiner process

We can use the data editor in RapidMiner to make changes in the underlying dataset to review the direct changes in the outcome of the MCDM calculation.

Figure 38: Review of data in the RapidMiner data editor

Following the approach of global & local weights and scores taken by the OPW, we calculate the weighted sum for each criterion and define the formula expression using the RapidMiner expressions editor.

Page 51: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 49

We first calculate the weighted score as (global-weighting x local-weighting x score) for each row in the database. We then calculate the sum of the weighted score for each option. Below table shows the calculation for option 5.

Table 11: Example MCDM calculation

Criteria Global-weighting

Local-weighting

Score Weighted Score

Technical - Ensure Flood Risk Management options are operationally robust.

5 5 3 75

Technical - Minimise Health and Safety risk of flood risk management options.

5 5 0 0

Technical - Ensure flood risk managed effectively and sustainable into the future.

5 5 0 0

Economic - Optimise economic return on flood risk management investment.

25 5 0.164489 20.5611015

Economic - Minimise risk to infrastructure. 15 4 3 180

Economic - Minimise risk to agricultural land. 5 2 0 0

Social - Minimise risk to human health and life. 30 5 3 450

Social - Minimise risk to the community. 10 5 3 150

Social - Minimise risk to, or enhance social amenity. 5 4 3 60

Environmental - Support the achievement of good ecological status/ potential (GES/GEP) under the WFD.

5 5 -1 -25

Environmental - Minimise risk to sites with pollution potential 15 0 0 0

Environmental - Avoid damage to and where possible enhance the flora and fauna of the catchment.

10 5 -1 -50

Environmental - Avoid damage to, and where possible, enhance fisheries within the catchment.

5 4 -1 -20

Environmental - Protect, and where possible enhance, landscape character and visual amenity within the catchment

5 4 -3 -60

Environmental - Avoid damage to or loss of features of cultural heritage importance, their setting and heritage value within the catchment

5 4 0 0

MCA benefit score 780.561102

Page 52: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 50

Figure 39: RapidMiner functions - MCDM formula for GOLF

To get the sum of all weighted sums for each flood protection option, we use the sum aggregate function with the weighted-score as attribute and group it by option.

Figure 40: Defining the sum over all weighted-scores calculated for the criteria for a specific option

Page 53: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 51

Figure 41: Defining the grouping by flood protection measure option

After running the process, RapidMiner shows the results in the ExampleSet view for each flood protection option.

Figure 42: Result set with values for each option

We use RapidMiner visualisation and configure it to present the MCDM results as a bar chart.

Figure 43: Visual representation of MCDM result set

Page 54: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 52

Based on the weights and scores, the provision of permanent flood walls/embarkments is the favourable investment option in this example, i.e. Option 5 (red bar in Figure 43). It received the highest score with 780.561102 according to the calculation described in

Table 11.

Conclusions This chapter has described the use of RapidMiner for an MCDM use case by the example of the OPW investment options MCA appraisal approach. It is used for analysing infrastructure investment options such as flood protection measures. We have discussed the MCA approach for flood protection investment options for Cork City with an example dataset.

The MCA approach discussed is broadly applicable across a range of government decisions and fulfils important criteria for application in government such as the ability to provide an audit trail, transparency and ease of use. It is an alternative approach to the SmartResilience MCDM approach described in the D3.4 report.

We have discussed the MCA application using RapidMiner following the OPW MCA appraisal process adapted in the context of the GOLF case study, which consists of global & local weights and a scoring. The global weights are defined by four core criteria technical, economic, social and environmental, each with several objectives such as economic return (25) or transport infrastructure (15) for the criterion economic with the global weights 25 & 15 respectively. Local weighting then puts a value on the importance of the area concerned by flooding such as major/international importance, national importance, local importance to no importance for the areas indicated by flooding. The scoring is a measure developed by an expert steering group for indicators for the flood-affected locations, for example, if a flood protection investment option partly achieves the aspirational target for reducing the risk of flooding to a number of business locations it will receive a score of 3. In other words, if a flood protection investment option partly reduces the employment-related indicator illustrated in Figure 33, then it will receive a score of 3.

RapidMiner has a high number of customisable operators that can be combined in a RapidMiner analytics process where process results can be explored through various charting features. The RapidMiner MCA implementation discussed can be used for other use cases that follow a similar process of global/local weights and scoring. We used the following data structure for data to feed into the MCA process:

Criteria (text) global-weighting (number) local-weighting (number) score (number) option (text)

In order to utilise the RapidMiner MCA process for other use cases a CSV data file with data and the same structure of the 5 data attributes above can be used.

Page 55: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 53

Enterprise Integration with RapidMiner

Reuse of RapidMiner Analytics Processes

Development efforts in RapidMiner focus on RapidMiner Analytics Processes that are stored in RMP XML files. These processes can be reused for solving analytics problems in similar environments. An example is the reuse of the MCDM process which can be reused with other datasets supplied in the same data format.

The predictive analytics processes can be reused to a certain degree. As discussed in chapter 2 predictive modelling is a problem-solution specific development effort. However, in the case of predictive modelling for forecasting flood water levels, the predictive model can be reused for other cities with similar environments to a certain degree. It is not possible to directly apply the Cork model to other cities without alterations. At a very minimum, the model would need to be retrained assuming that the data are identical in terms of available attributes.

Integration options

RapidMiner has several integration options. There is a wide range of data access and management features which can be used to access, load and analyse any type of data both traditional structured data and unstructured data like text, images, and media. It can also extract information from these types of data and transform unstructured data into structured.

There are a number of integration options via file, e.g. in Excel, CSV etc. format, connectivity with databases or publishing models through web services on Server in a cloud deployment scenario for instance.

Figure 44: Example operators supporting Integration of RapidMiner Resilience Analytics applications

Page 56: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 54

Figure 45: Webservices integration for analytics processes running on RapidMiner Server in, e.g. a cloud deployment scenario

Conclusions RapidMiner applications can be integrated into an operational environment using mainstream enterprise integration options. As discussed in section 4.5, RapidMiner analytics processes can be reused, provided the underlying business analytics approach, and data input structures are the same. For predictive modelling, as explained in section 2.1, a predictive model can be reused to a certain degree if the underlying business problem and data inputs are very similar. A predictive model in order to remain accurate needs to be retrained, so it meets the accuracy performance criteria defined by the end-user.

We have discussed the CRISP-DM process for data analytics projects in section 2.2 which describes the steps widely used in industry to implement a data analytics project. The first two steps, business understanding & data understanding, are crucial enablers for a successful analytics project. In these two steps end-users and business analysts need to clarify the interests of the end-user to formalise the business analytics problem. They also need to identify the data sources required to address the business problem with the more technical steps data preparation, modelling which again will be reviewed in a fifth step evaluation that checks if the technical implementation has met the business problem description articulated in step 1. In SmartResilience the business description has been carried out in T5.2 in which each case study leaders described their business problem with a scenario description and available datasets. T52 also concerned the data understanding step of CRISP-DM. In the GOLF case, we have explored business problem descriptions and reviewed a number of datasets, carried out data preps and data analysis tasks and evaluated the results in view of robust, accurate predictive indicators.

Many datasets have been provided in ESRI [14] shapefile format. We have used ESRI to explore

Street-level transport data Cork City Lidar height data Cork City

We have further explored various CCC datasets such as fire brigade call-outs and location-based employment statistics database.

The strength of CRISP-DM is in its built-in iteration. In that we went through GOLF business understanding, data review, data prep, analysis and evaluation phases in several iterations during the course of the project. We have also reviewed other SmartResilience case studies and access to historical timeseries datasets. The predictive model for GOLF water level forecasting and city location-based predictive impact and recovery assessment was identified as the business use case which was supported with available datasets. Datasets for resilience analytics in a multi-stakeholder environment such as flood disaster emergency coordination often involves time-consuming data scouting in various stakeholder organisations and negotiating terms for data access frequently in the form of an NDA.

Page 57: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 55

New data-driven indicators

The predictive resilience analytics prototyping has generated several new data-driven indicators. These indicators are based on the predictive model for the water level in Cork City which is discussed in chapter 3. Combined with existing databases available to Cork City Council and stakeholders by using RapidMiner advanced analytics tooling we derived several new data-driven indicators including:

Predicted impact on employment in the City based on the predicted flood water level: o Jobs affected by the predicted flood water level o Jobs affected severely by the predicted flood water level (structural flood damage to a business

location)

Predicted impact on businesses without flood damage insurance based on the predicted flood water level o Businesses without insurance affected by the predicted flood water level o Businesses without insurance affected severely by the predicted flood water level

Predicted damage to stock held in businesses based on the predicted flood water level o Euro amount of stock held at all businesses that will be affected by the predicted flood water

level o Euro amount of stock held at all businesses that will be affected severely by the predicted flood

water level

These indicators are available as an update to the SmartResilience database.

Conclusions We developed several new predictive data-driven indicators. These indicators are the result of many iterations through data analytics steps described in the CRISP-DM reference model for data analytics projects.

The indicators presented in the above are based on

A predictive flood water level model Lidar location height data Location statistics databases

A predictive indicator in order to be valuable for the end-user needs to be reliable. Reliability comes with predictive model accuracy and model accuracy depends on identifying the influencing historical data input streams which we can tune in with predictive modelling and discover patterns that help forecast future values.

The SmartResilience methodology builds on several macro indicators and depends on that these indicators provide a reliable indication of the underlying business understanding, such as referred to in CRISP-DM step 1. This D3.9 report provided a practical description of how to build reliable predictive data-driven indicators.

Page 58: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 56

It is important to note that the extension of any data-driven indicator database such as the SmartResilience resilience tool database requires hands-on consultancy work to ensure the end-user benefit from predictive indicators that provide reliable, validated insight in line with their business understanding.

The predictive modelling approach for resilience analytics discussed has several advantages over the existing risk analytics approaches, mainly in its combination of data science-driven predictive analytics with location-based intelligence. In that, it helps stakeholders understand the impact of a predicted threat by linking the threat prediction value with location-based metrics such as employment at business locations. Firstly this approach allows calculating the impact of a currently predicted threat (following a change process such as climate) using predictive models which the data scientist can train and test in RapidMiner - with, for example, sensor-sourced weather data like wind speed or wind direction and environmental data such as water levels from water level gauges in real-time. Secondly, it allows for exploring the threat impact by manually changing the value of the threat parameter and reviewing its impact using the location-specific metrics.

By exploring the impact of a threat with available location-associated metrics, we can assess the impact without historical data in a what-if style. Section 1.5 discussed deterministic vs parametric flood risk estimation approaches, the latter requiring only a few parameters. This links with the idea of the SmartResilience indicator database which provides a case study such as urban flooding or cyber-attacks with a large number of indicators to choose from and is similar to the concept of parametric vulnerability assessment described in Little and Rubin 1983 [6]. For example, the availability of flood bags increases flood resilience and reduces the impact of flood events. In the cyber domain, the term cyber resilience has recently been coined to identify specifically “the ability to continuously deliver the intended outcome despite adverse cyber events” [15]. Cyber resilience indicators can help describe the baseline distribution of a cyber-physical system. Such a baseline of a cyber resilience indicator can be learned from process control log data recorded during secure operation. Statistically significant deviations in resilience indexes for a wastewater facility can be produced by, for example, a faulty pump. Cyber-attacks do not exhibit a predictive pattern in contrast to naturally occurring fault. Therefore, insights about the cause of an anomaly could come from a comparison between several indicators, including those obtained simulating possible faults [16].

RapidMiner is aimed at the data science community who have undergone extensive training in data science. It is rich in predictive methods that rely on time series data, and it can also support what-if explorations which functionality is accessible through its data science audience-focused interface. Given the focus of a data science solutions development tool, it is not suitable for direct use by resilience-tasked end users but for technical staff that help build indicators for use by the end-users. Therefore, to benefit from its quantitative capabilities, it can be used to feed its results into other systems via various integration options. One is the storage of results from a RapidMiner analytics process in an excel file for use by another application, such as the SmartResilience indicator dashboard, like it was done through the excel import feature of the big data uploader described in D4.2. However, for RapidMiner, we did not reach this level of integration with the project integrated tool for several reasons. RapidMiner was installed in the SmartResilience web portal at an early stage, in order to explore the possibilities to perform (SmartResilience) resilience assessments using a COTS development tool for data science solutions as an alternative to the custom-made tools in SmartResilience. It turned out to be two main challenges with this. One was that the custom-made tools (and most of the methods) were at an early development stage, meaning that it was not exactly clear which assessments or parts of assessments should be performed using RapidMiner. This led e.g. to the development of a RapidMiner MCDM approach that differed from the one later developed and included in the integrated tool. The second challenge was the recognition that RapidMiner is aimed at the data science community who have undergone extensive training in data science. Attempts were made to explore the potential use of RapidMiner for the different use cases, mainly by data analytics experts analysing data sets received from the case studies, but it turned out to be difficult to obtain relevant (open and non-sensitive) data sets from the case studies. One main reason being that case studies except GOLF did not focus on the prediction of events and did not possess this type of (non-sensitive) data sets.

Page 59: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 57

Summary

We have discussed several approaches for building a predictive model for flood water level forecasting which helps flood emergency coordinators see the impact and recovery of a soon flooding event with the associated functionality levels (cf. the D3.3 report). We have been using available data (weather, tidal and river levels) and shown that a deep learning algorithm is quite accurate in forecasting future water levels.

The model itself cannot be applied to other cities due to the variance in infrastructure (flood prevention, damns, walls, river system, etc.) and weather/tidal data. However, it would be reasonably straight forward to fit a model to data obtained from other cities. We have shown that for Cork, the deep learning approach is the most suited algorithm. Whether this holds for other cities would need to be investigated for each case.

D3.9 has focused mostly on the GOLF case given the availability of datasets and dialogue with external stakeholders in Ireland. The contribution to D3.3 (modelling impact and recovery) is a predictive model for forecasting water levels in real-time which we combined with location height data and location statistics data to calculate new indicators that show respective functionality levels based on the water level values from the predictive model. The predictive time window can be configured with the window size. Modelling a whole FL-t is different for each incident as they have individual time lengths. The forecasting time length of the predictive model can be increased, however, at the cost of forecasting accuracy.

We have tested a deep learning predictive model with a window size of up to 4 hours into the future with acceptable results. We have connected the predictive model with location-centric statistics data using location height data. In that, we were able to identify all location statistics at or below a predicted water level and could calculate data-driven indicators such as jobs affected or affected severely by flooding. We have calculated the bounce-back functionality level considering the number of locations severely affected by a predicted flood water level which will likely suffer structural damage. We argue that there may not be a return back to pre-disaster normalcy in case of a severe flooding disaster. In that context, there is a need to think about regional adaptation, rethink resilience and see an individual flood-affected region as a complex adaptive system. An adaptive system is able to change or adapt to stresses rather than merely striving for a return to normalcy or a resumption of pre-challenge behaviours or outcomes [11].

It is essential to understand that a predictive analytics problem cannot be solved by loading data into a predictive modelling tool such as RapidMiner and hoping it will return the required results. Model creation is greatly supported with a visual modelling tool, but for the model to be effective, we need to carry out an in-depth review of the problem to be solved and the available data. Often it turns out that available data is not sufficient to support a predictive model, frequently in the pre-modelling phases, the strategy for predictive modelling changes its shape.

The MCA approach used in the GOLF case study is an alternative approach to the SmartResilience MCDM approach described in the D3.4 report. It is broadly applicable across a range of government decisions and fulfils important criteria for application in government such as the ability to provide an audit trail, transparency and ease of use. We have discussed the MCA application using RapidMiner following the OPW MCA appraisal process adapted in the context of the GOLF case study. The OPW MCA is based on global & local weights and a scoring. The global weights are defined by the four core criteria technical, economic, social and environmental, each with several objectives such as economic return (25) or transport infrastructure (15) for the criterion economic with the global weights 25 & 15 respectively. Local weighting then puts a value on the importance of the area concerned by flooding such as major/international importance, national importance, local importance to no importance. The scoring is a measure developed by an expert steering group for indicators for the flood-affected locations, for example, if a flood protection

Page 60: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 58

investment option partly achieves the aspirational target for reducing the risk of flooding to a number of business locations it will receive a score of 3. In other words, if a flood protection investment option partly reduces the employment-related indicator illustrated in Figure 33, then it will receive a score of 3. The RapidMiner MCA implementation can be directly adapted to other MCDM use cases if they follow the MCA approach we used.

The RapidMiner applications can be integrated in various ways, via file (Excel, CSV, etc.), database connectivity with most database systems, or exposed as web service in a cloud deployment scenario.

Concerning T3.7, we have discussed RapidMiner enterprise integration options and re-use of RapidMiner analytics processes which can be integrated with most database systems, including Microsoft SQL Server which is used for the SmartResilience tool. Also, RapidMiner Studio was installed and made accessible via RemoteDesktop in the hosting environment of the SmartResilience tool. T3.7 task members were engaged in discussions on tool integration.

D3.9 also created several new data-driven indicators as a contribution to T4.6. These indicators are based on the predictive water level model for the GOLF use case, which is discussed in chapter 3. Combined with OSI height data and location statistics data by using RapidMiner advanced analytics tooling, we derived several new data-driven indicators available for the SmartResilience database.

We have given an overview of CRISP-DM, which is a widely used reference model for data analytics projects. CRISP-DM gives guidance on the typical steps of a data analytics project leading to data-driven indicators and decision support tools. Datasets for resilience analytics in a multi-stakeholder environment concerned with flood disaster resilience often involves time-consuming data scouting across various stakeholder organisations and negotiating terms for data access frequently in the form of an NDA. Data understanding, data availability, data prep, modelling, evaluation and deployment are part of steps 2-6 in CRISP-DM and can lead to a review & change of the business analytics problem description in step 1. Such reviews & changes may be necessary to ensure to build decision support tools that address the interests of the end-user stakeholder.

Page 61: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 59

References

[1] SmartResilience Consortium, “Deliverable D3.3: Report on the ‘SmartResilience Methodology for Assessing Resilience of SCIs based on RIs (resilience indicators),’” 2018.

[2] SmartResilience Consortium, “Deliverable D3.4: Report on the SmartResilience MCDM Methodology serving as the basis for the ‘SCIs Dashboard,’” 2018.

[3] SmartResilience Consortium, “Deliverable D3.7: The “SCIs Dashboard containing the module on Dynamic Intelligent Checklists,” 2018.

[4] SmartResilience Consortium, “Deliverable D4.6: New release of the RI-database,” 2018. [5] S. F. Balica, I. Popescu, L. Beevers, and N. G. Wright, “Parametric and physically based modelling

techniques for flood risk and vulnerability assessment: A comparison,” Environ. Model. Softw., vol. 41, pp. 84–92, 2013.

[6] R. Little and D. Rubin, “On Jointly Estimating Parameters and Missing Data by Maximizing the Complete-Data Likelihood,” Am. Stat., vol. 37, p. 218, 1983.

[7] P. Chapman et al., “CRISP-DM 1.0 Step-by-step,” ASHA Present., p. 73, 2000. [8] RapidMiner, “RapidMiner Studio Datasheet with Feature List,” 2017. [9] RapidMiner, “RapidMiner Server,” 2017. [10] RapidMiner, “RapidMiner© Studio.” RapidMiner, Inc., 2019. [11] S. Carpenter, F. Westley, and M. Turner, Surrogates for Resilience of Social–Ecological Systems, vol. 8.

2005. [12] R. Pendall, K. A. Foster, and M. Cowell, Resilience and Regions: Building Understanding of the

Metaphor, vol. 3. 2009. [13] C. Kuhlicke, C. Begg, M. Beyer, I. Callsen, A. Kunath, and N. Löster, “Hochwasservorsorge und

Schutzgerechtigkeit - Erste Ergebnisse einer Haushaltsbefragung zur Hochwassersituation in Sachsen,” Helmholtz Centre for Environmental Research (UFZ), Leipzig, 15/2014, May 2014.

[14] Environmental Systems Research Institute Inc, “Esri: GIS Mapping Software, Spatial Data Analytics & Location Intelligence.” [Online]. Available: https://www.esri.com/en-us/home. [Accessed: 02-Feb-2017].

[15] F. Björck, M. Henkel, J. Stirna, and J. Zdravkovic, “Cyber Resilience – Fundamentals for a Definition,” Adv. Intell. Syst. Comput., vol. 353, pp. 311–316, 2015.

[16] G. Murino, A. Armando, and A. Tacchella, “Resilience of Cyber-Physical Systems: an Experimental Appraisal of Quantitative Measures,” in Proceedings of the 2019 11th International Conference on Cyber Conflict: Silent Battle, 2019, NATO CCD COE Publications, pp. 459–477.

Page 62: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 60

Annex 1 Summary of the input data

The data for predictive modelling was obtained from the following data sources:

1. Roches Point weather station hourly data Rainfall Air/Dewpoint temperature Relative humidity /vapour pressure Mean sea level pressure Wind speed/direction

https://cli.fusio.net/cli/climate_data/webdata/hly1075.zip

https://www.met.ie/weather-forecast/roches-point-weather-station-cork

2. Water level gauge Lee Road station 5min data Water level in metres

https://data.corkcity.ie/dataset/river-lee-levels

3. NMCI Tidal station 15min data Tide in metres

https://waterlevel.ie/hydro-data/stations/19069/Parameter/S/complete.zip

https://waterlevel.ie/hydro-data/search.html?rbd=SOUTH%20WESTERN%20RBD

Table 12: Data summary – Roches Point Weather Station – every hour

ID Element Unit

date weather Hourly measurements dd/hh/yyyy hh:mm

Ind (for date weather)

0. satisfactory.

1. deposition

2. trace or sum of precipitation.

3. trace or sum of deposition.

4. estimate precipitation.

5 estimate deposition.

6. estimate trace of precipitation.

rain Precipitation Amount mm

Ind1 (for rain)

0. positive.

1. negative.

2. positive estimated.

3. negative estimated.

4. not available.

temp Air Temperature °C

Page 63: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 61

ID Element Unit

Ind2

0. positive.

1. negative.

2. positive estimated.

3. negative estimated.

4. not available.

5. frozen negative.

wetb Wet Bulb Air Temperature °C

dewpt Dew Point Air Temperature °C

vappr Vapour Pressure Hpa

rhum Relative Humidity %

msl Mean Sea Level Pressure hPa

Ind3 2. Over 60 minutes.

4. Over 60 minutes and defective

6 Over 60 minutes and partially defective.

7. n/a

wdsp Mean Hourly Wind Speed kt

Ind4 2. Over 60 minutes.

4. Over 60 minutes and defective

6 Over 60 minutes and partially defective.

7. n/a

wddir Predominant Hourly wind Direction degree

Table 13: Data summary – Tidal Station NMCI Ringaskiddy Data - every 15mins

ID Element Unit

Date tide 15-minute interval measurements yyyy/mm/dd hh:mm:ss

Tide Height of tide metre

Table 14: Data summary – Water Level Station Lee Road - every 5mins

ID Element Unit

date Timestamp of water level measurement (5 minute intervals)

yyyy-mm-mmThh:mm:ss

level Level Station 1 (Target) metre

We used Lee Road Station: Lat: 51.89464 / Long: -8.51296.

Page 64: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 62

Annex 2 Charts

Figure 46: Insured businesses either affected or severely affected by the predicted water level

Figure 47: Ratio of the number of jobs at business locations either affected or severely affected by the

predicted flood water level

Page 65: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 63

Figure 48: Value of stock levels held at business locations either affected or severely affected by the

predicted flood water level

Page 66: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 64

Annex 3 Review process

The Content of this Annex has been submitted as part of the periodic review report to the PO/EU/Reviewers.

Review Response

Reviewer 1

General: Change page numbering (Roman numerals are used throughout the entire document).

Addressed

List of Acronyms: Provide a complete list (missing e.g. CSV, FRM, GW, LW, MCA, S, …).

Addressed

Page 67: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 65

Review Response

Section 1.3: Explain who the end users are. Is this a person that can perform FL assessment and MCDA using RapidMiner applications on his own, or must he use AIA as a consulting firm? How can the applications be made workable for all partners/users without having a RapidMiner expert?

We identified three D3.9 reader groups in section 1.3, one of them being data scientists. To extend the SmartResilience database with data-driven indicators for particular use cases (which each have their own data needs) it requires expertise in data science. Anybody with a background in data science can use RapidMiner and develop the data-driven indicators discussed in D3.9 and also MCA processes. This relates to my review comments as a reviewer for D3.6 which preceded this D3.9 report: “in the context of someone who is tasked with SCI resilience assessment, modelling, monitoring, analysing dependencies and interested in building decision support applications following a guideline my comments are: An organisation tasked to implement a project that results in decision support applications for end users in their operational environment would require the following project stakeholders: 1. application end users 2. business analysts 3. developers, data scientists Group one would be linked to application requirements and validation. Group two would be concerned with analysing the needs of application end users, the definition of their requirements, and working with data scientists & developers to build applications of value to group one. Group three would be concerned with preparing data from identified data sources, applying analytics methods to that data to calculate the value of smart indicators, updating the indicators registered in a database and creating decision support application that support SCI resilience assessment, modelling, monitoring, dependencies analysis.” The above has been addressed in section 1.3 of this D3.9 report. The section has been extended with a stakeholder role description. This D3.9 report addresses all three reader groups. By only addressing the end-user of indicators and not addressing business analysts and data scientists/developer it would be impossible to understand how an indicator database can be extended with new indicators for different domains. The CRISP-DM process we describe in section 2.2 together with chapters 3,4,5 & 6 informs business analysts and data scientists on how to add new resilience indicators to the database that meet end-users’ requirements.

Section 1.3 first sentence and Section 3.1 last sentence on page v: Denoting the project as a "data science-driven resilience assessment project" indicates a lack of understanding of the overall project. Big data and data analytics only have a partial role in the project.

Addressed in section 1.3.

Section 4.3: Change order of bullets according to the subsections 4.4.1-4.4.3 or change the order of the sub-sections.

Addressed.

Sub-section 4.4.4, Figure 29: Can refer to Figure 10, since this is the same figure.

Figure 29 (now 31) has been updated with the confusion matrix on unseen data.

Page 68: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 66

Review Response

Section 4.5, Figure 30: Figure 30 is difficult to read and understand. Explain the results and provide understandable captions on the X- and Y-axis.

Addressed in section 4.7 (now 3.7) figure 32 & 33

Section 5.1, third sentence: "The OPW supported the SmartResilience project and in particular the GOLF case study …". Did they support anything else apart from GOLF?

Addressed in section 4.1.

Section 5.2: Explain the method first, including complete list of criteria, and the ranges used for weights and scores. Then provide the example in Figure 31 (which should be reduced in size). Further, explain the results. What does the result -50 mean?

MCA is now explained, see section 5.3 (now 4.3) MCA benefit score approach. Explanation of figure 31 (now table 8) is addressed, reduced in size.

Section 5.3, Figure 37: Show the full calculation of at least one of the options in Figure 37, e.g. how do you obtain 780.561 for option 5? This will make the method and analysis more transparent.

Implemented, see table 9 - example MCDM calculation

Section 5.3, Figure 38: Explain the results. Implemented, added explanation below figure (now figure 40)

Section 5.4: What does it take for others to perform MCDM using RapidMiner (without data similar to OPW)? Is this readily available as an alternative to the MCDM in the Integrated Tool, or must AIA first create a new tailor-made application in RapidMiner? Could elaborate on this in Section 5.4.

Implemented, the RapidMiner MCA implementation can be adapted to other MCDM use cases with data similar to the one the OPW has been using for decision making relating to the GOLF case study. The MCDM implementation discussed can be used for other use cases that follow a similar process of global/local weights and scoring. We used the following data structure for data to feed into the MCDM process:

• Criteria (text) • global-weighting (number) • local-weighting (number) • score (number) • option (text)

Chapter 6: How has this supported the development of the integrated resilience assessment tool in D3.7?

Addressed in new section reuse of RapidMiner Analytics Processes in Chapter 6 (now 5)

Chapter 7: List all the new indicators. Chapter 7 lists new indicators derived from the following datasets: OSI height data (NDA) Historical tidal data Historical weather data Location statistics data such as CCC employment statistics database from which we have extracted data attributes and created dummy values given the database was supplied under an NDA Stock levels held were discussed during the workshop with participants, we did not succeed in accessing meaningful data, but data could be obtained from wholesalers like Musgraves supplying retail businesses in areas affected by flooding. Data access would need to be negotiated on commercial and legal terms.

Reviewer 2

Page 69: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 67

Review Response

Executive Summary: No results and conclusions are mentioned in the executive summary; only the objectives are stated and some acknowledgments are made. Please, make sure that the summary provides also key info on (i) background of the work, (ii) methods used, (iii) results and (iv) discussion / conclusions / interpretations of these results.

Implemented

Introduction: Section 1.2 lists related deliverables but gives no further explanation on what this relation is. This might also be a good place to position the work in D3.9 to the other efforts in the project more explicitly, maybe by pulling it together with the current chapter 2.

Implemented

Chapter 2: The table mentions that chapter 3&4 provide "resilience analytics" for e.g. % jobs affected by flooding. I don't find any such results in chapter 3&4. Please, harmonize this table with the actual content of the report.

Chapters 2 and 6 address development and use of data-driven predictive indicators such as %jobs affected for all three reader audiences of D3.9 - end users, business analysts and data scientists/developers.

Section 3.1: Instead of a one-to-one copy of RapidMiner promotional material, a critical appraisal of strengths and weaknesses of this software solution would be more in place. For instance, such environments are typically known to be limited in terms of performance for resource-intensive tasks and in terms of flexibility of implementing minor changes in the predictive algorithms. Is this an issue here too?

Implemented, section 2.1 has been extended with a discussion on performance and alternative software implementation solution options.

Section 4.2: I do not find the data to be sufficiently described in order to make the analysis reproducible. How was the data measured? How many sensors where there? Where was the location of these sensors? Over which time were the observations taken? Has this data been described elsewhere? Please, ask yourself if given on the information that you provide here, a person with a similar background as you could reproduce your findings.

Implemented, annex I summarises the data including measurement intervals, sensor locations, observation time frames. It also provides a link to the datasets (not the location statistics data such as employment statistics, location height data and others we obtained under an NDA).

Section 4.2: The most important omission is: What were the (potential) predictor variables, and what the predictees? The only (indirect) mention of this I find in the conclusion section!

Section 4.5 (now 3.5) provides a description of predictor variables such as water level station, rainfall etc. historical data and the predicted variable which is the future water level.

Section 4.3: Describe more clearly what you mean by "windowing", "horizon", and "offset" here. These are not standard terms.

Implemented, see section 3.4

Section 4.4.1: It is not clearly described whether the model was evaluated (the "Table" in "Figure" 10) on the training or test data. Also, if, as you describe, you use cross-validation, how can you conclude that your model works well for unseen data? I know that Naive Bayes has the advantage of being easily transferable, but in my understanding, that would require the use of a holdout dataset which I don't find described in the text.

Implemented, see sections 3.2, 3.3, 3.6.1, 3.6.3, 3.6.4 highlight the use of unseen data and the model performance discussion is based on unseen data Jan/Feb 2019. Training/test data was 2015-2018.

Page 70: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 68

Review Response

Section 4.4.2.1: I don't understand why one would use a standard ARIMA model to try to predict "deviations from the norm" If you want to predict such things from a timeseries model, consider the use of impulse response functions. In any case, it should be discussed why one would try to model extreme events (flood) through an autoregressive process, I don't get it...

We used this as ARIMA as one of the main methods for predicting time series. We also concluded that it did not work so not sure why the reviewer focuses on this. I also disagree a little with the reviewer. Floods do not come out of nowhere and are linked to wind, temperature, and tidal forecasts. These are also not isolated either so it was the right decision to attempt to fit an ARIMA model. Always easy afterwards to claim that it doesn’t work. The reality is that we are using models that do things humans can’t and the proof is in the pudding so trying them out is paramount.

Section 4.4.3: Again, it is not clear whether the model quality is measured in or out-of-sample.

Again, this is addressed in sections 3.2, 3.3, 3.6.1, 3.6.3, 3.6.4.

Section 4.4.4: For which reason are false negatives more acceptable than false positives?

We prefer to say that there will be a flood and then no flood occurs instead of the other way around. This point is made in section 3.6.4.

Figure 19 - …: Most of the time series figures have no or generic axis labels, here and in the following plots.

Implemented

Reviewer 3

Section 4: There is no description of the time series data other than a table with the excerpt from the database. No link to the SmartResilience indicator set is included. Also, no justification of the relevance of this data.

A description of the historical time series data is in Annex I. We do not use SmartResilience indicators as input to RapidMiner predictive modelling, the aim is to create these indicators. A justification of the relevance of the input data is sufficiently covered in the report.

Section 4: The following work is described, but only with respect to the data for GOLF (no link to other work performed in the project) o Preprocessing of data o The use of different prediction models: Naïve Bayes Classification, Deep learning, ARIMA o A lot of not really valuable screenshots are used to describe the performed work.

The work described aligns with the CRISP-DM industry reference model for data analytics projects and therefore can be applied to any SmartResilience case study, which likely requires the same time-consuming iterative approach we went through in the GOLF case. I disagree with the statement the screenshots are not valuable as they help the data scientist/business analyst reader (ref section 1.3) of this report to follow the approach and enable them to reproduce results for similar business use cases.

Section 4.4.4: In the analysis subsection 4.4.4 the model performance is discussed o Too brief and includes only a comparison of the different modelling approaches, but the objective of this deliverable is not do compare these three prediction models

I disagree with the review comment “objective of this deliverable is not do compare these three prediction models”. As clarified in CRISP-DM, modelling (step 3) is followed by evaluation (step 4). It is essential to evaluate the performance of the modelling, so it meets end user stakeholder requirements (step 1). A model needs to forecast values reliably and this reliability in the form of prediction accuracy is discussed in section 4.4.4 (now section 3.6.4)

Section 4.5: Subsection 4.5 includes only one rough figure about the predicted functionality level. This is not understandable without additional description. This is also the first time recovery is mentioned, but recovery aspects are described nowhere in the document.

Further charting of predicted impact and recovery levels has been added in annex II with axis descriptions and self-explanatory figure descriptions. Recovery is now also mentioned in the executive summary, further in sections 1.1, 1.2, 2.1, 2.2, 3.1, 3.7, 3.8 & 5.1. Section 3.7 contains an extensive description of the predictive impact and recovery levels.

Section 5: There is no link to other parts of the SmartResilience methodology / tool.

Now chapter 4, further explanation has been added on the use of indicators

Page 71: Coordinator: Aleksandar Jovanovic EU‐VRi Bastien Caillard EU‐VRi … · 2019. 10. 8. · Project Manager: Bastien Caillard EU‐VRi European Virtual Institute for Integrated Risk

SmartResilience: Indicators for Smart Critical Infrastructures

page 69

Review Response

Section 5: The section describes the implementation of MCDM for GOLF largely via the use of RapidMiner screenshots. The focus is on the technical implementation alone.

A detailed description of the method has been added to chapter 5 (now 4) with a calculation example

Section 5: There is no documentation of how the different decision criteria were developed together with the local stakeholders.

The documentation has been added.

Section 5: Section 5.4. (adaption to other use cases) could be highly relevant, but it just 4 lines. No details are provided.

This is now covered in section 4.5 conclusions

Section 5: Overall, this section is also poor. It does not address the reviewer’s interpretation of the task at any time. There is no link to the SmartResilience methodology / tool. Even for the other possible interpretation (independent use of RapidMiner), discussions related to the potential use of RapidMiner for other scenarios are missing. Everything is very brief, and the focus is just on the technical implementation.

Again, I disagree with the ‘poor’ rating. The chapter is in line with T3.4 in which AIA will contribute to developing multi-criteria decision analysis tools based on, e.g. RapidMiner. Further explanation has been added on the use of indicators. The application of the RapidMiner implementation in other use cases is now addressed in section 4.5 conclusions. A non-technical conceptual explanation of the MCA method has been added to chapter 5 (now 4)

Section 6: Section 6 consists of one page (1 figure and 1 paragraph text). No results are provided. This section is not acceptable.

Chapter 6 has been expanded and a conclusions section added.

Section 7: This seems to be somewhat unrelated to the objectives of this deliverable. It is a high level description of indicators from GOLF which are included in the SmartResilience database.

Again, this relates to a previous comment: As per the recent progress report D7.4, AIA is reporting on T4.6 within D3.9: “D3.9 is closely related to finalising AIA contributions in D4.6 and AIA will be in a position to update the Stuttgart database once D3.9 is completed, should D4.6 be submitted before D3.9 AIA plans to report on D4.6 in D3.9”