SKA1 TM SER SOFTWARE ARCHITECTURE DOCUMENTska-sdp.org › ... › t0800-0000-ar-001-02_tm_ser_sad.pdf · Name Designation Affiliation Signature Custodian Mauro Dolci TM.LMC Lead INAF

Name Designation Affiliation Signature

Custodian

Mauro Dolci TM.LMC Lead INAF

Date:

Approved by

V.Sathe Project Manager TCS

Date:

Released by

Ray Brederode TM Configuration Manager

SKA SA

Date:

SKA1 TM SER SOFTWARE ARCHITECTURE DOCUMENT

TM Number ....................................................................................... T0800-0000-AR-001 SKAO Number ................................................................................. SKA-TEL-TM-0000247 Context .......................................................................................................... TM-LMC-DD Revision ......................................................................................................................... 02 Primary Author ........................................................................................ Matteo Di Carlo Date ................................................................................................................. 2018-06-29 Document Classification ............................................................. FOR PROJECT USE ONLY Status ................................................................................................................. Approved

Document No.: Revision: Date:

T0800-0000-AR-001 02 2018-06-29

For Project use only Author: Matteo Di Carlo

Page 2 of 115

Name Designation Affiliation

Author(s)

Matteo Di Carlo TM.LMC team member INAF

Matteo Canzari TM.LMC team member INAF

Mauro Dolci TM.LMC lead INAF

Riccardo Smareglia INAF

Name Designation Affiliation

Contributor(s)

Bruno Morgado TM.LINFRA team member IT/ENGAGE SKA

João Paulo Barraca TM.LINFRA team member IT/ENGAGE SKA

D. Barbosa TM.LINFRA Lead IT/ ENGAGE SKA


T0800-0000-AR-001 02 2018-06-29


Page 3 of 115

DOCUMENT HISTORY

Revision Date Of Issue Engineering Change

Number

Comments

A 2017-07-31 CDR Initial draft for peer review

B 2017-10-31 CDR Second draft for peer review: 1. Combined ‘GUI C&C View’ into ‘Service C&C View’; 2. Moved ‘Virtualization View’ and ‘TM Health Status

and State Analysis View’ into Appendix since they are not really architectural;

3. Modified section ‘how stakeholder can use the documentation’;

4. Expanded section ‘how a view is documented’; 5. Modified section ‘system overview’; 6. Modified section ‘mapping between views’; 7. Modified section ‘Rationale’ for the quality

attributes subsection; 8. Moved scenarios (section ‘Use cases’) into

Appendix; 9. Refactoring ‘Abstract Data Model View’ into four

view packets; 10. Removed section prototype because part of the TM

prototyping and added reference to the right document;

11. Modified ‘Virtualization interface’ 12. Other 88 minor changes.

C 2017-12-07 - Third draft for SKAO review 1. Modified glossary and list of abbreviations; 2. Added reader guides to every view; 3. Added rationale for agent-based versus agentless

systems into ‘Service C&C View’ section; 4. Added rationale for failover mechanism and cloud

selection into ‘Allocation View’ section; 5. Refactored the ‘Virtualization view’; 6. Other 84 minor changes.

D 2018-01-31 - 1. Document reviewed by TechCom 2. Added example on ‘ TM Health Status and State

Analysis View’

01 2018-02-28 - Approved for CDR Submission

1A 2018-06-04 - Implemented CDR observations: TMCDR-78: clarified distinction between deployment and configuration TMCDR-80: clarified generic monitoring scope TMCDR-477: clarified allocation view TMCDR-570: added compliance section 13.1

02 2018-06-29 - Approved for CDR closure

DOCUMENT SOFTWARE

Package Version File Name

Word processor Microsoft Word 2011, 2013 T0800-0000-AR-001-02_TM_SER_SAD.docx

Block diagrams Cameo System

Modeler

18.0 SA Teamwork project TM Library.

Other Google doc https://drive.google.com/open?id=0B31xtq-7eI7hVzI1U1JmM1V4ZnM


T0800-0000-AR-001 02 2018-06-29


Page 4 of 115

ORGANISATION DETAILS

Name National Centre for Radio Astrophysics

Registered Address National Centre for Radio Astrophysics

Tata Institute of Fundamental Research,

Pune University Campus,

Post Bag 3 , Ganeshkhind,

Pune – 411007,

Maharashtra,

India

Phone Tel: +91 20 25719000, +91 20 25719111

Fax: +91 20 25692149

Website www.ncra.tifr.res.in

http://www.ncra.tifr.res.in/ncra


T0800-0000-AR-001 02 2018-06-29


Page 5 of 115

TABLE OF CONTENTS

1 LIST OF ABBREVIATIONS ......................................................................... 10

2 GLOSSARY .............................................................................................. 11

3 INTRODUCTION ........................................................................................... 14

3.1 Scope of the document ......................................................................................................... 14 3.2 Applicable and Reference Documents .................................................................................. 14

3.2.1 Applicable Documents .................................................................................................. 14 3.2.2 Reference Documents ................................................................................................... 14

4 DOCUMENTATION ROADMAP ......................................................................... 16

4.1 How the documentation is organized ................................................................................... 16 4.2 View Overview ...................................................................................................................... 17 4.3 How stakeholder can use the documentation ...................................................................... 17

5 HOW A VIEW IS DOCUMENTED ....................................................................... 19

6 SYSTEM OVERVIEW ...................................................................................... 20

7 MAPPING BETWEEN VIEWS ............................................................................ 23

8 RATIONALE ................................................................................................ 24

9 USE CASES ................................................................................................ 27

9.1 Monitoring use cases ............................................................................................................ 27 9.2 Fault Management use cases ................................................................................................ 27 9.3 Life-cycle Management use cases ......................................................................................... 28

9.3.1 Entity management use case ........................................................................................ 29 9.4 Logging use cases .................................................................................................................. 29

10 VIEWS ................................................................................................... 31

10.1 Uses Module View ................................................................................................................ 31 10.1.1 Primary Presentation .................................................................................................... 32 10.1.2 Element Catalog ............................................................................................................ 33

10.1.2.1 Elements ................................................................................................................ 33 10.1.2.2 Relations ................................................................................................................ 37 10.1.2.3 Behaviour .............................................................................................................. 38

10.1.3 Rationale ....................................................................................................................... 38 10.1.3.1 Interfaces .............................................................................................................. 38

10.2 Services C&C View ................................................................................................................. 39 10.2.1 Primary Presentation .................................................................................................... 39 10.2.2 Element Catalog ............................................................................................................ 41

10.2.2.1 Elements ................................................................................................................ 41 10.2.2.2 Relations ................................................................................................................ 42 10.2.2.3 Interfaces .............................................................................................................. 44 10.2.2.4 Behaviour .............................................................................................................. 45

10.2.3 Variability Mechanisms ................................................................................................. 49 10.2.3.1 Agent/Agentless solution ...................................................................................... 49 10.2.3.2 Lifecycle Manager ................................................................................................. 50 10.2.3.3 Monitoring ............................................................................................................ 50 10.2.3.4 Logging .................................................................................................................. 51

10.2.4 Rationale ....................................................................................................................... 51 10.2.4.1 Lifecycle manager ................................................................................................. 51


T0800-0000-AR-001 02 2018-06-29


Page 6 of 115

10.2.4.2 Monitoring System ................................................................................................ 52 10.2.4.3 Failover mechanism .............................................................................................. 52 10.2.4.4 Logging .................................................................................................................. 52 10.2.4.5 Service GUI ............................................................................................................ 52

10.3 Abstract Data Model ............................................................................................................. 53 10.3.1 Description .................................................................................................................... 53 10.3.2 Overview ....................................................................................................................... 53 10.3.3 Entity decomposition View Packet ............................................................................... 53

10.3.3.1 Primary Presentation ............................................................................................ 53 10.3.3.2 Element Catalog .................................................................................................... 54 10.3.3.3 Rationale ............................................................................................................... 56

10.3.4 Monitoring View packet ................................................................................................ 56 10.3.4.1 Primary Presentation ............................................................................................ 56 10.3.4.2 Element Catalog .................................................................................................... 56 10.3.4.3 Behavior ................................................................................................................ 58 10.3.4.4 Context Diagram ................................................................................................... 59 10.3.4.5 Rationale ............................................................................................................... 59

10.3.5 Lifecycle View Packet .................................................................................................... 59 10.3.5.1 Primary Presentation ............................................................................................ 60 10.3.5.2 Element Catalog .................................................................................................... 60 10.3.5.3 Context Diagram ................................................................................................... 63 10.3.5.4 Rationale ............................................................................................................... 63

10.3.6 Virtualization View Packet ............................................................................................ 63 10.3.6.1 Primary Presentation ............................................................................................ 64 10.3.6.2 Element Catalog .................................................................................................... 65 10.3.6.3 Context Diagram ................................................................................................... 73 10.3.6.4 Rationale ............................................................................................................... 73

10.4 Allocation View ..................................................................................................................... 73 10.4.1 Primary Presentation .................................................................................................... 74 10.4.2 Element Catalog ............................................................................................................ 74

10.4.2.1 Definitions ............................................................................................................. 74 10.4.2.2 Elements ................................................................................................................ 74 10.4.2.3 Relations ................................................................................................................ 76

10.4.3 Variability Mechanisms ................................................................................................. 76 10.4.4 Rationale ....................................................................................................................... 77

10.4.4.1 Tactics.................................................................................................................... 77

11 INTERFACES ............................................................................................ 78

11.1 Lifecycle Manager - TM Generic Application - Interface ...................................................... 78 11.1.1 Interface identity ........................................................................................................... 78 11.1.2 Resources provided ....................................................................................................... 78 11.1.3 Error handling ............................................................................................................... 79 11.1.4 Rationale and design issues .......................................................................................... 79

11.2 SSM - Monitoring Activity Interface ...................................................................................... 79 11.2.1 Interface identity ........................................................................................................... 79 11.2.2 Resources provided ....................................................................................................... 79 11.2.3 Data types and constants .............................................................................................. 80 11.2.4 Error handling ............................................................................................................... 80

11.3 TM Monitor Interface ........................................................................................................... 80 11.3.1 Interface identity ........................................................................................................... 80 11.3.2 Resources provided ....................................................................................................... 80


T0800-0000-AR-001 02 2018-06-29


Page 7 of 115

11.3.3 Data types and constants .............................................................................................. 81 11.3.4 Error handling ............................................................................................................... 81 11.3.5 Quality attribute characteristics ................................................................................... 82 11.3.6 Rationale and design issues .......................................................................................... 82

11.4 Virtualization Interface ......................................................................................................... 82 11.4.1 Interface definition ....................................................................................................... 82 11.4.2 Template Actions .......................................................................................................... 82 11.4.3 vResource Internal Actions ........................................................................................... 86 11.4.4 Error handling ............................................................................................................... 88 11.4.5 Rationale and design issues .......................................................................................... 88

12 PROTOTYPES ........................................................................................... 88

13 APPENDIX .............................................................................................. 89

13.1 Compliance statements for TM Service Requirements ........................................................ 89 13.2 Detailed scenarios ................................................................................................................. 89

13.2.1 Monitoring scenarios .................................................................................................... 90 13.2.1.1 Monitoring Resources ........................................................................................... 90 13.2.1.2 Monitoring Services (on network) ........................................................................ 90 13.2.1.3 Asynchronous Monitoring Software component ................................................. 91 13.2.1.4 Synchronous Monitoring Software component.................................................... 91 13.2.1.5 Sending alarm ....................................................................................................... 92

13.2.2 Fault management scenarios ........................................................................................ 92 13.2.2.1 Insert Recovery procedure .................................................................................... 92 13.2.2.2 Alarm notification ................................................................................................. 92

13.2.3 Lifecycle Management Scenarios .................................................................................. 93 13.2.3.1 Configure/Start Application .................................................................................. 93 13.2.3.2 Kill Application ...................................................................................................... 93 13.2.3.3 Restart Application ................................................................................................ 94 13.2.3.4 Add Application Version ........................................................................................ 94 13.2.3.5 Remove Application Version ................................................................................. 94 13.2.3.6 Set on line Application version ............................................................................. 95 13.2.3.7 Set off line Application version ............................................................................. 95 13.2.3.8 Update Application ............................................................................................... 96 13.2.3.9 List Applications .................................................................................................... 96 13.2.3.10 Use Application ..................................................................................................... 96

13.2.4 Logging scenarios .......................................................................................................... 97 13.2.4.1 Store Log ............................................................................................................... 97 13.2.4.2 Search Log ............................................................................................................. 97 13.2.4.3 Extract Log File ...................................................................................................... 98

13.3 Other views ........................................................................................................................... 98 13.3.1 TM Health Status and State Analysis View ................................................................... 98

13.3.1.1 Primary Presentation ............................................................................................ 99 13.3.1.2 Element Catalog .................................................................................................. 100 13.3.1.3 Context Diagram ................................................................................................. 101 13.3.1.4 Related View ....................................................................................................... 101 13.3.1.5 Rationale ............................................................................................................. 102

13.3.2 Virtualization View ...................................................................................................... 109 13.3.2.1 Primary Presentation .......................................................................................... 109 13.3.2.2 Context Diagram ................................................................................................. 114 13.3.2.3 Rationale ............................................................................................................. 114


T0800-0000-AR-001 02 2018-06-29


Page 8 of 115

LIST OF FIGURES

Figure 1: TM SER Context. ..................................................................................................................... 21 Figure 2: Monitoring context. ............................................................................................................... 21 Figure 3: TM SER use cases, monitoring and logging. ........................................................................... 27 Figure 4: TM SER use cases, lifecycle control. ....................................................................................... 28 Figure 5: Uses module diagram. ........................................................................................................... 32 Figure 6: SSM, LM, LS and Virtualization decomposition, Notation is Sysml........................................ 33 Figure 7: Service C&C View ................................................................................................................... 40 Figure 8: Lifecycle Generic Execution. .................................................................................................. 46 Figure 9: Perform monitoring activity. .................................................................................................. 47 Figure 10: Perform Fault Management. ............................................................................................... 48 Figure 11: Store log. .............................................................................................................................. 49 Figure 12: Entity and version. ............................................................................................................... 54 Figure 13: Monitoring data model. ....................................................................................................... 56 Figure 14: Monitoring context. ............................................................................................................. 59 Figure 15: Lifecycle Data Model. ........................................................................................................... 60 Figure 16: Lifecycle context. ................................................................................................................. 63 Figure 17: Virtualization Data Model. ................................................................................................... 64 Figure 18: Deployment diagram. .......................................................................................................... 74 Figure 19: TM Template Actions depending of the State. .................................................................... 86 Figure 20: Health status calculation. ..................................................................................................... 99 Figure 21: Mathematical representation. ........................................................................................... 100 Figure 22: For every TM Process there will be at least two monitoring activities to retrieve the state

and a measure of the performance. ......................................................................................... 101 Figure 23: Adapted from Figure 21. .................................................................................................... 106 Figure 24: Aggregated health status drill-down possibility and calculation. ...................................... 107 Figure 25: Drill-down with aggregation levels. ................................................................................... 108 Figure 26: Overall Presentation of Execution Environment. .............................................................. 110 Figure 27: Template and Instance presentation. ................................................................................ 110 Figure 28: Activity diagram for managing a template for a TM App. ................................................. 113


T0800-0000-AR-001 02 2018-06-29


Page 9 of 115

LIST OF TABLES Table 1: Sections overview. .................................................................................................................. 16 Table 2: View overview. ........................................................................................................................ 17 Table 3: Mapping between views. ........................................................................................................ 23 Table 4: ASRs ......................................................................................................................................... 24 Table 5: Physical Resource File Descriptor. ........................................................................................... 66 Table 6: Product Execution File Descriptor. .......................................................................................... 67 Table 7: vResource File Descriptor. ....................................................................................................... 68 Table 8: Physical Resource File Descriptor errors. ................................................................................ 71 Table 9: Product Execution File Descriptor Errors. ............................................................................... 71 Table 10: vResource File Descriptor Errors. .......................................................................................... 72 Table 11: Available actions for the Virtualization Service. ................................................................... 83 Table 12: Allowed Actions according to the State ................................................................................ 85 Table 13: vResources Internal Actions. ................................................................................................. 87 Table 14: Compliance statements for TM Service Requirements ........................................................ 89


T0800-0000-AR-001 02 2018-06-29


Page 10 of 115

1 LIST OF ABBREVIATIONS

TERM DESCRIPTION

AAA Authentication, Authorization and Auditing

ACK Acknowledge

ACL Access Control List

AGN Aggregation Node

API Application Programming Interface

ASR Architecturally Significant Requirement

ASRs Architecturally Significant Requirements

C&C Component and Connector

CP Control Point

CPF Common-Point Failure

CRUD Create, Read, Update, Delete

DSH Dish/Dishes

EDA Engineering Data Archive

EMS Engineering Management System

ERR Error

FM Fault Manager

FMECA Failure Modes, Effects and Criticality Analysis

GUI Graphical User Interface

IICD Internal Interface Control Document

INAF Istituto Nazionale di Astrofisica

LHC Large Hadron Collider

LINFRA Local Infrastructure and Power

LM Lifecycle Manager

LMC Local Monitoring and Control

LMS Lifecycle Manager Software

LS Logging Service

MeerKAT Karoo Array Telescope. SKA Precursor in South Africa

MP Monitoring Point

NRCC-DRAO National Research Council Of Canada – Dominion Radio Astrophysical Observatory

OBSMGT Observation Management

OSO Observatory Science Operation

QoS Quality of Service

RAMS Reliability, Availability, Maintainability and Safety

REF Refused

REST REpresentational State Transfer

SAD Software Architecture Document

SEI Software Engineering Institute

SER TM Services

SKA Square Kilometer Array


T0800-0000-AR-001 02 2018-06-29


Page 11 of 115

TERM DESCRIPTION

SKA-SA SKA South Africa

SLA Service Level Agreement

SSM Software system monitor

sysml System modelling language

TBC To Be Confirmed

TBD To Be Determined

TBJ To Be Justified

TELMGT Telescope Management

TM Telescope Manager

TMC Telescope manager control

TMO TM Observatory

TPZ Telespazio SpA – A Finmeccanica/Thales Company

UniRM2 Università degli Studi di Roma 2 – Tor Vergata

2 GLOSSARY

TERM DESCRIPTION

Access Control List It is a list of permissions attached to an object like an Operating System, a service and so on. An ACL specifies which users or system processes are granted access to objects, as well as what operations are allowed on given objects.

Agent Software service running on the host measuring usage and sending the results to the collector.

Aggregated MP A MP composed by a set of atomic MPs and/or even low-level aggregated MPs. (For example, eth1 is a network interface of a specific host and an aggregated MP can be bind to eth1 itself, which might be composed by outbound network traffic, inbound network traffic, counter of the packets that have been dropped, and so on)

Archive Permanent storage of data, including logs, monitoring data, list of monitoring points, monitoring configuration data, metering configuration data, events, alarms, failures, predictions, and so on. Note that the term Archive, without being preceded by other terms, includes ALL gathered data.

Configuration Set of information that affects (change) the behaviour or status of a specific target.

Distribution model Model describing which the expected roles of the components are and which kind of interaction can be performed.

Error A deviation of a system from normal operation.

Failure The inability of a system or component to perform required functions within specified performance requirements.

Fault An inherent weakness of the design or implementation of a system or component, which might result in a failure. It appears normally as a lasting error or a warning condition.

Health Status An attribute to be assigned to a component or to a (physical or logical) group of components, both hardware (that is, fan, network interface,


T0800-0000-AR-001 02 2018-06-29


Page 12 of 115

TERM DESCRIPTION

host...) and software (that is, a computing process, a service...), to denote their health level. The attribute may be expressed, for example, in percentage, with 100% denoting perfect behaviour and <100% a degraded behaviour triggering a fault analysis on that component.

Logging The activity of a system to record a specific message when executing an action. Logs are normally stored in a log file

Maintainer An Operator with responsibility of maintaining one (or more) software application.

Monitoring data A subset of the metric samples on all TM sub-elements (hardware, software and communication) gathered periodically/on event/on request and stored in monitor store. Monitoring data may include:

Failure detections, based on a FMECA analysis

Metric samples used for failure prediction and diagnosis

Metric samples required to identify faults (fault finding)

Near Real-time

Time delay introduced, by automated data processing or network transmission, between the occurrence of an event and the use of the processed data and it implies that there are no significant delays.

Node In a computer network, a node is an end-point identified by an IP address or a name that can receive, create, store or send data along distributed network routes.

Polling agent An agent aimed to collect measurements by polling some API or other tool, usually at a regular interval.

Process An instance of an executable running in sequential or parallel mode.

Push agent An agent aimed to provide measurements by pushing some API or other tool, usually in an event-triggered fashion. The push agent is the only solution to fetch data within TM sub-elements, which do not expose the required data in a remotely usable way. This is not the preferred method as it makes deployment a bit more complex having to add a component to each of the nodes that need to be monitored.

Resource A physical or virtual component of limited availability within the system, that is, time of execution, resident memory, CPU cycles, number of hosts, bandwidth usage, power consumption, and so on.

Restart A process acting on a system the initial, intermediate and final statuses of which, are ON - OFF - ON, respectively.

Sample Data sample for a particular meter.

Server It is a computer program or a device that provides functionality for other programs or devices, called ‘clients’.

Service Level Agreement

It is an agreement between a service provider and a client where some aspects of the service – quality, availability, responsibilities – are agreed between the service provider and the service user. SLA can have a technical definition in mean time between failures (MTBF), mean time to repair or mean time to recovery (MTTR); identifying which party is responsible for reporting faults or paying fees; responsibility for various data rates; throughput; jitter; or similar measurable details.

Shutdown A process acting on a system, the initial and final statuses of which are ON - OFF, respectively.

Start-up A process acting on a system, the initial and final statuses of which are OFF - ON, respectively.


T0800-0000-AR-001 02 2018-06-29


Page 13 of 115

TERM DESCRIPTION

TM Sub-System OSO, TMC or SER

View It is a representation of a coherent set of architectural elements (a structure) needed to reason about the system. A view comprises software elements, relations among them and properties of both


T0800-0000-AR-001 02 2018-06-29


Page 14 of 115

3 Introduction

Scope of the document

This document defines the software architecture for the Telescope Manager (TM) Services (SER) sub-element for SKA. Across this document, the terms ‘TM.SER’ and ‘SER’ will be used to indicate Telescope Manager Services sub-element for SKA, and ‘SKA’ to refer to the SKA telescope system. To view related requirements on TM.SER are reported in the corresponding document ([AD1], see the following).

Applicable and Reference Documents

Applicable Documents

The following documents are applicable to the extent stated herein. In the event of conflict between the contents of the applicable documents and this document, the applicable documents shall take precedence.

[AD1] SKA-TEL-TM-0000252, T0800-0000-RS-001, SKA1 TM SERVICE REQUIREMENT SPECIFICATION, Rev 02

[AD2] T0000-0000-MP-003, SKA1 TM Maintenance Plan, Rev 03 [AD3] SKA-TEL-SKO-0000656, SKA Control System Guidelines (CS_Guidelines) - Volume 2: SKA

LOGGING GUIDELINES, Rev A [AD4] 000-000000-010, SKA Control System Guidelines (CS_Guidelines), Rev. 01 [AD5] SKA-TEL-TM-0000263, T0000-0000-AR-026, SKA1 TM Context Document, Rev. 02 [AD6] T0000-0000-RAM-001, SKA RAM AND ILS Report, Ver 01 [AD7] SKA-TEL-TM-0000270, Telescope Manager Product Breakdowns and Dictionary, Ver 01 [AD8] SKA-TEL-TM-0000253, SKA1 TM Services Test Specification, Ver. 01

Reference Documents

The following documents are referenced in this document. In the event of conflict between the contents of the referenced documents and this document, this document shall take precedence. [RD1] l. Bass, P. Clements, R. Kazman, Software Architecture in practice, SEI Series [RD2] Paul Clements, Felix Bachmann, Len Bass, David Garlan, James Ivers, Reed Little, Paulo

Merson, Robert Nord, Judith Stafford, Documenting Software Architectures, Views and Beyond, Second Edition, SEI

[RD3] SKA-TEL-SKO-0000661, Fundamental SKA Software and Hardware description language standards, Rev 02

[RD4] T0000-0000-AR-007, SKA-TEL-TM-0000148, SKA1 TM Prototyping Report, Rev 01 [RD5] T0700-0000-DR-001, SKA-TEL-TM-0000037, SKA1 AAA Design Report, Rev 02 [RD6] Starter Device, http://tango-controls.readthedocs.io/en/latest/tools-and-

extensions/astor/introduction.html [RD7] Astor, http://www.tango-controls.org/community/projects/astor/ [RD8] TANGO Generic Web Application, https://github.com/tango-controls/tango-webapp [RD9] TANGO features overview, http://tango-controls.readthedocs.io/en/latest/tools-and-

extensions/astor/features_overview.html [RD10] TANGO Kernel Documentation, http://tango-

controls.readthedocs.io/en/latest/contents.html [RD11] LogViewer, http://www.tango-controls.org/community/projects/log-viewer/


T0800-0000-AR-001 02 2018-06-29


Page 15 of 115

[RD12] Jive, http://www.tango-controls.org/community/projects/jive/ [RD13] Nagios redundancy,

https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/redundancy.html [RD14] Nagios Hardware requirements,

https://assets.nagios.com/downloads/nagiosxi/docs/Nagios-XI-Hardware-Requirements.pdf [RD15] ELK Cluster Documentation,

https://www.elastic.co/guide/en/elasticsearch/guide/current/distributed-cluster.html [RD16] ELK Failover Documentation,

https://www.elastic.co/guide/en/elasticsearch/guide/current/_add_failover.html [RD17] Heat, https://docs.openstack.org/heat/latest/ [RD18] OpenStack Networking, https://docs.openstack.org/liberty/networking-guide/intro-

networking.html [RD19] LHC, https://home.cern/topics/large-hadron-collider [RD20] Elettra, www.elettra.trieste.it [RD21] ELK, www.elastic.co [RD22] Apache Lucene, lucene.apache.org [RD23] AWS Cloud Formation,

http://docs.amazonwebservices.com/AWSCloudFormation/latest/APIReference/Welcome.html?r=7078

[RD24] Chef automation tool, www.chef.io [RD25] Ansible automation tool, www.ansible.com [RD26] Puppet automation tool, puppet.com [RD27] TANGO cookbook, https://supermarket.chef.io/cookbooks/tango [RD28] Turowski M., Lenk A. (2015) Vertical Scaling Capability of OpenStack. In: Toumani F. et al.

(eds) Service-Oriented Computing - ICSOC 2014 Workshops. Lecture Notes in Computer Science, vol 8954. Springer, Cham

[RD29] Neutron, https://wiki.openstack.org/wiki/Neutron [RD30] Cinder, https://wiki.openstack.org/wiki/Cinder [RD31] Docker, www.docker.com [RD32] KVM, https://www.linux-kvm.org/page/Main_Page [RD33] Dockerfile, https://docs.docker.com/engine/reference/builder/

https://docs.openstack.org/heat/latest/


T0800-0000-AR-001 02 2018-06-29


Page 16 of 115

4 Documentation Roadmap

This section describes how this document is organized and how a reader can find directly the information of interest without reading it cover-to-cover.

How the documentation is organized

This document is organized into the sections highlighted in the following table. Table 1: Sections overview.

Section Overview

Document History

It shows the revisions of the present document

Documentation Roadmap - How the documentation is organized

It explains what can be found in each section of the present document

Documentation Roadmap - View Overview

It gives a short description of the views included in the present SAD

Documentation Roadmap - How stakeholder can use the documentation

It explains how a stakeholder can read the documentation to answer his concerns.

How a View is documented

It explains the organization for a generic view

System overview

It gives a short description of the main functions of the system, its users and the background needed to go through the various views

Mapping between views

Shows the main entities of the system and their presence in the views

Rationale - Architecturally Significant Requirements

It explain the ASRs that drove the development of the present architecture

Rationale - Decisions It explain the main decisions that drove the development of the present architecture considering ASRs as well

Rationale - Products It shows the main products that will be delivered from the development of the present architecture

Rationale - Quality Attributes It shows the qualities that drove the development of the present architecture

Glossary

It defines terms used in the architecture documentation

Acronym

It defines acronyms used in the architecture documentation


T0800-0000-AR-001 02 2018-06-29


Page 17 of 115

View Overview

The present section gives a short description of the views included in the present SAD. Each of them has been created with a specific purpose or to answer a specific question and, in general, they represent the decisions taken to solve the problem highlighted in [AD1].

Table 2: View overview.

View Overview

Uses Module View This is a module view as per [RD2]. It is a decomposition of the system into units of implementation and shows the distinction between off the shelf software and built software.

Service C&C View This is a C&C view as per [RD2]. It highlights the runtime components of the system and their relations. It also highlights the decisions for having an agent-based architecture (even if an agentless architecture is still possible) and how it works.

Abstract Data Model This view is the blueprint for the implementation of the data entities and a domain analysis of the main concepts wherewith the TM Services works. Without a data model, it will not be possible to get qualities such as, modifiability or to understand how the virtualization service works.

Allocation View This is allocation view as per [RD2]. It shows the mapping between the runtime component of the system and the servers needed to run them like decisions for failover mechanism (that is, cluster active-passive).

How stakeholder can use the documentation

The main stakeholders of the TM Service will be the OBSMGT and TELMGT teams responsible for the OSO and TMC software architecture. In this way, it has been possible to

Define better requirements and

Avoid any misunderstanding.

With them, TM architects (that are stakeholders for every view), reviewers, developers and maintainer represent the stakeholders for the present documentation. The Uses Module View decompose the system into unit of implementations distinguish between Off the shelf software and built software. It is primarily for architect, reviewer and manager that wants to understand how the system is decomposed for planning purpose for example. The Service C&C View shows how the system work at runtime, showing the types of connections of the active instances. It is primarily for developers, architects and reviewers who are looking for answering questions such as, ‘which the dynamic behaviour of the system is?’ or ‘who starts the interaction of monitoring, lifecycle or logging?’. The Abstract Data Model (see 10.310.3) view is the blueprint for the implementation of the data entities and a domain analysis of the main concepts wherewith the TM Services works. It is primarily for developers and maintainers that want to understand the data model of the system, such as, what the main entities of the system are. In specific, it is shown how a monitoring activity can be seen both


T0800-0000-AR-001 02 2018-06-29


Page 18 of 115

as data and as runnable entities of the TM SER demonstrating how the system can perform different functions (new activities) without changing its architecture. It is also shown here how it is possible to add a new version of an application. The possibility to add versions and, for each of them, one or more monitoring activities is the core of the modifiability quality attribute for the TM SER. The Allocation View maps the runtime processes highlighted in the Service C&C View into running server (that can be virtual or real). It is primarily for maintainers, architects and reviewers that want to understand how many servers the Service will need at runtime and how it is possible to increase the scalability of the system.


T0800-0000-AR-001 02 2018-06-29


Page 19 of 115

5 How a View is documented

Every view is documented using the standard TOC for the View and Beyond approach of the SEI [RD1][RD2] that comprise the following sections:

Name of view

View description

Primary presentation: This section presents the elements and the relations among them that populate this view packet, using an appropriate language, notation, or tool-based representation.

Element catalog: Whereas the primary presentation shows the important elements and relations of the view packet, this section provides additional information needed to complete the architectural picture. It consists of subsections for (respectively) elements, relations, interfaces, behaviour, and constraints.

Context diagram: This section set the context for the system represented by this view packet. It also designates the view packet’s scope with a distinguished symbol, and shows interactions with external entities in the vocabulary of the view.

Variability mechanisms: This section describes any variabilities that are available in the portion of the system shown in the view packet, along with how and when those mechanisms may be exercised.

The Abstract Data Model View (see 10.3) is an exception, since it required a number of "view packets". Each view packet is structured following the standard TOC, as specified above.


T0800-0000-AR-001 02 2018-06-29


Page 20 of 115

6 System overview

The SKA project is an international effort (10 member and 10 associated countries with the involvement of 100 companies and research institutions) to build the world’s largest radio telescope. The SKA Telescope Manager (TM) is the core package of the SKA Telescope aimed at scheduling observations, controlling their execution, monitoring the telescope and so on. To do that, TM directly interfaces with the Local Monitoring and Control systems (LMCs) of the other SKA Elements (for example, Dishes, Correlator and so on), exchanging commands and data with them by using the TANGO controls framework. TM in turn needs to be monitored and controlled, in order its continuous and proper operation is ensured and this higher responsibility has been assigned to the TM SER package.

The problem of monitoring and controlling a software is an artificial intelligence problem that is the research in a state space characterised by:

1. an initial state; 2. a set of possible actions which transform a state to another one; 3. A path from a state to another state (list of actions).

This description of the problem helps in understanding what is needed to realize the architecture for the TM SER that is a set of actions and a set of monitoring data from where a state can be calculated. From the requirements analysis (done with the help of the entire TM in the numerous discussion held) the main functions of the system have been extracted and are described in the use cases section (see 9).

Therefore, the main system’s functions (see 9) can be summarized in the following list:

TM generic monitoring and fault management to detect internal failure and gather TM performance;

TM lifecycle management to manage the versions of the TM and the TM applications which includes:

o Configuration of TM software applications; o Starting, stopping and restarting of TM software applications; o Update and downgrade of TM software applications;

TM Logging, which includes the control of the destination of log messages, the transformation of the message (if required) and the query GUI;

Controlling of the virtualization system, according to the interface provided by the LINFRA team.

It is worth to notice that deployment is different from configuration because, conceptually, the first phase is the deployment and later there is the configuration. The update/upgrade/downgrade assume that the new/old version is already in the same machine and the update/downgrade is basically a restart with different versions. The TM Service take place in the middle between the domain logic and the infrastructure. In particular the following diagram explain the above concept with a layered structure:


T0800-0000-AR-001 02 2018-06-29


Page 21 of 115

Figure 1: TM SER Context.

Domain/Business Layer: functional monitoring and controlling of business logic performed

by each application; Services Layer: Monitors and controls processes on a generic level (non-functional) like web

services, database servers, custom applications; Infrastructure Layer: Monitors and controls virtualisation, servers, OS, network, storage.

The proposed architecture has been driven by the study of best practices and well known solutions to the problem highlighted by TM and SER requirements (see [AD1]), reserving new developments only for uncovered problems. Another important function of the system is the aggregation of the TM health status and the TM State (of the various TM applications) and reporting it to the Operator. This function can be considered an application of the current architecture and it is described in the TM Health Status and State Analysis View (see 13.3.1). The TM generic monitoring, realized by a Software System Monitor (SSM), comprises periodic tests or measurements of network-related data, network devices and legacy services (DB, OS-level services) that are not directly monitored by the TANGO control system [RD10]. Usually, some of the monitoring data are taken from the generic monitoring (such as, CPU usage, memory, processes, threads, uptime

and so on) is reported to the TANGO control system for historical archiving and correlation. This is the main reason for the development of the TM Monitor device (see 10.1 and 11.3). The users will read information from both system so that in case one of them is not working, they can always perform one or more recovery actions according to the information taken from the other system. This concept is the monitoring context and it is expressed in the following figure.

Figure 2: Monitoring context.


T0800-0000-AR-001 02 2018-06-29


Page 22 of 115


T0800-0000-AR-001 02 2018-06-29


Page 23 of 115

7 Mapping between views

As a general rule, the mapping between views is made by the name of the elements. The following table shows a list of the main elements of the system and an indication of where to find them within the views.

Table 3: Mapping between views.

Uses Module View Service C&C View Allocation View

LS Engine (Figure 6) LS Engine Logging Node

Logging Service (Figure 5) - Logging Service

LS Data Repository (Figure 6) LS Data Repository Logging Node

LS Forwarder (Figure 6) LS Forwarder -

Software System Monitor (SSM) (Figure 5)

- Software System Monitor Node

SSM Core (Figure 6) SSM Core Software System Monitor Node

MonData Repository (Figure 6) MonData Repository Software System Monitor Node

FM Repository (Figure 6) FM Repository Software System Monitor Node

Fault Engine (Figure 6) Fault Engine Software System Monitor Node

Notification System (Figure 6) Notification System Software System Monitor Node

SSM Agent (Figure 6) SSM Agent

Lifecycle Manager (Figure 5) - Lifecycle Manager Node

Lifecycle Manager Core (Figure 6) LM Core Lifecycle Manager Node

Lifecycle Manager Data Repository (Figure 6)

LM Data Repository Lifecycle Manager Node

Lifecycle Manager Service (Figure 6) LM Service

Virtualization - Virtualization

Virtualization Orchestrator Virtualization Orchestrator

-

Service GUI Service GUI Service GUI

- Config DB -


T0800-0000-AR-001 02 2018-06-29


Page 24 of 115

8 Rationale

Architecturally Significant Requirements From the requirement analysis of the TM and of the SER derived requirements (see [AD1]), the requirements that can be considered architecturally significant are:

Table 4: ASRs

SER_REQ_2 Detect and generate internal Alarms

The SER shall generate an Internal Alarm based on information received from the TM sub-systems (including itself) signifying that a condition related to the TM’s functioning has occurred. This requires automatic and OR operator intervention and that is based on either one of the following states: 1. The system has detected a failure that

requires operator as well as maintainer intervention.

2. The system has detected a condition that reduces the ability of the TM to effectively perform its mission.

3. A safety hazard (based on a hazard analysis regarding the TM operations) has realised.

TM_REQ_7 TMO_REQ_018

Test

SER_REQ_6 Control TM sub-system life cycle

The SER shall be able to control the life cycle of any TM sub-system by being able to command the TM sub-system to do at least one of the following:

Shut down

Start up

Configure

Update/Upgrade

Downgrade

TM_REQ_192 TM_REQ_195 TM_REQ_197 TM_REQ_198 TMO_REQ_020 TM_REQ_212 TM_REQ_220 TM_REQ_367 TM_REQ_210 TMO_REQ_058

Demonstration

SER_REQ_7 Operator control of life cycles

The SER shall give an operator the ability to send respective life cycle commands to each TM sub-system by an authenticated and authorised user.

TM_REQ_181 Demonstration

SER_REQ_8a Aggregate and Report TM Health Status

The SER shall aggregate the TM internal status and report it to the Operator in a structured health view based on the TM PBS. Note: in case of TMO, the SER shall report to the Operator and to the EMS.



T0800-0000-AR-001 02 2018-06-29


Page 25 of 115

SER_REQ_8b Manage TM State

The SER shall manage the TM state (by sending signal for state transition) that can assume, among the others, the following values: start-up, shutdown, standby and operational. The following are the possible state transitions: 1. from standby to startup 2. from startup to operational 3. from operational to shutdown 4. from shutdown to standby

TM_REQ_201 TM_REQ_202 TM_REQ_342 TM_REQ_385 TM_REQ_386 TM_REQ_387

Demonstration

Notes and decisions

The requirements are not intended for the online or offline system but for any generic TM

Sub-System which can be any sub-element, that is, any TM Application

According to SER_REQ_2, there shall be a generic monitoring (performed for instance through

Nagios, a Software System Monitor (SSM))

Every TANGO-based system needs a generic level monitoring (as every system actually does).

However, TANGO does not provide in itself a generic monitoring system such as, for example

server monitoring of hard disk, cpu and so on. According to the analysis performed by the LMC

team (analysis of the Elettra [RD20] control system generic monitoring), it is indeed a best

practice to use a software system monitor (external to TANGO) rather developing it within

TANGO.

TANGO does not provide any storage system for logging (although it is possible to have

viewing features as an instant-log consumer), which therefore must be developed as a service

external to TANGO.

The SKA logging guideline [AD3] suggests a service based on the latest technology as the best

choice (in particular for what concerns the ability to use the full-text search capability, see

Elasticsearch [RD21] and Apache Lucene [RD22]). The analysis performed by the LMC team

revealed that this approach has been successfully adopted in other big and very complex

projects currently running, as LHC (see [RD19]).

SER_REQ_11 defines the generic Logging Service

SER_REQ_6 and SER_REQ_7 define the Lifecycle Management together with update/upgrade and downgrade; even in this case the architecture will be the same both for online and offline. The only variability concerns Astor (that is, TANGO framework tools).

Every reliable system must not have common point of failure between the monitoring system

and the monitored one. For this reason, a generic monitoring architecture, and more generally

the TM service architecture, cannot depend on any other TM sub-system

Generic monitoring, logging service and lifecycle management are needed in all systems of

every sites (GHQ, ZA and AUS)

Products The products for the TM SER are defined in [AD7].


T0800-0000-AR-001 02 2018-06-29


Page 26 of 115

Quality Attributes

The main quality that drove the development of the present architecture was the maintainability intended as availability (reliability and recovery), modifiability, testability and more, in general the ability of a system to cope with changes. Concerning availability, the present architecture enables many tactics [RD1], such as:

Detect faults: o Ping: asynchronous request/response message to determine that the monitored

component is alive and responding correctly;

o Heartbeat: periodic message exchange between the SSM (see 10.1 and 10.1.3.1) and

the component (or host);

o Timestamp: every message has a timestamp to rebuild the correct order of messages.

o Monitoring activities: processes that monitors an entity and produces monitoring

data, see 10.1, 10.1.3.1 and 10.3.4);

o Timeout: the monitoring activity should complete within a predetermined amount of

time;

Recover from faults: active redundancy, software upgrade and reconfiguration; o Retry: in case of a faulty monitoring activity, the SSM retries to execute it;

o Redundancy (for every SER sub-system): in case of fault or failure is it possible to

switch a passive (or redundant) node automatically, using an automatic action, or

manually, by the intervention the operator. See 10.4 for further information.

o Reconfiguration: the SER, through the Lifecycle Manager (see 10.1, 10.1.3.1 and

10.3.5), can re-configure a faulty component with a versioned configuration

automatically or with an operator command.

o Software upgrade or downgrade (see 10.1, 10.1.3.1 and 10.3.5)

o Exception handling: Once that an exception has been detected, the system must handle it. There are several possibilities to handle an exception. A possible way is to include with the exception an error code that contain information helpful in fault correlation.

Prevent faults: predictive model and transaction (when accessing to repositories). See 10.2.2.4.4 for further information.

To prevent faults, a possibility is the use of a Fault Manager (see 10.1 and 10.1.3.1) component (usually included in many generic monitoring system) to make trend analysis and failure prediction of TM components taking as input both generic monitoring data and logging data. This tactic is called ‘Predictive model’ ([RD1]): according with the health status detected by the SSM, the predictive model ensures that the system is operating within its normal operating parameters and could potentially take corrective actions. The modifiability is reached in the following areas of the system: monitoring activities, lifecycle scripts, logging rules and fault rules. In particular it has been increased cohesion and reduced coupling so that it is easy to add new version of an application and monitoring activities. At the same time, the testability is also reached by limiting the complexity of the system. In fact, if there is a new test to perform, it is possible to add a monitoring point for it that can represent a state, a measure or a simple message. Once the required monitoring point is available, it is easy to generate an event to intercept the particular problem raised with the test (see 10.3 for more information).


T0800-0000-AR-001 02 2018-06-29


Page 27 of 115

9 Use Cases

From the requirement analysis the use cases highlighted in Figure 3 and Figure 4, and described in the following sections, have been found (for the detailed monitoring scenarios, please consult 13.2.1).

Figure 3: TM SER use cases, monitoring and logging.

Monitoring use cases

Figure 3 shows the main functions of the monitoring system that are: Monitoring Network and Resources: SER monitors TM resources (defined as any physical

component of limited availability of a computer, such as CPU Load, Memory Usage and so on) and any TM service (defined as an application using protocol like TCP, HTTP, FTP and so on).

Monitoring Software Components: SER monitors proper functioning of the processes of TM. This includes monitoring of operational state and failures thrown by such processes performed by a SSM agent installed on the local machine. In particular, the SSM agent provides two communication modes: asynchronous mode (for non-critical communication like as process status) and synchronous mode (for instance to communicate an exception to SER).

Reporting and sending TM internal alarms: SER provides an interface to the operator and a series of alarms. The interface shows a series of information about network, resources and service status in different views, providing the possibility to summarize information, create graphs or custom views, drilldown component information and so on. An operator can handle it or, if exists, can use a procedure defined in SER fault management.

Fault Management use cases

The monitoring of TM applications can be performed at three different levels of depth: 1. Generic level (RAM allocation, CPU usage…), 2. Generic level + process status (as defined in 13.3.1), 3. Correctness of operations (for example, coordinate conversion by a devoted application).

SER is responsible for levels one and two. TMC/OSO applications are responsible for level 3. The separation among levels two and three defines essentially the monitoring boundaries between SER and TMC/OSO (in fact, the correctness of operations is a duty of each TMC/OSO application). The Fault Management uses the monitoring system (that is, the Software System Monitor and the specific activity that can be done through it) to perform its duty that is:


T0800-0000-AR-001 02 2018-06-29


Page 28 of 115

1. Detection, which is the ability to understand if in the system there is a fault; 2. Isolation, which is the ability to isolate a fault understanding where it is; 3. Recovery, which is the ability to recover the situation.

A monitoring activity together with alarm filtering (usually available in any software system monitor) realizes the detection activity. The same monitoring activity together with log information realizes the isolation while the recovery is essentially a control operation that TM.SER can do: for instance an online action, which is a lifecycle command (reconfigure, restart, and so on) or an offline activity such as, raising a modification request for the software maintenance. It is important to consider that the lifecycle management is realized with the help of an IT automation tools such as, Ansible, Puppet or Chef (see 10.1), and this allows not only a lifecycle operation on the application but actually it is possible to make any kind of script that can be executed directly in the server machine (where the TMC/OSO application runs). Therefore, a TMC/OSO Developer could create a specific recovery procedure (that is a specific script stored in the Lifecycle Manager Repository) that can be executed by the Fault Management application in case of a specific alarm condition.

Life-cycle Management use cases

Figure 4: TM SER use cases, lifecycle control.

The lifecycle management is the ability to manage a software application in the following phases of its lifetime:

Configuration Start Stop/Kill Update, Upgrade or Downgrade (version control)

Having a lifecycle management software (LMS) brings to a similar way of managing all TM Applications, so that, TM.SER can be the main entry point for TM. The use cases in Figure 4 summarize the concept described above.

There are two kinds of consumers for the LMS: the administrator and the user. The user is the one that works with TM for telescope operations while the administrator (who is a special user with privileges) has the responsibility of the lifecycle from the start phase (which includes the


T0800-0000-AR-001 02 2018-06-29


Page 29 of 115

configuration) to the entities management (CRUD operations in the configuration database), with specific attention for the version control of the applications. Version control is an important aspect of TM maintenance and is in relation with the software maintenance process (See [RD6]). This means that when an administrator is adding a new application version, he should relate that version to a specific number of modification requests identified and resolved in the maintenance. Every TMC/OSO applications or components has to specify its own lifecycle highlighting the specific interface to realize the use cases of the above figure. To do that, it is convenient to subdivide them

into their typology. Based on [RD1], the following application typologies compose TM: OS Service, which is a process that starts and stops with the operating system, Web server, which is an information technology that processes requests via HTTP, the basic

network protocol used to distribute information on the World Wide Web, Web application, which is a software running in a web server, Desktop application, which is a software running in the client computer, Server application, which is a software running in a server that usually does not have the

same lifetime of the operating system, DB server, which is a database management system (which can be a RDBMS or a NoSql DB

technology). In addition, consider that many TM applications work together with other services such as, for instance, a TANGO device that needs the TANGO database service and the corresponding RDBMS and perhaps other TANGO devices to work properly. For this reason, it is expected to have a list of dependencies for each application. LMS has also the responsibility to ease the job of the user in all the phases of the application lifetime to simplify the work of the user.

Entity management use case

The ‘CRUD’ acronym means the four standard functions of persistent storage that are the following: create, read, update, and delete. CRUD is also relevant at the user interface level of most applications. As a bare minimum, the software must allow the user to:

Create or add new entries Read, retrieve, search, or view existing entries

Update or edit existing entries

Delete/deactivate existing entries Without at least these four operations, the software is not complete. As these operations are so fundamental, it is important to document them under one comprehensive heading: the entity management. In the TM.SER context, these operations are necessary because of the needs to build the abstract data model described in 10.3.

Logging use cases

Logging is an important component of the development cycle. It offers several advantages. It provides precise context about a run of the application. Once inserted into the code, the generation of logging output requires no human intervention. Moreover, log output can be saved in persistent medium to be studied later. In specific, the logging use cases are the following (see Figure 3):


T0800-0000-AR-001 02 2018-06-29


Page 30 of 115

Archive logs: a software application or component could want to archive one or more log messages;

Search for log information (by words, by datetime, extract log files): the maintainers, administrators or developer could search for one or more log messages making a query to the system.

For more information about logging for SKA, please consult [AD3].


T0800-0000-AR-001 02 2018-06-29


Page 31 of 115

10 Views

Uses Module View

This is a module view as per [RD1]. It is a decomposition of the system into units of implementation and shows the distinction between off the shelf software and built software together with the functional responsibility assigned to each of them. It shows the dependability of the modules and the modifiability related to changing the responsibility of them.

Figure 5 shows the four main off the shelf modules: SSM, LM, LS and TANGO framework. Depending on the choice of the off the shelf software the work to do will be qualified (for instance choosing Nagios core or Zabbix will influence the Monitoring Activities to make such as, language or API used). Fault rules use both the SSM and the LM to perform automatic actions (if required) while the Lifecycle scripts are executed by the engine provided by the LM. Every module will store logging data with the help of the logging forwarding rules through the LS engine. The TM Monitor is the link between the TMC domain and the SER domain and it will use both the SSM (to read monitoring data) and the TANGO framework (to send monitoring data). The Service GUI allows the user to work with all the SER software while the Virtualization Service will allow the build of virtual platforms where all the software will run. Figure 6 shows the usual decomposition of the four main off the shelf modules taken from some tested architecture: Nagios core, Chef, ELK and OpenStack.


T0800-0000-AR-001 02 2018-06-29


Page 32 of 115

Primary Presentation

Figure 5: Uses module diagram.


T0800-0000-AR-001 02 2018-06-29


Page 33 of 115

Figure 6: SSM, LM, LS and Virtualization decomposition, Notation is Sysml.

Element Catalog

Elements

Element Description

SSM A software system monitor (SSM), that is, Nagios core, Solarwinds and so on, is a software component used to monitor resources and performance in a computer system. It is usually composed by a server and one or more agents distributed into the computer network that allows the execution of monitoring activities.

Off-the-Shelf

Figure 5, Figure 6

SSM Server It is composed by the SSM Core and the Mon Data Repository

Off-the-Shelf

Figure 6

SSM Core Collect data from every SSM Agent in the network for hardware, software and network monitoring; Store the collected data into the repository; Scheduling of monitoring activities;

Off-the-Shelf

Figure 6

SSM Notification System

Software module that provide the functionality of delivering a message to one or more destinations

Off-the-Shelf

Figure 6

Mon Data Repository

Maintain the collected monitoring data into the repository

Figure 6


T0800-0000-AR-001 02 2018-06-29


Page 34 of 115

Element Description

SSM Agent

Software application daemon that manage the activities to gather monitoring data (aka system metrics) from the TM Application to monitor/control, group and send them back to the server. The communication can start from the server or it can be started directly from the application through the SSM Agent. The agent has a set of scripts (called monitoring activity) to perform the above operations. The advantage of having an agent is the fact that, instead of calling every monitoring activity, the server call only the agent that make the job.

Off-the-Shelf

Figure 6

Fault Manager

Part of the SSM devoted to detecting, diagnosing and fixing faults, and finally returning to normal operations.

Off-the-Shelf

Figure 6

Fault Engine Evaluate the rules and perform actions according to the mapping between rules and action stored in the FM repository

Off-the-Shelf

Figure 6

FM repository

It contains failure definition, rules definition, faults definition and mapping with actions, if required

Off-the-Shelf

Figure 6

Monitoring Activities

A monitoring activity (that is, Nagios core check) is a software module that produces monitoring data. For each SSM Agent there are more than one activity so that a list of monitoring points is built for every application. See 10.3 for a further details.

Built Figure 5

Lifecycle Manager

The Lifecycle Manager (LM) is an IT automation tool (that is, Chef, Puppet, Ansible and so on) gives the possibility to control a software application in every phase of its lifetime. It is usually composed by a server and many agents distributed in the computer network that allows the execution of a lifecycle script. See 10.1.3.1 for further details.

Off-the-Shelf

Figure 5, Figure 6

Lifecycle Manager Engine

Server side of the Lifecycle Manager Off-the-Shelf

Figure 6

Lifecycle Manager Service

An agent that applies the configuration Off-the-Shelf

Figure 6

Lifecycle Manager Core

Part of the engine: OS service that allow external components to interact with the Engine and support the request for virtualization

Off-the-Shelf

Figure 6


T0800-0000-AR-001 02 2018-06-29


Page 35 of 115

Element Description

Lifecycle Manager Data Repository

Part of the engine: repository of configuration items (for example in chef they are Ruby scripts)

Off-the-Shelf

Figure 6

Lifecycle scripts

Executable software scripts that allow to start, stop and upgrade or downgrade a TM application. See 10.3 for a further details.

Built Figure 5

Logging Service (LS)

The TM Logging Service is usually composed by three software entities: the forwarder, the repository (data center) and the query GUI (that is, ELK1 stack). The repository (data center) is a cluster of databases (usually NoSQL) and potentially every TANGO Facility (element) may have a specific database cluster to collect log messages and increase performance of the queries (first choice for the data center is Elasticsearch). The log forwarder is an OS service that give the possibility to forward the message to a repository. The query GUI is designed for analytics/business-intelligence needs, to quickly investigate, analyse, visualize, and ask ad-hoc questions on a lot of data (millions or billions of records).

Off-the-Shelf

Figure 5, Figure 6

LS Server Composed by a data repository and an engine Off-the-Shelf

Figure 6

LS Data Repository

NoSql database (that is, Elasticsearch) to organize the information

Off-the-Shelf

Figure 6

LS Engine Entity responsible for collecting and organizing the log messages from every applications;

Off-the-Shelf

Figure 6

LS Forwarder Entity responsible for the transfer and the transformation of the information to the logging server based on logging forwarding rules

Off-the-Shelf

Figure 6

Logging forwarding rules

A forwarding rule is basically a configuration item; it allows to forward the message to a particular repository; usually there is also a transformation into a structure (that is, ELK). Usually log data are seen as log files and, even if they are initially just file, many modern logging systems (that is, ELK) transform those data into structure (that is, in ELK json document). See [RD21].

Built Figure 5

TM Monitor The TM Monitor device is a TANGO Device that integrate the monitoring data in the TANGO facility from the SSM

Built Figure 5

1 https://www.elastic.co/webinars/introduction-elk-stack

https://www.elastic.co/webinars/introduction-elk-stack


T0800-0000-AR-001 02 2018-06-29


Page 36 of 115

Element Description

(or from the local server, if required). It is a bridge between the generic monitoring and the telescope monitoring so that the TMC can correlate some generic monitoring data, if required.

Service GUI The Service GUI is a graphical user interface that interact with every SER components to access the functionalities available. In particular it allows to

● to check the monitoring points (that is, the TM health status), alarms and any kind of event

● To query the logging repository ● To perform control actions that is to use the

lifecycle manager

Built Figure 5

Fault Rules A fault rule is a relation between a fault and a specific action that can be made by the lifecycle manager. The rule take information from the SSM, from the Logging Service and send action command through the Lifecycle Manager to recover a fault (that is, if an alarm is raised, a mail has to be sent to a particular Operator)

Built Figure 5

Virtualization Service

Software made for creating a virtual (rather than actual) version of something, including virtual computer hardware platforms, storage devices, and computer network resources. It also includes the optimization and the right usage of the Off-the-Shelf vProvider available.

Built Figure 5, Figure 6

vProvider Generalization of the providers (vStorage, vNetwork, vMonitoring and vConf) managers by the Orchestrator

Off-the-Shelf

Figure 5, Figure 6

Virtualization Generic term indicating a virtualization service (that is, Openstack, VMWare and so on)

Off-the-Shelf

Figure 6

Orchestrator It provides a template-based way to describe a cloud application, then coordinates running the needed OpenStack API calls to run cloud applications

Off-the-Shelf

Figure 6

vStorage It provides persistent storage to virtual machines that are managed

Off-the-Shelf

Figure 6

vNetwork It provides an API that allows users to set up and define network connectivity and addressing in the cloud (that is, Openstack networking [RD18])

Off-the-Shelf

Figure 6

vMonitoring It provides monitoring functionalities for the virtualization to report alarms, alerts and every information needed by the upper level monitoring (SSM)

Off-the-Shelf

Figure 6


T0800-0000-AR-001 02 2018-06-29


Page 37 of 115

Element Description

vConf It stores all the configuration information Off-the-Shelf

Figure 6

Relations

Part A Part B Description

Fault Rules SSM A fault rule takes information from the SSM to evaluate it

Fault Rules LM A fault rule takes information from the LM to evaluate it

Fault Rules Logging Service

A fault rule takes information from the Logging Service to evaluate it

Fault Rules Logging forwarding rules

A fault rule sends log information to the Logging Service through a Logging forwarding rule.

SSM Logging forwarding rules

The SSM sends log information to the Logging Service through a Logging forwarding rule.

SSM Monitoring Activities

The SSM performs its activity through a set of monitoring activities (see 10.3).

LM Logging forwarding rules

The LM sends log information to the Logging Service through a Logging forwarding rule.

LM Lifecycle Scripts

The LM performs its activity through a set of lifecycle scripts (see 10.3).

Lifecycle Scripts Logging forwarding rules

A Lifecycle script sends log information to the Logging Service through a Logging forwarding rule.

Monitoring Activities Logging forwarding rules

A Monitoring Activity sends log information to the Logging Service through a Logging forwarding rule.

Service GUI TANGO framework

The Service GUI interacts with the TANGO framework to send lifecycle action to the TM devices (if needed and commanded by the Operator)

Service GUI SSM The Service GUI interacts with the SSM to gather monitoring information

Service GUI LM The Service GUI interacts with the LM to send lifecycle


T0800-0000-AR-001 02 2018-06-29


Page 38 of 115

Part A Part B Description

action to the TM application (if needed and commanded by the Operator).

Service GUI Logging Service

The Service GUI interacts with the LS to read log information from it (if needed and commanded by the Operator).

TM Monitor TANGO framework

The TM Monitor is a TANGO device (see [RD10])

TM Monitor SSM The TM Monitor report monitoring information from the SSM to the TANGO facility

LM Virtualization Service

To perform lifecycle operations

Virtualization Service vProvider To realize the virtualization

Virtualization Service Logging forwarding rules

Virtualization Service sends log information to the Logging service through a Logging forwarding rule

vProvider Logging forwarding rules

vProvider sends log information to the Logging service through a Logging forwarding rule

Behaviour

For the dynamic behaviour of the entities depicted in the present view, please refer to 10.1.3.1.

Rationale

● Cost saving is the main reason to use the Off-the-Shelf software

● Monitoring activities, fault rules, Lifecycle scripts and Logging forwarding rules depend from

the choice of the Off-the-Shelf software

● Monitoring activities, fault rules, Lifecycle scripts and Logging forwarding rules are separated

from the execution engine (SSM versus Monitoring activities, SSM versus fault rules, Lifecycle

Manager versus Lifecycle scripts, Logging service versus Logging forwarding rules) to increase

the modifiability of the system; the choice of the Off-the-Shelf software will not influence the

work to do (that is, agent-based versus agentless) but rather the details of how it will be made.

● The rationale for having the TM Monitor component is to report every monitoring point to the TMC to give the Operator a clear picture of the functional and non-functional monitoring points of the system.

Interfaces

Most of the interfaces highlighted in the primary presentation depend on the technology chosen that


T0800-0000-AR-001 02 2018-06-29


Page 39 of 115

gives already an interface to interact with it. Therefore if not stated in the following table, it means that it is a ‘ready to use’ interface.

Interface Document link

SSM-Monitoring Activities SSM - Monitoring Activity Interface, 11.2

Service GUI-TANGO framework The Service GUI, when needed to talk with the TANGO framework will use the Starter (see [RD6])

TM Monitor-SSM TM Monitor Interface, 11.3

Virtualization interface Virtualization Interface , 11.4

Services C&C View

This view highlights the runtime components of the system and their relations.


The primary presentation shows the TM Services runtime components and their relations. The diagram is divided in three parts: Client, Server and Virtualization. The Client part represents the TM Services runtime components installed in a generic TM host that is composed by a local data store and the various agents. The first one (for instance the file system) is where any TM application can store its local data, such as log files or configuration files while the agents are: Logging Service (LS) forwarder, Software System Monitor (SSM) Agent and Lifecycle Manager (LM) Service. Each of them sends and receives data to the respective server. Note that also in the virtualization representation there are the same agents installed. The Server part shows the server components that are in relation with the various agents, which is the LS Forwarder which sends data to the LS Engine, the SSM Agent to the SSM Core and the LM Service to the LM Core. The SSM Core has the responsibility to gather the monitoring data from the hosts, via the agents, that are then used in the Fault engine to evaluate Fault rules. The visualization of data and the configuration of server components is made by the Service Graphical User Interface (GUI) that allows the operator to access the information, based on the authorizations provided by the AAA.

Figure 7: Service C&C View

Element Catalog

Elements

Element Description

SSM Core Collects data from every SSM Agent in the network for hardware, software and network monitoring; Receives asynchronous event from the Monitoring Agent Store the collected data into the repository; Provides data to visualize to the operator through the Service UI

SSM Agent Software application daemon that manage the activities to gather monitoring data (aka system metrics) from the TM Application to monitor/control, group and send them back to the server. The communication can start from the server (C/S style) or it can be started directly from the application through the SSM Agent (pub/sub style). The agent has a set of scripts (called monitoring activity) to perform the above operations. The advantage of having an agent is the fact that, instead of calling every monitoring activity, the server call only the agent that make the job.

MonData Repository Repository where the monitoring data are stored.

FM Repository The repository of the alarm/fault rules and actions

Fault Engine The rule engine where define the mapping of monitoring data and the alarms. It also provide an engine to perform a fault predictive using monitoring and log data.

Notification System Software application that send notification to the Service GUI or operator using email or sms, according to the fault rules

Service GUI Generic UI that allow an Operator to visualize every monitoring data, alarm, log and so on. Also permits to perform action like configure fault rules, take lifecycle action having the configuration already done

TM Application to Monitor/Control

A generic TM Application TANGO or Non-TANGO based

LM Core Server side of the Lifecycle Manager. Allow to send a specific control action to a TM Application.

LM Data Repository Part of the engine: repository of configuration items (for example in chef they are Ruby scripts)

LM Service An agent that applies the configuration

LS Engine Entity responsible for collecting and organizing the log messages from every applications; furthermore translate the query to retrieve the data from the LS data repository.

LS data repository NoSql database (that is, Elasticsearch) to organize the information


T0800-0000-AR-001 02 2018-06-29


Page 42 of 115

Element Description

LS Forwarder Entity responsible for the transfer and the transformation of the information to the logging server based on logging forwarding rules (see 10.1)

Local Data Store It can be based on simple files (or local syslog udp server based on logstash or rsyslog for instance): more information can be found on the SKA Logging Guideline (see [AD3]).

Config DB A relational database system that store all the information that cannot be stored in other sub-system. It is important to notice that this component can be a schema in other database sub-system (for instance the one in the online or offline system) so that it is possible to save licence cost (if any). The information to store are highlighted in 10.3.

AAA AAA exposes API to provide authentication and authorization to operator

VIrtualization Orchestrator

Virtualization Orchestrator component manages templates and Instance resource allocation; It is the entry point for the Virtualization Service. See 13.3.2.

Relations

Part A Part B Description Multiplicity

SSM Agent TM application to monitor/control

The SSM agent gathers the monitoring data of the TM applications as per the pre-configured monitoring activities.

Multiple

LM Service TM application to monitor/control

Once the LM service has retrieved the host configuration from repository, it applies the configuration to the host where application has to be deployed.

Multiple

TM application to monitor/control

LS Forwarder TM application send logging data through this connection into LS Forwarder

Multiple

LS Forwarder LS Engine The LS forwarder sends logging data to the LS Engine through this connection, according to the configuration

Multiple

LS Engine LS Data Repository LS Engine stores and retrieves the logging data in the LS Data Repository

Single

Fault Engine FM Repository This correspond to the fault management executor that receives script from Fault management repository and executes it

Single

Fault Engine MonData Repository

This correspond to the fault management client that retrieve the

Single


T0800-0000-AR-001 02 2018-06-29


Page 43 of 115


monitoring data from the software system monitor needed to perform the fault management analysis.

Fault Engine LS Engine This correspond to the fault engine that retrieves the log data from the logging service data repository needed to perform the trend analysis.

Single

Fault Engine Notification System According the rules, the Fault Engine sends a message to the Notification System, to send a notification to the operator (email or SMS), in case of failure or fault

Single

Fault Engine LM Core According to the rules, the Fault Engine can send a command to the LM Core, to perform a Lifecycle action in case of failure or fault

Single

SSM Core SSM Agent SSM Core request and download the monitoring data collected by SSM Agent

Multiple

SSM Agent SSM Core SSM Agent sends an asynchronous message to the SSM Core

Multipe

LM Service LM Core The connection allows to any client to retrieve the corresponding configuration item from the LM Core. If necessary the LM Core can directly talk to the LM Service.

Multiple

LM Core LM Data Repository LM Core retrieve configuration from the LM Data Repository

Single

Notification System

Service GUI The notification system sends an alert to the operator through the GUI, in case of fault or failure

Single

Service GUI SSM Core Service GUI that allow to an Operator to visualize every monitoring data, notification and/or specific grouping of data to check the situation of the TM at generic level.

Single

Service GUI LM Core Specialized UI application for lifecycle action having the configuration already done. The Service GUI interacts with the engine to manage

Single


T0800-0000-AR-001 02 2018-06-29


Page 44 of 115


(create/update/delete) the configuration items for the TM subsystems.

Service GUI LS Engine The Service GUI allows users to create queries to retrieves the logging data from the LS Engine

Single

Service GUI AAA To access the functions of the Lifecycle Manager (that allow to configure all the application included the service one), the Operator must be authenticated and authorized. With this relation the Service GUI allows the User to gain the authentication token together with his groups (for authorization). Please consult [RD5] for further information.

Single

Service GUI Config DB The Service GUI retrieves its configuration (for instance subsystems information) from the local configuration DB.

Single

Service GUI Virtualization The Service GUI enables the Operator to create/modify/delete and so on a virtual host into the system, if required.

Single

Virtualization Orchestrator

SSM Core Virtualization environment can access to the monitoring data from the SSM Core

Multiple

SSM Core Virtualization Environment

The Software system monitor receives monitoring data from the Virtualization Environment.

Multiple


LS Core The Virtualization Orchestrator sends logging data to the LS Engine through this connection, according to the logging forwarding rules

Multiple


LM Core Virtualization Orchestrator retrieve configuration from the LM Data Repository

Multiple

Interfaces

Most of the interfaces highlighted in the primary presentation depend on the technology chosen that gives already an interface to interact with it. Therefore if not stated in the following table, it means


T0800-0000-AR-001 02 2018-06-29


Page 45 of 115

that it is a ‘ready to use’ interface.


SSM-Monitoring Activities SSM - Monitoring Activity Interface, 11.2

Service GUI-TANGO framework The Service GUI, when needed to talk with the TANGO framework will use the Starter (see [RD6])

TM Monitor-SSM Virtualization Interface , 11.4

Virtualization interface SSM - Monitoring Activity Interface, 11.2

Behaviour

Interaction with AAA

Service GUI interacts with AAA to retrieve user information and user role. The interaction is described in the following pseudo-code. AAA.Autenticate(Username, Password);

IF(Authentication.result == Successfull)

{

GUI.Store(Autentication.id, Autentication.name,

Autentication.surname, Autentication.email);

AAA.RequestListGroup(Autentication.id);

GUI.Store(ListGroupResult[]);

}

ELSE

{

THROW(AutenticationFailedException);

}


T0800-0000-AR-001 02 2018-06-29


Page 46 of 115

Lifecycle Manager

10.2.2.4.2.1 Lifecycle Generic Execution Activity

Figure 8: Lifecycle Generic Execution.

Figure 8 shows the generic execution activity that the Lifecycle Manager implements for the specific need of TM. The activity starts when the Lifecycle Manager Engine receives a lifecycle command (start, stop, restart and so on) from the user through the Service GUI. Inside the repository of the Lifecycle Manager Engine, there is a list of applications (versioned) connected to a list of hosts. Based on this relationship, when the user selects an application to command, the repository knows where to apply the correct configuration and it informs the service local to the virtual server. The lifecycle service has internally a clock and every tick it applies the right configuration for its host but is also possible to force the application of it. The ‘Apply configuration item’ does exactly this job interacting with the TM application (in the case of the figure an OSO application) and asking support to the Virtualization Service (if needed). It is very important to highlight that the use of the timer to start the function in the Lifecycle Manager Service is possible because those functions do not change the system status if they are executed more than once (idempotent ability). After the configuration is over, the Lifecycle Manager Service checks that the lifecycle command has executed correctly and reports the result to the Operator through the Service GUI.


T0800-0000-AR-001 02 2018-06-29


Page 47 of 115

Monitoring

Figure 9: Perform monitoring activity.

Figure 9 shows the sysml diagram for a generic execution of the monitoring activity. The activity starts when the list of all current Monitoring Points (MPs) is requested to the SSM Server (updated during the ‘Configure monitoring activity’). This list is stored in a local buffer and provided to the Software System Monitor Scheduler (SSM Scheduler) in a predefined time period (t_1 and t_2). The SSM Scheduler performs two activities: tests the communication with the node and retrieves its monitoring data. During the test of the communication, SSM Scheduler sends a keep-alive packet to the node (aka Virtual Machine) and if a response is provided in a time period minor to the timeout time the test is successful, otherwise it is an error. All the monitoring activities (hardware, software or related to the network) are executed locally every t_3 seconds and the monitoring data are initially stored in a local buffer and, as soon as possible, communicated to the SSM Server through the SSM agent. The SSM Scheduler, in t_2 time period, downloads the aggregated data collected by the monitoring activities using the SSM agent and dumps the local buffer. It also uses the network standard protocol like SNMP (Simple Network Managment Protocol) to test the connectivity and the services of the TM node, as defined in the MP configuration. It is also possible for a TM application to send an asynchronous message to the SSM Server. The aggregated data that comes from the test of the communication and the SSM Agent, are matched with the rules defined in the configuration and, in case of a mismatch, an alarm is raised.


T0800-0000-AR-001 02 2018-06-29


Page 48 of 115

Fault Manager

Figure 10: Perform Fault Management.

Figure 10 shows the sysml diagram for a generic execution of the fault management activity. The Fault Manager (FM) analyses monitoring data that comes from SSM for detection phase, and the log system to support the analysis. It also depends on Lifecycle Manager to execute the command in Isolation and Recovery Phase. When the FM Engine starts, each t_1 seconds, it retrieves the monitoring data from SSM and stores them in a local buffer. To analyse the data, FM Engine uses the engine rules that are stored in the FM Repository. These rules are defined by the operator or developers usually from a dependability analysis. FM Engine compares (detection phase) the monitoring data with the Engine rule and in case of fault or failure, starts the isolation phase. During the isolation phase, the FM Engine can use log data (retrieved by Logging Server) to support the analysis. After this step, the FM Engine sends the isolation command to the Lifecycle manager. The recovery follows the isolation phase and consists to retrieve the recovery procedure from the FM Repository and send it to the Lifecycle Manager, to recover to the normal behaviour of the component. Another important feature of the Fault Manager is the trend analysis and failure prediction. It consists of analysing the monitoring and logging data to discover possible future failure or trend, using machine-learning algorithm or Markow chain and so on. If it detects a possible fault or failure, sends the data to the FM Engine that starts the detection phase.


T0800-0000-AR-001 02 2018-06-29


Page 49 of 115

Logging

Figure 11: Store log.

Figure 11 shows the main function of the Logging Service that is collecting log data. The most important part of the activity diagram is the fact that there is no direct link between the TM application and the logging forwarder and this will help the general performance of the network. In fact, it is not important to have log data immediately in the central server (they tell the story of the application and they cannot be used to actively monitor an application for bugs) compared to the importance of the network performance. Therefore, the forwarder will be configured to avoid any network flooding and in particular to avoid sending big quantity of data if the network is very busy.

Variability Mechanisms

Agent/Agentless solution

In the architecture shown above, an agent-based solution is preferred compared to an agentless one. This is because, even if an agentless-based architecture can appear simpler, it introduces some limits. In fact, without an agent installed in the client, it is possible to monitor network devices (for instance using protocols like SNMP), but it is not easy to monitor other features such as, CPU load). On the other hand, the architecture presented requires an extra work that is to install and configure the agent. Another important point is that, without an agent, some operations require a secure connection, and this adds complexity to the system introducing a potential security issue and a network overhead. For example a remote data collector must be allowed to communicate with the target system on different ports and may also need to be installed with domain administrator privileges to access the remote systems. Furthermore an agentless communication introduces additional network traffic, as the raw performance data is transported to a remote data collector. Instead, using an agent, data is collected locally and only the processed results are transported to the server.


T0800-0000-AR-001 02 2018-06-29


Page 50 of 115

Lifecycle Manager

● In case of a TANGO based engine, it is worth to configure the TANGO devices with an IT

Automation tool (that is, chef, ansible and so on) together with the Starter Device Server (see

[RD6]) so that it is possible to use Astor (see [RD7]).

● It is possible to have a Lifecycle Manager without agent such as, Ansible. The behaviour

(described in 10.2.2.4.2) is the same except for the fact that the communication starts from

the server.

Astor

As the TANGO framework is an open source project, it provides many tools that can be used as a starting point in the development of the Service GUI. Among the others, there is an important tool called Astor (see [RD7]) that could be considered the Lifecycle Manager for a TANGO based control system with some limitations. The architecture is based on a device called Starter that is able to control any device server in a remote host. The main limitations are related to:

● configuring and starting the Starter itself (the device must be called with a specific name)

● starting a non C++ device server

● upgrading/downgrading and in general, managing versions of devices

According to the documentation, Astor acts as client of the Starter device deployed in each host present in every host and allows to:

● display the control system status and component status using coloured icons

● execute actions on components (any command defined within the device)

● execute diagnostics on components

● execute global analysis on a large number of hosts or databases

It is also an example of integrated UI (see [RD9]) because from the Astor it is possible to open other important tools such as, the Access Control Panel (see [RD10], chapter Advanced Features), the LogViewer (see [RD11]), Jive (see [RD12]) and so on. According to the above considerations, starting from the Astor tool, there are two main line of extension:

● include the engine of the Lifecycle Manager because based on a (relatively) new technologies

such as Puppet, Chef, Ansible and so on;

● Include the link for the UIs of the SER subsystems (monitoring system, for the Logging service

and for the Lifecycle Manager);

Considering the lifetime of the SKA project (and as a consequence of the TM project), it is recommended to start the development of new UIs based on more recent technologies such as, Rest and the Web. The TANGO community has already started a project for a TANGO REST app (see [RD8]).

Monitoring

● In a generic monitoring system, it is always possible to make the communications both

asynchronous or synchronous and, in general, the synchronous one is preferred. In fact,

following the generic software monitoring best practices, in order to define if a trend of a

monitoring point is in fault or not, it is necessary to gather more sampling point. So, a single


T0800-0000-AR-001 02 2018-06-29


Page 51 of 115

sampling point in a asynchronous exchange of data could be not enough.

Also, synchronous communication reduces the complexity of the monitoring point scripts and

permits to control the network traffic and the overload of the server. In fact, considering an

asynchronous communication, the whole activity of the monitoring is performed by the node,

that monitors itself and raises an alert in case of fault, opening a communication with the

server. Furthermore, in this way, the SSM server cannot manage how many packets will be

sent, so it is not possible to control the network traffic and the server overload.

Logging

● In case of a TANGO based engine, it is possible to use directly the TANGO Logging Service that

is based on a specific TANGO Device Server (Log Consumer Device) to view the log messages.

More information can be found at the SKA logging Guidelines (see [AD3]).

● According to the SKA logging Guidelines (see [AD3]), the same pattern will be extended to

every SKA Application and not only to the TM Applications.

Rationale

Lifecycle manager

● The lifecycle management is the ability to control a software application in the following

phases of its lifetime: Configuration, Start, Stop/Kill, Update, Upgrade or Downgrade.

● To realize the lifecycle management, it is convenient to subdivide all applications that

compose TM into their typology. In particular, it is possible to distinguish the following types:

○ OS Service, which is a process that starts and stops with the operating system,

○ Web server, which is an information technology that processes requests via HTTP, the

basic network protocol used to distribute information on the World Wide Web,

○ Web application, which is a software running in a web server,

○ Desktop application, which is a software running in the client computer,

○ Server application, which is a software running in a server that usually does not have

the same lifetime of the operating system,

○ DB server, which is a database management system (which can be a RDBMS or a NoSql

DB technology).

● The configuration phase is the ability to set all the precondition (library needed, DB,

configuration files and so on) to make a software application ready to start (the configuration

phase is the preparation of the start phase).

● All this activities can be done through an IT automation tool (like Puppet, Chef, Ansible and

son on) in cooperation with other sub-element since they only know the details of the

applications.

● It is not worth to document the interface between the client side (called Service) and the

server side (called Engine) of the Lifecycle Manager since it is Off-the-Shelf.

● To access to lifecycle action, it is necessary an authentication token, provided by AAA (if the

operator has the proper role)


T0800-0000-AR-001 02 2018-06-29


Page 52 of 115

Monitoring System

A monitoring system is composed by a software system monitor and by sub-elements specific monitor. Having such a piece of software in every node of the network can easily provide the ability of monitoring other aspects of TM that are network services (applications using protocol like TCP, HTTP, FTP and so on) or host resources (processor load, disk usage, system logs) and every monitoring points chosen and decided by other teams.

A Software System Monitor (SSM) is a client-server application that realize the ability to monitor a set of application producing one or more monitoring points in one or more computer networks. There is also the possibility to have some specific monitoring activities to produce a list of monitoring points for every application so that it is possible to make a strategy (technique) for fault management (for instance black box/white box error handling). The most important activity of the System Monitor is the Perform monitoring activity described in 10.2.2.4.3 that is responsible for:

Hardware monitoring: perform hardware monitoring, list MPs, atomic hardware MPs

Software monitoring: perform software monitoring, list MPs, atomic software MPs

Network monitoring: atomic network MPs, keep-alive monitoring

Generic monitoring: security level, event attributes

Data archive: monitoring data, monitoring configuration, metering configuration, failure

modes.

Failover mechanism

An important consideration concerns a possible failover mechanism. Failover is an automatic action to recover from a specific situation and can happen at different level. The boundaries for the level are:

1. if the failover is needed at server level, then it is a responsibility of the Virtualization,

2. if it can be solved with a lifecycle action then is a responsibility of the Service

3. otherwise it is at level of the TANGO facility (for instance in case of capability transfer) and it

is a responsibility of the M&C Module.

Logging

● The proposed logging architecture is a best practice

● There is no direct link between the TM application and the logging forwarder and this will help

the general performance of the network

● The forwarder will be configured to avoid any network flooding and in particular to avoid

sending big quantity of data if the network is very busy

● The growth of logging data should be controlled to avoid the storage of unused information

and to maintain an enough amount of data persistent without the risk of data flooding. There

are different possibilities and one of the most used one is to have a fixed size for data and

drop all the messages which exceed that data.

● A logging prototype has been developed based on the ELK stack (see [RD4]).

Service GUI

The Service GUI is the entry point for the TM Services software package and will allow the Operator to access all the functionalities provided from one single UI.

There are important indications from the work being done in the UI Team like:

http://en.wikipedia.org/wiki/Central_processing_unit

http://en.wikipedia.org/wiki/Hard_disk


T0800-0000-AR-001 02 2018-06-29


Page 53 of 115

○ Not many steps: when working, the user should not make many steps to reach his various GUIS;

○ High integration: a user may need or want to compose his specific GUI.

As a consequences of point 2, it is crucial to take part of the construction of the Feature Request # 6 – TANGO web application (see [RD8]):

○ There is already an effort in refactoring of the generic web app to be an open platform (3rd parties will be able to implement plugins for the platform to resolve their needs)

○ The development of the Service GUI will correspond to the construction of some plugins as per primary presentation.

Abstract Data Model

Description

The present view is the blueprint for the implementation of the data entities and a domain analysis of the main concepts wherewith the TM Services works.

Overview

This view has been divided into the following view packets for convenience of presentation:

1. Entities, 10.3.3: It describes the type of entities managed by the TM SER

2. Monitoring, 10.3.4: It describes the data type managed by the SSM

3. Lifecycle, 10.3.5: It describes how it is possible to have multiple version of the same application and the data type managed by the Lifecycle Manager.

4. Virtualization, 10.3.6: It describes the use that TM is going to do with the virtualization service; In particular the data model associated with it.

Entity decomposition View Packet

The present view packet highlights the entities managed by the TM SER package.


To correctly read the following diagram, it is important to start from the Entity block which is central. An entity can be an application, a monitored process, a monitoring activity, the virtualization, a virtual resource managed by the virtualization or a template (see also Virtualization view packet, 10.3.6).


T0800-0000-AR-001 02 2018-06-29


Page 54 of 115

Figure 12: Entity and version.

Element Catalog

Elements

Element Description

Entity It is the main data wherewith every TM Service application refer to. It can be a monitored process or an application (that is composition of monitored processes) or a monitoring activity.

Application It is an aggregation of MonitoredProcess selected according to a particular version with the block LogicalComposition

MonitoredProcess It is an OS process that needs to be monitored and controlled

MonitoringActivity It is a process (a runtime entity like a script or a os service) that monitors an entity and produces monitoring data

LogicalComposition It is a composition of one MonitoredProcess and one Version. An application is a composition of those blocks so that for each monitoredProcess it can refer to a particular version.

Version Every entity is related to the system with a particular version that indicates the particular composition of the software product.

Virtualization

Please see Virtualization view packet, 10.3.6. Template

vResource


T0800-0000-AR-001 02 2018-06-29


Page 55 of 115

Properties

Element Property Description

Entity Singleton This property indicates whether an entity can have multiple instances or not

Version Singleton This property indicates whether a version can have multiple instances or not

MonitoredProcess isProduct This property indicates whether the monitored process is a product (part of the PBS) or not

Application isProduct This property indicates whether the application is a product (part of the PBS) or not

vResource IP Address A floating IP address and a private IP address can be used at the same time on a single network-interface. The private IP address is likely to be used for accessing the instance by other instances in private networks while the floating IP address would be used for accessing the instance from public networks

Floating IP Address

Relations

Part A Part B Type Multiplicity

Entity Element Specialization 1

Entity MonitoredProcess Specialization 1

Entity Application Specialization 1

Entity MonitoringActivity Specialization 1

Entity Virtualization Specialization 1

Entity Template Specialization 1

Entity vResource Specialization 1

Application LogicalComposition Composition 1..*

LogicalComposition MonitoredProcess Composition 1

LogicalComposition Version Composition 1

Entity Version Relationship 1..*


T0800-0000-AR-001 02 2018-06-29


Page 56 of 115

Rationale

● Desktop applications are not in the model because they will not be monitored. Any server

application (being part of the TM network) is part of the model.

● Relation Version-Configuration has to be 0..* because not all the entities will be managed

with the lifecycle manager. Even if a script in an automation tool is always preferable

because it is replicable, it is also possible to configure an application outside the Lifecycle

Manager (because the application can be too complicated like the virtualization or because

there is a need for a manual configuration).

Monitoring View packet

The present view is the blueprint for the implementation of the data entities and a domain analysis of the main concepts wherewith the TM Services works.


The starting point for reading the following diagram is the block MonitoringActivity. It is an entity as any other in the TM SER and refers to a particular version of another entity called ‘entity2monitor’. It composed by an ActivityType and a Criticality and is related (produces) to one or more MonitoringPoint. This block is composed by a type called ‘MonitoringPointType’ and produces MonitoringData in a certain mode (Asynchronous or Synchronous). A MonitoringActivity can also generate events that can be an alarm, a warning, a information of an unknown one.

Figure 13: Monitoring data model.

Element Catalog

Elements

Element Description

Entity It is the main data wherewith every TM Service application refer to. It can be a monitored process or an application (that is composition of monitored processes) or a monitoring activity.

MonitoringActivity It is a process that monitors an entity and produces monitoring data


T0800-0000-AR-001 02 2018-06-29


Page 57 of 115


Event Message indicating that something has happened (for instance an alarm that require a user interaction)

ActivityType There are two main type of monitoring activity: a measure monitoring activity (that is, read the CPU utilization) and the message monitoring activity (to communicate a particular information, a message)

Criticality A monitoring activity has a level of criticality (High, Medium and Low), that allow to configure a priority for the monitoring data it produces.

MonitoringPoint Definition of a specific kind of data that is representative of an aspect of the system and that can have an interest by an operator o by a component

MonitoringPointType A monitoring point data value can be of a particular type like Percentage, Elapsed time, Status or a simple number

MonitoringData A value for a monitoring point at a particular time stamp

SendMode A monitoring data value can be sent by asynchronous or synchronous message

Properties




Relations


Entity MonitoringActivity Specialization 1


Version Configuration Relationship 0..*

MonitoringActivity Entity Specialization 1

MonitoringActivity Entity Relationship 1

MonitoringActivity Event Relationship 0..*


T0800-0000-AR-001 02 2018-06-29


Page 58 of 115

MonitoringActivity MonitoringPoint Composition 1..*

MonitoringActivity ActivityType Composition 1

MonitoringActivity CriticalLevel Composition 1

Event EventType Relationship 1

MonitoringData MonitoringPoint Relationship 1..*

MonitoringPoint MonitoringPointType Composition 1

MonitoringData SendMode Composition 1

Interfaces

Monitoring Interface (see 11.2) - Interface for a generic monitoring activity.

Behavior

In this section, it is provided an example of a monitoring activity that measures the latency between two communicating processes as indicator of communication problems. To do that, a SER user has to (assuming that the two process are Entities managed by TM SER):

1. Create a MonitoringActivity (depending on the solution off the shelf selected it can be a script

for instance) to store (in a log for instance) the timestamp (MonitoringPointType: TIME) and

data (only the ID) sent by the sending MonitoredProcess.

2. Create another MonitoringActivity to store the timestamp and data (only the ID) received by

the other MonitoredProcess.

3. Create a LogicalComposition (see Entities view packet, 10.3.3) composed by the two

MonitoredProcess considered above.

4. Create another MonitoringActivity that

a. Retrieve the two MonitoringData (correponding to the same data id) coming from the

first two monitoring activities,

b. Calculate the difference of the timestamp and

c. Raise an event of type alarm, if required.

Note that point number 1 and 2 can be omitted if the timestamp and data id are already stored somewhere.


T0800-0000-AR-001 02 2018-06-29


Page 59 of 115

Context Diagram

Figure 14: Monitoring context.

The above figure (for the full picture of the SER C&C, please see the Service C&C View at 10.1.3.1) shows the context for the present view packet. The interaction between client and server can be started both from the SSM Agent and from the SSM Core depending on the configuration made for the SSM. The SSM Agent execute one or more monitoring activities and send the result of it as monitoring data to the SSM Core that collect and store them into the MonData Repository. One or more monitoring activities can be executed also from the SSM Core, at server side.

Rationale

● A MonitoringActivity is a runnable process and an entity of the TM SER. It can generate one

or more monitoring points, which has a type and generate monitoring data.

● The monitoring activities are both runnable executables (or scripts) and data for the SSM.

Therefore, adding a new monitoring activity correspond to an insert of a script in a file folder

(this is true for Nagios Core but, in general, it depends on the technology chosen).

● Since the monitoring activity is an entity of the TM SER (like an OSO/TMC application), it can

be managed as any other entity with the lifecycle, allowing the possibility to create multiple

version of the same monitoring activity (even running them at the same time).

Lifecycle View Packet

The present view packet highlights the resources managed for a TM Application and in specific how an entity will be configured by the Lifecycle Manager component.


T0800-0000-AR-001 02 2018-06-29


Page 60 of 115


The starting point for reading the following diagram is the block Configuration. In fact they key concept for starting an application (therefore for the lifecycle) is the configuration phase. A Configuration represents all that is needed for prepare an application to run. So it is composed by one or more ConfigurationResouce that can be a Library (for instance a python library), a Directory, a File or a LogConfigurationFile. A File can represent, for example, a specific configuration to do so it can have a TemplateFile that can be filled with the property ‘attributes’ (in this model, it has been described as Dictionary but nothing prevents to have a json string) of the configuration. It is also composed by a Script (to be executed locally to prepare the configuration or to Start an Entity, see Entity view packet, 10.3.3) that depends on the specific choice of the engine for the lifecycle (for instance an IT Automation tool like Ansible or Chef, see Service C&C View, 10.1.3.1). It also has a type ‘ConfigurationType’ and is related to a particular version of an entity.

Figure 15: Lifecycle Data Model.

Element Catalog

Elements

Element Description

Entity Please see 10.3.3


Configuration It contains all the necessary information for the lifecycle management related to the particular version of an entity. For instance it maintains the reference to the host where the application runs or the relations with other objects (for instance an application could depend on another


T0800-0000-AR-001 02 2018-06-29


Page 61 of 115

application like the TANGO-controls framework depends on ZeroMq library)

ConfigurationType For each configuration it is associated a type which says the particular behavior for the specific configuration

Script Program written for a special run-time environment (that is, python or Ruby) that automate the execution of tasks that could alternatively be executed one-by-one by a human operator. 2 Scripts depends on SKA decision on the technology to adopt.

PreconditionCheck Script used for check the precondition of the Entity (for instance if it is a singleton, it cannot be started two times). It will always be run before the main configuration script.

Bash Unix shell script3

Ruby Ruby script4

Specific IT Automation Tool script

Most of the IT Automation tool available on the market, comes with a specific scripting tool to help the development of them

ConfigurationResource A generic configuration resource needed by a particular version of an entity

LogConfigurationFile Specialization of ConfigurationResource, that is a special kind of file used for logging configuration

Library Specialization of ConfigurationResource, needed by a specific entity

Directory Specialization of ConfigurationResource, that indicates a directory to be created

File Specialization of ConfigurationResource, that indicates a file to be created

TemplateFile Used to create a file as specific configuration item

Properties




2 https://en.wikipedia.org/wiki/Scripting_language 3 https://en.wikipedia.org/wiki/Bash_(Unix_shell) 4 https://en.wikipedia.org/wiki/Ruby_(programming_language)

https://en.wikipedia.org/wiki/Scripting_language

https://en.wikipedia.org/wiki/Bash_(Unix_shell)

https://en.wikipedia.org/wiki/Ruby_(programming_language)


T0800-0000-AR-001 02 2018-06-29


Page 62 of 115

MonitoredProcess isProduct This property indicates whether the monitored process is a product (part of the PBS) or not

Application isProduct This property indicates whether the application is a product (part of the PBS) or not

Relations



Version Configuration Relationship 0..*

Configuration ConfigurationType Composition 1

Configuration Script Relationship 1

Script Bash Specialization 1

Script Ruby Specialization 1

Script Specific IT Automation Tool Specialization 1

Script ConfigurationResource Composition 1..*

Script PreconditionCheck Specialization 1

ConfigurationResource Library Specialization 1

ConfigurationResource Directory Specialization 1

ConfigurationResource File Specialization 1

LogConfigurationFile File Specialization 1

File TemplateFile Composition 0..1

Interfaces

Lifecycle interface (see 11.1) - Interface for an application (at a specific version).


T0800-0000-AR-001 02 2018-06-29


Page 63 of 115

Context Diagram

Figure 16: Lifecycle context.

Figure 16 (for the full picture of the SER C&C, please see 10.1.3.1) shows the context for the present view packet. The lifecycle service takes from the Lifecycle Engine the configuration item and applies it to make a lifecycle action to the TM application. The interaction between the lifecycle engine and the lifecycle service can start both from the engine (blue line) and from the service (black line) depending on the tool chosen as IT automation tool.

Rationale

● The present model shows that a configuration is a composition of a script (written with a

specific language) and a set of configuration resources. The Lifecycle Manager will maintain

this data and will execute it whenever needed.

● The configurations are both runnable executables and data for the Lifecycle Manager.

Therefore, adding a new application version correspond to an insert of a set of scripts in a

dedicated repository (this is true for Chef but, in general, it depends on the technology

chosen).

● Adding a new version of an application implies to develop a new configuration item and

upload it into the Lifecycle repository. Nothing prevents to have more than one version of the

same application running (enabling modifiability).

Virtualization View Packet

The present view packet presents the data associated to a TM entity for the virtualization configuration. It basically depicts what data a TM application needs to use the virtualization service.


T0800-0000-AR-001 02 2018-06-29


Page 64 of 115


To correctly read the model, the starting point is the template which is basically a file describing what it is needed by a particular version of an entity to perform the lifecycle management. An application could need a certain number of resources (called vResource in the model) that can be CPU, storage space, network constraint that form the SLA (Service level agreement). Once defined the template, it acts as single unit (also called stack), with a state (Template State) and it is managed by the Virtualization entity though the virtualization interface. The managed resources (vResource) have a state as well (vResource State) that enables the work of the virtualization (for instance a resource can be transferred to another template only if it is in a particular state). There are mainly three types of vResource: computational, network and storage (see the Element catalog for more information).

Figure 17: Virtualization Data Model.


T0800-0000-AR-001 02 2018-06-29


Page 65 of 115

Element Catalog

Elements

Element Description

Template A Template (that is, Heat [RD17], AWS CloudFormation [RD23] and so on) is a file that describes a collection of resources (called vResources in the primary presentation) like VMs or containers as a single unit called a stack. The Virtualization Service (called only Virtualization in the primary presentation) manages the resources of the entire set so that it matches a certain SLA (in term of CPU, memory and so on) and implements basic failover mechanisms (if required and specified in the template itself).

Physical Resource File Descriptor

Document containing all the commands needed to build an image from scratch including testing capabilities and access to volumes and network services (see 10.3.6.2.3 for further detail).

Product Execution File Descriptor

Document defining the configuration for multi-image container operations. It includes testing capabilities as well as describe the access to hardware specifics like volumes and network (see 10.3.6.2.3 for further detail).

vResource Descriptor Document defining the runtime specifics and interoperability of various containers as well as their hardware access policies, hardware resource priorities, update mechanism, privileges, and restart strategies (see 10.3.6.2.3 for further detail).

vResource Generic term indicating a resources managed by the virtualization including CPU, storage, and networking

Virtualization Generic term indicating the entity that manages the virtualization service. It allows to create a virtual version of something, including virtual computer hardware platforms, storage devices, and computer network resources.

Container An OS level container

Hardware Hardware entity (with a serial number to manage)

vResourceCompute Virtualized computational resources. It can be an real or virtual (called vHardware in the primary presentation) Hardware a Container or a VM

vResourceNetwork Virtual network for a cloud application

vResourceStorage Virtual storage for a cloud application

VM Virtual Machine

vHardware Virtualized computational resources

https://wiki.openstack.org/wiki/Heat




T0800-0000-AR-001 02 2018-06-29


Page 66 of 115

Element Description

Entity Please see 10.3.3

Configuration Please see 10.3.5

Properties


Hardware SerialNumber This property indicates the serial number of the specific Hardware

vResource IP Address A floating IP address and a private IP address can be used at the same time on a single network-interface. The private IP address is likely to be used for accessing the instance by other instances in private networks while the floating IP address would be used for accessing the instance from public networks

Floating IP Address

Template Template State

Enumerative. It can assume one of the following values: INITIALIZED, ACTIVE, PAUSED, SUSPENDED, STOPPED, DELETED, ERROR

vResource vResource State

Enumerative. Is can assume one of the following values: INITIALIZED, PAUSED, SUSPENDED, SOFT_DELETED, ERROR, RESCUED, STOPPED

Descriptors

Table 5: Physical Resource File Descriptor.

instruction description arguments Syntax

from Base image of image Name from standard set of low level pre made images (that is, debian) or another user made image

FROM image_name

run Run shell command Valid shell command RUN <command>

label Set metadata for image to be used externally

List of label keys and their values

LABEL <key>=<value> <key>=<value> ...

expose Allow container to listen on specific port at runtime

Network port EXPOSE <port>

env Set variable name Key name and value ENV <key> <value>

volume Creates a mount point with the specified name and

List of mounting points VOLUME [‘/data’]


T0800-0000-AR-001 02 2018-06-29


Page 67 of 115

instruction description arguments Syntax

marks it as holding externally mounted volumes from native host or other containers

user Set the user name (or UID) and optionally the user group (or GID) to use when running the image

UID and optionally GID USER <UID>[:<GID>]

healthcheck Run an instruction and wait for specific output to ensure the image is running correctly

Optional options and command

HEALTHCHECK [OPTIONS] CMD command

Table 6: Product Execution File Descriptor.

instruction description example

container_name Specify a custom container name, rather than a generated default name.

container_name: my-container

container_creator Creator of the container container_creator: ‘john doe’

version Version of the config file syntax version: 1

labels Specify labels for the container. labels: description: ‘Example Lbl’

restart_policy Configures if and how to restart containers when they exit.

● Condition: One of none, on-failure or any (default: any).

● Delay: How long to wait between restart attempts, specified as a duration (default: 0).

● max_attempts: How many times to attempt to restart a container before giving up (default: never give up).

● Window: How long to wait before deciding if a restart has succeeded, specified as a duration (default: decide immediately).

restart_policy: condition: on-failure delay: 5s max_attempts: 3 window: 120s

devices List of device mappings. devices:

https://docs.docker.com/compose/compose-file/#specifying-durations



T0800-0000-AR-001 02 2018-06-29


Page 68 of 115


- ‘/dev/ttyUSB0:/dev/ttyUSB0’

env_file Add environment variables from a file. Can be a single value or a list.

env_file: - ./common.env - ./apps/web.env - /opt/secrets.env

environment Add environment variables. environment: RACK_ENV: development SHOW: 'true' SESSION_SECRET:

expose Expose ports between containers without publishing them to the host machine

expose: - ‘3000’ - ‘8000’

healthcheck Configure a check that’s run to determine whether or not containers are ‘healthy’.

healthcheck: test: [‘CMD’, ‘curl’, ‘-f’, ‘http://localhost’] interval: 1m30s timeout: 10s retries: 3

image Specify the image to start the container from.

image: a4bc65fd

depends_on Express dependency between containers

depends_on: - db - redis

Table 7: vResource File Descriptor.


service_name Specify a custom service name, rather than a generated default name.

service_name: my-container

service_creator Creator of the service service_creator: ‘john doe’

version Version of the config file syntax version: 1

labels Specify labels for the service. labels: description: ‘Example Lbl’

restart_policy Configures if and how to restart service when it exits.

● condition: One of none, on-failure or any (default: any).

● delay: How long to wait between

restart_policy: condition: on-failure delay: 5s max_attempts: 3 window: 120s


T0800-0000-AR-001 02 2018-06-29


Page 69 of 115


restart attempts, specified as a duration (default: 0).

● max_attempts: How many times to attempt to restart a container before giving up (default: never give up).

● window: How long to wait before deciding if a restart has succeeded, specified as a duration (default: decide immediately).

replicas If the container is replicated, specify the number of containers that should be running at any given time.

replicas: 6

resources Configures resource constraints. resources: limits: cpus: '0.001' memory: 50M reservations: cpus: '0.0001' memory: 20M Priority: 1

update_config Configures how the service should be updated. Useful for configuring rolling updates.

● parallelism: The number of containers to update at a time.

● delay: The time to wait between updating a group of containers.

● failure_action: What to do if an update fails. One of continue, rollback, or pause (default: pause).

● monitor: Duration after each task update to monitor for failure (ns|us|ms|s|m|h) (default 0s).

● max_failure_ratio: Failure rate to tolerate during an update.

update_config: parallelism: 2 delay: 10s

devices List of device mappings. devices: - ‘/dev/ttyUSB0:/dev/ttyUSB0’

env_file Add environment variables from a file. Can be a single value or a list.

env_file: - ./common.env - ./apps/web.env - /opt/secrets.env





T0800-0000-AR-001 02 2018-06-29


Page 70 of 115


external_links Link to services started outside this one external_links: - redis_1 - project_db_1:mysql - project_db_1:postgresql

healthcheck Configure a check that’s run to determine whether or not containers for this service are ‘healthy’.

healthcheck: test: [‘CMD’, ‘curl’, ‘-f’, ‘http://localhost’] interval: 1m30s timeout: 10s retries: 3

container Specify the containers to start the service from.

container: - db - processing1

user Set the user name (or UID) and optionally the user group (or GID) to use when running the service

User: 1002:1001

networks Networks to join networks: - some-network - other-network

ipv4_address, ipv6_address

Specify a static IP address for containers for this service when joining the network.

networks: app_net: ipv4_address: 172.16.238.10 ipv6_address: 2001:3984:3989::10

ports Expose ports. ports: - ‘3000’ - ‘3000-3005’ - ‘8000:8000’ - ‘9090-9091:8080-8081’ - ‘49100:22’

volumes Mount host paths or named volumes, specified as sub-options to a service.

volumes: source: mydata target: /data options: nocopy: true


T0800-0000-AR-001 02 2018-06-29


Page 71 of 115

Descriptors Errors

The error handling at the abstract data model level will be separated between layers as specified. All layers will provide testing capabilities by the use of the healthcheck command to ensure their correct state. 1st level error handling will check if all the variables in virtualization service description for a TM Application are within range and/or exist on the host system where the virtualization will run. 2nd level error handling will ensure the normal boot up of the TM Application image and can have error checking in place by the TM service to abort the process at the virtualization level if needed. 3rd level error handling will continuously monitor the virtualization of each TM Application and the remainder TM Applications running at the hardware level to ensure that hardware availability is within specifications, balancing the allocated resources to each part of the Virtualization Service.

Table 8: Physical Resource File Descriptor errors.

Error Original instruction

Description Extra info

Descriptor syntax error

- Syntax error in the virtual machine creator file descriptor

Line

Nonexistent image

from The image file doesn’t exist Image name

Cmd syntax error run Nonexistent command or syntax error

If it is a nonexistent command or syntax error

Health Check failed

healthcheck The healthcheck command doesn’t run or throws an exception

Specify the case and the test

Table 9: Product Execution File Descriptor Errors.





Line

Device error devices Nonexistent or non-accessible device

Device and reason

Environment file error

env_file Nonexistent or syntax error in environment file

File reason and if necessary line in file

Health Check failed

healthcheck The healthcheck command doesn’t run or throws an



T0800-0000-AR-001 02 2018-06-29


Page 72 of 115

exception

Nonexistent image

from The image file doesn’t exist Image name

Nonexistent container

depends_on The container doesn’t exist Container name

Table 10: vResource File Descriptor Errors.





Line

Device error devices Nonexistent or non-accessible device

Device and reason

Environment file error

env_file Nonexistent or syntax error in environment file

File reason and if necessary line in file

Health Check failed

healthcheck The healthcheck command doesn’t run or throws an exception


Nonexistent container

container The container doesn’t exist Container name

Nonexistent network

networks The network doesn’t exist Network name

Nonexistent volume/path

volumes The service volume or host path doesn’t exist

Specify case and volume/path

Relations


Entity Virtualization Specialization 1

Entity Template Specialization 1

Entity vResource Specialization 1

Version vResource Reference 0..1

Version Template Reference 0..1


T0800-0000-AR-001 02 2018-06-29


Page 73 of 115

Template Template Reference (depends on) 0..1

Virtualization vResource Manages 1

Virtualization Template Manages 1

vResource vResourceCompute Specialization 1

vResource vResourceNetwork Specialization 1

vResource vResourceStorage Specialization 1

vResourceCompute Hardware Specialization 1

vResourceCompute vHardware Specialization 1

vResourceCompute Container Specialization 1

vResourceCompute VM Specialization 1

vResource vResouce Descriptor Composition 1

Container Container File Descriptor

Composition 1

VM Image Creator File Descriptor

Composition 1

Interfaces

Virtualization Interface at 11.4.

Context Diagram

When starting an application in a distributed (or even cloud) environment, it is important to start it in the correct place. This view packet wants to make the reader aware that, as well as for the lifecycle code, the infrastructure must be coded as well.

Rationale

To start correctly a TM application there must be a lifecycle script code and an infrastructure

description called template.

Allocation View

This view shows the mapping between the runtime component of the system (see 10.1) and the (virtual or real) servers (see 13.3.2) needed to run them. The figure below shows how the Services products (server side) are deployed across the sites, in particular in the GHQ (SKA Headquarter in England) and in the Telescope (SKA-MID in South Africa and SKA-LOW in Australia). Each site, represented as a geographical boundary, hosts an instance of every products of TM Services. In the diagram, different representation are reported, based on the technologies. The technologies used are: Virtual Machines node, Cluster, External Cloud.


T0800-0000-AR-001 02 2018-06-29


Page 74 of 115


Figure 18: Deployment diagram.

Element Catalog

Definitions

Element Description

Node In a computer network, a node is an end-point identified by an IP address or a name that can receive, create, store or send data along distributed network routes.

VMs Node A node composed by at least two virtual machines, an active and a passive node. In case of failure or abnormal situation, failover mechanism switches from the active to the passive node.

Cluster A cluster is a collection of one or more nodes that together holds data and provides search capabilities across all nodes, to improve efficiency, availability and performance.

External Cloud

An external cloud hosts services that need to run in an external environment than other TM products.

Elements

Element Description

Service GUI Node Is a VMs Node, composed by an active and passive node, where are installed the


T0800-0000-AR-001 02 2018-06-29


Page 75 of 115

Element Description

Service GUIs

Software System Monitor Node

SSM Core is a VMs Node, composed by an active and passive node, where are installed components of the Software System Monitor

Logging Server Cluster A cluster where are installed the components of the server part of the Logging service

Lifecycle Manager Is a cloud node that hosts component of IT automation software (like Chef, Puppet, Ansible and so on); it is going to reside into an external cloud for two reasons: to avoid having three locations for it and for cost saving.

Service GUI Generic UI that allow to an Operator to visualize every monitoring data, alarm, log and so on. Also permits to perform action like configure fault rules, take lifecycle action having the configuration already done. See 10.1 and 10.1.3.1 for further details.

Software System Monitor

See 10.1 and 10.1.3.1 for further details.

SSM Core Collects data from every SSM Agent in the network for hardware, software and network monitoring; Receives asynchronous event from the Monitoring Agent Store the collected data into the repository; Provides data to visualize to the operator through the Service UI

MonData Repository Repository where the monitoring data are stored. See 10.1 and 10.1.3.1 for further details.

FM Repository A repository where rules and actions are installed. See 10.1 and 10.1.3.1 for further details.

Fault Engine The rule engine which defines the mapping of monitoring data and the alarms. It also provide an engine to perform a fault predictive using monitoring and log data. See 10.1 and 10.1.3.1 for further details.

Notification System Software application that send notification to the Service GUI or operator using email or sms, according the fault rules. See 10.1 and 10.1.3.1 for further details.

Logging Service Entity responsible for collecting and organizing the log messages from every applications. It is usually composed by three software entities: the forwarder, the repository (data center) and the query GUI (that is, ELK stack). The repository (data center) is a hierarchy of databases (usually NoSQL) and potentially every domain TANGO Facility (element instance) may have a specific database cluster to collect log messages and increase performance of the queries (first choice for the data center is Elasticsearch). See 10.1 and 10.1.3.1 for further details.

LS Engine See 10.1 and 10.1.3.1 for further details.


T0800-0000-AR-001 02 2018-06-29


Page 76 of 115

Element Description

LM Core Server side of the Lifecycle Manager. Allow to send a specific control action to a TM Application. See 10.1 and 10.1.3.1 for further details.

LM Data Repository Part of the engine: repository of configuration items (for example in chef they are Ruby scripts). See 10.1 and 10.1.3.1 for further details.

Relations

Part A Allocated To Description

Service GUI Node

GHQ and Telescopes

Service GUI is a VMs Node

Software System Monitor Node

GHQ and Telescopes

SSM Node is a VMs Node

Logging Server Cluster

GHQ and Telescopes

Logging Data is a cluster that do not need any TM failover mechanism because it is already present in the ELK stack.

Lifecycle Manager Cloud

GHQ and Telescopes

The Lifecycle Manager has been put outside the virtualization (specifically in a cloud environment) because every TM applications (Virtualization included) depends on it for its lifecycle.

Variability Mechanisms

To reach a high level of reliability, software engineers usually use a failover mechanism. There are two main best practices for this: the active/active (also known as high-availability cluster) or the active/passive architecture (also known as simple failover). An active/passive solution manages the failover when the active (or primary) node crashes and therefore its associated resources are relocated to (and restarted on) the passive (or secondary) node, ensuring the persistence of the services. Together with the reliability quality, in this way another advantage is the ability to deal with either planned or unplanned service outages: the administrator can update the passive node while the active is still running, easily switch them after the update, and do the same for the other node. An Active/active solution, manages the failover maintaining all the nodes working so that if one fails the other can still receive requests. In this case, it is crucial to guarantee the traffic balance adding one or more load balancers in the network. It is also important to properly size the dimension of the servers to avoid spare resources; in fact a best practise is that the load balancers runs servers at near full capacity. In the TM SER case, it has been preferred an active/passive solution for two main reasons: the first one is to minimize the cost (there is no need for the load balancer, therefore it is easier to implement) and the second one is because this solution fully satisfy the assigned requirements.


T0800-0000-AR-001 02 2018-06-29


Page 77 of 115

Rationale

For the deployment of the TM SER components, it has been consulted and taken as reference the Nagios core architecture (see [RD13]), the ELK Architecture (see [RD15] and [RD16]) and the Chef architecture ([RD24]). While the Monitoring system needs only one server for managing up to 5 hundred nodes [RD14], the logging service needs a cluster to store all the data needed (so the need is for two servers, one for logging and one for monitoring). The requirement for logging storage says to maintain log data for 2 years [AD1]. After a test, it has been found out that an application heavily working with logs can send up to 24 Mb of logs per day. There are 17 (11 for TMC, 6 for OSO) products in the cost model therefore the plan is for 150 GB per year. Having 1TB for each site would be enough. The Lifecycle Manager server has been put in an external cloud environment instead of having an instance (that is, a VM node) in every sites. Since the managed information are not going to be heavy (the TANGO cookbook [RD27] is 95.4 KB) and since those information get usually cached (that is, chef) in the client computer, there is no need for allocating specific resource within the sites. Even if the cost associated to a cloud based solution are comparable to the classical solution (see also [RD24]), it has been preferred a cloud one for the better support associated.

Tactics

The SSM Agent, to detect TM faulty component, implements the following mechanism: Ping: asynchronous request/response message to determines that the component is alive and

responding correctly Heartbeat: periodic message exchange with the SSM Server. Timestamp: every message has a timestamp to rebuild the correct order of messages. Timeout: the monitoring activity (that is, Nagios check) should complete within a

predetermined amount of time. Retry: in case of a faulty monitoring activity, the SSM Agent retries to execute the activity.

To recover TM SER faulty component, the SER sub-system implements the following mechanism:

Active redundancy (for every SER sub-system): in case of fault or failure is it possible to switch a passive (or redundant) node automatically, using an automatic action, or manually, by the intervention the operator.

Reconfiguration: the LMC, through the Lifecycle Manager, can re-configure a faulty component with a versioned configuration automatically or with an operator command.

Software upgrade or downgrade: patch after a software bug. Exception handling: Once that an exception has been detected, the system must handle it.

There are several possibilities to handle an exception. A possible way is to include with the exception an error code that contain information helpful in fault correlation.

To prevent faults the Fault Management component will make trend analysis and failure prediction of TM components. Those analysis has to be done with the help of the logging data coming from the TM Logging server. The tactics implemented by LMC Fault Management are:

Predictive model: according with health status detected by the software system monitor, the predictive model ensures that the system is operating within its normal operating parameters and takes corrective actions. A possible way to create a predictive model is gather the data (from monitoring, log and so on) and to analyse with an artificial intelligence -software


T0800-0000-AR-001 02 2018-06-29


Page 78 of 115

11 Interfaces

Lifecycle Manager - TM Generic Application - Interface

Interface identity

Lifecycle Manager Main Interface

Resources provided

Configuration script resource: syntax: it is a script written in a specific IT Automation Tool Scripting language (Ruby, Bash,

Perl and so on) semantics of the resource:

o The observable effect of the resource takes place in the node (aka virtual or real host) where the script runs;

o The configuration script prepares the host with all the necessary resources for running a TM Application (see 10.3)

o Every version of an application must have its own configuration o The script is idempotent (it can run multiple times leaving the host at the same

state) o It refers to a particular version of an application

error handling: permission not available

Start script resource:

syntax: it is a script written in a specific IT Automation Tool Scripting language (Ruby, Bash, Perl and so on)

semantics of the resource: o The observable effect of the resource takes place in the node (aka virtual or real

host) where the script runs; o The start script runs after the configuration script and start the specific TM

application o The script is not idempotent o It refers to a particular version of an application


Stop script resource:



host) where the script runs; o The stop script runs after the start script and stop the specific TM application o The script is not idempotent o It refers to a particular version of an application


Kill script resource:



T0800-0000-AR-001 02 2018-06-29


Page 79 of 115


host) where the script runs; o The kill script runs after the start script and kill the process of the specific TM

application o The script is not idempotent o It refers to a particular version of an application


Error handling

Permission not available: the configuration resources could need to write to a certain directory (for instance to create a particular service);

Wrong configuration

Rationale and design issues

To start or stop an application there is the need of a configuration phase The configuration phase can be done manually or through the use of an IT Automation Tool

such as, Chef, Puppet, Ansible and so on

SSM - Monitoring Activity Interface

Interface identity

Monitoring activity Interface

Resources provided

Execute resource: syntax of the resource: a script that run in the machine where the monitoring activity is

hosted. It could be a C compiled script or a shell script semantics of the resource: when the resource is called the activity starts and returns:

o A code that can assume the following values: INFORMATION, WARNING, ALARM, UNKNOW

o A value indicating the explanation of the code: Measure, Message, State error handling: the following errors must be handled

o The resource represents a script that must be well written o The script has the correct permission to access to read the resources of the machine o The SSM Agent where the monitoring activity runs can communicate with SSM

Server

Configure resources:

syntax of the resource: a json-based file with information about the resource to monitor semantics of the resource:

o Version and information about the host o

error handling: the following errors must be handled

o Configuration wrong


T0800-0000-AR-001 02 2018-06-29


Page 80 of 115

Data types and constants

The return data type is an array of {CODE, MESSAGE}.

Error handling

Permission: the script or agent cannot access to read resources of the host to control Communication: communication problem between the SSM Server and the SSM Agent

TM Monitor Interface

Interface identity

TM Monitor Device Interface

Resources provided

Add Monitoring Point resource: syntax: TANGO Controls Device command

o Input Name of the monitoring point, string Url to retrieve the monitoring point, string Query to get the monitoring point value, string Type of the monitoring point, string

o Output Boolean that indicates the success of the operation

semantics of the resource:

o The observable effect of the resource is the addition of a dynamic attribute in the device representing the monitoring point

o The url to get the monitoring point depends on the technology chosen o The query to get the monitoring point value depends on the technology chosen

error handling: o The Url to get the monitoring point is wrong or does not allow to retrieve any value

o The query to get the monitoring point value is wrong or does not allow to retrieve any value

o The name of the monitoring point is already existing o The SSM in not active

Remove Monitoring Point resource:

syntax: TANGO Controls Device command o Input

Name of the monitoring point, string o Output

Boolean that indicates the success of the operation semantics of the resource:

o The observable effect of the resource is the removal of a dynamic attribute in the device representing the monitoring point

error handling: o The name of the monitoring point is not existing


T0800-0000-AR-001 02 2018-06-29


Page 81 of 115

o The SSM in not active

Host List resource:

syntax: TANGO Controls Device command o Output

List of the Host configured with monitoring semantics of the resource:

o This resource retrieves all the hosts in the network that has been configured for monitoring

error handling: o The SSM in not active and it is not possible to retrieve the list

Get Monitoring Point Value resource:


Url to retrieve the monitoring point, string Query to get the monitoring point value, string

o Output Value of the monitoring point

semantics of the resource: o This resource retrieves the value of a particular monitoring point o The url to get the monitoring point depends on the technology chosen o The query to get the monitoring point value depends on the technology chosen

error handling: o The Url to get the monitoring point is wrong or does not allow to retrieve any value

o The query to get the monitoring point value is wrong or does not allow to retrieve any value

o The SSM in not active

Get Monitoring Points Process resource:


Name of the process, string o Output

List of monitoring points, string semantics of the resource:

o Retrieve the list of generic monitoring points for a process error handling:

o The process should be already configured in the SSM o The SSM in not active

Data types and constants

This interface will use the types available for the dynamic attribute within TANGO controls framework

Error handling

If the SSM is not active, the TM Monitor will not work. To fix it, make sure that the SSM is active.


T0800-0000-AR-001 02 2018-06-29


Page 82 of 115

Quality attribute characteristics

QA characteristics of the resources Modifiability: Adding and removing monitoring points from the device is easy.


This interface is a bridge between a generic SSM and the TANGO controls world. It allows to take advantage of the TANGO controls mechanism for storing and generating

alarm correlated with the functional monitoring points of the TMC

Virtualization Interface

Interface definition

The interface to the Virtualization Service will be divided into three configuration layers. See Figure 26 of the Virtualization View for an overview. The first layer, named Physical Resource Layer, consist on every hardware that is available for supporting computation of LINFRA. This layer is internal to LINFRA and managed according to computational requirements, while observing the need for maintenance and support tasks. The second layer, named Product Execution Layer, consists of a distributed environment (set of several virtual machines) operating towards the provisioning of highly available products. The third layer, named Virtualized Resource (vResource) Layer, consists of the vvirtual machines, containers and other hardware that has a logical representation to the Product Execution Layer and is part of a template or is available to be used by future templates. The interface will be constrained by exposing template actions externally according to the aforementioned three layers, while working based on a structure relying on a state based architecture that will control the Virtualization Orchestrator and its managed templates and instances. The objective is to provide high availability, abstracting the underlying hardware infrastructure, and allowing software defined failover and horizontal scalability. Externally the requesting services will need only to define the template for the virtualization service, being all the internal state and allocated resources out of the provided resource pool, managed by the virtualization service itself providing only logging capabilities to the outside.

These Virtualization Service internals are defined in the Virtualization View (see 13.3.2).

Template Actions

Actions are allowed to be issued according to the present state. The template corresponds to one or several instances of vResources (see 10.3.6) according a given deployment plan and a SLA (Service Level Agreement). Actions may be exposed to the remaining of TM, through Lifecycle Manager, or be only available internally as the internal virtualization management processes. To the outside of computation infrastructure, management will be driven by each template, with internal management of the instances of vResources to be handled locally. Depending on the effective technology being used for the vResources, some instance states may be missing. The definition of


T0800-0000-AR-001 02 2018-06-29


Page 83 of 115

templates is vital so that the Virtualization Service allocates and manages resources adequately, by considering the overall SLA and dependencies of all instances of a template. That is, allocation of a TM template must be performed if all instances can be allocated. Also, scaling, failover and migration can be handled by the Virtualization Service, circumventing failures due to maintenance and hardware, as well as avoiding implementation redundancies at all other components (of course at Server level) according to the Context of the Virtualization View (see 13.3.2.2).

Table 11: Available actions for the Virtualization Service.

Action Description Required State

Triggered State

Visibility

get_state Retrieve the present state of the template. This action superseeds any other existing at every level for monitoring/debugging purposes.

ANY NONE EXTERNAL

create Create a new Template Instantiation

INITIALIZED ACTIVE, ERROR

EXTERNAL

set_admin_password Specify Template administrator password. Might enter an ERROR state if the complexity requirements are not met.

ACTIVE ACTIVE, ERROR

EXTERNAL

live_migrate Moves Template instances between hardware computing units, but it won’t power off the instances in general so the instance will not suffer a down time. Virtualization engineers may use this to evacuate servers from a physical server that needs to undergo maintenance tasks.


INTERNAL

suspend Suspend a Template if it is infrequently used or to perform system maintenance. The Instances state is stored on disk, all memory is written to disk, and the instances are stopped. Some technologies (LXC may not allow this)

ACTIVE SUSPENDED, ERROR

INTERNAL

pause Stores the state of the Template instances in RAM. Paused instances continues to

ACTIVE PAUSED, ERROR

INTERNAL

https://docs.google.com/document/d/1qdRQ_keltP7Ymfizi2KHSTx2iAlplKbgsRYxdc9bWHU/edit#heading=h.tou8tudlzrd4

https://docs.google.com/document/d/1qdRQ_keltP7Ymfizi2KHSTx2iAlplKbgsRYxdc9bWHU

https://docs.google.com/document/d/1qdRQ_keltP7Ymfizi2KHSTx2iAlplKbgsRYxdc9bWHU


T0800-0000-AR-001 02 2018-06-29


Page 84 of 115

Action Description Required State

Triggered State

Visibility

run in a frozen state.

resume Resume suspended instances. SUSPENDED ACTIVE, ERROR

EXTERNAL

unpause Returns a paused Template back to an active state

PAUSED ACTIVE, ERROR

EXTERNAL

stop Power off the Template and its instances.

ACTIVE, ERROR

STOPPED, ERROR

EXTERNAL

backup Store Template’s current state to the general archive

ACTIVE, STOPPED

ACTIVE, STOPPED, ERROR

INTERNAL

rebuild Remove all data on the Template and rebuild it according the latest specification.

ACTIVE, STOPPED


INTERNAL

scale Change the Template resource allocation, number of instances and QoS parameters

ACTIVE, STOPPED

ACTIVE, ERROR

INTERNAL

start Power on the Template. STOPPED ACTIVE, ERROR

EXTERNAL

reboot Soft or hard reboot a Template instances. A soft reboot attempts a graceful shutdown and restart of the instances (soft reboot powers down the instances normally before rebooting). A hard reboot power cycles the instances (shuts them down immediately).

ACTIVE, STOPPED,

ACTIVE, ERROR

EXTERNAL

delete Power off the given Template first, then detach all the resources associated to the instances such as network and volumes, then delete the instantiation of the Template.

INITIALIZED, PAUSED, SUSPENDED, STOPPED, ERROR

DELETED INTERNAL


T0800-0000-AR-001 02 2018-06-29


Page 85 of 115

Table 12: Allowed Actions according to the State

State Description Allowed Actions

INITIALIZED Template was just created but nothing exists

create

ACTIVE Template is running. pause, stop, snapshot, reboot

PAUSED Template is paused and all instances are also paused.

unpause, delete

SUSPENDED Template is suspended and all instances are suspended.

Resume, delete

STOPPED Template is not running. snapshot, backup, rebuild, reboot, resize, rescue, start

DELETED From quota perspective, the Template no longer exists. Instances will eventually be destroyed running on compute, disk images too.

ERROR Some unrecoverable error happened. Only delete is allowed to be called on the Template.

delete


T0800-0000-AR-001 02 2018-06-29


Page 86 of 115

Figure 19: TM Template Actions depending of the State.

vResource Internal Actions

Instances (of vResource, according to 10.3.6) are a construct which is managed internally by TM.LINFRA. TM Engineers will have the capability of assessing an instance directly, through the SSO mechanisms defined in the AAA policies. The instances are internally managed by the virtualization provider and are not available outside for the user.


T0800-0000-AR-001 02 2018-06-29


Page 87 of 115

Table 13: vResources Internal Actions.

Action Description Required State Triggered State

create Create a new Template Instantiation INITIALIZED ACTIVE, ERROR

live_migrate Moves instances between hardware computing units, but it won’t power off the instances in general so the instance will not suffer a down time.


soft_delete See delete but deleted instance will not be deleted immediately, instead it will be put into a queue and deleted according to the operational policy.

ACTIVE, STOPPED

SOFT_DELETED, ERROR

suspend Suspend an instance if it is infrequently used or to perform system maintenance. The VM state is stored on disk, all memory is written to disk, and the virtual machine is stopped.

ACTIVE SUSPENDED, ERROR

pause Stores the state of the Instances in RAM. Paused instance continues to run in a frozen state.

ACTIVE PAUSED, ERROR

restore Restores deleted Instance. SOFT_DELETED ACTIVE, ERROR

resume Resume suspended instance. SUSPENDED ACTIVE, ERROR

unpause Returns a paused set of instances back to an active state.

PAUSED ACTIVE, ERROR

stop Power off the instance. ACTIVE STOPPED, ERROR

snapshot Store the current state of the instance root disk to be saved and uploaded back into the glance image repository.

ACTIVE, STOPPED


backup Store instance’s current state ACTIVE, STOPPED


rebuild Remove all data on the application server and replace it with a specified instance image.

ACTIVE, STOPPED


resize Convert an existing application server to a different flavor, scaling the application server up or down. The original application server is saved for a period of

ACTIVE, STOPPED

RESIZED, ERROR


T0800-0000-AR-001 02 2018-06-29


Page 88 of 115

Action Description Required State Triggered State

time to allow rollback if there is a problem.

rescue Start application server in a special configuration whereby it is booted from a special root disk image. Enables try to restore broken guest system.

ACTIVE, STOPPED

RESCUED, ERROR

start Power on the instance. STOPPED ACTIVE, ERROR

reboot Soft or hard reboot an instance. A soft reboot attempts a graceful shutdown and restart of the instance. A hard reboot power cycles the instance.

ACTIVE, STOPPED, RESCUED

ACTIVE, ERROR

confirm_resize See resize. RESIZED ACTIVE, ERROR

revert_resize See resize. RESIZED ACTIVE, ERROR

delete Power off the given instance first, then detach. All the resources associated to the instances such as network and volumes, then delete the instantiation from the Template.

INITIALIZED, PAUSED, SUSPENDED, SOFT_DELETED, ERROR, RESCUED, STOPPED

DELETED, ERROR

unrescue Reverse action of Rescue. The instance spawned from the special root image will be deleted.

RESCUED ACTIVE

Error handling

The error handling will exist at the layer level and interface level. At the interface level the error will be handled by the logging facilities expressed in the Virtualization View document and the states described in the Interface Template Actions.


● Completely abstract the hardware from the TM service so that only LINFRA has to take into

account hardware non uniformity.

● Provide high level access to all relevant hardware capabilities to deduplicate the work

needed by the TM services.

12 Prototypes

The TM SER prototypes are included in the TM Prototype report [RD4].


T0800-0000-AR-001 02 2018-06-29


Page 89 of 115

13 Appendix

Compliance statements for TM Service Requirements

The following table shows the analysis made for the compliance statements for the TM Service requirements. Table 14: Compliance statements for TM Service Requirements

Id Addressed in

SER_REQ_1 Abstract Data Model (see 10.3), Service C&C View (see 10.2), Uses Module View (see 10.1)


SER_REQ_3 Abstract Data Model (see 10.3), Service C&C View (see 10.2), Uses Module View (see 10.1), TM Monitor Prototype (see 11.3)

SER_REQ_4 Service C&C View (see 10.2)



SER_REQ_6a Abstract Data Model (see 10.3), Service C&C View (see 10.2), Uses Module View (see 10.1)


SER_REQ_8a TM Health Status and State Analysis View (see 13.3.1)

SER_REQ_8b TM Health Status and State Analysis View (see 13.3.1)

SER_REQ_9 Allocation View (see 10.4)


SER_REQ_11a Abstract Data Model (see 10.3), Service C&C View (see 10.2), Uses Module View (see 10.1)

SER_REQ_11b Abstract Data Model (see 10.3), Service C&C View (see 10.2), Uses Module View (see 10.1)

SER_REQ_11c Abstract Data Model (see 10.3), Service C&C View (see 10.2), Uses Module View (see 10.1)

SER_REQ_11d Abstract Data Model (see 10.3), Service C&C View (see 10.2), Uses Module View (see 10.1)


Detailed scenarios

In this section there are some detailed scenarios that were not included in the use cases section (see 9).


T0800-0000-AR-001 02 2018-06-29


Page 90 of 115

Monitoring scenarios

Monitoring Resources

Name Monitor Resources

Description SER SSM monitors TMC/OSO resources defined as any physical component of limited availability of a computer, such as CPU Load, Memory Usage and so on.

Actor SER SSM

Pre-condition SER SSM has the configuration and authorization to access the resource monitoring data

Basic flow 1. SER SSM measures performance of TMC/OSO resources 2. SER SSM evaluates current and nominal values for normal behaviour 3. SER SSM reports monitoring data and in case of failure/error throws an alarm

Alternative Flow

-

Post-condition

SER SSM receives and reports monitoring data

Monitoring Services (on network)

Name Monitoring Services (on network)

Description SER SSM monitors TMC/OSO services (defined as an application running at the network application layer, for instance TCP, HTTP, FTP and so on)

Actor SER SSM

Pre-condition The TMC/OSO service must be available from the network

Basic Flow 1. SER SSM periodically sends network packets (that is request the service) to TMC/OSO component to be monitored

2. TMC/OSO receives packets and replies to SER request 3. SER SSM receives acknowledgement from TMC/OSO 4. SER SSM evaluates current and nominal acknowledgement for normal

behaviour 5. SER SSM reports monitoring data and in case throws an alarm

Alternative Flow

-

Post-condition

SER SSM receives and reports monitoring data


T0800-0000-AR-001 02 2018-06-29


Page 91 of 115

Asynchronous Monitoring Software component

Name Asynchronous Monitoring Software component

Description SER SSM monitors software component in asynchronous mode and download monitoring data periodically.

Actor SER SSM Server, SER SSM Agent

Pre-condition SER SSM Server has right configuration and authorization to access at SER SSM Agent. The remote host has the SER SSM Agent installed and working.

Basic flow 1. SER SSM Agent retrieves and collects data from the local machine 2. SER SSM Server connects to SSM agent and downloads monitoring data 3. SER SSM Agent removes the data downloaded by the SER SSM server 4. SER SSM Server evaluates current and nominal values for normal behaviour

5. SER SSM reports monitoring data and in case throws an alarm

Alternative Flow

-

Post-condition -

Synchronous Monitoring Software component

Name Synchronous Monitoring Software component

Description A TMC/OSO application (remote process), through the SER SSM Agent, sends a synchronous message to SER

Actor SER SSM Server, SER SSM Agent

Pre-condition SER SSM Server has right configuration and authorization to access at SER SSM Agent. The remote host has the SER SSM Agent installed and working.

Basic flow 1. A TMC/OSO application (remote process) sends a message to SER SSM Agent 2. SER SSM Agent connects to SER SSM Server and forwards the message 3. SER SSM Server receives the message 4. SER SSM Server evaluates the message and check if it is a normal behaviour 5. SER SSM Server reports monitoring message and in case throws an alarm

Alternative Flow

-

Post-condition -


T0800-0000-AR-001 02 2018-06-29


Page 92 of 115

Sending alarm

Name Sending alarm

Description SER Fault Management sends an alarm to an operator.

Actor SER SSM Server, Operator, Fault Management

Pre-Condition Alarm is not handled by TMC/OSO

Basic flow 1. SER receives a monitoring data that not match with normal behaviour 2. SER has not defined a rule that match with mismatching situation 3. SER sends alarm to the operator

Alternative Flow -

Post-condition -

Fault management scenarios

Insert Recovery procedure

Name Insert Recovery procedure

Description A TMC/OSO Developer insert a recovery procedure to handle a specific alarm

Actor TMC/OSO Developer, SER Administrator

Pre-condition

The corresponding alarm is defined

Principal flow

1. TMC/OSO Developer creates a procedure to handle a specific alarm 2. TMC/OSO Developer uploads the procedure into the Lifecycle Manager

Repository 3. The SER Administrator tests the procedure and declare it as runnable or ask the

TMC/OSO Developer to modify the procedure

Post-condition

Recovery procedure is stored in the Lifecycle Manager Repository

Alarm notification

Name Alarm notification

Description TMC/OSO Developer stores rules to send alarm

Actor TMC/OSO Developer, SER Administrator

Pre-Condition

The alarms does not exist


T0800-0000-AR-001 02 2018-06-29


Page 93 of 115

Principal flow 1. TMC/OSO Developer defines rules to send an alarm 2. TMC/OSO Developer uploads rules into the Software System Monitor 3. The SER Administrator checks the rules and declare it as correct or ask the

TMC/OSO Developer to modify it

Post-condition

A new alarm condition is stored

Lifecycle Management Scenarios

Configure/Start Application

Name Configure/Start Application

Description Start a TM application in a remote host or locally

Actor Administrator

Pre-condition

The administrator must have the right for starting an application; the application has not started and, if it is, the policy should allow having multiple instances of it.

Principal flow

1. Log in into the TM.SER 2. Check the user right to start the application chosen 3. Check the application start-up policy if the application is already started 4. Configure application or check if correct configuration is loaded 5. Start the application 6. Start the corresponding monitoring activities if available 7. Test application

Post-condition

The application is started

Kill Application

Name Kill Application

Description Stop an application in a remote host or locally

Actor Administrator

Pre-condition The administrator must have the right for stopping an application. The application has started.

Principal flow 1. Log in into the TM.SER 2. Check the user right to kill the application chosen 3. Stop the corresponding monitoring activities if available 4. Kill the application


T0800-0000-AR-001 02 2018-06-29


Page 94 of 115

Post-condition The application is stopped

Restart Application

Name Restart Application

Description Restart a specific instance of application in a remote host or locally

Actor Administrator

Pre-condition The user must have the right for restarting the application. The application has started.

Principal flow 1. Log in into TM.SER 2. Check the user right to restart (kill and start) the application in the specific host 3. Stop the corresponding monitoring activities if available 4. Kill the application 5. Configure/Reconfigure the application 6. Start the application 7. Start the corresponding monitoring activities if available 8. Test application

Post-condition

The application is started

Add Application Version

Name Add Application Version

Description Add a specific version of an application into the system

Actor Administrator

Pre-condition Version not present

Principal flow 1. Log in into TM.SER 2. Check the user right to add a new version of the application 3. Check if the version is already in the system and if not 4. Add the version entry

Post-condition Version present

Remove Application Version

Name Remove Application Version

Description Remove a specific version of an application into the system

Actor Administrator


T0800-0000-AR-001 02 2018-06-29


Page 95 of 115

Pre-condition Version present

Principal flow 1. Log in into TM.SER 2. Check the user right to remove a version of the application 3. Check if the version is in the system and if it is offline. In case of online set offline

the application (4.2.7) 4. Remove the version entry

Post-condition

Version not present

Set on line Application version

Name Set on line Application version

Description Set a specific version available to be used by an User

Actor Administrator

Pre-condition Version off-line

Principal flow 1. Log in into TM.SER 2. Check the user right to set online a version of the application 3. Check if the version is off-line 4. Configure/Reconfigure the application 5. Start the application 6. Start the corresponding monitoring activities if available 7. Test application 8. Set the specified version on line

Post-condition Version on line

Set off line Application version

Name Set off line Application version

Description Set a specific version not available to be used by an User

Actor Administrator

Pre-condition Version on line

Principal flow 1. Log in into TM.SER 2. Check the user right to set offline a version of the application 3. Check if the version is on line and not used by any operator 4. Check the user right to kill the application chosen 5. Stop the corresponding monitoring activities if available 6. Kill the application 7. Set the specified version off-line


T0800-0000-AR-001 02 2018-06-29


Page 96 of 115

8. Uninstall application

Post-condition Version off-line

Update Application

Name Update Application

Description Update a TM application without removing or adding a specific version. This case can happen if, for instance, there is a bug (or a specific user request) and there is the need to solve it as soon as possible.

Actor Administrator

Pre-condition

-

Principal flow

1. Log in into TM.SER 2. Check the user right to update an application 3. Check if the version is on line and not used by any operator otherwise stop 4. Kill the application 5. Install/Update version 6. Configure/Reconfigure the application 7. Start the application 8. Start the corresponding monitoring activities if available 9. Test application

Post-condition

-

List Applications

Name List Applications

Description List the applications available for the user

Actor User

Pre-condition -

Principal flow 1. Log in into the TM.SER 2. List applications

Post-condition -

Use Application

Name Use Application


T0800-0000-AR-001 02 2018-06-29


Page 97 of 115

Description Link the user to the right version of the application he wants to work with

Actor User

Pre-condition -

Principal flow 1. Log in into the TM.SER 2. List applications available for the user 3. Choose an application from the list 4. Use the application

Post-condition -

Logging scenarios

Store Log

Name Write Log

Description A TMC/OSO Application stores a log

Actor TMC/OSO Application

Pre-condition The Logging service is active and the logging priority is high (for instance there is a need to find out a behaviour).

Principal flow (with log server connection)

1. TMC/OSO Application creates a log packet defining log details and log information

2. TMC/OSO Application sends log packet to SER Logging Service 3. Log Service stores log packet

Alternative flow (without log server connection)

1. TMC/OSO Application creates a log packet defining log details and log information

2. TMC/OSO Application maintains the log packet until the connection with SER Logging Service returns

3. TMC/OSO Application sends log packet to SER Logging Service 4. Log Service stores log packet

Post-condition Log is stored

Search Log

Name Search Log

Description A maintainer or developer needs to investigate on a certain behaviour of the TMC/OSO Application

Actor A TMC/OSO (or TM) Developer (or Maintainer)


T0800-0000-AR-001 02 2018-06-29


Page 98 of 115

Pre-condition The Developer has the permission to search a log

Principal flow 1. The Developer log in into TM.SER Logging Service 2. The Developer search for log messages using key search values (for instance a

word) 3. The Logging Service provides a list of matching log messages

Post-condition

-

Extract Log File

Name Extract Log File

Description A maintainer or developer needs to investigate on a certain behaviour of the TMC/OSO Application

Actor A TMC/OSO (or TM) Developer (or Maintainer)

Pre-condition The Developer has the permission to extract a log

Principal flow 1. The Developer logs into TM.SER Logging Service 2. The Developer search for log messages using key search values 3. The Developer download the log files he needs

Post-condition

-

Other views

TM Health Status and State Analysis View

The present view is intended to satisfy the following requirements:

ID Name Description Source Verification method

SER_REQ_8a Aggregate and Report TM Health Status

The SER shall aggregate the TM internal status and report it to the Operator in a structured health view based on the TM PBS



T0800-0000-AR-001 02 2018-06-29


Page 99 of 115

SER_REQ_8b Manage TM State

The SER shall manage the TM state (by sending signal for state transition) that can assume, among the others, the following values: start-up, shutdown, standby and operational. The following are the possible state transitions: 1. from standby to startup, 2. from startup to operational, 3. from operational to shutdown, 4. from shutdown to standby.

TM_REQ_201 TM_REQ_202 TM_REQ_342 TM_REQ_385 TM_REQ_386 TM_REQ_387

Demonstration

From the requirement analysis, there is a need for two distinct aggregated values that will express the TM State and the TM Health Status, respectively. The first one is an enumerative that has to contain at least the following values: start-up, shutdown, standby and operational. The second one is a performance indication that must be structured so that an Operator can understand it (according to the TM PBS).

This view represents an analysis of a simple, systematic approach to the definition of an aggregation method and performance metrics.


Figure 20: Health status calculation.


T0800-0000-AR-001 02 2018-06-29


Page 100 of 115

Figure 21: Mathematical representation.

Element Catalog

Elements

Element Description Figure

i Monitoring point index (within a critical group). -

j Critical grouping index. It identifies the critical group (in Figure 21 j=2 for G2).

-

Mi,j Monitored item (see 13.3.1.3); for example, a process, an application, a cpu and so on); it is critical if its failure corresponds to a TM failure.

Figure 20

Gj Critical grouping, that is a set of non critical items that together form a critical item (for instance a set of applications or a set of monitoring point forming a server); It is composed by the minimum number of non-critical items whose collective failure corresponds to a TM failure; it becomes critical if and only if all composing items fails.

Figure 20

si,j State of the monitored entity corresponding to the i-th monitoring point inside the j-th critical group

-

(si,j ) = 1 if si,j = operating state

(si,j ) = 0 if si,j = non-operating state (for example, standby, switched-off, and so on)

Figure 20

wi,j Weight of the i-th monitoring point inside the j-th critical group. It is a measure of the importance of the item in the TM (0≤wi,j≤1); The values of them is the natural outcome of the FMECA analysis.

Figure 20

hi,j Normalized value of each monitoring point normalized (0≤hi,j≤1). It can be thought as a sort of “health status of Mi,j “.

Figure 20, Figure 21


T0800-0000-AR-001 02 2018-06-29


Page 101 of 115

aj Aggregated health status for the j-th critical group. Figure 20, Figure 21

t Time at which the monitoring point values have been acquired and the system health computed and provided as an aggregated value.

Figure 21 as ‘time’

Pin Starting time of the aggregation process. It can be also thought as t - 𝛥t, being 𝛥t the duration of the aggregation process

Figure 21

Pout Ending time of the aggregation process. It can be also thought as t. Figure 21

nj Number of components of the j-th critical group -

∑ It represent a mathematical operation as expressed in Eq. 5 Figure 21

∏ It represent a mathematical operation as expressed in Eq. 2 Figure 21

Interfaces


SSM-Monitoring Activities SSM - Monitoring Activity Interface at 11.2

Context Diagram

This view is an application of the current architecture and the context for it is the same of the monitoring system (see 10.3.4.4). Besides, it is important to consider that a generic monitoring activity will produce a state value or a monitoring point value. Therefore, for each TM application (aka monitored item) it must be possible to consider at least a performance value and a state value. The following diagram summarizes this.

Figure 22: For every TM Process there will be at least two monitoring activities to retrieve the state and a

measure of the performance.

Related View

Uses Module View, 10.1 Service C&C View, 10.2 Abstract Data Model, 10.3


T0800-0000-AR-001 02 2018-06-29


Page 102 of 115

Rationale

Health Status

The classical approach to RAM modelling has been followed, considering the fundamental difference between:

critical items, that is, those whose failure causes a failure of the overall system; non-critical items, that is, those whose failure causes a degradation of the overall system,

without blocking it.

and by also considering the fundamental concept of specific grouping of non-critical items, which can become a critical (higher-level) group. According to this concept, a critical group is composed by the minimum number of (non-critical) items whose collective and simultaneous failure causes a failure of the whole system. For such a group the degraded behaviour is caused by the degraded performance of some or all the components, or even the failure of some (but not all) components. Note in addition that this concept can be extended to the concept of generalized critical group, which also includes:

1. critical groups composed by single critical items 2. critical groups composed by non-critical items which, however, will never fail together and

simultaneously (so the groups are virtually critical, but really never critical) 3. critical groups composed by non-critical items whose collective and simultaneous failure does

not produce a failure of the overall system in any case. These groups can be built as essentially composed by N-1 real non-critical items, plus a N-th virtual item, properly added to the group and defined as ‘never failing’: in this way the group will never have all items simultaneously failing and will be equivalent to the groups mentioned in point (2) above.

It is therefore possible to define an aggregation procedure in terms of generalized critical groups only. We define aj , aggregated health status of the j-th generalized group Gj (composed by nj lower-level items) as the weighted average of the health statuses hi,j (scalar quantities with 0 ≤ hi,j ≤ 1) of all lower-level items Mi,j (with i ranging from 1 to nj ) which the group is composed by:

(𝑬𝒒. 𝟏) 𝑎𝑗 = ∑ 𝑤𝑖,𝑗ℎ𝑖,𝑗

𝑛𝑗

𝑖=1

𝑤𝑖𝑡ℎ ∑ 𝑤𝑖,𝑗 = 1

𝑛𝑗

𝑖=1

It is easy to prove that 0 ≤ aj ≤ 1, too. It is also quite evident that equation (1) becomes trivial when a single critical-component is considered: if nj = 1, then w1,j = 1 and aj = h1,j . With these principles in mind, we can represent the correct operation of the system as a sort of process, starting at an initial time Pin and ending at a final time Pout by passing over a number of serial steps, each one representing the operation of a generalized critical group Gj, with j ranging from 1 to the number of generalized critical groups N. Critical components are reported as a series in analogy with electrical engineering, where a circuit is open as a single serial component is broken. It can be noted, incidentally, that such a scheme could also allow to define the performance degradation of the system, which occurs as a single serial component shows a degraded behaviour (that is, under normal performance).


T0800-0000-AR-001 02 2018-06-29


Page 103 of 115

The overall aggregated health status aTOT is defined as a modified N-dimensional geometrical average of all the serial (that is, critical) health statuses aj :

(𝑬𝒒. 𝟐) 𝑎𝑇𝑂𝑇 = ∏ 𝑎𝑗𝑐/𝑁

𝑁

𝑗=1

This simple equation indeed meets all the fundamental requirements for aTOT :

1. 0 ≤ aj ≤ 1 ⇒ 0 ≤ aTOT ≤ 1

2. aTOT = 0 ⇒ j : aj = 0

3. aTOT = 1 (fully operational) aj = 1 j The use of a geometrical average, instead of a simple product, is dictated by the need to make the aggregation formula scalable as the number of components varies. Being aj less than 1, the simple product should produce very low numbers for aTOT as N becomes significant, making the overall value hard to handle. On the other hand, a generic factor c is introduced at the exponential, called severity coefficient, that allows to enhance or to mitigate the overall value with respect to the input values. It is easy to check that for c > 1 or c < 1 the overall health status has even lower or higher values than for c=1, respectively. The value for c can be adopted according to a defined policy for the aggregated status. Equations (1) and (2) do not take into account the state si,j of the Mi,j item, however. On the other hand, the state cannot be neglected: a critical item Mi,j could be in a ‘non operating’ state (for example, standby or switched off) and therefore returning aj = hi,j = 0 even if healthy, so producing the false value aTOT = 0 with unwanted consequences. In such a situation, a value aj (critical) or hi,j (non-critical) should simply be excluded from the computations. The procedure to do that is very different, however, depending on Mi,j item being critical or not. Since aj appears in a product, it can be excluded by putting it equal to 1. On the other hand, since hi,j appears into a weighted average, it can be excluded by simply putting to zero the corresponding weight (wi,j = 0) and re-equilibrating the other weights in such a way the normalization condition is still fulfilled. Let’s define a function of the state (si,j ) in the following way:

(si,j ) = 1 if si,j = operating state

(si,j ) = 0 if si,j = non-operating state (for example, standby, switched-off, and so on) The expression for the health aj of a critical single item, to be inserted into definition (2), is then

state-affected critical single item health = aj + (1 - aj )[1 - 𝜎(si,j )] (Eq. 3) where it can be verified that the aj assumes its normal value if 𝜎(si,j)=1 or the value aj =1 if 𝜎(si,j)=0. The expression for the weight wi,j of a non-critical item, to be inserted into equation (1), is instead


T0800-0000-AR-001 02 2018-06-29


Page 104 of 115

state-affected non-critical weight = 𝜎(si,j )wi,j all other wk,j (k ≠ i) being changed to preserve normalization. A series of simple steps can lead to a more general expression for Equation (1). First of all, let us consider the case when there is a k-th item in a non-operating state. This means forcing wk,j = 0 and therefore changing the normalization condition as

∑ 𝑤𝑖,𝑗(𝑛𝑒𝑤)

𝑛𝑗

𝑖=1

= ∑ 𝑤𝑖,𝑗

𝑖≠𝑘

= ∑ 𝑤𝑖,𝑗 − 𝑤𝑘,𝑗

𝑛𝑗

𝑖=1

= (1 − 𝑤𝑘,𝑗) ≠ 1

This condition can be restored, however, by re-computing the weights as follows:

𝑤𝑖,𝑗 =𝑤𝑖,𝑗

1 − 𝑤𝑘,𝑗

so that

∑ 𝑤𝑖,𝑗

𝑛𝑗

𝑖=1

= 1

(Note that the range of i values is unchanged, since null value for wk,j produces a null values for wk,j and can be therefore inserted into the expression). In general, if there are N0 items in a non-operating state, the weights distribution will be changed as

𝑤𝑖,𝑗 =𝜎(𝑠𝑖,𝑗)𝑤𝑖,𝑗

1 − ∑ 𝑤𝑘,𝑗𝑁0𝑘=1

A general expression can then be found by noticing that the sum at the denominator can be written over the full range of values for k (1 to nj) in a different form:

𝑤𝑖,𝑗 =𝜎(𝑠𝑖,𝑗)𝑤𝑖,𝑗

1 − ∑ [1 − 𝜎(𝑠𝑘,𝑗)]𝑤𝑘,𝑗𝑛𝑗

𝑘=1

Therefore, by taking into account the state of each item, equation (1) can be rewritten in the more general form as follows:

(𝑬𝒒. 𝟒) 𝑎𝑗 = ∑𝜎(𝑠𝑖,𝑗)𝑤𝑖,𝑗

1 − ∑ [1 − 𝜎(𝑠𝑘,𝑗)]𝑤𝑘,𝑗𝑛𝑗

𝑘=1

ℎ𝑖,𝑗 𝑤𝑖𝑡ℎ ∑ 𝑤𝑖,𝑗 = 1

𝑛𝑗

𝑖=1

𝑛𝑗

𝑖=1

It is easy to verify that equation (4) returns to equation (1) when all (si,j)=1 within the group Gj .

What happens, however, if all (si,j) within the group Gj are zero? In this case equation (4) should produce aj = 0 and therefore lead to a false result for equation (2). Essentially, even excluding all non-critical Mi,j , equation (4) does not allow to exclude the whole group Gj from the computation.


T0800-0000-AR-001 02 2018-06-29


Page 105 of 115

To solve this final issue, let us introduce a second function 𝜂(x) defined as follows:

𝜂(x) = 1 if x = 1 𝜂(x) = 0 if x > 1

Then equation (4) can be finally modified by introducing an additional term which will return aj=1 in

case all (si,j) are zero or the value of aj given by equation (4) in case at least one 𝜎(si,j) is not zero:

(𝑬𝒒. 𝟓) 𝑎𝑗 = ∑ (𝜎(𝑠𝑖,𝑗)𝑤𝑖,𝑗

1 − ∑ [1 − 𝜎(𝑠𝑘,𝑗)]𝑤𝑘,𝑗𝑛𝑗

𝑘=1

) ℎ𝑖,𝑗 + η (1 + ∑ 𝜎(𝑠𝑖,𝑗)

𝑛𝑗

𝑖=1

) 𝑤𝑖𝑡ℎ ∑ 𝑤𝑖,𝑗 = 1

𝑛𝑗

𝑖=1

𝑛𝑗

𝑖=1

Equation (2) can therefore be written in its complete and general form to include both single critical items and critical groups as follows:

(𝐄𝐪. 𝟔) 𝑎𝑇𝑂𝑇 = ∏[𝑎𝑗 + 𝜂(𝑛𝑗)(1 − 𝑎𝑗)(1 − 𝜎(𝑠1,𝑗))]𝑐/𝑁

𝑁

𝑗=1

In conclusion, an aggregated health status is completely defined by equations (5) and (6). It appears evident that the basis of the computation is the set of health statuses hi,j together with their associated weights wi,j . The definition of the values to be associated to the health statuses is clearly a matter of performance metrics. Generally speaking, the range [0,1] can be decomposed in a number of adjacent, strictly separated intervals, each one representing a well-defined status: for example one could imagine that 0 ≤ hi,j (or aj) ≤ 0.1 ⇒ faulty 0.1 < hi,j (or aj) < 0.8 ⇒ degraded performance 0.8 ≤ hi,j (or aj) ≤ 0.1 ⇒ fully operating or make a different choice, increasing the ‘resolution’ as follows: 0 ≤ hi,j (or aj) ≤ 0.1 ⇒ faulty 0.1 < hi,j (or aj) ≤ 0.4 ⇒ severely degraded performance 0.4 < hi,j (or aj) ≤ 0.7 ⇒ normally degraded performance 0.7 < hi,j (or aj) ≤ 0.8 ⇒ slightly degraded performance 0.8 < hi,j (or aj) ≤ 0.1 ⇒ fully operating and so on. The values for hi,j could even be discrete, being 0 (faulty), 0.5 (degraded) and 1 (fully operating) the most simple assumption. These values can be inserted into Equation (6) as well, producing floating values for both aj and aTOT, without making these Equations invalid, however. A performance metrics should be the result of a detailed performance analysis carried on the final system, which in turn is closely related to the dependability analysis (FMECA, Fault Tree, and so on). It is essentially out of the scope of the system design at the Pre-Construction Phase.


T0800-0000-AR-001 02 2018-06-29


Page 106 of 115

Dependability analysis also is the only way to compute the weights wi,j . In addition, since every group Gj is essentially identified by the corresponding set of weights wi,j, we can conclude that the dependability analysis will automatically define also the criteria for the specific grouping of non-critical items. Finally, it is certainly worth of notice the fact that the proposed approach makes the aggregation metrics poorly sensitive to the hierarchical organization of the system, being it based on critical and non-critical items only. Internal dependencies of the system organization are reflected in the weights wi,j only.

13.3.1.5.1.1 Example

According to the components of the Telescope Manager Test Configurations highlighted in [AD8], consider the following TM applications:

1. Observation Execution Tool 2. Central Coordinator 3. EDA

Suppose that each of them are running on two redundant (virtual or not does not change the essence of the present example) servers. Consider also the following parameters for the TM applications:

1. CPU usage (double precision number from 0, no use, to 1, all CPU used), 2. Memory usage (double precision number from 0, no use, to 1, all memory used).

And consider the following parameters for the two servers:

1. CPU usage (double precision number from 0, no use, to 1, all CPU used), 2. Memory usage (double precision number from 0, no use, to 1, all memory used), 3. Hard disk space usage (double precision number from 0, no use, to 1, all disk space used).

Figure 23 shows how the mathematical representation is adapted for this specific example and indicates the monitoring activities to build, that is the scripts (or software module according to 10.1) that calculates the health status aOET, aCC, aEDA, h1,SERVER_OET1, h2,SERVER_OET2, h1,SERVER_CC1, h2,SERVER_CC2, h1,SERVER_EDA1, h2,SERVER_EDA2.

Figure 23: Adapted from Figure 21.

There are endless possibilities to calculate those values: aOET could assume the three basic discrete values according to a look-up table based on the CPU and memory usage, in the following way:

CPU ≥ 95% and Memory ≥ 95% ⇒ aOET = 0 ⇒ faulty, 55% ≤ CPU < 95% and 55% ≤ Memory < 95% ⇒ aOET = 0.5 ⇒ degraded performance,

CPU and Memory < 55% ⇒ aOET = 1 ⇒ fully operating. Note that:

this option for the calculation was described in 13.3.1.5.1,


T0800-0000-AR-001 02 2018-06-29


Page 107 of 115

it is possible to create many different kinds of performance metrics, by only putting the constraint that the resulting health status must be a number between 0 and 1,

the servers are redundant so the variables 𝑤𝑖,𝑗 = 0.5 i, j

Another possibility is computing a floating value, ranging between 0 and 1, for the selected component:

𝑎𝐶𝐶 =(1 − 𝑃𝐶𝑃𝑈) + (1 − 𝑃𝑀𝐸𝑀)

2= 1 −

𝑃𝐶𝑃𝑈 + 𝑃𝑀𝐸𝑀

2

where PCPU and PMEM represent the percentage of CPU and Memory used that component, respectively. Once defined all the health statuses of Figure 23, the SSM automatically allows the Operator to check the overall health status of the considered applications. Figure 24 shows the final calculation of a degraded system and the (one level only) drill-down capability that shows to the Operator the degraded performance of the OET.

Figure 24: Aggregated health status drill-down possibility and calculation.

It is very important to remark that the final aggregated health status cannot have labels, because this could lead to wrong conclusions. Although its absolute value can be interesting, what is important are its variations. The resulting value is indeed a performance value that a trained Operator can take as input in order to perform some control actions, if required. In general, defining more aggregation levels as shown in Figure 25 provides a full drill-down capability that allows the Operator to display the components responsible for the degraded performance.


T0800-0000-AR-001 02 2018-06-29


Page 108 of 115

Figure 25: Drill-down with aggregation levels.

A very important application of the present architecture is the Trend Analysis that can be performed onto the system and at any level. By looking at the values assumed by aTOT (and the drilled-down aj) over the time, it is possible to predict faulty behaviours of subsystems, components or even the whole system, by intervening in order to prevent them and to preserve the system availability. On the other hand, in case an alarm is raised at time t0, the trend of the aggregated health status up to the alarm should allow to find the correspondence with the specific event and see if there are, for instance, hidden dependencies.

TM State

While it is possible to aggregate a value for the TM Health Status (aka TM performance), this does not appear true for the TM State. This is because TM States defines an enumerative set of at least 4 values (start-up, shutdown, standby and operational), which however cannot be ordered. Unless more complex definitions of the states will be provided in the future, it appears that for any two different states sa and sb it is not possible to say that one and only one of the following relations is valid (numerical order rule): sa > sb sa = sb sa < sb or that one and only one of the following relations are valid (set inclusion order rule): sa ⊃ sb sa ⊂ sb As a consequence an aggregation of states, at least in terms of algebraic rules, aimed at keeping or understanding the behaviour of the entire system, is not possible. The states can be managed with logical rules only.


T0800-0000-AR-001 02 2018-06-29


Page 109 of 115

Part of these rules are actually defined in TM Requirements, while others can be derived from them. According to the requirement analysis we can indeed say that:

TM is in Operational state when all its sub-systems are ready for operational use (TM_REQ_201);

TM is in standby state when all its sub-systems are shutted down or in standby state (the standby state does not mean the power is down; this means that the SER is still able to start the system up, TM_REQ_386) ;

The shutdown and startup state are temporary states. Currently the requirements define only four states with clear logical aggregation rules for them. Therefore there should be no need to store those information for the TM SER. However, in case the number of states will be increased in the future, it could become appropriate to store them in 10.3.

Virtualization View

The view describe the flow and the interaction of Services with the virtualization service provided by the LINFRA team.


This view will be built on a virtualization orchestrator and managing system for the virtualization service to ensure high availability requirements, deduplication of efforts by the different TM Services and uniformization of hardware access. This will allow virtualization service to respond to the computing requirements of the different parts of the Telescope Manager permitting a total utilization of the available physical computing resources and therefore be a fundamental module to enforce a demanding TM services high availability requirements. The virtualization service will also free the other parts of the telescope manager from having to deal with different machines since they will all be exposed under a unified interface by the Virtualization View implementing this virtualization layer above the physical computing resources. Furthermore it will avoid the duplication and diminish the complexity of part of the low level requests pertaining the hardware access since they will be directly handled by the virtualization service. The proposed architecture accommodates several paradigms of computation and data acquisition as well as different technology stacks, still it was built using the present working models of some existing projects and relies on the use of off the shelf technology stacks. Therefore we introduce a the Virtualized Resource Layer representing the Orchestrator, networking, storage, and monitoring based upon Heat [RD17], Neutron [RD29] and Cinder [RD30]; a Product Execution Layer representing the TM App based upon Openstack itself; and a Physical Resource Layer representing the Container based upon Docker [RD31] or KVM [RD32]. There will be a hardware redundancy of the machines which according to best practices described in the RAM and ILS Report [AD6] should be of 10%. This redundancy will make it able to provide an active-active failover system for all the virtualization services made possible by the very distributed nature of all their components and the agent based nature of Heat. This architecture is the most adequate for a large scale computing project since if the redundancy of the hardware available for Virtualized


T0800-0000-AR-001 02 2018-06-29


Page 110 of 115

Resources is assured then all TM Apps will continue working - though with more constrained computing resources - even if the computing hardware assigned to them suffers any failure.

Figure 26: Overall Presentation of Execution Environment.

Figure 26: an orchestrator will manage the access to monitoring storage and networking while being in charge of starting and allocating resources to diverse TM Apps as needed and according to the priorities defined by the different TM Apps and services. The Execution platform will manage the communication and hardware resources of each individual TM App and their allocated containers.

Figure 27: Template and Instance presentation.


T0800-0000-AR-001 02 2018-06-29


Page 111 of 115

Figure 27: The Virtualization Service will be based on a three tier template definition. At the machine level there will be the Physical Resource Template that will consist on every hardware that is available for supporting computation of LINFRA. At the TM App level the Product Execution Layer, consisting of a distributed environment operating towards the provisioning of highly available products. And at the Orchestration level the Virtualized Resource Layer, consisting of the virtual machines, containers and other hardware that has a logical representation to the Product Execution Layer and is part of a template or is available to be used by future templates.

Element Catalog

Elements

Element Description Shown

Orchestrator The Orchestrator component manages templates and Instance resource allocation; It is the entry point for the Virtualization Service

Figure 26 Figure 27

Monitoring Receives the monitoring signals from the different virtualization components and logs them appropriately.

Figure 26 Figure 27

Storage Persistent storage for the different virtualization components. Figure 26 Figure 27

Network Outside networking control for all the layers (definition in 11.4) running under the orchestrator.

Figure 26 Figure 27

Virtual Machine A container or a VM. Figure 26 Figure 27

TM App A generic application running on one or more Virtual Machines. Figure 26 Figure 27

Template A TM Template (see[RD17]) that is composed by a set of instances (application servers, VMs or Containers) with a Service Level Agreement (SLA), user Access Control List (ACL) and network ACL. The Virtualization Service will manage the resources of the entire set so that it matches the SLA, and will implement basic failover mechanisms.

Figure 26

Product Execution Platform

Platform managing the distributed execution of TM applications in virtualized environments.

Figure 26 Figure 27

Hardware Hardware computing resource. Figure 26 Figure 27

Virtualized Resource Template

Template defining virtual machines, containers and other hardware that has a logical representation to the Product Execution Layer (definition in 11.4).

Figure 27


T0800-0000-AR-001 02 2018-06-29


Page 112 of 115

Product Execution Template

Template defining the distributed Virtual Machine environment and its networking, storage and monitoring accesses used to deliver a TM App.

Figure 27

Product Execution Platform

Distributed environment (set of several virtual machines) operating towards the provisioning of highly available products.

Figure 26 Figure 27

Physical Resource Template

Template defining the Virtual Machine setup and its requirements that are available as physical machine resources (that is,

Dockerfile, [RD33]).

Figure 27

Interfaces

The interface will be constrained by exposing template actions externally, while working internally based on a structure relying on a state based architecture that will control the Virtualization Orchestrator and its managed templates and instances. Externally the requesting services will need only to define the template for the virtualization service, being all the internal state and allocated resources out of the provided resource pool, managed by the virtualization service itself. The interface to the Virtualization Service will be divided into three configuration layers. See figure 1 for an overview. The Interface is described in greater detail at 11.4.


T0800-0000-AR-001 02 2018-06-29


Page 113 of 115

Behaviour

Figure 28: Activity diagram for managing a template for a TM App.

Figure 28 Notes:

1. The containers validate the correct startup processes, allocation of computing resources and

networking.

2. The scripting will define a template and once the template finishes computing the deliverable

the orchestrator will then free the computing resources.

3. The TM App is the reference for the template needed as defined in the Abstract Data Model.


T0800-0000-AR-001 02 2018-06-29


Page 114 of 115

4. The TM App State is the state of the running template defined in the TM Template State table.

5. The TM Instance State uses the same states defined in the TM Template State table.

Context Diagram

The context of the present view of the TM SER can be summarized in the following diagram:

● Domain/Business Layer: functional monitoring and controlling of business logic performed by

each application,

● Services Layer: Monitors and controls processes on a generic level (not functionality) like web

services, database servers, custom applications,

● Infrastructure Layer: Monitors and controls virtualisation, servers, OS, network, storage.

An important consideration to make concerns a possible failover mechanism. Failover is an automatic action to recover from a specific situation and can happen at different level. The boundaries for the level are:

1. if the failover is needed at application server level, then it is a responsibility of the

Virtualization,

2. if it can be solved with a lifecycle action then is a responsibility of the Service

3. otherwise it is at level of the TANGO facility (for instance in case of capability transfer) and it

is a responsibility of the M&C Module.

Rationale

● Ensure the availability of the entire TM (and the availability of the TM SER) by dynamical

allocation of compute resources (see also [AD6]).

● By defining only a virtualization template for each service, there is no need to repeat work

by managing each individual instance and resource allocation.

● Uniformization of hardware access by TM Services.

● The OS Services are not included in the OS distribution

● The OS Services are not aware of the upper layer containers

● Every container will have its own configuration files per application

● In figure 1, the upper layer can access the lower level and not vice versa

● An example of Virtualization Orchestrator is Heat (see [RD17]).

Quality attribute characteristics

This design of the Virtualization Service focus on the following quality attributes:

INFRASTRUCTURE

SERVICES

DOMAIN/BUSINESS


T0800-0000-AR-001 02 2018-06-29


Page 115 of 115

Maintainability: By creating an uniformization layer above the hardware and by providing a set of descriptive creation of the TM Applications running times, it becomes significantly simpler to maintain the service.

Reusability: The three tier system for the description of the Virtualization Service allows for each building block to be reused in creating more complex ones.

Availability: By defining the priority of the processes in the interface therefore allowing the Virtualization Service to manage the allocation of resources, we are able to ensure high hardware fault tolerance and high concurrent availability of resources.

Interoperability: Creating a layer above the hardware and having a built in interoperability of resource access and communication within the Virtual Machine Service and the containers ensures the interoperability of the TM Applications.

Manageability: Dividing the system into a three tier rational provides a balance between complexity and manageability of resources and systems.

Performance: The Virtual Machine Service automatic allocation or resources according to their priority will ensure that the available resources are maximized.

Reliability: Having a fault tolerant system by containerizing all its components makes it capable of redirecting computing resources upon their failure to the available ones automatically greatly increases reliability.

Scalability: The system is built to be scalable by design and from the ground up. Increasing or decreasing the available computing resources will automatically be managed by the Virtualization Service and immediately put to use as soon as they are added to the existing computing resources pool.

Testability: In this case resides on two aspects. First the inbuilt testing tools on the three layers provided by the health checks, second by providing a totally modular approach simplify the testing since it now also becomes modular and contained.

Usability: Like already stated the three tier rational provides a balance between complexity and manageability, this allows the teams to focus on the building blocks of the TM Apps providing a more intuitive approach at building and deploying the TM Applications.

For a more in depth analysis of the quality attribute characteristics of virtualization platforms, namely OpenStack see: [RD28].