Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Name Designation Affiliation Signature
Custodian
Mauro Dolci TM.LMC Lead INAF
Date:
Approved by
V.Sathe Project Manager TCS
Date:
Released by
Ray Brederode TM Configuration Manager
SKA SA
Date:
SKA1 TM SER SOFTWARE ARCHITECTURE DOCUMENT
TM Number ....................................................................................... T0800-0000-AR-001 SKAO Number ................................................................................. SKA-TEL-TM-0000247 Context .......................................................................................................... TM-LMC-DD Revision ......................................................................................................................... 02 Primary Author ........................................................................................ Matteo Di Carlo Date ................................................................................................................. 2018-06-29 Document Classification ............................................................. FOR PROJECT USE ONLY Status ................................................................................................................. Approved
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 2 of 115
Name Designation Affiliation
Author(s)
Matteo Di Carlo TM.LMC team member INAF
Matteo Canzari TM.LMC team member INAF
Mauro Dolci TM.LMC lead INAF
Riccardo Smareglia INAF
Name Designation Affiliation
Contributor(s)
Bruno Morgado TM.LINFRA team member IT/ENGAGE SKA
João Paulo Barraca TM.LINFRA team member IT/ENGAGE SKA
D. Barbosa TM.LINFRA Lead IT/ ENGAGE SKA
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 3 of 115
DOCUMENT HISTORY
Revision Date Of Issue Engineering Change
Number
Comments
A 2017-07-31 CDR Initial draft for peer review
B 2017-10-31 CDR Second draft for peer review: 1. Combined ‘GUI C&C View’ into ‘Service C&C View’; 2. Moved ‘Virtualization View’ and ‘TM Health Status
and State Analysis View’ into Appendix since they are not really architectural;
3. Modified section ‘how stakeholder can use the documentation’;
4. Expanded section ‘how a view is documented’; 5. Modified section ‘system overview’; 6. Modified section ‘mapping between views’; 7. Modified section ‘Rationale’ for the quality
attributes subsection; 8. Moved scenarios (section ‘Use cases’) into
Appendix; 9. Refactoring ‘Abstract Data Model View’ into four
view packets; 10. Removed section prototype because part of the TM
prototyping and added reference to the right document;
11. Modified ‘Virtualization interface’ 12. Other 88 minor changes.
C 2017-12-07 - Third draft for SKAO review 1. Modified glossary and list of abbreviations; 2. Added reader guides to every view; 3. Added rationale for agent-based versus agentless
systems into ‘Service C&C View’ section; 4. Added rationale for failover mechanism and cloud
selection into ‘Allocation View’ section; 5. Refactored the ‘Virtualization view’; 6. Other 84 minor changes.
D 2018-01-31 - 1. Document reviewed by TechCom 2. Added example on ‘ TM Health Status and State
Analysis View’
01 2018-02-28 - Approved for CDR Submission
1A 2018-06-04 - Implemented CDR observations: TMCDR-78: clarified distinction between deployment and configuration TMCDR-80: clarified generic monitoring scope TMCDR-477: clarified allocation view TMCDR-570: added compliance section 13.1
02 2018-06-29 - Approved for CDR closure
DOCUMENT SOFTWARE
Package Version File Name
Word processor Microsoft Word 2011, 2013 T0800-0000-AR-001-02_TM_SER_SAD.docx
Block diagrams Cameo System
Modeler
18.0 SA Teamwork project TM Library.
Other Google doc https://drive.google.com/open?id=0B31xtq-7eI7hVzI1U1JmM1V4ZnM
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 4 of 115
ORGANISATION DETAILS
Name National Centre for Radio Astrophysics
Registered Address National Centre for Radio Astrophysics
Tata Institute of Fundamental Research,
Pune University Campus,
Post Bag 3 , Ganeshkhind,
Pune – 411007,
Maharashtra,
India
Phone Tel: +91 20 25719000, +91 20 25719111
Fax: +91 20 25692149
Website www.ncra.tifr.res.in
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 5 of 115
TABLE OF CONTENTS
1 LIST OF ABBREVIATIONS ......................................................................... 10
2 GLOSSARY .............................................................................................. 11
3 INTRODUCTION ........................................................................................... 14
3.1 Scope of the document ......................................................................................................... 14 3.2 Applicable and Reference Documents .................................................................................. 14
3.2.1 Applicable Documents .................................................................................................. 14 3.2.2 Reference Documents ................................................................................................... 14
4 DOCUMENTATION ROADMAP ......................................................................... 16
4.1 How the documentation is organized ................................................................................... 16 4.2 View Overview ...................................................................................................................... 17 4.3 How stakeholder can use the documentation ...................................................................... 17
5 HOW A VIEW IS DOCUMENTED ....................................................................... 19
6 SYSTEM OVERVIEW ...................................................................................... 20
7 MAPPING BETWEEN VIEWS ............................................................................ 23
8 RATIONALE ................................................................................................ 24
9 USE CASES ................................................................................................ 27
9.1 Monitoring use cases ............................................................................................................ 27 9.2 Fault Management use cases ................................................................................................ 27 9.3 Life-cycle Management use cases ......................................................................................... 28
9.3.1 Entity management use case ........................................................................................ 29 9.4 Logging use cases .................................................................................................................. 29
10 VIEWS ................................................................................................... 31
10.1 Uses Module View ................................................................................................................ 31 10.1.1 Primary Presentation .................................................................................................... 32 10.1.2 Element Catalog ............................................................................................................ 33
10.1.2.1 Elements ................................................................................................................ 33 10.1.2.2 Relations ................................................................................................................ 37 10.1.2.3 Behaviour .............................................................................................................. 38
10.1.3 Rationale ....................................................................................................................... 38 10.1.3.1 Interfaces .............................................................................................................. 38
10.2 Services C&C View ................................................................................................................. 39 10.2.1 Primary Presentation .................................................................................................... 39 10.2.2 Element Catalog ............................................................................................................ 41
10.2.2.1 Elements ................................................................................................................ 41 10.2.2.2 Relations ................................................................................................................ 42 10.2.2.3 Interfaces .............................................................................................................. 44 10.2.2.4 Behaviour .............................................................................................................. 45
10.2.3 Variability Mechanisms ................................................................................................. 49 10.2.3.1 Agent/Agentless solution ...................................................................................... 49 10.2.3.2 Lifecycle Manager ................................................................................................. 50 10.2.3.3 Monitoring ............................................................................................................ 50 10.2.3.4 Logging .................................................................................................................. 51
10.2.4 Rationale ....................................................................................................................... 51 10.2.4.1 Lifecycle manager ................................................................................................. 51
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 6 of 115
10.2.4.2 Monitoring System ................................................................................................ 52 10.2.4.3 Failover mechanism .............................................................................................. 52 10.2.4.4 Logging .................................................................................................................. 52 10.2.4.5 Service GUI ............................................................................................................ 52
10.3 Abstract Data Model ............................................................................................................. 53 10.3.1 Description .................................................................................................................... 53 10.3.2 Overview ....................................................................................................................... 53 10.3.3 Entity decomposition View Packet ............................................................................... 53
10.3.3.1 Primary Presentation ............................................................................................ 53 10.3.3.2 Element Catalog .................................................................................................... 54 10.3.3.3 Rationale ............................................................................................................... 56
10.3.4 Monitoring View packet ................................................................................................ 56 10.3.4.1 Primary Presentation ............................................................................................ 56 10.3.4.2 Element Catalog .................................................................................................... 56 10.3.4.3 Behavior ................................................................................................................ 58 10.3.4.4 Context Diagram ................................................................................................... 59 10.3.4.5 Rationale ............................................................................................................... 59
10.3.5 Lifecycle View Packet .................................................................................................... 59 10.3.5.1 Primary Presentation ............................................................................................ 60 10.3.5.2 Element Catalog .................................................................................................... 60 10.3.5.3 Context Diagram ................................................................................................... 63 10.3.5.4 Rationale ............................................................................................................... 63
10.3.6 Virtualization View Packet ............................................................................................ 63 10.3.6.1 Primary Presentation ............................................................................................ 64 10.3.6.2 Element Catalog .................................................................................................... 65 10.3.6.3 Context Diagram ................................................................................................... 73 10.3.6.4 Rationale ............................................................................................................... 73
10.4 Allocation View ..................................................................................................................... 73 10.4.1 Primary Presentation .................................................................................................... 74 10.4.2 Element Catalog ............................................................................................................ 74
10.4.2.1 Definitions ............................................................................................................. 74 10.4.2.2 Elements ................................................................................................................ 74 10.4.2.3 Relations ................................................................................................................ 76
10.4.3 Variability Mechanisms ................................................................................................. 76 10.4.4 Rationale ....................................................................................................................... 77
10.4.4.1 Tactics.................................................................................................................... 77
11 INTERFACES ............................................................................................ 78
11.1 Lifecycle Manager - TM Generic Application - Interface ...................................................... 78 11.1.1 Interface identity ........................................................................................................... 78 11.1.2 Resources provided ....................................................................................................... 78 11.1.3 Error handling ............................................................................................................... 79 11.1.4 Rationale and design issues .......................................................................................... 79
11.2 SSM - Monitoring Activity Interface ...................................................................................... 79 11.2.1 Interface identity ........................................................................................................... 79 11.2.2 Resources provided ....................................................................................................... 79 11.2.3 Data types and constants .............................................................................................. 80 11.2.4 Error handling ............................................................................................................... 80
11.3 TM Monitor Interface ........................................................................................................... 80 11.3.1 Interface identity ........................................................................................................... 80 11.3.2 Resources provided ....................................................................................................... 80
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 7 of 115
11.3.3 Data types and constants .............................................................................................. 81 11.3.4 Error handling ............................................................................................................... 81 11.3.5 Quality attribute characteristics ................................................................................... 82 11.3.6 Rationale and design issues .......................................................................................... 82
11.4 Virtualization Interface ......................................................................................................... 82 11.4.1 Interface definition ....................................................................................................... 82 11.4.2 Template Actions .......................................................................................................... 82 11.4.3 vResource Internal Actions ........................................................................................... 86 11.4.4 Error handling ............................................................................................................... 88 11.4.5 Rationale and design issues .......................................................................................... 88
12 PROTOTYPES ........................................................................................... 88
13 APPENDIX .............................................................................................. 89
13.1 Compliance statements for TM Service Requirements ........................................................ 89 13.2 Detailed scenarios ................................................................................................................. 89
13.2.1 Monitoring scenarios .................................................................................................... 90 13.2.1.1 Monitoring Resources ........................................................................................... 90 13.2.1.2 Monitoring Services (on network) ........................................................................ 90 13.2.1.3 Asynchronous Monitoring Software component ................................................. 91 13.2.1.4 Synchronous Monitoring Software component.................................................... 91 13.2.1.5 Sending alarm ....................................................................................................... 92
13.2.2 Fault management scenarios ........................................................................................ 92 13.2.2.1 Insert Recovery procedure .................................................................................... 92 13.2.2.2 Alarm notification ................................................................................................. 92
13.2.3 Lifecycle Management Scenarios .................................................................................. 93 13.2.3.1 Configure/Start Application .................................................................................. 93 13.2.3.2 Kill Application ...................................................................................................... 93 13.2.3.3 Restart Application ................................................................................................ 94 13.2.3.4 Add Application Version ........................................................................................ 94 13.2.3.5 Remove Application Version ................................................................................. 94 13.2.3.6 Set on line Application version ............................................................................. 95 13.2.3.7 Set off line Application version ............................................................................. 95 13.2.3.8 Update Application ............................................................................................... 96 13.2.3.9 List Applications .................................................................................................... 96 13.2.3.10 Use Application ..................................................................................................... 96
13.2.4 Logging scenarios .......................................................................................................... 97 13.2.4.1 Store Log ............................................................................................................... 97 13.2.4.2 Search Log ............................................................................................................. 97 13.2.4.3 Extract Log File ...................................................................................................... 98
13.3 Other views ........................................................................................................................... 98 13.3.1 TM Health Status and State Analysis View ................................................................... 98
13.3.1.1 Primary Presentation ............................................................................................ 99 13.3.1.2 Element Catalog .................................................................................................. 100 13.3.1.3 Context Diagram ................................................................................................. 101 13.3.1.4 Related View ....................................................................................................... 101 13.3.1.5 Rationale ............................................................................................................. 102
13.3.2 Virtualization View ...................................................................................................... 109 13.3.2.1 Primary Presentation .......................................................................................... 109 13.3.2.2 Context Diagram ................................................................................................. 114 13.3.2.3 Rationale ............................................................................................................. 114
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 8 of 115
LIST OF FIGURES
Figure 1: TM SER Context. ..................................................................................................................... 21 Figure 2: Monitoring context. ............................................................................................................... 21 Figure 3: TM SER use cases, monitoring and logging. ........................................................................... 27 Figure 4: TM SER use cases, lifecycle control. ....................................................................................... 28 Figure 5: Uses module diagram. ........................................................................................................... 32 Figure 6: SSM, LM, LS and Virtualization decomposition, Notation is Sysml........................................ 33 Figure 7: Service C&C View ................................................................................................................... 40 Figure 8: Lifecycle Generic Execution. .................................................................................................. 46 Figure 9: Perform monitoring activity. .................................................................................................. 47 Figure 10: Perform Fault Management. ............................................................................................... 48 Figure 11: Store log. .............................................................................................................................. 49 Figure 12: Entity and version. ............................................................................................................... 54 Figure 13: Monitoring data model. ....................................................................................................... 56 Figure 14: Monitoring context. ............................................................................................................. 59 Figure 15: Lifecycle Data Model. ........................................................................................................... 60 Figure 16: Lifecycle context. ................................................................................................................. 63 Figure 17: Virtualization Data Model. ................................................................................................... 64 Figure 18: Deployment diagram. .......................................................................................................... 74 Figure 19: TM Template Actions depending of the State. .................................................................... 86 Figure 20: Health status calculation. ..................................................................................................... 99 Figure 21: Mathematical representation. ........................................................................................... 100 Figure 22: For every TM Process there will be at least two monitoring activities to retrieve the state
and a measure of the performance. ......................................................................................... 101 Figure 23: Adapted from Figure 21. .................................................................................................... 106 Figure 24: Aggregated health status drill-down possibility and calculation. ...................................... 107 Figure 25: Drill-down with aggregation levels. ................................................................................... 108 Figure 26: Overall Presentation of Execution Environment. .............................................................. 110 Figure 27: Template and Instance presentation. ................................................................................ 110 Figure 28: Activity diagram for managing a template for a TM App. ................................................. 113
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 9 of 115
LIST OF TABLES Table 1: Sections overview. .................................................................................................................. 16 Table 2: View overview. ........................................................................................................................ 17 Table 3: Mapping between views. ........................................................................................................ 23 Table 4: ASRs ......................................................................................................................................... 24 Table 5: Physical Resource File Descriptor. ........................................................................................... 66 Table 6: Product Execution File Descriptor. .......................................................................................... 67 Table 7: vResource File Descriptor. ....................................................................................................... 68 Table 8: Physical Resource File Descriptor errors. ................................................................................ 71 Table 9: Product Execution File Descriptor Errors. ............................................................................... 71 Table 10: vResource File Descriptor Errors. .......................................................................................... 72 Table 11: Available actions for the Virtualization Service. ................................................................... 83 Table 12: Allowed Actions according to the State ................................................................................ 85 Table 13: vResources Internal Actions. ................................................................................................. 87 Table 14: Compliance statements for TM Service Requirements ........................................................ 89
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 10 of 115
1 LIST OF ABBREVIATIONS
TERM DESCRIPTION
AAA Authentication, Authorization and Auditing
ACK Acknowledge
ACL Access Control List
AGN Aggregation Node
API Application Programming Interface
ASR Architecturally Significant Requirement
ASRs Architecturally Significant Requirements
C&C Component and Connector
CP Control Point
CPF Common-Point Failure
CRUD Create, Read, Update, Delete
DSH Dish/Dishes
EDA Engineering Data Archive
EMS Engineering Management System
ERR Error
FM Fault Manager
FMECA Failure Modes, Effects and Criticality Analysis
GUI Graphical User Interface
IICD Internal Interface Control Document
INAF Istituto Nazionale di Astrofisica
LHC Large Hadron Collider
LINFRA Local Infrastructure and Power
LM Lifecycle Manager
LMC Local Monitoring and Control
LMS Lifecycle Manager Software
LS Logging Service
MeerKAT Karoo Array Telescope. SKA Precursor in South Africa
MP Monitoring Point
NRCC-DRAO National Research Council Of Canada – Dominion Radio Astrophysical Observatory
OBSMGT Observation Management
OSO Observatory Science Operation
QoS Quality of Service
RAMS Reliability, Availability, Maintainability and Safety
REF Refused
REST REpresentational State Transfer
SAD Software Architecture Document
SEI Software Engineering Institute
SER TM Services
SKA Square Kilometer Array
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 11 of 115
TERM DESCRIPTION
SKA-SA SKA South Africa
SLA Service Level Agreement
SSM Software system monitor
sysml System modelling language
TBC To Be Confirmed
TBD To Be Determined
TBJ To Be Justified
TELMGT Telescope Management
TM Telescope Manager
TMC Telescope manager control
TMO TM Observatory
TPZ Telespazio SpA – A Finmeccanica/Thales Company
UniRM2 Università degli Studi di Roma 2 – Tor Vergata
2 GLOSSARY
TERM DESCRIPTION
Access Control List It is a list of permissions attached to an object like an Operating System, a service and so on. An ACL specifies which users or system processes are granted access to objects, as well as what operations are allowed on given objects.
Agent Software service running on the host measuring usage and sending the results to the collector.
Aggregated MP A MP composed by a set of atomic MPs and/or even low-level aggregated MPs. (For example, eth1 is a network interface of a specific host and an aggregated MP can be bind to eth1 itself, which might be composed by outbound network traffic, inbound network traffic, counter of the packets that have been dropped, and so on)
Archive Permanent storage of data, including logs, monitoring data, list of monitoring points, monitoring configuration data, metering configuration data, events, alarms, failures, predictions, and so on. Note that the term Archive, without being preceded by other terms, includes ALL gathered data.
Configuration Set of information that affects (change) the behaviour or status of a specific target.
Distribution model Model describing which the expected roles of the components are and which kind of interaction can be performed.
Error A deviation of a system from normal operation.
Failure The inability of a system or component to perform required functions within specified performance requirements.
Fault An inherent weakness of the design or implementation of a system or component, which might result in a failure. It appears normally as a lasting error or a warning condition.
Health Status An attribute to be assigned to a component or to a (physical or logical) group of components, both hardware (that is, fan, network interface,
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 12 of 115
TERM DESCRIPTION
host...) and software (that is, a computing process, a service...), to denote their health level. The attribute may be expressed, for example, in percentage, with 100% denoting perfect behaviour and <100% a degraded behaviour triggering a fault analysis on that component.
Logging The activity of a system to record a specific message when executing an action. Logs are normally stored in a log file
Maintainer An Operator with responsibility of maintaining one (or more) software application.
Monitoring data A subset of the metric samples on all TM sub-elements (hardware, software and communication) gathered periodically/on event/on request and stored in monitor store. Monitoring data may include:
Failure detections, based on a FMECA analysis
Metric samples used for failure prediction and diagnosis
Metric samples required to identify faults (fault finding)
Near Real-time
Time delay introduced, by automated data processing or network transmission, between the occurrence of an event and the use of the processed data and it implies that there are no significant delays.
Node In a computer network, a node is an end-point identified by an IP address or a name that can receive, create, store or send data along distributed network routes.
Polling agent An agent aimed to collect measurements by polling some API or other tool, usually at a regular interval.
Process An instance of an executable running in sequential or parallel mode.
Push agent An agent aimed to provide measurements by pushing some API or other tool, usually in an event-triggered fashion. The push agent is the only solution to fetch data within TM sub-elements, which do not expose the required data in a remotely usable way. This is not the preferred method as it makes deployment a bit more complex having to add a component to each of the nodes that need to be monitored.
Resource A physical or virtual component of limited availability within the system, that is, time of execution, resident memory, CPU cycles, number of hosts, bandwidth usage, power consumption, and so on.
Restart A process acting on a system the initial, intermediate and final statuses of which, are ON - OFF - ON, respectively.
Sample Data sample for a particular meter.
Server It is a computer program or a device that provides functionality for other programs or devices, called ‘clients’.
Service Level Agreement
It is an agreement between a service provider and a client where some aspects of the service – quality, availability, responsibilities – are agreed between the service provider and the service user. SLA can have a technical definition in mean time between failures (MTBF), mean time to repair or mean time to recovery (MTTR); identifying which party is responsible for reporting faults or paying fees; responsibility for various data rates; throughput; jitter; or similar measurable details.
Shutdown A process acting on a system, the initial and final statuses of which are ON - OFF, respectively.
Start-up A process acting on a system, the initial and final statuses of which are OFF - ON, respectively.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 13 of 115
TERM DESCRIPTION
TM Sub-System OSO, TMC or SER
View It is a representation of a coherent set of architectural elements (a structure) needed to reason about the system. A view comprises software elements, relations among them and properties of both
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 14 of 115
3 Introduction
Scope of the document
This document defines the software architecture for the Telescope Manager (TM) Services (SER) sub-element for SKA. Across this document, the terms ‘TM.SER’ and ‘SER’ will be used to indicate Telescope Manager Services sub-element for SKA, and ‘SKA’ to refer to the SKA telescope system. To view related requirements on TM.SER are reported in the corresponding document ([AD1], see the following).
Applicable and Reference Documents
Applicable Documents
The following documents are applicable to the extent stated herein. In the event of conflict between the contents of the applicable documents and this document, the applicable documents shall take precedence.
[AD1] SKA-TEL-TM-0000252, T0800-0000-RS-001, SKA1 TM SERVICE REQUIREMENT SPECIFICATION, Rev 02
[AD2] T0000-0000-MP-003, SKA1 TM Maintenance Plan, Rev 03 [AD3] SKA-TEL-SKO-0000656, SKA Control System Guidelines (CS_Guidelines) - Volume 2: SKA
LOGGING GUIDELINES, Rev A [AD4] 000-000000-010, SKA Control System Guidelines (CS_Guidelines), Rev. 01 [AD5] SKA-TEL-TM-0000263, T0000-0000-AR-026, SKA1 TM Context Document, Rev. 02 [AD6] T0000-0000-RAM-001, SKA RAM AND ILS Report, Ver 01 [AD7] SKA-TEL-TM-0000270, Telescope Manager Product Breakdowns and Dictionary, Ver 01 [AD8] SKA-TEL-TM-0000253, SKA1 TM Services Test Specification, Ver. 01
Reference Documents
The following documents are referenced in this document. In the event of conflict between the contents of the referenced documents and this document, this document shall take precedence. [RD1] l. Bass, P. Clements, R. Kazman, Software Architecture in practice, SEI Series [RD2] Paul Clements, Felix Bachmann, Len Bass, David Garlan, James Ivers, Reed Little, Paulo
Merson, Robert Nord, Judith Stafford, Documenting Software Architectures, Views and Beyond, Second Edition, SEI
[RD3] SKA-TEL-SKO-0000661, Fundamental SKA Software and Hardware description language standards, Rev 02
[RD4] T0000-0000-AR-007, SKA-TEL-TM-0000148, SKA1 TM Prototyping Report, Rev 01 [RD5] T0700-0000-DR-001, SKA-TEL-TM-0000037, SKA1 AAA Design Report, Rev 02 [RD6] Starter Device, http://tango-controls.readthedocs.io/en/latest/tools-and-
extensions/astor/introduction.html [RD7] Astor, http://www.tango-controls.org/community/projects/astor/ [RD8] TANGO Generic Web Application, https://github.com/tango-controls/tango-webapp [RD9] TANGO features overview, http://tango-controls.readthedocs.io/en/latest/tools-and-
extensions/astor/features_overview.html [RD10] TANGO Kernel Documentation, http://tango-
controls.readthedocs.io/en/latest/contents.html [RD11] LogViewer, http://www.tango-controls.org/community/projects/log-viewer/
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 15 of 115
[RD12] Jive, http://www.tango-controls.org/community/projects/jive/ [RD13] Nagios redundancy,
https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/redundancy.html [RD14] Nagios Hardware requirements,
https://assets.nagios.com/downloads/nagiosxi/docs/Nagios-XI-Hardware-Requirements.pdf [RD15] ELK Cluster Documentation,
https://www.elastic.co/guide/en/elasticsearch/guide/current/distributed-cluster.html [RD16] ELK Failover Documentation,
https://www.elastic.co/guide/en/elasticsearch/guide/current/_add_failover.html [RD17] Heat, https://docs.openstack.org/heat/latest/ [RD18] OpenStack Networking, https://docs.openstack.org/liberty/networking-guide/intro-
networking.html [RD19] LHC, https://home.cern/topics/large-hadron-collider [RD20] Elettra, www.elettra.trieste.it [RD21] ELK, www.elastic.co [RD22] Apache Lucene, lucene.apache.org [RD23] AWS Cloud Formation,
http://docs.amazonwebservices.com/AWSCloudFormation/latest/APIReference/Welcome.html?r=7078
[RD24] Chef automation tool, www.chef.io [RD25] Ansible automation tool, www.ansible.com [RD26] Puppet automation tool, puppet.com [RD27] TANGO cookbook, https://supermarket.chef.io/cookbooks/tango [RD28] Turowski M., Lenk A. (2015) Vertical Scaling Capability of OpenStack. In: Toumani F. et al.
(eds) Service-Oriented Computing - ICSOC 2014 Workshops. Lecture Notes in Computer Science, vol 8954. Springer, Cham
[RD29] Neutron, https://wiki.openstack.org/wiki/Neutron [RD30] Cinder, https://wiki.openstack.org/wiki/Cinder [RD31] Docker, www.docker.com [RD32] KVM, https://www.linux-kvm.org/page/Main_Page [RD33] Dockerfile, https://docs.docker.com/engine/reference/builder/
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 16 of 115
4 Documentation Roadmap
This section describes how this document is organized and how a reader can find directly the information of interest without reading it cover-to-cover.
How the documentation is organized
This document is organized into the sections highlighted in the following table. Table 1: Sections overview.
Section Overview
Document History
It shows the revisions of the present document
Documentation Roadmap - How the documentation is organized
It explains what can be found in each section of the present document
Documentation Roadmap - View Overview
It gives a short description of the views included in the present SAD
Documentation Roadmap - How stakeholder can use the documentation
It explains how a stakeholder can read the documentation to answer his concerns.
How a View is documented
It explains the organization for a generic view
System overview
It gives a short description of the main functions of the system, its users and the background needed to go through the various views
Mapping between views
Shows the main entities of the system and their presence in the views
Rationale - Architecturally Significant Requirements
It explain the ASRs that drove the development of the present architecture
Rationale - Decisions It explain the main decisions that drove the development of the present architecture considering ASRs as well
Rationale - Products It shows the main products that will be delivered from the development of the present architecture
Rationale - Quality Attributes It shows the qualities that drove the development of the present architecture
Glossary
It defines terms used in the architecture documentation
Acronym
It defines acronyms used in the architecture documentation
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 17 of 115
View Overview
The present section gives a short description of the views included in the present SAD. Each of them has been created with a specific purpose or to answer a specific question and, in general, they represent the decisions taken to solve the problem highlighted in [AD1].
Table 2: View overview.
View Overview
Uses Module View This is a module view as per [RD2]. It is a decomposition of the system into units of implementation and shows the distinction between off the shelf software and built software.
Service C&C View This is a C&C view as per [RD2]. It highlights the runtime components of the system and their relations. It also highlights the decisions for having an agent-based architecture (even if an agentless architecture is still possible) and how it works.
Abstract Data Model This view is the blueprint for the implementation of the data entities and a domain analysis of the main concepts wherewith the TM Services works. Without a data model, it will not be possible to get qualities such as, modifiability or to understand how the virtualization service works.
Allocation View This is allocation view as per [RD2]. It shows the mapping between the runtime component of the system and the servers needed to run them like decisions for failover mechanism (that is, cluster active-passive).
How stakeholder can use the documentation
The main stakeholders of the TM Service will be the OBSMGT and TELMGT teams responsible for the OSO and TMC software architecture. In this way, it has been possible to
Define better requirements and
Avoid any misunderstanding.
With them, TM architects (that are stakeholders for every view), reviewers, developers and maintainer represent the stakeholders for the present documentation. The Uses Module View decompose the system into unit of implementations distinguish between Off the shelf software and built software. It is primarily for architect, reviewer and manager that wants to understand how the system is decomposed for planning purpose for example. The Service C&C View shows how the system work at runtime, showing the types of connections of the active instances. It is primarily for developers, architects and reviewers who are looking for answering questions such as, ‘which the dynamic behaviour of the system is?’ or ‘who starts the interaction of monitoring, lifecycle or logging?’. The Abstract Data Model (see 10.310.3) view is the blueprint for the implementation of the data entities and a domain analysis of the main concepts wherewith the TM Services works. It is primarily for developers and maintainers that want to understand the data model of the system, such as, what the main entities of the system are. In specific, it is shown how a monitoring activity can be seen both
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 18 of 115
as data and as runnable entities of the TM SER demonstrating how the system can perform different functions (new activities) without changing its architecture. It is also shown here how it is possible to add a new version of an application. The possibility to add versions and, for each of them, one or more monitoring activities is the core of the modifiability quality attribute for the TM SER. The Allocation View maps the runtime processes highlighted in the Service C&C View into running server (that can be virtual or real). It is primarily for maintainers, architects and reviewers that want to understand how many servers the Service will need at runtime and how it is possible to increase the scalability of the system.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 19 of 115
5 How a View is documented
Every view is documented using the standard TOC for the View and Beyond approach of the SEI [RD1][RD2] that comprise the following sections:
Name of view
View description
Primary presentation: This section presents the elements and the relations among them that populate this view packet, using an appropriate language, notation, or tool-based representation.
Element catalog: Whereas the primary presentation shows the important elements and relations of the view packet, this section provides additional information needed to complete the architectural picture. It consists of subsections for (respectively) elements, relations, interfaces, behaviour, and constraints.
Context diagram: This section set the context for the system represented by this view packet. It also designates the view packet’s scope with a distinguished symbol, and shows interactions with external entities in the vocabulary of the view.
Variability mechanisms: This section describes any variabilities that are available in the portion of the system shown in the view packet, along with how and when those mechanisms may be exercised.
The Abstract Data Model View (see 10.3) is an exception, since it required a number of "view packets". Each view packet is structured following the standard TOC, as specified above.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 20 of 115
6 System overview
The SKA project is an international effort (10 member and 10 associated countries with the involvement of 100 companies and research institutions) to build the world’s largest radio telescope. The SKA Telescope Manager (TM) is the core package of the SKA Telescope aimed at scheduling observations, controlling their execution, monitoring the telescope and so on. To do that, TM directly interfaces with the Local Monitoring and Control systems (LMCs) of the other SKA Elements (for example, Dishes, Correlator and so on), exchanging commands and data with them by using the TANGO controls framework. TM in turn needs to be monitored and controlled, in order its continuous and proper operation is ensured and this higher responsibility has been assigned to the TM SER package.
The problem of monitoring and controlling a software is an artificial intelligence problem that is the research in a state space characterised by:
1. an initial state; 2. a set of possible actions which transform a state to another one; 3. A path from a state to another state (list of actions).
This description of the problem helps in understanding what is needed to realize the architecture for the TM SER that is a set of actions and a set of monitoring data from where a state can be calculated. From the requirements analysis (done with the help of the entire TM in the numerous discussion held) the main functions of the system have been extracted and are described in the use cases section (see 9).
Therefore, the main system’s functions (see 9) can be summarized in the following list:
TM generic monitoring and fault management to detect internal failure and gather TM performance;
TM lifecycle management to manage the versions of the TM and the TM applications which includes:
o Configuration of TM software applications; o Starting, stopping and restarting of TM software applications; o Update and downgrade of TM software applications;
TM Logging, which includes the control of the destination of log messages, the transformation of the message (if required) and the query GUI;
Controlling of the virtualization system, according to the interface provided by the LINFRA team.
It is worth to notice that deployment is different from configuration because, conceptually, the first phase is the deployment and later there is the configuration. The update/upgrade/downgrade assume that the new/old version is already in the same machine and the update/downgrade is basically a restart with different versions. The TM Service take place in the middle between the domain logic and the infrastructure. In particular the following diagram explain the above concept with a layered structure:
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 21 of 115
Figure 1: TM SER Context.
Domain/Business Layer: functional monitoring and controlling of business logic performed
by each application; Services Layer: Monitors and controls processes on a generic level (non-functional) like web
services, database servers, custom applications; Infrastructure Layer: Monitors and controls virtualisation, servers, OS, network, storage.
The proposed architecture has been driven by the study of best practices and well known solutions to the problem highlighted by TM and SER requirements (see [AD1]), reserving new developments only for uncovered problems. Another important function of the system is the aggregation of the TM health status and the TM State (of the various TM applications) and reporting it to the Operator. This function can be considered an application of the current architecture and it is described in the TM Health Status and State Analysis View (see 13.3.1). The TM generic monitoring, realized by a Software System Monitor (SSM), comprises periodic tests or measurements of network-related data, network devices and legacy services (DB, OS-level services) that are not directly monitored by the TANGO control system [RD10]. Usually, some of the monitoring data are taken from the generic monitoring (such as, CPU usage, memory, processes, threads, uptime
and so on) is reported to the TANGO control system for historical archiving and correlation. This is the main reason for the development of the TM Monitor device (see 10.1 and 11.3). The users will read information from both system so that in case one of them is not working, they can always perform one or more recovery actions according to the information taken from the other system. This concept is the monitoring context and it is expressed in the following figure.
Figure 2: Monitoring context.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 22 of 115
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 23 of 115
7 Mapping between views
As a general rule, the mapping between views is made by the name of the elements. The following table shows a list of the main elements of the system and an indication of where to find them within the views.
Table 3: Mapping between views.
Uses Module View Service C&C View Allocation View
LS Engine (Figure 6) LS Engine Logging Node
Logging Service (Figure 5) - Logging Service
LS Data Repository (Figure 6) LS Data Repository Logging Node
LS Forwarder (Figure 6) LS Forwarder -
Software System Monitor (SSM) (Figure 5)
- Software System Monitor Node
SSM Core (Figure 6) SSM Core Software System Monitor Node
MonData Repository (Figure 6) MonData Repository Software System Monitor Node
FM Repository (Figure 6) FM Repository Software System Monitor Node
Fault Engine (Figure 6) Fault Engine Software System Monitor Node
Notification System (Figure 6) Notification System Software System Monitor Node
SSM Agent (Figure 6) SSM Agent
Lifecycle Manager (Figure 5) - Lifecycle Manager Node
Lifecycle Manager Core (Figure 6) LM Core Lifecycle Manager Node
Lifecycle Manager Data Repository (Figure 6)
LM Data Repository Lifecycle Manager Node
Lifecycle Manager Service (Figure 6) LM Service
Virtualization - Virtualization
Virtualization Orchestrator Virtualization Orchestrator
-
Service GUI Service GUI Service GUI
- Config DB -
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 24 of 115
8 Rationale
Architecturally Significant Requirements From the requirement analysis of the TM and of the SER derived requirements (see [AD1]), the requirements that can be considered architecturally significant are:
Table 4: ASRs
SER_REQ_2 Detect and generate internal Alarms
The SER shall generate an Internal Alarm based on information received from the TM sub-systems (including itself) signifying that a condition related to the TM’s functioning has occurred. This requires automatic and OR operator intervention and that is based on either one of the following states: 1. The system has detected a failure that
requires operator as well as maintainer intervention.
2. The system has detected a condition that reduces the ability of the TM to effectively perform its mission.
3. A safety hazard (based on a hazard analysis regarding the TM operations) has realised.
TM_REQ_7 TMO_REQ_018
Test
SER_REQ_6 Control TM sub-system life cycle
The SER shall be able to control the life cycle of any TM sub-system by being able to command the TM sub-system to do at least one of the following:
Shut down
Start up
Configure
Update/Upgrade
Downgrade
TM_REQ_192 TM_REQ_195 TM_REQ_197 TM_REQ_198 TMO_REQ_020 TM_REQ_212 TM_REQ_220 TM_REQ_367 TM_REQ_210 TMO_REQ_058
Demonstration
SER_REQ_7 Operator control of life cycles
The SER shall give an operator the ability to send respective life cycle commands to each TM sub-system by an authenticated and authorised user.
TM_REQ_181 Demonstration
SER_REQ_8a Aggregate and Report TM Health Status
The SER shall aggregate the TM internal status and report it to the Operator in a structured health view based on the TM PBS. Note: in case of TMO, the SER shall report to the Operator and to the EMS.
TM_REQ_211 Demonstration
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 25 of 115
SER_REQ_8b Manage TM State
The SER shall manage the TM state (by sending signal for state transition) that can assume, among the others, the following values: start-up, shutdown, standby and operational. The following are the possible state transitions: 1. from standby to startup 2. from startup to operational 3. from operational to shutdown 4. from shutdown to standby
TM_REQ_201 TM_REQ_202 TM_REQ_342 TM_REQ_385 TM_REQ_386 TM_REQ_387
Demonstration
Notes and decisions
The requirements are not intended for the online or offline system but for any generic TM
Sub-System which can be any sub-element, that is, any TM Application
According to SER_REQ_2, there shall be a generic monitoring (performed for instance through
Nagios, a Software System Monitor (SSM))
Every TANGO-based system needs a generic level monitoring (as every system actually does).
However, TANGO does not provide in itself a generic monitoring system such as, for example
server monitoring of hard disk, cpu and so on. According to the analysis performed by the LMC
team (analysis of the Elettra [RD20] control system generic monitoring), it is indeed a best
practice to use a software system monitor (external to TANGO) rather developing it within
TANGO.
TANGO does not provide any storage system for logging (although it is possible to have
viewing features as an instant-log consumer), which therefore must be developed as a service
external to TANGO.
The SKA logging guideline [AD3] suggests a service based on the latest technology as the best
choice (in particular for what concerns the ability to use the full-text search capability, see
Elasticsearch [RD21] and Apache Lucene [RD22]). The analysis performed by the LMC team
revealed that this approach has been successfully adopted in other big and very complex
projects currently running, as LHC (see [RD19]).
SER_REQ_11 defines the generic Logging Service
SER_REQ_6 and SER_REQ_7 define the Lifecycle Management together with update/upgrade and downgrade; even in this case the architecture will be the same both for online and offline. The only variability concerns Astor (that is, TANGO framework tools).
Every reliable system must not have common point of failure between the monitoring system
and the monitored one. For this reason, a generic monitoring architecture, and more generally
the TM service architecture, cannot depend on any other TM sub-system
Generic monitoring, logging service and lifecycle management are needed in all systems of
every sites (GHQ, ZA and AUS)
Products The products for the TM SER are defined in [AD7].
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 26 of 115
Quality Attributes
The main quality that drove the development of the present architecture was the maintainability intended as availability (reliability and recovery), modifiability, testability and more, in general the ability of a system to cope with changes. Concerning availability, the present architecture enables many tactics [RD1], such as:
Detect faults: o Ping: asynchronous request/response message to determine that the monitored
component is alive and responding correctly;
o Heartbeat: periodic message exchange between the SSM (see 10.1 and 10.1.3.1) and
the component (or host);
o Timestamp: every message has a timestamp to rebuild the correct order of messages.
o Monitoring activities: processes that monitors an entity and produces monitoring
data, see 10.1, 10.1.3.1 and 10.3.4);
o Timeout: the monitoring activity should complete within a predetermined amount of
time;
Recover from faults: active redundancy, software upgrade and reconfiguration; o Retry: in case of a faulty monitoring activity, the SSM retries to execute it;
o Redundancy (for every SER sub-system): in case of fault or failure is it possible to
switch a passive (or redundant) node automatically, using an automatic action, or
manually, by the intervention the operator. See 10.4 for further information.
o Reconfiguration: the SER, through the Lifecycle Manager (see 10.1, 10.1.3.1 and
10.3.5), can re-configure a faulty component with a versioned configuration
automatically or with an operator command.
o Software upgrade or downgrade (see 10.1, 10.1.3.1 and 10.3.5)
o Exception handling: Once that an exception has been detected, the system must handle it. There are several possibilities to handle an exception. A possible way is to include with the exception an error code that contain information helpful in fault correlation.
Prevent faults: predictive model and transaction (when accessing to repositories). See 10.2.2.4.4 for further information.
To prevent faults, a possibility is the use of a Fault Manager (see 10.1 and 10.1.3.1) component (usually included in many generic monitoring system) to make trend analysis and failure prediction of TM components taking as input both generic monitoring data and logging data. This tactic is called ‘Predictive model’ ([RD1]): according with the health status detected by the SSM, the predictive model ensures that the system is operating within its normal operating parameters and could potentially take corrective actions. The modifiability is reached in the following areas of the system: monitoring activities, lifecycle scripts, logging rules and fault rules. In particular it has been increased cohesion and reduced coupling so that it is easy to add new version of an application and monitoring activities. At the same time, the testability is also reached by limiting the complexity of the system. In fact, if there is a new test to perform, it is possible to add a monitoring point for it that can represent a state, a measure or a simple message. Once the required monitoring point is available, it is easy to generate an event to intercept the particular problem raised with the test (see 10.3 for more information).
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 27 of 115
9 Use Cases
From the requirement analysis the use cases highlighted in Figure 3 and Figure 4, and described in the following sections, have been found (for the detailed monitoring scenarios, please consult 13.2.1).
Figure 3: TM SER use cases, monitoring and logging.
Monitoring use cases
Figure 3 shows the main functions of the monitoring system that are: Monitoring Network and Resources: SER monitors TM resources (defined as any physical
component of limited availability of a computer, such as CPU Load, Memory Usage and so on) and any TM service (defined as an application using protocol like TCP, HTTP, FTP and so on).
Monitoring Software Components: SER monitors proper functioning of the processes of TM. This includes monitoring of operational state and failures thrown by such processes performed by a SSM agent installed on the local machine. In particular, the SSM agent provides two communication modes: asynchronous mode (for non-critical communication like as process status) and synchronous mode (for instance to communicate an exception to SER).
Reporting and sending TM internal alarms: SER provides an interface to the operator and a series of alarms. The interface shows a series of information about network, resources and service status in different views, providing the possibility to summarize information, create graphs or custom views, drilldown component information and so on. An operator can handle it or, if exists, can use a procedure defined in SER fault management.
Fault Management use cases
The monitoring of TM applications can be performed at three different levels of depth: 1. Generic level (RAM allocation, CPU usage…), 2. Generic level + process status (as defined in 13.3.1), 3. Correctness of operations (for example, coordinate conversion by a devoted application).
SER is responsible for levels one and two. TMC/OSO applications are responsible for level 3. The separation among levels two and three defines essentially the monitoring boundaries between SER and TMC/OSO (in fact, the correctness of operations is a duty of each TMC/OSO application). The Fault Management uses the monitoring system (that is, the Software System Monitor and the specific activity that can be done through it) to perform its duty that is:
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 28 of 115
1. Detection, which is the ability to understand if in the system there is a fault; 2. Isolation, which is the ability to isolate a fault understanding where it is; 3. Recovery, which is the ability to recover the situation.
A monitoring activity together with alarm filtering (usually available in any software system monitor) realizes the detection activity. The same monitoring activity together with log information realizes the isolation while the recovery is essentially a control operation that TM.SER can do: for instance an online action, which is a lifecycle command (reconfigure, restart, and so on) or an offline activity such as, raising a modification request for the software maintenance. It is important to consider that the lifecycle management is realized with the help of an IT automation tools such as, Ansible, Puppet or Chef (see 10.1), and this allows not only a lifecycle operation on the application but actually it is possible to make any kind of script that can be executed directly in the server machine (where the TMC/OSO application runs). Therefore, a TMC/OSO Developer could create a specific recovery procedure (that is a specific script stored in the Lifecycle Manager Repository) that can be executed by the Fault Management application in case of a specific alarm condition.
Life-cycle Management use cases
Figure 4: TM SER use cases, lifecycle control.
The lifecycle management is the ability to manage a software application in the following phases of its lifetime:
Configuration Start Stop/Kill Update, Upgrade or Downgrade (version control)
Having a lifecycle management software (LMS) brings to a similar way of managing all TM Applications, so that, TM.SER can be the main entry point for TM. The use cases in Figure 4 summarize the concept described above.
There are two kinds of consumers for the LMS: the administrator and the user. The user is the one that works with TM for telescope operations while the administrator (who is a special user with privileges) has the responsibility of the lifecycle from the start phase (which includes the
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 29 of 115
configuration) to the entities management (CRUD operations in the configuration database), with specific attention for the version control of the applications. Version control is an important aspect of TM maintenance and is in relation with the software maintenance process (See [RD6]). This means that when an administrator is adding a new application version, he should relate that version to a specific number of modification requests identified and resolved in the maintenance. Every TMC/OSO applications or components has to specify its own lifecycle highlighting the specific interface to realize the use cases of the above figure. To do that, it is convenient to subdivide them
into their typology. Based on [RD1], the following application typologies compose TM: OS Service, which is a process that starts and stops with the operating system, Web server, which is an information technology that processes requests via HTTP, the basic
network protocol used to distribute information on the World Wide Web, Web application, which is a software running in a web server, Desktop application, which is a software running in the client computer, Server application, which is a software running in a server that usually does not have the
same lifetime of the operating system, DB server, which is a database management system (which can be a RDBMS or a NoSql DB
technology). In addition, consider that many TM applications work together with other services such as, for instance, a TANGO device that needs the TANGO database service and the corresponding RDBMS and perhaps other TANGO devices to work properly. For this reason, it is expected to have a list of dependencies for each application. LMS has also the responsibility to ease the job of the user in all the phases of the application lifetime to simplify the work of the user.
Entity management use case
The ‘CRUD’ acronym means the four standard functions of persistent storage that are the following: create, read, update, and delete. CRUD is also relevant at the user interface level of most applications. As a bare minimum, the software must allow the user to:
Create or add new entries Read, retrieve, search, or view existing entries
Update or edit existing entries
Delete/deactivate existing entries Without at least these four operations, the software is not complete. As these operations are so fundamental, it is important to document them under one comprehensive heading: the entity management. In the TM.SER context, these operations are necessary because of the needs to build the abstract data model described in 10.3.
Logging use cases
Logging is an important component of the development cycle. It offers several advantages. It provides precise context about a run of the application. Once inserted into the code, the generation of logging output requires no human intervention. Moreover, log output can be saved in persistent medium to be studied later. In specific, the logging use cases are the following (see Figure 3):
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 30 of 115
Archive logs: a software application or component could want to archive one or more log messages;
Search for log information (by words, by datetime, extract log files): the maintainers, administrators or developer could search for one or more log messages making a query to the system.
For more information about logging for SKA, please consult [AD3].
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 31 of 115
10 Views
Uses Module View
This is a module view as per [RD1]. It is a decomposition of the system into units of implementation and shows the distinction between off the shelf software and built software together with the functional responsibility assigned to each of them. It shows the dependability of the modules and the modifiability related to changing the responsibility of them.
Figure 5 shows the four main off the shelf modules: SSM, LM, LS and TANGO framework. Depending on the choice of the off the shelf software the work to do will be qualified (for instance choosing Nagios core or Zabbix will influence the Monitoring Activities to make such as, language or API used). Fault rules use both the SSM and the LM to perform automatic actions (if required) while the Lifecycle scripts are executed by the engine provided by the LM. Every module will store logging data with the help of the logging forwarding rules through the LS engine. The TM Monitor is the link between the TMC domain and the SER domain and it will use both the SSM (to read monitoring data) and the TANGO framework (to send monitoring data). The Service GUI allows the user to work with all the SER software while the Virtualization Service will allow the build of virtual platforms where all the software will run. Figure 6 shows the usual decomposition of the four main off the shelf modules taken from some tested architecture: Nagios core, Chef, ELK and OpenStack.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 32 of 115
Primary Presentation
Figure 5: Uses module diagram.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 33 of 115
Figure 6: SSM, LM, LS and Virtualization decomposition, Notation is Sysml.
Element Catalog
Elements
Element Description
SSM A software system monitor (SSM), that is, Nagios core, Solarwinds and so on, is a software component used to monitor resources and performance in a computer system. It is usually composed by a server and one or more agents distributed into the computer network that allows the execution of monitoring activities.
Off-the-Shelf
Figure 5, Figure 6
SSM Server It is composed by the SSM Core and the Mon Data Repository
Off-the-Shelf
Figure 6
SSM Core Collect data from every SSM Agent in the network for hardware, software and network monitoring; Store the collected data into the repository; Scheduling of monitoring activities;
Off-the-Shelf
Figure 6
SSM Notification System
Software module that provide the functionality of delivering a message to one or more destinations
Off-the-Shelf
Figure 6
Mon Data Repository
Maintain the collected monitoring data into the repository
Figure 6
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 34 of 115
Element Description
SSM Agent
Software application daemon that manage the activities to gather monitoring data (aka system metrics) from the TM Application to monitor/control, group and send them back to the server. The communication can start from the server or it can be started directly from the application through the SSM Agent. The agent has a set of scripts (called monitoring activity) to perform the above operations. The advantage of having an agent is the fact that, instead of calling every monitoring activity, the server call only the agent that make the job.
Off-the-Shelf
Figure 6
Fault Manager
Part of the SSM devoted to detecting, diagnosing and fixing faults, and finally returning to normal operations.
Off-the-Shelf
Figure 6
Fault Engine Evaluate the rules and perform actions according to the mapping between rules and action stored in the FM repository
Off-the-Shelf
Figure 6
FM repository
It contains failure definition, rules definition, faults definition and mapping with actions, if required
Off-the-Shelf
Figure 6
Monitoring Activities
A monitoring activity (that is, Nagios core check) is a software module that produces monitoring data. For each SSM Agent there are more than one activity so that a list of monitoring points is built for every application. See 10.3 for a further details.
Built Figure 5
Lifecycle Manager
The Lifecycle Manager (LM) is an IT automation tool (that is, Chef, Puppet, Ansible and so on) gives the possibility to control a software application in every phase of its lifetime. It is usually composed by a server and many agents distributed in the computer network that allows the execution of a lifecycle script. See 10.1.3.1 for further details.
Off-the-Shelf
Figure 5, Figure 6
Lifecycle Manager Engine
Server side of the Lifecycle Manager Off-the-Shelf
Figure 6
Lifecycle Manager Service
An agent that applies the configuration Off-the-Shelf
Figure 6
Lifecycle Manager Core
Part of the engine: OS service that allow external components to interact with the Engine and support the request for virtualization
Off-the-Shelf
Figure 6
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 35 of 115
Element Description
Lifecycle Manager Data Repository
Part of the engine: repository of configuration items (for example in chef they are Ruby scripts)
Off-the-Shelf
Figure 6
Lifecycle scripts
Executable software scripts that allow to start, stop and upgrade or downgrade a TM application. See 10.3 for a further details.
Built Figure 5
Logging Service (LS)
The TM Logging Service is usually composed by three software entities: the forwarder, the repository (data center) and the query GUI (that is, ELK1 stack). The repository (data center) is a cluster of databases (usually NoSQL) and potentially every TANGO Facility (element) may have a specific database cluster to collect log messages and increase performance of the queries (first choice for the data center is Elasticsearch). The log forwarder is an OS service that give the possibility to forward the message to a repository. The query GUI is designed for analytics/business-intelligence needs, to quickly investigate, analyse, visualize, and ask ad-hoc questions on a lot of data (millions or billions of records).
Off-the-Shelf
Figure 5, Figure 6
LS Server Composed by a data repository and an engine Off-the-Shelf
Figure 6
LS Data Repository
NoSql database (that is, Elasticsearch) to organize the information
Off-the-Shelf
Figure 6
LS Engine Entity responsible for collecting and organizing the log messages from every applications;
Off-the-Shelf
Figure 6
LS Forwarder Entity responsible for the transfer and the transformation of the information to the logging server based on logging forwarding rules
Off-the-Shelf
Figure 6
Logging forwarding rules
A forwarding rule is basically a configuration item; it allows to forward the message to a particular repository; usually there is also a transformation into a structure (that is, ELK). Usually log data are seen as log files and, even if they are initially just file, many modern logging systems (that is, ELK) transform those data into structure (that is, in ELK json document). See [RD21].
Built Figure 5
TM Monitor The TM Monitor device is a TANGO Device that integrate the monitoring data in the TANGO facility from the SSM
Built Figure 5
1 https://www.elastic.co/webinars/introduction-elk-stack
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 36 of 115
Element Description
(or from the local server, if required). It is a bridge between the generic monitoring and the telescope monitoring so that the TMC can correlate some generic monitoring data, if required.
Service GUI The Service GUI is a graphical user interface that interact with every SER components to access the functionalities available. In particular it allows to
● to check the monitoring points (that is, the TM health status), alarms and any kind of event
● To query the logging repository ● To perform control actions that is to use the
lifecycle manager
Built Figure 5
Fault Rules A fault rule is a relation between a fault and a specific action that can be made by the lifecycle manager. The rule take information from the SSM, from the Logging Service and send action command through the Lifecycle Manager to recover a fault (that is, if an alarm is raised, a mail has to be sent to a particular Operator)
Built Figure 5
Virtualization Service
Software made for creating a virtual (rather than actual) version of something, including virtual computer hardware platforms, storage devices, and computer network resources. It also includes the optimization and the right usage of the Off-the-Shelf vProvider available.
Built Figure 5, Figure 6
vProvider Generalization of the providers (vStorage, vNetwork, vMonitoring and vConf) managers by the Orchestrator
Off-the-Shelf
Figure 5, Figure 6
Virtualization Generic term indicating a virtualization service (that is, Openstack, VMWare and so on)
Off-the-Shelf
Figure 6
Orchestrator It provides a template-based way to describe a cloud application, then coordinates running the needed OpenStack API calls to run cloud applications
Off-the-Shelf
Figure 6
vStorage It provides persistent storage to virtual machines that are managed
Off-the-Shelf
Figure 6
vNetwork It provides an API that allows users to set up and define network connectivity and addressing in the cloud (that is, Openstack networking [RD18])
Off-the-Shelf
Figure 6
vMonitoring It provides monitoring functionalities for the virtualization to report alarms, alerts and every information needed by the upper level monitoring (SSM)
Off-the-Shelf
Figure 6
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 37 of 115
Element Description
vConf It stores all the configuration information Off-the-Shelf
Figure 6
Relations
Part A Part B Description
Fault Rules SSM A fault rule takes information from the SSM to evaluate it
Fault Rules LM A fault rule takes information from the LM to evaluate it
Fault Rules Logging Service
A fault rule takes information from the Logging Service to evaluate it
Fault Rules Logging forwarding rules
A fault rule sends log information to the Logging Service through a Logging forwarding rule.
SSM Logging forwarding rules
The SSM sends log information to the Logging Service through a Logging forwarding rule.
SSM Monitoring Activities
The SSM performs its activity through a set of monitoring activities (see 10.3).
LM Logging forwarding rules
The LM sends log information to the Logging Service through a Logging forwarding rule.
LM Lifecycle Scripts
The LM performs its activity through a set of lifecycle scripts (see 10.3).
Lifecycle Scripts Logging forwarding rules
A Lifecycle script sends log information to the Logging Service through a Logging forwarding rule.
Monitoring Activities Logging forwarding rules
A Monitoring Activity sends log information to the Logging Service through a Logging forwarding rule.
Service GUI TANGO framework
The Service GUI interacts with the TANGO framework to send lifecycle action to the TM devices (if needed and commanded by the Operator)
Service GUI SSM The Service GUI interacts with the SSM to gather monitoring information
Service GUI LM The Service GUI interacts with the LM to send lifecycle
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 38 of 115
Part A Part B Description
action to the TM application (if needed and commanded by the Operator).
Service GUI Logging Service
The Service GUI interacts with the LS to read log information from it (if needed and commanded by the Operator).
TM Monitor TANGO framework
The TM Monitor is a TANGO device (see [RD10])
TM Monitor SSM The TM Monitor report monitoring information from the SSM to the TANGO facility
LM Virtualization Service
To perform lifecycle operations
Virtualization Service vProvider To realize the virtualization
Virtualization Service Logging forwarding rules
Virtualization Service sends log information to the Logging service through a Logging forwarding rule
vProvider Logging forwarding rules
vProvider sends log information to the Logging service through a Logging forwarding rule
Behaviour
For the dynamic behaviour of the entities depicted in the present view, please refer to 10.1.3.1.
Rationale
● Cost saving is the main reason to use the Off-the-Shelf software
● Monitoring activities, fault rules, Lifecycle scripts and Logging forwarding rules depend from
the choice of the Off-the-Shelf software
● Monitoring activities, fault rules, Lifecycle scripts and Logging forwarding rules are separated
from the execution engine (SSM versus Monitoring activities, SSM versus fault rules, Lifecycle
Manager versus Lifecycle scripts, Logging service versus Logging forwarding rules) to increase
the modifiability of the system; the choice of the Off-the-Shelf software will not influence the
work to do (that is, agent-based versus agentless) but rather the details of how it will be made.
● The rationale for having the TM Monitor component is to report every monitoring point to the TMC to give the Operator a clear picture of the functional and non-functional monitoring points of the system.
Interfaces
Most of the interfaces highlighted in the primary presentation depend on the technology chosen that
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 39 of 115
gives already an interface to interact with it. Therefore if not stated in the following table, it means that it is a ‘ready to use’ interface.
Interface Document link
SSM-Monitoring Activities SSM - Monitoring Activity Interface, 11.2
Service GUI-TANGO framework The Service GUI, when needed to talk with the TANGO framework will use the Starter (see [RD6])
TM Monitor-SSM TM Monitor Interface, 11.3
Virtualization interface Virtualization Interface , 11.4
Services C&C View
This view highlights the runtime components of the system and their relations.
Primary Presentation
The primary presentation shows the TM Services runtime components and their relations. The diagram is divided in three parts: Client, Server and Virtualization. The Client part represents the TM Services runtime components installed in a generic TM host that is composed by a local data store and the various agents. The first one (for instance the file system) is where any TM application can store its local data, such as log files or configuration files while the agents are: Logging Service (LS) forwarder, Software System Monitor (SSM) Agent and Lifecycle Manager (LM) Service. Each of them sends and receives data to the respective server. Note that also in the virtualization representation there are the same agents installed. The Server part shows the server components that are in relation with the various agents, which is the LS Forwarder which sends data to the LS Engine, the SSM Agent to the SSM Core and the LM Service to the LM Core. The SSM Core has the responsibility to gather the monitoring data from the hosts, via the agents, that are then used in the Fault engine to evaluate Fault rules. The visualization of data and the configuration of server components is made by the Service Graphical User Interface (GUI) that allows the operator to access the information, based on the authorizations provided by the AAA.
Figure 7: Service C&C View
Element Catalog
Elements
Element Description
SSM Core Collects data from every SSM Agent in the network for hardware, software and network monitoring; Receives asynchronous event from the Monitoring Agent Store the collected data into the repository; Provides data to visualize to the operator through the Service UI
SSM Agent Software application daemon that manage the activities to gather monitoring data (aka system metrics) from the TM Application to monitor/control, group and send them back to the server. The communication can start from the server (C/S style) or it can be started directly from the application through the SSM Agent (pub/sub style). The agent has a set of scripts (called monitoring activity) to perform the above operations. The advantage of having an agent is the fact that, instead of calling every monitoring activity, the server call only the agent that make the job.
MonData Repository Repository where the monitoring data are stored.
FM Repository The repository of the alarm/fault rules and actions
Fault Engine The rule engine where define the mapping of monitoring data and the alarms. It also provide an engine to perform a fault predictive using monitoring and log data.
Notification System Software application that send notification to the Service GUI or operator using email or sms, according to the fault rules
Service GUI Generic UI that allow an Operator to visualize every monitoring data, alarm, log and so on. Also permits to perform action like configure fault rules, take lifecycle action having the configuration already done
TM Application to Monitor/Control
A generic TM Application TANGO or Non-TANGO based
LM Core Server side of the Lifecycle Manager. Allow to send a specific control action to a TM Application.
LM Data Repository Part of the engine: repository of configuration items (for example in chef they are Ruby scripts)
LM Service An agent that applies the configuration
LS Engine Entity responsible for collecting and organizing the log messages from every applications; furthermore translate the query to retrieve the data from the LS data repository.
LS data repository NoSql database (that is, Elasticsearch) to organize the information
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 42 of 115
Element Description
LS Forwarder Entity responsible for the transfer and the transformation of the information to the logging server based on logging forwarding rules (see 10.1)
Local Data Store It can be based on simple files (or local syslog udp server based on logstash or rsyslog for instance): more information can be found on the SKA Logging Guideline (see [AD3]).
Config DB A relational database system that store all the information that cannot be stored in other sub-system. It is important to notice that this component can be a schema in other database sub-system (for instance the one in the online or offline system) so that it is possible to save licence cost (if any). The information to store are highlighted in 10.3.
AAA AAA exposes API to provide authentication and authorization to operator
VIrtualization Orchestrator
Virtualization Orchestrator component manages templates and Instance resource allocation; It is the entry point for the Virtualization Service. See 13.3.2.
Relations
Part A Part B Description Multiplicity
SSM Agent TM application to monitor/control
The SSM agent gathers the monitoring data of the TM applications as per the pre-configured monitoring activities.
Multiple
LM Service TM application to monitor/control
Once the LM service has retrieved the host configuration from repository, it applies the configuration to the host where application has to be deployed.
Multiple
TM application to monitor/control
LS Forwarder TM application send logging data through this connection into LS Forwarder
Multiple
LS Forwarder LS Engine The LS forwarder sends logging data to the LS Engine through this connection, according to the configuration
Multiple
LS Engine LS Data Repository LS Engine stores and retrieves the logging data in the LS Data Repository
Single
Fault Engine FM Repository This correspond to the fault management executor that receives script from Fault management repository and executes it
Single
Fault Engine MonData Repository
This correspond to the fault management client that retrieve the
Single
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 43 of 115
Part A Part B Description Multiplicity
monitoring data from the software system monitor needed to perform the fault management analysis.
Fault Engine LS Engine This correspond to the fault engine that retrieves the log data from the logging service data repository needed to perform the trend analysis.
Single
Fault Engine Notification System According the rules, the Fault Engine sends a message to the Notification System, to send a notification to the operator (email or SMS), in case of failure or fault
Single
Fault Engine LM Core According to the rules, the Fault Engine can send a command to the LM Core, to perform a Lifecycle action in case of failure or fault
Single
SSM Core SSM Agent SSM Core request and download the monitoring data collected by SSM Agent
Multiple
SSM Agent SSM Core SSM Agent sends an asynchronous message to the SSM Core
Multipe
LM Service LM Core The connection allows to any client to retrieve the corresponding configuration item from the LM Core. If necessary the LM Core can directly talk to the LM Service.
Multiple
LM Core LM Data Repository LM Core retrieve configuration from the LM Data Repository
Single
Notification System
Service GUI The notification system sends an alert to the operator through the GUI, in case of fault or failure
Single
Service GUI SSM Core Service GUI that allow to an Operator to visualize every monitoring data, notification and/or specific grouping of data to check the situation of the TM at generic level.
Single
Service GUI LM Core Specialized UI application for lifecycle action having the configuration already done. The Service GUI interacts with the engine to manage
Single
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 44 of 115
Part A Part B Description Multiplicity
(create/update/delete) the configuration items for the TM subsystems.
Service GUI LS Engine The Service GUI allows users to create queries to retrieves the logging data from the LS Engine
Single
Service GUI AAA To access the functions of the Lifecycle Manager (that allow to configure all the application included the service one), the Operator must be authenticated and authorized. With this relation the Service GUI allows the User to gain the authentication token together with his groups (for authorization). Please consult [RD5] for further information.
Single
Service GUI Config DB The Service GUI retrieves its configuration (for instance subsystems information) from the local configuration DB.
Single
Service GUI Virtualization The Service GUI enables the Operator to create/modify/delete and so on a virtual host into the system, if required.
Single
Virtualization Orchestrator
SSM Core Virtualization environment can access to the monitoring data from the SSM Core
Multiple
SSM Core Virtualization Environment
The Software system monitor receives monitoring data from the Virtualization Environment.
Multiple
Virtualization Orchestrator
LS Core The Virtualization Orchestrator sends logging data to the LS Engine through this connection, according to the logging forwarding rules
Multiple
Virtualization Orchestrator
LM Core Virtualization Orchestrator retrieve configuration from the LM Data Repository
Multiple
Interfaces
Most of the interfaces highlighted in the primary presentation depend on the technology chosen that gives already an interface to interact with it. Therefore if not stated in the following table, it means
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 45 of 115
that it is a ‘ready to use’ interface.
Interface Document link
SSM-Monitoring Activities SSM - Monitoring Activity Interface, 11.2
Service GUI-TANGO framework The Service GUI, when needed to talk with the TANGO framework will use the Starter (see [RD6])
TM Monitor-SSM Virtualization Interface , 11.4
Virtualization interface SSM - Monitoring Activity Interface, 11.2
Behaviour
Interaction with AAA
Service GUI interacts with AAA to retrieve user information and user role. The interaction is described in the following pseudo-code. AAA.Autenticate(Username, Password);
IF(Authentication.result == Successfull)
{
GUI.Store(Autentication.id, Autentication.name,
Autentication.surname, Autentication.email);
AAA.RequestListGroup(Autentication.id);
GUI.Store(ListGroupResult[]);
}
ELSE
{
THROW(AutenticationFailedException);
}
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 46 of 115
Lifecycle Manager
10.2.2.4.2.1 Lifecycle Generic Execution Activity
Figure 8: Lifecycle Generic Execution.
Figure 8 shows the generic execution activity that the Lifecycle Manager implements for the specific need of TM. The activity starts when the Lifecycle Manager Engine receives a lifecycle command (start, stop, restart and so on) from the user through the Service GUI. Inside the repository of the Lifecycle Manager Engine, there is a list of applications (versioned) connected to a list of hosts. Based on this relationship, when the user selects an application to command, the repository knows where to apply the correct configuration and it informs the service local to the virtual server. The lifecycle service has internally a clock and every tick it applies the right configuration for its host but is also possible to force the application of it. The ‘Apply configuration item’ does exactly this job interacting with the TM application (in the case of the figure an OSO application) and asking support to the Virtualization Service (if needed). It is very important to highlight that the use of the timer to start the function in the Lifecycle Manager Service is possible because those functions do not change the system status if they are executed more than once (idempotent ability). After the configuration is over, the Lifecycle Manager Service checks that the lifecycle command has executed correctly and reports the result to the Operator through the Service GUI.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 47 of 115
Monitoring
Figure 9: Perform monitoring activity.
Figure 9 shows the sysml diagram for a generic execution of the monitoring activity. The activity starts when the list of all current Monitoring Points (MPs) is requested to the SSM Server (updated during the ‘Configure monitoring activity’). This list is stored in a local buffer and provided to the Software System Monitor Scheduler (SSM Scheduler) in a predefined time period (t_1 and t_2). The SSM Scheduler performs two activities: tests the communication with the node and retrieves its monitoring data. During the test of the communication, SSM Scheduler sends a keep-alive packet to the node (aka Virtual Machine) and if a response is provided in a time period minor to the timeout time the test is successful, otherwise it is an error. All the monitoring activities (hardware, software or related to the network) are executed locally every t_3 seconds and the monitoring data are initially stored in a local buffer and, as soon as possible, communicated to the SSM Server through the SSM agent. The SSM Scheduler, in t_2 time period, downloads the aggregated data collected by the monitoring activities using the SSM agent and dumps the local buffer. It also uses the network standard protocol like SNMP (Simple Network Managment Protocol) to test the connectivity and the services of the TM node, as defined in the MP configuration. It is also possible for a TM application to send an asynchronous message to the SSM Server. The aggregated data that comes from the test of the communication and the SSM Agent, are matched with the rules defined in the configuration and, in case of a mismatch, an alarm is raised.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 48 of 115
Fault Manager
Figure 10: Perform Fault Management.
Figure 10 shows the sysml diagram for a generic execution of the fault management activity. The Fault Manager (FM) analyses monitoring data that comes from SSM for detection phase, and the log system to support the analysis. It also depends on Lifecycle Manager to execute the command in Isolation and Recovery Phase. When the FM Engine starts, each t_1 seconds, it retrieves the monitoring data from SSM and stores them in a local buffer. To analyse the data, FM Engine uses the engine rules that are stored in the FM Repository. These rules are defined by the operator or developers usually from a dependability analysis. FM Engine compares (detection phase) the monitoring data with the Engine rule and in case of fault or failure, starts the isolation phase. During the isolation phase, the FM Engine can use log data (retrieved by Logging Server) to support the analysis. After this step, the FM Engine sends the isolation command to the Lifecycle manager. The recovery follows the isolation phase and consists to retrieve the recovery procedure from the FM Repository and send it to the Lifecycle Manager, to recover to the normal behaviour of the component. Another important feature of the Fault Manager is the trend analysis and failure prediction. It consists of analysing the monitoring and logging data to discover possible future failure or trend, using machine-learning algorithm or Markow chain and so on. If it detects a possible fault or failure, sends the data to the FM Engine that starts the detection phase.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 49 of 115
Logging
Figure 11: Store log.
Figure 11 shows the main function of the Logging Service that is collecting log data. The most important part of the activity diagram is the fact that there is no direct link between the TM application and the logging forwarder and this will help the general performance of the network. In fact, it is not important to have log data immediately in the central server (they tell the story of the application and they cannot be used to actively monitor an application for bugs) compared to the importance of the network performance. Therefore, the forwarder will be configured to avoid any network flooding and in particular to avoid sending big quantity of data if the network is very busy.
Variability Mechanisms
Agent/Agentless solution
In the architecture shown above, an agent-based solution is preferred compared to an agentless one. This is because, even if an agentless-based architecture can appear simpler, it introduces some limits. In fact, without an agent installed in the client, it is possible to monitor network devices (for instance using protocols like SNMP), but it is not easy to monitor other features such as, CPU load). On the other hand, the architecture presented requires an extra work that is to install and configure the agent. Another important point is that, without an agent, some operations require a secure connection, and this adds complexity to the system introducing a potential security issue and a network overhead. For example a remote data collector must be allowed to communicate with the target system on different ports and may also need to be installed with domain administrator privileges to access the remote systems. Furthermore an agentless communication introduces additional network traffic, as the raw performance data is transported to a remote data collector. Instead, using an agent, data is collected locally and only the processed results are transported to the server.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 50 of 115
Lifecycle Manager
● In case of a TANGO based engine, it is worth to configure the TANGO devices with an IT
Automation tool (that is, chef, ansible and so on) together with the Starter Device Server (see
[RD6]) so that it is possible to use Astor (see [RD7]).
● It is possible to have a Lifecycle Manager without agent such as, Ansible. The behaviour
(described in 10.2.2.4.2) is the same except for the fact that the communication starts from
the server.
Astor
As the TANGO framework is an open source project, it provides many tools that can be used as a starting point in the development of the Service GUI. Among the others, there is an important tool called Astor (see [RD7]) that could be considered the Lifecycle Manager for a TANGO based control system with some limitations. The architecture is based on a device called Starter that is able to control any device server in a remote host. The main limitations are related to:
● configuring and starting the Starter itself (the device must be called with a specific name)
● starting a non C++ device server
● upgrading/downgrading and in general, managing versions of devices
According to the documentation, Astor acts as client of the Starter device deployed in each host present in every host and allows to:
● display the control system status and component status using coloured icons
● execute actions on components (any command defined within the device)
● execute diagnostics on components
● execute global analysis on a large number of hosts or databases
It is also an example of integrated UI (see [RD9]) because from the Astor it is possible to open other important tools such as, the Access Control Panel (see [RD10], chapter Advanced Features), the LogViewer (see [RD11]), Jive (see [RD12]) and so on. According to the above considerations, starting from the Astor tool, there are two main line of extension:
● include the engine of the Lifecycle Manager because based on a (relatively) new technologies
such as Puppet, Chef, Ansible and so on;
● Include the link for the UIs of the SER sub- systems (monitoring system, for the Logging service
and for the Lifecycle Manager);
Considering the lifetime of the SKA project (and as a consequence of the TM project), it is recommended to start the development of new UIs based on more recent technologies such as, Rest and the Web. The TANGO community has already started a project for a TANGO REST app (see [RD8]).
Monitoring
● In a generic monitoring system, it is always possible to make the communications both
asynchronous or synchronous and, in general, the synchronous one is preferred. In fact,
following the generic software monitoring best practices, in order to define if a trend of a
monitoring point is in fault or not, it is necessary to gather more sampling point. So, a single
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 51 of 115
sampling point in a asynchronous exchange of data could be not enough.
Also, synchronous communication reduces the complexity of the monitoring point scripts and
permits to control the network traffic and the overload of the server. In fact, considering an
asynchronous communication, the whole activity of the monitoring is performed by the node,
that monitors itself and raises an alert in case of fault, opening a communication with the
server. Furthermore, in this way, the SSM server cannot manage how many packets will be
sent, so it is not possible to control the network traffic and the server overload.
Logging
● In case of a TANGO based engine, it is possible to use directly the TANGO Logging Service that
is based on a specific TANGO Device Server (Log Consumer Device) to view the log messages.
More information can be found at the SKA logging Guidelines (see [AD3]).
● According to the SKA logging Guidelines (see [AD3]), the same pattern will be extended to
every SKA Application and not only to the TM Applications.
Rationale
Lifecycle manager
● The lifecycle management is the ability to control a software application in the following
phases of its lifetime: Configuration, Start, Stop/Kill, Update, Upgrade or Downgrade.
● To realize the lifecycle management, it is convenient to subdivide all applications that
compose TM into their typology. In particular, it is possible to distinguish the following types:
○ OS Service, which is a process that starts and stops with the operating system,
○ Web server, which is an information technology that processes requests via HTTP, the
basic network protocol used to distribute information on the World Wide Web,
○ Web application, which is a software running in a web server,
○ Desktop application, which is a software running in the client computer,
○ Server application, which is a software running in a server that usually does not have
the same lifetime of the operating system,
○ DB server, which is a database management system (which can be a RDBMS or a NoSql
DB technology).
● The configuration phase is the ability to set all the precondition (library needed, DB,
configuration files and so on) to make a software application ready to start (the configuration
phase is the preparation of the start phase).
● All this activities can be done through an IT automation tool (like Puppet, Chef, Ansible and
son on) in cooperation with other sub-element since they only know the details of the
applications.
● It is not worth to document the interface between the client side (called Service) and the
server side (called Engine) of the Lifecycle Manager since it is Off-the-Shelf.
● To access to lifecycle action, it is necessary an authentication token, provided by AAA (if the
operator has the proper role)
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 52 of 115
Monitoring System
A monitoring system is composed by a software system monitor and by sub-elements specific monitor. Having such a piece of software in every node of the network can easily provide the ability of monitoring other aspects of TM that are network services (applications using protocol like TCP, HTTP, FTP and so on) or host resources (processor load, disk usage, system logs) and every monitoring points chosen and decided by other teams.
A Software System Monitor (SSM) is a client-server application that realize the ability to monitor a set of application producing one or more monitoring points in one or more computer networks. There is also the possibility to have some specific monitoring activities to produce a list of monitoring points for every application so that it is possible to make a strategy (technique) for fault management (for instance black box/white box error handling). The most important activity of the System Monitor is the Perform monitoring activity described in 10.2.2.4.3 that is responsible for:
Hardware monitoring: perform hardware monitoring, list MPs, atomic hardware MPs
Software monitoring: perform software monitoring, list MPs, atomic software MPs
Network monitoring: atomic network MPs, keep-alive monitoring
Generic monitoring: security level, event attributes
Data archive: monitoring data, monitoring configuration, metering configuration, failure
modes.
Failover mechanism
An important consideration concerns a possible failover mechanism. Failover is an automatic action to recover from a specific situation and can happen at different level. The boundaries for the level are:
1. if the failover is needed at server level, then it is a responsibility of the Virtualization,
2. if it can be solved with a lifecycle action then is a responsibility of the Service
3. otherwise it is at level of the TANGO facility (for instance in case of capability transfer) and it
is a responsibility of the M&C Module.
Logging
● The proposed logging architecture is a best practice
● There is no direct link between the TM application and the logging forwarder and this will help
the general performance of the network
● The forwarder will be configured to avoid any network flooding and in particular to avoid
sending big quantity of data if the network is very busy
● The growth of logging data should be controlled to avoid the storage of unused information
and to maintain an enough amount of data persistent without the risk of data flooding. There
are different possibilities and one of the most used one is to have a fixed size for data and
drop all the messages which exceed that data.
● A logging prototype has been developed based on the ELK stack (see [RD4]).
Service GUI
The Service GUI is the entry point for the TM Services software package and will allow the Operator to access all the functionalities provided from one single UI.
There are important indications from the work being done in the UI Team like:
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 53 of 115
○ Not many steps: when working, the user should not make many steps to reach his various GUIS;
○ High integration: a user may need or want to compose his specific GUI.
As a consequences of point 2, it is crucial to take part of the construction of the Feature Request # 6 – TANGO web application (see [RD8]):
○ There is already an effort in refactoring of the generic web app to be an open platform (3rd parties will be able to implement plugins for the platform to resolve their needs)
○ The development of the Service GUI will correspond to the construction of some plugins as per primary presentation.
Abstract Data Model
Description
The present view is the blueprint for the implementation of the data entities and a domain analysis of the main concepts wherewith the TM Services works.
Overview
This view has been divided into the following view packets for convenience of presentation:
1. Entities, 10.3.3: It describes the type of entities managed by the TM SER
2. Monitoring, 10.3.4: It describes the data type managed by the SSM
3. Lifecycle, 10.3.5: It describes how it is possible to have multiple version of the same application and the data type managed by the Lifecycle Manager.
4. Virtualization, 10.3.6: It describes the use that TM is going to do with the virtualization service; In particular the data model associated with it.
Entity decomposition View Packet
The present view packet highlights the entities managed by the TM SER package.
Primary Presentation
To correctly read the following diagram, it is important to start from the Entity block which is central. An entity can be an application, a monitored process, a monitoring activity, the virtualization, a virtual resource managed by the virtualization or a template (see also Virtualization view packet, 10.3.6).
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 54 of 115
Figure 12: Entity and version.
Element Catalog
Elements
Element Description
Entity It is the main data wherewith every TM Service application refer to. It can be a monitored process or an application (that is composition of monitored processes) or a monitoring activity.
Application It is an aggregation of MonitoredProcess selected according to a particular version with the block LogicalComposition
MonitoredProcess It is an OS process that needs to be monitored and controlled
MonitoringActivity It is a process (a runtime entity like a script or a os service) that monitors an entity and produces monitoring data
LogicalComposition It is a composition of one MonitoredProcess and one Version. An application is a composition of those blocks so that for each monitoredProcess it can refer to a particular version.
Version Every entity is related to the system with a particular version that indicates the particular composition of the software product.
Virtualization
Please see Virtualization view packet, 10.3.6. Template
vResource
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 55 of 115
Properties
Element Property Description
Entity Singleton This property indicates whether an entity can have multiple instances or not
Version Singleton This property indicates whether a version can have multiple instances or not
MonitoredProcess isProduct This property indicates whether the monitored process is a product (part of the PBS) or not
Application isProduct This property indicates whether the application is a product (part of the PBS) or not
vResource IP Address A floating IP address and a private IP address can be used at the same time on a single network-interface. The private IP address is likely to be used for accessing the instance by other instances in private networks while the floating IP address would be used for accessing the instance from public networks
Floating IP Address
Relations
Part A Part B Type Multiplicity
Entity Element Specialization 1
Entity MonitoredProcess Specialization 1
Entity Application Specialization 1
Entity MonitoringActivity Specialization 1
Entity Virtualization Specialization 1
Entity Template Specialization 1
Entity vResource Specialization 1
Application LogicalComposition Composition 1..*
LogicalComposition MonitoredProcess Composition 1
LogicalComposition Version Composition 1
Entity Version Relationship 1..*
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 56 of 115
Rationale
● Desktop applications are not in the model because they will not be monitored. Any server
application (being part of the TM network) is part of the model.
● Relation Version-Configuration has to be 0..* because not all the entities will be managed
with the lifecycle manager. Even if a script in an automation tool is always preferable
because it is replicable, it is also possible to configure an application outside the Lifecycle
Manager (because the application can be too complicated like the virtualization or because
there is a need for a manual configuration).
Monitoring View packet
The present view is the blueprint for the implementation of the data entities and a domain analysis of the main concepts wherewith the TM Services works.
Primary Presentation
The starting point for reading the following diagram is the block MonitoringActivity. It is an entity as any other in the TM SER and refers to a particular version of another entity called ‘entity2monitor’. It composed by an ActivityType and a Criticality and is related (produces) to one or more MonitoringPoint. This block is composed by a type called ‘MonitoringPointType’ and produces MonitoringData in a certain mode (Asynchronous or Synchronous). A MonitoringActivity can also generate events that can be an alarm, a warning, a information of an unknown one.
Figure 13: Monitoring data model.
Element Catalog
Elements
Element Description
Entity It is the main data wherewith every TM Service application refer to. It can be a monitored process or an application (that is composition of monitored processes) or a monitoring activity.
MonitoringActivity It is a process that monitors an entity and produces monitoring data
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 57 of 115
Version Every entity is related to the system with a particular version that indicates the particular composition of the software product.
Event Message indicating that something has happened (for instance an alarm that require a user interaction)
ActivityType There are two main type of monitoring activity: a measure monitoring activity (that is, read the CPU utilization) and the message monitoring activity (to communicate a particular information, a message)
Criticality A monitoring activity has a level of criticality (High, Medium and Low), that allow to configure a priority for the monitoring data it produces.
MonitoringPoint Definition of a specific kind of data that is representative of an aspect of the system and that can have an interest by an operator o by a component
MonitoringPointType A monitoring point data value can be of a particular type like Percentage, Elapsed time, Status or a simple number
MonitoringData A value for a monitoring point at a particular time stamp
SendMode A monitoring data value can be sent by asynchronous or synchronous message
Properties
Element Property Description
Entity Singleton This property indicates whether an entity can have multiple instances or not
Version Singleton This property indicates whether a version can have multiple instances or not
Relations
Part A Part B Type Multiplicity
Entity MonitoringActivity Specialization 1
Entity Version Relationship 1..*
Version Configuration Relationship 0..*
MonitoringActivity Entity Specialization 1
MonitoringActivity Entity Relationship 1
MonitoringActivity Event Relationship 0..*
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 58 of 115
MonitoringActivity MonitoringPoint Composition 1..*
MonitoringActivity ActivityType Composition 1
MonitoringActivity CriticalLevel Composition 1
Event EventType Relationship 1
MonitoringData MonitoringPoint Relationship 1..*
MonitoringPoint MonitoringPointType Composition 1
MonitoringData SendMode Composition 1
Interfaces
Monitoring Interface (see 11.2) - Interface for a generic monitoring activity.
Behavior
In this section, it is provided an example of a monitoring activity that measures the latency between two communicating processes as indicator of communication problems. To do that, a SER user has to (assuming that the two process are Entities managed by TM SER):
1. Create a MonitoringActivity (depending on the solution off the shelf selected it can be a script
for instance) to store (in a log for instance) the timestamp (MonitoringPointType: TIME) and
data (only the ID) sent by the sending MonitoredProcess.
2. Create another MonitoringActivity to store the timestamp and data (only the ID) received by
the other MonitoredProcess.
3. Create a LogicalComposition (see Entities view packet, 10.3.3) composed by the two
MonitoredProcess considered above.
4. Create another MonitoringActivity that
a. Retrieve the two MonitoringData (correponding to the same data id) coming from the
first two monitoring activities,
b. Calculate the difference of the timestamp and
c. Raise an event of type alarm, if required.
Note that point number 1 and 2 can be omitted if the timestamp and data id are already stored somewhere.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 59 of 115
Context Diagram
Figure 14: Monitoring context.
The above figure (for the full picture of the SER C&C, please see the Service C&C View at 10.1.3.1) shows the context for the present view packet. The interaction between client and server can be started both from the SSM Agent and from the SSM Core depending on the configuration made for the SSM. The SSM Agent execute one or more monitoring activities and send the result of it as monitoring data to the SSM Core that collect and store them into the MonData Repository. One or more monitoring activities can be executed also from the SSM Core, at server side.
Rationale
● A MonitoringActivity is a runnable process and an entity of the TM SER. It can generate one
or more monitoring points, which has a type and generate monitoring data.
● The monitoring activities are both runnable executables (or scripts) and data for the SSM.
Therefore, adding a new monitoring activity correspond to an insert of a script in a file folder
(this is true for Nagios Core but, in general, it depends on the technology chosen).
● Since the monitoring activity is an entity of the TM SER (like an OSO/TMC application), it can
be managed as any other entity with the lifecycle, allowing the possibility to create multiple
version of the same monitoring activity (even running them at the same time).
Lifecycle View Packet
The present view packet highlights the resources managed for a TM Application and in specific how an entity will be configured by the Lifecycle Manager component.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 60 of 115
Primary Presentation
The starting point for reading the following diagram is the block Configuration. In fact they key concept for starting an application (therefore for the lifecycle) is the configuration phase. A Configuration represents all that is needed for prepare an application to run. So it is composed by one or more ConfigurationResouce that can be a Library (for instance a python library), a Directory, a File or a LogConfigurationFile. A File can represent, for example, a specific configuration to do so it can have a TemplateFile that can be filled with the property ‘attributes’ (in this model, it has been described as Dictionary but nothing prevents to have a json string) of the configuration. It is also composed by a Script (to be executed locally to prepare the configuration or to Start an Entity, see Entity view packet, 10.3.3) that depends on the specific choice of the engine for the lifecycle (for instance an IT Automation tool like Ansible or Chef, see Service C&C View, 10.1.3.1). It also has a type ‘ConfigurationType’ and is related to a particular version of an entity.
Figure 15: Lifecycle Data Model.
Element Catalog
Elements
Element Description
Entity Please see 10.3.3
Version Every entity is related to the system with a particular version that indicates the particular composition of the software product.
Configuration It contains all the necessary information for the lifecycle management related to the particular version of an entity. For instance it maintains the reference to the host where the application runs or the relations with other objects (for instance an application could depend on another
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 61 of 115
application like the TANGO-controls framework depends on ZeroMq library)
ConfigurationType For each configuration it is associated a type which says the particular behavior for the specific configuration
Script Program written for a special run-time environment (that is, python or Ruby) that automate the execution of tasks that could alternatively be executed one-by-one by a human operator. 2 Scripts depends on SKA decision on the technology to adopt.
PreconditionCheck Script used for check the precondition of the Entity (for instance if it is a singleton, it cannot be started two times). It will always be run before the main configuration script.
Bash Unix shell script3
Ruby Ruby script4
Specific IT Automation Tool script
Most of the IT Automation tool available on the market, comes with a specific scripting tool to help the development of them
ConfigurationResource A generic configuration resource needed by a particular version of an entity
LogConfigurationFile Specialization of ConfigurationResource, that is a special kind of file used for logging configuration
Library Specialization of ConfigurationResource, needed by a specific entity
Directory Specialization of ConfigurationResource, that indicates a directory to be created
File Specialization of ConfigurationResource, that indicates a file to be created
TemplateFile Used to create a file as specific configuration item
Properties
Element Property Description
Entity Singleton This property indicates whether an entity can have multiple instances or not
Version Singleton This property indicates whether a version can have multiple instances or not
2 https://en.wikipedia.org/wiki/Scripting_language 3 https://en.wikipedia.org/wiki/Bash_(Unix_shell) 4 https://en.wikipedia.org/wiki/Ruby_(programming_language)
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 62 of 115
MonitoredProcess isProduct This property indicates whether the monitored process is a product (part of the PBS) or not
Application isProduct This property indicates whether the application is a product (part of the PBS) or not
Relations
Part A Part B Type Multiplicity
Entity Version Relationship 1..*
Version Configuration Relationship 0..*
Configuration ConfigurationType Composition 1
Configuration Script Relationship 1
Script Bash Specialization 1
Script Ruby Specialization 1
Script Specific IT Automation Tool Specialization 1
Script ConfigurationResource Composition 1..*
Script PreconditionCheck Specialization 1
ConfigurationResource Library Specialization 1
ConfigurationResource Directory Specialization 1
ConfigurationResource File Specialization 1
LogConfigurationFile File Specialization 1
File TemplateFile Composition 0..1
Interfaces
Lifecycle interface (see 11.1) - Interface for an application (at a specific version).
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 63 of 115
Context Diagram
Figure 16: Lifecycle context.
Figure 16 (for the full picture of the SER C&C, please see 10.1.3.1) shows the context for the present view packet. The lifecycle service takes from the Lifecycle Engine the configuration item and applies it to make a lifecycle action to the TM application. The interaction between the lifecycle engine and the lifecycle service can start both from the engine (blue line) and from the service (black line) depending on the tool chosen as IT automation tool.
Rationale
● The present model shows that a configuration is a composition of a script (written with a
specific language) and a set of configuration resources. The Lifecycle Manager will maintain
this data and will execute it whenever needed.
● The configurations are both runnable executables and data for the Lifecycle Manager.
Therefore, adding a new application version correspond to an insert of a set of scripts in a
dedicated repository (this is true for Chef but, in general, it depends on the technology
chosen).
● Adding a new version of an application implies to develop a new configuration item and
upload it into the Lifecycle repository. Nothing prevents to have more than one version of the
same application running (enabling modifiability).
Virtualization View Packet
The present view packet presents the data associated to a TM entity for the virtualization configuration. It basically depicts what data a TM application needs to use the virtualization service.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 64 of 115
Primary Presentation
To correctly read the model, the starting point is the template which is basically a file describing what it is needed by a particular version of an entity to perform the lifecycle management. An application could need a certain number of resources (called vResource in the model) that can be CPU, storage space, network constraint that form the SLA (Service level agreement). Once defined the template, it acts as single unit (also called stack), with a state (Template State) and it is managed by the Virtualization entity though the virtualization interface. The managed resources (vResource) have a state as well (vResource State) that enables the work of the virtualization (for instance a resource can be transferred to another template only if it is in a particular state). There are mainly three types of vResource: computational, network and storage (see the Element catalog for more information).
Figure 17: Virtualization Data Model.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 65 of 115
Element Catalog
Elements
Element Description
Template A Template (that is, Heat [RD17], AWS CloudFormation [RD23] and so on) is a file that describes a collection of resources (called vResources in the primary presentation) like VMs or containers as a single unit called a stack. The Virtualization Service (called only Virtualization in the primary presentation) manages the resources of the entire set so that it matches a certain SLA (in term of CPU, memory and so on) and implements basic failover mechanisms (if required and specified in the template itself).
Physical Resource File Descriptor
Document containing all the commands needed to build an image from scratch including testing capabilities and access to volumes and network services (see 10.3.6.2.3 for further detail).
Product Execution File Descriptor
Document defining the configuration for multi-image container operations. It includes testing capabilities as well as describe the access to hardware specifics like volumes and network (see 10.3.6.2.3 for further detail).
vResource Descriptor Document defining the runtime specifics and interoperability of various containers as well as their hardware access policies, hardware resource priorities, update mechanism, privileges, and restart strategies (see 10.3.6.2.3 for further detail).
vResource Generic term indicating a resources managed by the virtualization including CPU, storage, and networking
Virtualization Generic term indicating the entity that manages the virtualization service. It allows to create a virtual version of something, including virtual computer hardware platforms, storage devices, and computer network resources.
Container An OS level container
Hardware Hardware entity (with a serial number to manage)
vResourceCompute Virtualized computational resources. It can be an real or virtual (called vHardware in the primary presentation) Hardware a Container or a VM
vResourceNetwork Virtual network for a cloud application
vResourceStorage Virtual storage for a cloud application
VM Virtual Machine
vHardware Virtualized computational resources
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 66 of 115
Element Description
Entity Please see 10.3.3
Configuration Please see 10.3.5
Properties
Element Property Description
Hardware SerialNumber This property indicates the serial number of the specific Hardware
vResource IP Address A floating IP address and a private IP address can be used at the same time on a single network-interface. The private IP address is likely to be used for accessing the instance by other instances in private networks while the floating IP address would be used for accessing the instance from public networks
Floating IP Address
Template Template State
Enumerative. It can assume one of the following values: INITIALIZED, ACTIVE, PAUSED, SUSPENDED, STOPPED, DELETED, ERROR
vResource vResource State
Enumerative. Is can assume one of the following values: INITIALIZED, PAUSED, SUSPENDED, SOFT_DELETED, ERROR, RESCUED, STOPPED
Descriptors
Table 5: Physical Resource File Descriptor.
instruction description arguments Syntax
from Base image of image Name from standard set of low level pre made images (that is, debian) or another user made image
FROM image_name
run Run shell command Valid shell command RUN <command>
label Set metadata for image to be used externally
List of label keys and their values
LABEL <key>=<value> <key>=<value> ...
expose Allow container to listen on specific port at runtime
Network port EXPOSE <port>
env Set variable name Key name and value ENV <key> <value>
volume Creates a mount point with the specified name and
List of mounting points VOLUME [‘/data’]
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 67 of 115
instruction description arguments Syntax
marks it as holding externally mounted volumes from native host or other containers
user Set the user name (or UID) and optionally the user group (or GID) to use when running the image
UID and optionally GID USER <UID>[:<GID>]
healthcheck Run an instruction and wait for specific output to ensure the image is running correctly
Optional options and command
HEALTHCHECK [OPTIONS] CMD command
Table 6: Product Execution File Descriptor.
instruction description example
container_name Specify a custom container name, rather than a generated default name.
container_name: my-container
container_creator Creator of the container container_creator: ‘john doe’
version Version of the config file syntax version: 1
labels Specify labels for the container. labels: description: ‘Example Lbl’
restart_policy Configures if and how to restart containers when they exit.
● Condition: One of none, on-failure or any (default: any).
● Delay: How long to wait between restart attempts, specified as a duration (default: 0).
● max_attempts: How many times to attempt to restart a container before giving up (default: never give up).
● Window: How long to wait before deciding if a restart has succeeded, specified as a duration (default: decide immediately).
restart_policy: condition: on-failure delay: 5s max_attempts: 3 window: 120s
devices List of device mappings. devices:
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 68 of 115
instruction description example
- ‘/dev/ttyUSB0:/dev/ttyUSB0’
env_file Add environment variables from a file. Can be a single value or a list.
env_file: - ./common.env - ./apps/web.env - /opt/secrets.env
environment Add environment variables. environment: RACK_ENV: development SHOW: 'true' SESSION_SECRET:
expose Expose ports between containers without publishing them to the host machine
expose: - ‘3000’ - ‘8000’
healthcheck Configure a check that’s run to determine whether or not containers are ‘healthy’.
healthcheck: test: [‘CMD’, ‘curl’, ‘-f’, ‘http://localhost’] interval: 1m30s timeout: 10s retries: 3
image Specify the image to start the container from.
image: a4bc65fd
depends_on Express dependency between containers
depends_on: - db - redis
Table 7: vResource File Descriptor.
instruction description example
service_name Specify a custom service name, rather than a generated default name.
service_name: my-container
service_creator Creator of the service service_creator: ‘john doe’
version Version of the config file syntax version: 1
labels Specify labels for the service. labels: description: ‘Example Lbl’
restart_policy Configures if and how to restart service when it exits.
● condition: One of none, on-failure or any (default: any).
● delay: How long to wait between
restart_policy: condition: on-failure delay: 5s max_attempts: 3 window: 120s
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 69 of 115
instruction description example
restart attempts, specified as a duration (default: 0).
● max_attempts: How many times to attempt to restart a container before giving up (default: never give up).
● window: How long to wait before deciding if a restart has succeeded, specified as a duration (default: decide immediately).
replicas If the container is replicated, specify the number of containers that should be running at any given time.
replicas: 6
resources Configures resource constraints. resources: limits: cpus: '0.001' memory: 50M reservations: cpus: '0.0001' memory: 20M Priority: 1
update_config Configures how the service should be updated. Useful for configuring rolling updates.
● parallelism: The number of containers to update at a time.
● delay: The time to wait between updating a group of containers.
● failure_action: What to do if an update fails. One of continue, rollback, or pause (default: pause).
● monitor: Duration after each task update to monitor for failure (ns|us|ms|s|m|h) (default 0s).
● max_failure_ratio: Failure rate to tolerate during an update.
update_config: parallelism: 2 delay: 10s
devices List of device mappings. devices: - ‘/dev/ttyUSB0:/dev/ttyUSB0’
env_file Add environment variables from a file. Can be a single value or a list.
env_file: - ./common.env - ./apps/web.env - /opt/secrets.env
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 70 of 115
instruction description example
external_links Link to services started outside this one external_links: - redis_1 - project_db_1:mysql - project_db_1:postgresql
healthcheck Configure a check that’s run to determine whether or not containers for this service are ‘healthy’.
healthcheck: test: [‘CMD’, ‘curl’, ‘-f’, ‘http://localhost’] interval: 1m30s timeout: 10s retries: 3
container Specify the containers to start the service from.
container: - db - processing1
user Set the user name (or UID) and optionally the user group (or GID) to use when running the service
User: 1002:1001
networks Networks to join networks: - some-network - other-network
ipv4_address, ipv6_address
Specify a static IP address for containers for this service when joining the network.
networks: app_net: ipv4_address: 172.16.238.10 ipv6_address: 2001:3984:3989::10
ports Expose ports. ports: - ‘3000’ - ‘3000-3005’ - ‘8000:8000’ - ‘9090-9091:8080-8081’ - ‘49100:22’
volumes Mount host paths or named volumes, specified as sub-options to a service.
volumes: source: mydata target: /data options: nocopy: true
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 71 of 115
Descriptors Errors
The error handling at the abstract data model level will be separated between layers as specified. All layers will provide testing capabilities by the use of the healthcheck command to ensure their correct state. 1st level error handling will check if all the variables in virtualization service description for a TM Application are within range and/or exist on the host system where the virtualization will run. 2nd level error handling will ensure the normal boot up of the TM Application image and can have error checking in place by the TM service to abort the process at the virtualization level if needed. 3rd level error handling will continuously monitor the virtualization of each TM Application and the remainder TM Applications running at the hardware level to ensure that hardware availability is within specifications, balancing the allocated resources to each part of the Virtualization Service.
Table 8: Physical Resource File Descriptor errors.
Error Original instruction
Description Extra info
Descriptor syntax error
- Syntax error in the virtual machine creator file descriptor
Line
Nonexistent image
from The image file doesn’t exist Image name
Cmd syntax error run Nonexistent command or syntax error
If it is a nonexistent command or syntax error
Health Check failed
healthcheck The healthcheck command doesn’t run or throws an exception
Specify the case and the test
Table 9: Product Execution File Descriptor Errors.
Error Original instruction
Description Extra info
Descriptor syntax error
- Syntax error in the virtual machine creator file descriptor
Line
Device error devices Nonexistent or non-accessible device
Device and reason
Environment file error
env_file Nonexistent or syntax error in environment file
File reason and if necessary line in file
Health Check failed
healthcheck The healthcheck command doesn’t run or throws an
Specify the case and the test
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 72 of 115
exception
Nonexistent image
from The image file doesn’t exist Image name
Nonexistent container
depends_on The container doesn’t exist Container name
Table 10: vResource File Descriptor Errors.
Error Original instruction
Description Extra info
Descriptor syntax error
- Syntax error in the virtual machine creator file descriptor
Line
Device error devices Nonexistent or non-accessible device
Device and reason
Environment file error
env_file Nonexistent or syntax error in environment file
File reason and if necessary line in file
Health Check failed
healthcheck The healthcheck command doesn’t run or throws an exception
Specify the case and the test
Nonexistent container
container The container doesn’t exist Container name
Nonexistent network
networks The network doesn’t exist Network name
Nonexistent volume/path
volumes The service volume or host path doesn’t exist
Specify case and volume/path
Relations
Part A Part B Type Multiplicity
Entity Virtualization Specialization 1
Entity Template Specialization 1
Entity vResource Specialization 1
Version vResource Reference 0..1
Version Template Reference 0..1
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 73 of 115
Template Template Reference (depends on) 0..1
Virtualization vResource Manages 1
Virtualization Template Manages 1
vResource vResourceCompute Specialization 1
vResource vResourceNetwork Specialization 1
vResource vResourceStorage Specialization 1
vResourceCompute Hardware Specialization 1
vResourceCompute vHardware Specialization 1
vResourceCompute Container Specialization 1
vResourceCompute VM Specialization 1
vResource vResouce Descriptor Composition 1
Container Container File Descriptor
Composition 1
VM Image Creator File Descriptor
Composition 1
Interfaces
Virtualization Interface at 11.4.
Context Diagram
When starting an application in a distributed (or even cloud) environment, it is important to start it in the correct place. This view packet wants to make the reader aware that, as well as for the lifecycle code, the infrastructure must be coded as well.
Rationale
To start correctly a TM application there must be a lifecycle script code and an infrastructure
description called template.
Allocation View
This view shows the mapping between the runtime component of the system (see 10.1) and the (virtual or real) servers (see 13.3.2) needed to run them. The figure below shows how the Services products (server side) are deployed across the sites, in particular in the GHQ (SKA Headquarter in England) and in the Telescope (SKA-MID in South Africa and SKA-LOW in Australia). Each site, represented as a geographical boundary, hosts an instance of every products of TM Services. In the diagram, different representation are reported, based on the technologies. The technologies used are: Virtual Machines node, Cluster, External Cloud.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 74 of 115
Primary Presentation
Figure 18: Deployment diagram.
Element Catalog
Definitions
Element Description
Node In a computer network, a node is an end-point identified by an IP address or a name that can receive, create, store or send data along distributed network routes.
VMs Node A node composed by at least two virtual machines, an active and a passive node. In case of failure or abnormal situation, failover mechanism switches from the active to the passive node.
Cluster A cluster is a collection of one or more nodes that together holds data and provides search capabilities across all nodes, to improve efficiency, availability and performance.
External Cloud
An external cloud hosts services that need to run in an external environment than other TM products.
Elements
Element Description
Service GUI Node Is a VMs Node, composed by an active and passive node, where are installed the
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 75 of 115
Element Description
Service GUIs
Software System Monitor Node
SSM Core is a VMs Node, composed by an active and passive node, where are installed components of the Software System Monitor
Logging Server Cluster A cluster where are installed the components of the server part of the Logging service
Lifecycle Manager Is a cloud node that hosts component of IT automation software (like Chef, Puppet, Ansible and so on); it is going to reside into an external cloud for two reasons: to avoid having three locations for it and for cost saving.
Service GUI Generic UI that allow to an Operator to visualize every monitoring data, alarm, log and so on. Also permits to perform action like configure fault rules, take lifecycle action having the configuration already done. See 10.1 and 10.1.3.1 for further details.
Software System Monitor
See 10.1 and 10.1.3.1 for further details.
SSM Core Collects data from every SSM Agent in the network for hardware, software and network monitoring; Receives asynchronous event from the Monitoring Agent Store the collected data into the repository; Provides data to visualize to the operator through the Service UI
MonData Repository Repository where the monitoring data are stored. See 10.1 and 10.1.3.1 for further details.
FM Repository A repository where rules and actions are installed. See 10.1 and 10.1.3.1 for further details.
Fault Engine The rule engine which defines the mapping of monitoring data and the alarms. It also provide an engine to perform a fault predictive using monitoring and log data. See 10.1 and 10.1.3.1 for further details.
Notification System Software application that send notification to the Service GUI or operator using email or sms, according the fault rules. See 10.1 and 10.1.3.1 for further details.
Logging Service Entity responsible for collecting and organizing the log messages from every applications. It is usually composed by three software entities: the forwarder, the repository (data center) and the query GUI (that is, ELK stack). The repository (data center) is a hierarchy of databases (usually NoSQL) and potentially every domain TANGO Facility (element instance) may have a specific database cluster to collect log messages and increase performance of the queries (first choice for the data center is Elasticsearch). See 10.1 and 10.1.3.1 for further details.
LS Engine See 10.1 and 10.1.3.1 for further details.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 76 of 115
Element Description
LM Core Server side of the Lifecycle Manager. Allow to send a specific control action to a TM Application. See 10.1 and 10.1.3.1 for further details.
LM Data Repository Part of the engine: repository of configuration items (for example in chef they are Ruby scripts). See 10.1 and 10.1.3.1 for further details.
Relations
Part A Allocated To Description
Service GUI Node
GHQ and Telescopes
Service GUI is a VMs Node
Software System Monitor Node
GHQ and Telescopes
SSM Node is a VMs Node
Logging Server Cluster
GHQ and Telescopes
Logging Data is a cluster that do not need any TM failover mechanism because it is already present in the ELK stack.
Lifecycle Manager Cloud
GHQ and Telescopes
The Lifecycle Manager has been put outside the virtualization (specifically in a cloud environment) because every TM applications (Virtualization included) depends on it for its lifecycle.
Variability Mechanisms
To reach a high level of reliability, software engineers usually use a failover mechanism. There are two main best practices for this: the active/active (also known as high-availability cluster) or the active/passive architecture (also known as simple failover). An active/passive solution manages the failover when the active (or primary) node crashes and therefore its associated resources are relocated to (and restarted on) the passive (or secondary) node, ensuring the persistence of the services. Together with the reliability quality, in this way another advantage is the ability to deal with either planned or unplanned service outages: the administrator can update the passive node while the active is still running, easily switch them after the update, and do the same for the other node. An Active/active solution, manages the failover maintaining all the nodes working so that if one fails the other can still receive requests. In this case, it is crucial to guarantee the traffic balance adding one or more load balancers in the network. It is also important to properly size the dimension of the servers to avoid spare resources; in fact a best practise is that the load balancers runs servers at near full capacity. In the TM SER case, it has been preferred an active/passive solution for two main reasons: the first one is to minimize the cost (there is no need for the load balancer, therefore it is easier to implement) and the second one is because this solution fully satisfy the assigned requirements.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 77 of 115
Rationale
For the deployment of the TM SER components, it has been consulted and taken as reference the Nagios core architecture (see [RD13]), the ELK Architecture (see [RD15] and [RD16]) and the Chef architecture ([RD24]). While the Monitoring system needs only one server for managing up to 5 hundred nodes [RD14], the logging service needs a cluster to store all the data needed (so the need is for two servers, one for logging and one for monitoring). The requirement for logging storage says to maintain log data for 2 years [AD1]. After a test, it has been found out that an application heavily working with logs can send up to 24 Mb of logs per day. There are 17 (11 for TMC, 6 for OSO) products in the cost model therefore the plan is for 150 GB per year. Having 1TB for each site would be enough. The Lifecycle Manager server has been put in an external cloud environment instead of having an instance (that is, a VM node) in every sites. Since the managed information are not going to be heavy (the TANGO cookbook [RD27] is 95.4 KB) and since those information get usually cached (that is, chef) in the client computer, there is no need for allocating specific resource within the sites. Even if the cost associated to a cloud based solution are comparable to the classical solution (see also [RD24]), it has been preferred a cloud one for the better support associated.
Tactics
The SSM Agent, to detect TM faulty component, implements the following mechanism: Ping: asynchronous request/response message to determines that the component is alive and
responding correctly Heartbeat: periodic message exchange with the SSM Server. Timestamp: every message has a timestamp to rebuild the correct order of messages. Timeout: the monitoring activity (that is, Nagios check) should complete within a
predetermined amount of time. Retry: in case of a faulty monitoring activity, the SSM Agent retries to execute the activity.
To recover TM SER faulty component, the SER sub-system implements the following mechanism:
Active redundancy (for every SER sub-system): in case of fault or failure is it possible to switch a passive (or redundant) node automatically, using an automatic action, or manually, by the intervention the operator.
Reconfiguration: the LMC, through the Lifecycle Manager, can re-configure a faulty component with a versioned configuration automatically or with an operator command.
Software upgrade or downgrade: patch after a software bug. Exception handling: Once that an exception has been detected, the system must handle it.
There are several possibilities to handle an exception. A possible way is to include with the exception an error code that contain information helpful in fault correlation.
To prevent faults the Fault Management component will make trend analysis and failure prediction of TM components. Those analysis has to be done with the help of the logging data coming from the TM Logging server. The tactics implemented by LMC Fault Management are:
Predictive model: according with health status detected by the software system monitor, the predictive model ensures that the system is operating within its normal operating parameters and takes corrective actions. A possible way to create a predictive model is gather the data (from monitoring, log and so on) and to analyse with an artificial intelligence -software
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 78 of 115
11 Interfaces
Lifecycle Manager - TM Generic Application - Interface
Interface identity
Lifecycle Manager Main Interface
Resources provided
Configuration script resource: syntax: it is a script written in a specific IT Automation Tool Scripting language (Ruby, Bash,
Perl and so on) semantics of the resource:
o The observable effect of the resource takes place in the node (aka virtual or real host) where the script runs;
o The configuration script prepares the host with all the necessary resources for running a TM Application (see 10.3)
o Every version of an application must have its own configuration o The script is idempotent (it can run multiple times leaving the host at the same
state) o It refers to a particular version of an application
error handling: permission not available
Start script resource:
syntax: it is a script written in a specific IT Automation Tool Scripting language (Ruby, Bash, Perl and so on)
semantics of the resource: o The observable effect of the resource takes place in the node (aka virtual or real
host) where the script runs; o The start script runs after the configuration script and start the specific TM
application o The script is not idempotent o It refers to a particular version of an application
error handling: permission not available
Stop script resource:
syntax: it is a script written in a specific IT Automation Tool Scripting language (Ruby, Bash, Perl and so on)
semantics of the resource: o The observable effect of the resource takes place in the node (aka virtual or real
host) where the script runs; o The stop script runs after the start script and stop the specific TM application o The script is not idempotent o It refers to a particular version of an application
error handling: permission not available
Kill script resource:
syntax: it is a script written in a specific IT Automation Tool Scripting language (Ruby, Bash, Perl and so on)
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 79 of 115
semantics of the resource: o The observable effect of the resource takes place in the node (aka virtual or real
host) where the script runs; o The kill script runs after the start script and kill the process of the specific TM
application o The script is not idempotent o It refers to a particular version of an application
error handling: permission not available
Error handling
Permission not available: the configuration resources could need to write to a certain directory (for instance to create a particular service);
Wrong configuration
Rationale and design issues
To start or stop an application there is the need of a configuration phase The configuration phase can be done manually or through the use of an IT Automation Tool
such as, Chef, Puppet, Ansible and so on
SSM - Monitoring Activity Interface
Interface identity
Monitoring activity Interface
Resources provided
Execute resource: syntax of the resource: a script that run in the machine where the monitoring activity is
hosted. It could be a C compiled script or a shell script semantics of the resource: when the resource is called the activity starts and returns:
o A code that can assume the following values: INFORMATION, WARNING, ALARM, UNKNOW
o A value indicating the explanation of the code: Measure, Message, State error handling: the following errors must be handled
o The resource represents a script that must be well written o The script has the correct permission to access to read the resources of the machine o The SSM Agent where the monitoring activity runs can communicate with SSM
Server
Configure resources:
syntax of the resource: a json-based file with information about the resource to monitor semantics of the resource:
o Version and information about the host o
error handling: the following errors must be handled
o Configuration wrong
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 80 of 115
Data types and constants
The return data type is an array of {CODE, MESSAGE}.
Error handling
Permission: the script or agent cannot access to read resources of the host to control Communication: communication problem between the SSM Server and the SSM Agent
TM Monitor Interface
Interface identity
TM Monitor Device Interface
Resources provided
Add Monitoring Point resource: syntax: TANGO Controls Device command
o Input Name of the monitoring point, string Url to retrieve the monitoring point, string Query to get the monitoring point value, string Type of the monitoring point, string
o Output Boolean that indicates the success of the operation
semantics of the resource:
o The observable effect of the resource is the addition of a dynamic attribute in the device representing the monitoring point
o The url to get the monitoring point depends on the technology chosen o The query to get the monitoring point value depends on the technology chosen
error handling: o The Url to get the monitoring point is wrong or does not allow to retrieve any value
o The query to get the monitoring point value is wrong or does not allow to retrieve any value
o The name of the monitoring point is already existing o The SSM in not active
Remove Monitoring Point resource:
syntax: TANGO Controls Device command o Input
Name of the monitoring point, string o Output
Boolean that indicates the success of the operation semantics of the resource:
o The observable effect of the resource is the removal of a dynamic attribute in the device representing the monitoring point
error handling: o The name of the monitoring point is not existing
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 81 of 115
o The SSM in not active
Host List resource:
syntax: TANGO Controls Device command o Output
List of the Host configured with monitoring semantics of the resource:
o This resource retrieves all the hosts in the network that has been configured for monitoring
error handling: o The SSM in not active and it is not possible to retrieve the list
Get Monitoring Point Value resource:
syntax: TANGO Controls Device command o Input
Url to retrieve the monitoring point, string Query to get the monitoring point value, string
o Output Value of the monitoring point
semantics of the resource: o This resource retrieves the value of a particular monitoring point o The url to get the monitoring point depends on the technology chosen o The query to get the monitoring point value depends on the technology chosen
error handling: o The Url to get the monitoring point is wrong or does not allow to retrieve any value
o The query to get the monitoring point value is wrong or does not allow to retrieve any value
o The SSM in not active
Get Monitoring Points Process resource:
syntax: TANGO Controls Device command o Input
Name of the process, string o Output
List of monitoring points, string semantics of the resource:
o Retrieve the list of generic monitoring points for a process error handling:
o The process should be already configured in the SSM o The SSM in not active
Data types and constants
This interface will use the types available for the dynamic attribute within TANGO controls framework
Error handling
If the SSM is not active, the TM Monitor will not work. To fix it, make sure that the SSM is active.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 82 of 115
Quality attribute characteristics
QA characteristics of the resources Modifiability: Adding and removing monitoring points from the device is easy.
Rationale and design issues
This interface is a bridge between a generic SSM and the TANGO controls world. It allows to take advantage of the TANGO controls mechanism for storing and generating
alarm correlated with the functional monitoring points of the TMC
Virtualization Interface
Interface definition
The interface to the Virtualization Service will be divided into three configuration layers. See Figure 26 of the Virtualization View for an overview. The first layer, named Physical Resource Layer, consist on every hardware that is available for supporting computation of LINFRA. This layer is internal to LINFRA and managed according to computational requirements, while observing the need for maintenance and support tasks. The second layer, named Product Execution Layer, consists of a distributed environment (set of several virtual machines) operating towards the provisioning of highly available products. The third layer, named Virtualized Resource (vResource) Layer, consists of the vvirtual machines, containers and other hardware that has a logical representation to the Product Execution Layer and is part of a template or is available to be used by future templates. The interface will be constrained by exposing template actions externally according to the aforementioned three layers, while working based on a structure relying on a state based architecture that will control the Virtualization Orchestrator and its managed templates and instances. The objective is to provide high availability, abstracting the underlying hardware infrastructure, and allowing software defined failover and horizontal scalability. Externally the requesting services will need only to define the template for the virtualization service, being all the internal state and allocated resources out of the provided resource pool, managed by the virtualization service itself providing only logging capabilities to the outside.
These Virtualization Service internals are defined in the Virtualization View (see 13.3.2).
Template Actions
Actions are allowed to be issued according to the present state. The template corresponds to one or several instances of vResources (see 10.3.6) according a given deployment plan and a SLA (Service Level Agreement). Actions may be exposed to the remaining of TM, through Lifecycle Manager, or be only available internally as the internal virtualization management processes. To the outside of computation infrastructure, management will be driven by each template, with internal management of the instances of vResources to be handled locally. Depending on the effective technology being used for the vResources, some instance states may be missing. The definition of
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 83 of 115
templates is vital so that the Virtualization Service allocates and manages resources adequately, by considering the overall SLA and dependencies of all instances of a template. That is, allocation of a TM template must be performed if all instances can be allocated. Also, scaling, failover and migration can be handled by the Virtualization Service, circumventing failures due to maintenance and hardware, as well as avoiding implementation redundancies at all other components (of course at Server level) according to the Context of the Virtualization View (see 13.3.2.2).
Table 11: Available actions for the Virtualization Service.
Action Description Required State
Triggered State
Visibility
get_state Retrieve the present state of the template. This action superseeds any other existing at every level for monitoring/debugging purposes.
ANY NONE EXTERNAL
create Create a new Template Instantiation
INITIALIZED ACTIVE, ERROR
EXTERNAL
set_admin_password Specify Template administrator password. Might enter an ERROR state if the complexity requirements are not met.
ACTIVE ACTIVE, ERROR
EXTERNAL
live_migrate Moves Template instances between hardware computing units, but it won’t power off the instances in general so the instance will not suffer a down time. Virtualization engineers may use this to evacuate servers from a physical server that needs to undergo maintenance tasks.
ACTIVE ACTIVE, ERROR
INTERNAL
suspend Suspend a Template if it is infrequently used or to perform system maintenance. The Instances state is stored on disk, all memory is written to disk, and the instances are stopped. Some technologies (LXC may not allow this)
ACTIVE SUSPENDED, ERROR
INTERNAL
pause Stores the state of the Template instances in RAM. Paused instances continues to
ACTIVE PAUSED, ERROR
INTERNAL
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 84 of 115
Action Description Required State
Triggered State
Visibility
run in a frozen state.
resume Resume suspended instances. SUSPENDED ACTIVE, ERROR
EXTERNAL
unpause Returns a paused Template back to an active state
PAUSED ACTIVE, ERROR
EXTERNAL
stop Power off the Template and its instances.
ACTIVE, ERROR
STOPPED, ERROR
EXTERNAL
backup Store Template’s current state to the general archive
ACTIVE, STOPPED
ACTIVE, STOPPED, ERROR
INTERNAL
rebuild Remove all data on the Template and rebuild it according the latest specification.
ACTIVE, STOPPED
ACTIVE, STOPPED, ERROR
INTERNAL
scale Change the Template resource allocation, number of instances and QoS parameters
ACTIVE, STOPPED
ACTIVE, ERROR
INTERNAL
start Power on the Template. STOPPED ACTIVE, ERROR
EXTERNAL
reboot Soft or hard reboot a Template instances. A soft reboot attempts a graceful shutdown and restart of the instances (soft reboot powers down the instances normally before rebooting). A hard reboot power cycles the instances (shuts them down immediately).
ACTIVE, STOPPED,
ACTIVE, ERROR
EXTERNAL
delete Power off the given Template first, then detach all the resources associated to the instances such as network and volumes, then delete the instantiation of the Template.
INITIALIZED, PAUSED, SUSPENDED, STOPPED, ERROR
DELETED INTERNAL
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 85 of 115
Table 12: Allowed Actions according to the State
State Description Allowed Actions
INITIALIZED Template was just created but nothing exists
create
ACTIVE Template is running. pause, stop, snapshot, reboot
PAUSED Template is paused and all instances are also paused.
unpause, delete
SUSPENDED Template is suspended and all instances are suspended.
Resume, delete
STOPPED Template is not running. snapshot, backup, rebuild, reboot, resize, rescue, start
DELETED From quota perspective, the Template no longer exists. Instances will eventually be destroyed running on compute, disk images too.
ERROR Some unrecoverable error happened. Only delete is allowed to be called on the Template.
delete
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 86 of 115
Figure 19: TM Template Actions depending of the State.
vResource Internal Actions
Instances (of vResource, according to 10.3.6) are a construct which is managed internally by TM.LINFRA. TM Engineers will have the capability of assessing an instance directly, through the SSO mechanisms defined in the AAA policies. The instances are internally managed by the virtualization provider and are not available outside for the user.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 87 of 115
Table 13: vResources Internal Actions.
Action Description Required State Triggered State
create Create a new Template Instantiation INITIALIZED ACTIVE, ERROR
live_migrate Moves instances between hardware computing units, but it won’t power off the instances in general so the instance will not suffer a down time.
ACTIVE ACTIVE, ERROR
soft_delete See delete but deleted instance will not be deleted immediately, instead it will be put into a queue and deleted according to the operational policy.
ACTIVE, STOPPED
SOFT_DELETED, ERROR
suspend Suspend an instance if it is infrequently used or to perform system maintenance. The VM state is stored on disk, all memory is written to disk, and the virtual machine is stopped.
ACTIVE SUSPENDED, ERROR
pause Stores the state of the Instances in RAM. Paused instance continues to run in a frozen state.
ACTIVE PAUSED, ERROR
restore Restores deleted Instance. SOFT_DELETED ACTIVE, ERROR
resume Resume suspended instance. SUSPENDED ACTIVE, ERROR
unpause Returns a paused set of instances back to an active state.
PAUSED ACTIVE, ERROR
stop Power off the instance. ACTIVE STOPPED, ERROR
snapshot Store the current state of the instance root disk to be saved and uploaded back into the glance image repository.
ACTIVE, STOPPED
ACTIVE, STOPPED, ERROR
backup Store instance’s current state ACTIVE, STOPPED
ACTIVE, STOPPED, ERROR
rebuild Remove all data on the application server and replace it with a specified instance image.
ACTIVE, STOPPED
ACTIVE, STOPPED, ERROR
resize Convert an existing application server to a different flavor, scaling the application server up or down. The original application server is saved for a period of
ACTIVE, STOPPED
RESIZED, ERROR
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 88 of 115
Action Description Required State Triggered State
time to allow rollback if there is a problem.
rescue Start application server in a special configuration whereby it is booted from a special root disk image. Enables try to restore broken guest system.
ACTIVE, STOPPED
RESCUED, ERROR
start Power on the instance. STOPPED ACTIVE, ERROR
reboot Soft or hard reboot an instance. A soft reboot attempts a graceful shutdown and restart of the instance. A hard reboot power cycles the instance.
ACTIVE, STOPPED, RESCUED
ACTIVE, ERROR
confirm_resize See resize. RESIZED ACTIVE, ERROR
revert_resize See resize. RESIZED ACTIVE, ERROR
delete Power off the given instance first, then detach. All the resources associated to the instances such as network and volumes, then delete the instantiation from the Template.
INITIALIZED, PAUSED, SUSPENDED, SOFT_DELETED, ERROR, RESCUED, STOPPED
DELETED, ERROR
unrescue Reverse action of Rescue. The instance spawned from the special root image will be deleted.
RESCUED ACTIVE
Error handling
The error handling will exist at the layer level and interface level. At the interface level the error will be handled by the logging facilities expressed in the Virtualization View document and the states described in the Interface Template Actions.
Rationale and design issues
● Completely abstract the hardware from the TM service so that only LINFRA has to take into
account hardware non uniformity.
● Provide high level access to all relevant hardware capabilities to deduplicate the work
needed by the TM services.
12 Prototypes
The TM SER prototypes are included in the TM Prototype report [RD4].
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 89 of 115
13 Appendix
Compliance statements for TM Service Requirements
The following table shows the analysis made for the compliance statements for the TM Service requirements. Table 14: Compliance statements for TM Service Requirements
Id Addressed in
SER_REQ_1 Abstract Data Model (see 10.3), Service C&C View (see 10.2), Uses Module View (see 10.1)
SER_REQ_2 Abstract Data Model (see 10.3), Service C&C View (see 10.2), Uses Module View (see 10.1)
SER_REQ_3 Abstract Data Model (see 10.3), Service C&C View (see 10.2), Uses Module View (see 10.1), TM Monitor Prototype (see 11.3)
SER_REQ_4 Service C&C View (see 10.2)
SER_REQ_5 Service C&C View (see 10.2)
SER_REQ_6 Abstract Data Model (see 10.3), Service C&C View (see 10.2), Uses Module View (see 10.1)
SER_REQ_6a Abstract Data Model (see 10.3), Service C&C View (see 10.2), Uses Module View (see 10.1)
SER_REQ_7 Abstract Data Model (see 10.3), Service C&C View (see 10.2), Uses Module View (see 10.1)
SER_REQ_8a TM Health Status and State Analysis View (see 13.3.1)
SER_REQ_8b TM Health Status and State Analysis View (see 13.3.1)
SER_REQ_9 Allocation View (see 10.4)
SER_REQ_10 Abstract Data Model (see 10.3), Service C&C View (see 10.2), Uses Module View (see 10.1)
SER_REQ_11a Abstract Data Model (see 10.3), Service C&C View (see 10.2), Uses Module View (see 10.1)
SER_REQ_11b Abstract Data Model (see 10.3), Service C&C View (see 10.2), Uses Module View (see 10.1)
SER_REQ_11c Abstract Data Model (see 10.3), Service C&C View (see 10.2), Uses Module View (see 10.1)
SER_REQ_11d Abstract Data Model (see 10.3), Service C&C View (see 10.2), Uses Module View (see 10.1)
SER_REQ_12 Service C&C View (see 10.2)
Detailed scenarios
In this section there are some detailed scenarios that were not included in the use cases section (see 9).
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 90 of 115
Monitoring scenarios
Monitoring Resources
Name Monitor Resources
Description SER SSM monitors TMC/OSO resources defined as any physical component of limited availability of a computer, such as CPU Load, Memory Usage and so on.
Actor SER SSM
Pre-condition SER SSM has the configuration and authorization to access the resource monitoring data
Basic flow 1. SER SSM measures performance of TMC/OSO resources 2. SER SSM evaluates current and nominal values for normal behaviour 3. SER SSM reports monitoring data and in case of failure/error throws an alarm
Alternative Flow
-
Post-condition
SER SSM receives and reports monitoring data
Monitoring Services (on network)
Name Monitoring Services (on network)
Description SER SSM monitors TMC/OSO services (defined as an application running at the network application layer, for instance TCP, HTTP, FTP and so on)
Actor SER SSM
Pre-condition The TMC/OSO service must be available from the network
Basic Flow 1. SER SSM periodically sends network packets (that is request the service) to TMC/OSO component to be monitored
2. TMC/OSO receives packets and replies to SER request 3. SER SSM receives acknowledgement from TMC/OSO 4. SER SSM evaluates current and nominal acknowledgement for normal
behaviour 5. SER SSM reports monitoring data and in case throws an alarm
Alternative Flow
-
Post-condition
SER SSM receives and reports monitoring data
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 91 of 115
Asynchronous Monitoring Software component
Name Asynchronous Monitoring Software component
Description SER SSM monitors software component in asynchronous mode and download monitoring data periodically.
Actor SER SSM Server, SER SSM Agent
Pre-condition SER SSM Server has right configuration and authorization to access at SER SSM Agent. The remote host has the SER SSM Agent installed and working.
Basic flow 1. SER SSM Agent retrieves and collects data from the local machine 2. SER SSM Server connects to SSM agent and downloads monitoring data 3. SER SSM Agent removes the data downloaded by the SER SSM server 4. SER SSM Server evaluates current and nominal values for normal behaviour
5. SER SSM reports monitoring data and in case throws an alarm
Alternative Flow
-
Post-condition -
Synchronous Monitoring Software component
Name Synchronous Monitoring Software component
Description A TMC/OSO application (remote process), through the SER SSM Agent, sends a synchronous message to SER
Actor SER SSM Server, SER SSM Agent
Pre-condition SER SSM Server has right configuration and authorization to access at SER SSM Agent. The remote host has the SER SSM Agent installed and working.
Basic flow 1. A TMC/OSO application (remote process) sends a message to SER SSM Agent 2. SER SSM Agent connects to SER SSM Server and forwards the message 3. SER SSM Server receives the message 4. SER SSM Server evaluates the message and check if it is a normal behaviour 5. SER SSM Server reports monitoring message and in case throws an alarm
Alternative Flow
-
Post-condition -
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 92 of 115
Sending alarm
Name Sending alarm
Description SER Fault Management sends an alarm to an operator.
Actor SER SSM Server, Operator, Fault Management
Pre-Condition Alarm is not handled by TMC/OSO
Basic flow 1. SER receives a monitoring data that not match with normal behaviour 2. SER has not defined a rule that match with mismatching situation 3. SER sends alarm to the operator
Alternative Flow -
Post-condition -
Fault management scenarios
Insert Recovery procedure
Name Insert Recovery procedure
Description A TMC/OSO Developer insert a recovery procedure to handle a specific alarm
Actor TMC/OSO Developer, SER Administrator
Pre-condition
The corresponding alarm is defined
Principal flow
1. TMC/OSO Developer creates a procedure to handle a specific alarm 2. TMC/OSO Developer uploads the procedure into the Lifecycle Manager
Repository 3. The SER Administrator tests the procedure and declare it as runnable or ask the
TMC/OSO Developer to modify the procedure
Post-condition
Recovery procedure is stored in the Lifecycle Manager Repository
Alarm notification
Name Alarm notification
Description TMC/OSO Developer stores rules to send alarm
Actor TMC/OSO Developer, SER Administrator
Pre-Condition
The alarms does not exist
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 93 of 115
Principal flow 1. TMC/OSO Developer defines rules to send an alarm 2. TMC/OSO Developer uploads rules into the Software System Monitor 3. The SER Administrator checks the rules and declare it as correct or ask the
TMC/OSO Developer to modify it
Post-condition
A new alarm condition is stored
Lifecycle Management Scenarios
Configure/Start Application
Name Configure/Start Application
Description Start a TM application in a remote host or locally
Actor Administrator
Pre-condition
The administrator must have the right for starting an application; the application has not started and, if it is, the policy should allow having multiple instances of it.
Principal flow
1. Log in into the TM.SER 2. Check the user right to start the application chosen 3. Check the application start-up policy if the application is already started 4. Configure application or check if correct configuration is loaded 5. Start the application 6. Start the corresponding monitoring activities if available 7. Test application
Post-condition
The application is started
Kill Application
Name Kill Application
Description Stop an application in a remote host or locally
Actor Administrator
Pre-condition The administrator must have the right for stopping an application. The application has started.
Principal flow 1. Log in into the TM.SER 2. Check the user right to kill the application chosen 3. Stop the corresponding monitoring activities if available 4. Kill the application
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 94 of 115
Post-condition The application is stopped
Restart Application
Name Restart Application
Description Restart a specific instance of application in a remote host or locally
Actor Administrator
Pre-condition The user must have the right for restarting the application. The application has started.
Principal flow 1. Log in into TM.SER 2. Check the user right to restart (kill and start) the application in the specific host 3. Stop the corresponding monitoring activities if available 4. Kill the application 5. Configure/Reconfigure the application 6. Start the application 7. Start the corresponding monitoring activities if available 8. Test application
Post-condition
The application is started
Add Application Version
Name Add Application Version
Description Add a specific version of an application into the system
Actor Administrator
Pre-condition Version not present
Principal flow 1. Log in into TM.SER 2. Check the user right to add a new version of the application 3. Check if the version is already in the system and if not 4. Add the version entry
Post-condition Version present
Remove Application Version
Name Remove Application Version
Description Remove a specific version of an application into the system
Actor Administrator
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 95 of 115
Pre-condition Version present
Principal flow 1. Log in into TM.SER 2. Check the user right to remove a version of the application 3. Check if the version is in the system and if it is offline. In case of online set offline
the application (4.2.7) 4. Remove the version entry
Post-condition
Version not present
Set on line Application version
Name Set on line Application version
Description Set a specific version available to be used by an User
Actor Administrator
Pre-condition Version off-line
Principal flow 1. Log in into TM.SER 2. Check the user right to set online a version of the application 3. Check if the version is off-line 4. Configure/Reconfigure the application 5. Start the application 6. Start the corresponding monitoring activities if available 7. Test application 8. Set the specified version on line
Post-condition Version on line
Set off line Application version
Name Set off line Application version
Description Set a specific version not available to be used by an User
Actor Administrator
Pre-condition Version on line
Principal flow 1. Log in into TM.SER 2. Check the user right to set offline a version of the application 3. Check if the version is on line and not used by any operator 4. Check the user right to kill the application chosen 5. Stop the corresponding monitoring activities if available 6. Kill the application 7. Set the specified version off-line
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 96 of 115
8. Uninstall application
Post-condition Version off-line
Update Application
Name Update Application
Description Update a TM application without removing or adding a specific version. This case can happen if, for instance, there is a bug (or a specific user request) and there is the need to solve it as soon as possible.
Actor Administrator
Pre-condition
-
Principal flow
1. Log in into TM.SER 2. Check the user right to update an application 3. Check if the version is on line and not used by any operator otherwise stop 4. Kill the application 5. Install/Update version 6. Configure/Reconfigure the application 7. Start the application 8. Start the corresponding monitoring activities if available 9. Test application
Post-condition
-
List Applications
Name List Applications
Description List the applications available for the user
Actor User
Pre-condition -
Principal flow 1. Log in into the TM.SER 2. List applications
Post-condition -
Use Application
Name Use Application
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 97 of 115
Description Link the user to the right version of the application he wants to work with
Actor User
Pre-condition -
Principal flow 1. Log in into the TM.SER 2. List applications available for the user 3. Choose an application from the list 4. Use the application
Post-condition -
Logging scenarios
Store Log
Name Write Log
Description A TMC/OSO Application stores a log
Actor TMC/OSO Application
Pre-condition The Logging service is active and the logging priority is high (for instance there is a need to find out a behaviour).
Principal flow (with log server connection)
1. TMC/OSO Application creates a log packet defining log details and log information
2. TMC/OSO Application sends log packet to SER Logging Service 3. Log Service stores log packet
Alternative flow (without log server connection)
1. TMC/OSO Application creates a log packet defining log details and log information
2. TMC/OSO Application maintains the log packet until the connection with SER Logging Service returns
3. TMC/OSO Application sends log packet to SER Logging Service 4. Log Service stores log packet
Post-condition Log is stored
Search Log
Name Search Log
Description A maintainer or developer needs to investigate on a certain behaviour of the TMC/OSO Application
Actor A TMC/OSO (or TM) Developer (or Maintainer)
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 98 of 115
Pre-condition The Developer has the permission to search a log
Principal flow 1. The Developer log in into TM.SER Logging Service 2. The Developer search for log messages using key search values (for instance a
word) 3. The Logging Service provides a list of matching log messages
Post-condition
-
Extract Log File
Name Extract Log File
Description A maintainer or developer needs to investigate on a certain behaviour of the TMC/OSO Application
Actor A TMC/OSO (or TM) Developer (or Maintainer)
Pre-condition The Developer has the permission to extract a log
Principal flow 1. The Developer logs into TM.SER Logging Service 2. The Developer search for log messages using key search values 3. The Developer download the log files he needs
Post-condition
-
Other views
TM Health Status and State Analysis View
The present view is intended to satisfy the following requirements:
ID Name Description Source Verification method
SER_REQ_8a Aggregate and Report TM Health Status
The SER shall aggregate the TM internal status and report it to the Operator in a structured health view based on the TM PBS
TM_REQ_211 Demonstration
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 99 of 115
SER_REQ_8b Manage TM State
The SER shall manage the TM state (by sending signal for state transition) that can assume, among the others, the following values: start-up, shutdown, standby and operational. The following are the possible state transitions: 1. from standby to startup, 2. from startup to operational, 3. from operational to shutdown, 4. from shutdown to standby.
TM_REQ_201 TM_REQ_202 TM_REQ_342 TM_REQ_385 TM_REQ_386 TM_REQ_387
Demonstration
From the requirement analysis, there is a need for two distinct aggregated values that will express the TM State and the TM Health Status, respectively. The first one is an enumerative that has to contain at least the following values: start-up, shutdown, standby and operational. The second one is a performance indication that must be structured so that an Operator can understand it (according to the TM PBS).
This view represents an analysis of a simple, systematic approach to the definition of an aggregation method and performance metrics.
Primary Presentation
Figure 20: Health status calculation.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 100 of 115
Figure 21: Mathematical representation.
Element Catalog
Elements
Element Description Figure
i Monitoring point index (within a critical group). -
j Critical grouping index. It identifies the critical group (in Figure 21 j=2 for G2).
-
Mi,j Monitored item (see 13.3.1.3); for example, a process, an application, a cpu and so on); it is critical if its failure corresponds to a TM failure.
Figure 20
Gj Critical grouping, that is a set of non critical items that together form a critical item (for instance a set of applications or a set of monitoring point forming a server); It is composed by the minimum number of non-critical items whose collective failure corresponds to a TM failure; it becomes critical if and only if all composing items fails.
Figure 20
si,j State of the monitored entity corresponding to the i-th monitoring point inside the j-th critical group
-
(si,j ) = 1 if si,j = operating state
(si,j ) = 0 if si,j = non-operating state (for example, standby, switched-off, and so on)
Figure 20
wi,j Weight of the i-th monitoring point inside the j-th critical group. It is a measure of the importance of the item in the TM (0≤wi,j≤1); The values of them is the natural outcome of the FMECA analysis.
Figure 20
hi,j Normalized value of each monitoring point normalized (0≤hi,j≤1). It can be thought as a sort of “health status of Mi,j “.
Figure 20, Figure 21
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 101 of 115
aj Aggregated health status for the j-th critical group. Figure 20, Figure 21
t Time at which the monitoring point values have been acquired and the system health computed and provided as an aggregated value.
Figure 21 as ‘time’
Pin Starting time of the aggregation process. It can be also thought as t - 𝛥t, being 𝛥t the duration of the aggregation process
Figure 21
Pout Ending time of the aggregation process. It can be also thought as t. Figure 21
nj Number of components of the j-th critical group -
∑ It represent a mathematical operation as expressed in Eq. 5 Figure 21
∏ It represent a mathematical operation as expressed in Eq. 2 Figure 21
Interfaces
Interface Document link
SSM-Monitoring Activities SSM - Monitoring Activity Interface at 11.2
Context Diagram
This view is an application of the current architecture and the context for it is the same of the monitoring system (see 10.3.4.4). Besides, it is important to consider that a generic monitoring activity will produce a state value or a monitoring point value. Therefore, for each TM application (aka monitored item) it must be possible to consider at least a performance value and a state value. The following diagram summarizes this.
Figure 22: For every TM Process there will be at least two monitoring activities to retrieve the state and a
measure of the performance.
Related View
Uses Module View, 10.1 Service C&C View, 10.2 Abstract Data Model, 10.3
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 102 of 115
Rationale
Health Status
The classical approach to RAM modelling has been followed, considering the fundamental difference between:
critical items, that is, those whose failure causes a failure of the overall system; non-critical items, that is, those whose failure causes a degradation of the overall system,
without blocking it.
and by also considering the fundamental concept of specific grouping of non-critical items, which can become a critical (higher-level) group. According to this concept, a critical group is composed by the minimum number of (non-critical) items whose collective and simultaneous failure causes a failure of the whole system. For such a group the degraded behaviour is caused by the degraded performance of some or all the components, or even the failure of some (but not all) components. Note in addition that this concept can be extended to the concept of generalized critical group, which also includes:
1. critical groups composed by single critical items 2. critical groups composed by non-critical items which, however, will never fail together and
simultaneously (so the groups are virtually critical, but really never critical) 3. critical groups composed by non-critical items whose collective and simultaneous failure does
not produce a failure of the overall system in any case. These groups can be built as essentially composed by N-1 real non-critical items, plus a N-th virtual item, properly added to the group and defined as ‘never failing’: in this way the group will never have all items simultaneously failing and will be equivalent to the groups mentioned in point (2) above.
It is therefore possible to define an aggregation procedure in terms of generalized critical groups only. We define aj , aggregated health status of the j-th generalized group Gj (composed by nj lower-level items) as the weighted average of the health statuses hi,j (scalar quantities with 0 ≤ hi,j ≤ 1) of all lower-level items Mi,j (with i ranging from 1 to nj ) which the group is composed by:
(𝑬𝒒. 𝟏) 𝑎𝑗 = ∑ 𝑤𝑖,𝑗ℎ𝑖,𝑗
𝑛𝑗
𝑖=1
𝑤𝑖𝑡ℎ ∑ 𝑤𝑖,𝑗 = 1
𝑛𝑗
𝑖=1
It is easy to prove that 0 ≤ aj ≤ 1, too. It is also quite evident that equation (1) becomes trivial when a single critical-component is considered: if nj = 1, then w1,j = 1 and aj = h1,j . With these principles in mind, we can represent the correct operation of the system as a sort of process, starting at an initial time Pin and ending at a final time Pout by passing over a number of serial steps, each one representing the operation of a generalized critical group Gj, with j ranging from 1 to the number of generalized critical groups N. Critical components are reported as a series in analogy with electrical engineering, where a circuit is open as a single serial component is broken. It can be noted, incidentally, that such a scheme could also allow to define the performance degradation of the system, which occurs as a single serial component shows a degraded behaviour (that is, under normal performance).
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 103 of 115
The overall aggregated health status aTOT is defined as a modified N-dimensional geometrical average of all the serial (that is, critical) health statuses aj :
(𝑬𝒒. 𝟐) 𝑎𝑇𝑂𝑇 = ∏ 𝑎𝑗𝑐/𝑁
𝑁
𝑗=1
This simple equation indeed meets all the fundamental requirements for aTOT :
1. 0 ≤ aj ≤ 1 ⇒ 0 ≤ aTOT ≤ 1
2. aTOT = 0 ⇒ j : aj = 0
3. aTOT = 1 (fully operational) aj = 1 j The use of a geometrical average, instead of a simple product, is dictated by the need to make the aggregation formula scalable as the number of components varies. Being aj less than 1, the simple product should produce very low numbers for aTOT as N becomes significant, making the overall value hard to handle. On the other hand, a generic factor c is introduced at the exponential, called severity coefficient, that allows to enhance or to mitigate the overall value with respect to the input values. It is easy to check that for c > 1 or c < 1 the overall health status has even lower or higher values than for c=1, respectively. The value for c can be adopted according to a defined policy for the aggregated status. Equations (1) and (2) do not take into account the state si,j of the Mi,j item, however. On the other hand, the state cannot be neglected: a critical item Mi,j could be in a ‘non operating’ state (for example, standby or switched off) and therefore returning aj = hi,j = 0 even if healthy, so producing the false value aTOT = 0 with unwanted consequences. In such a situation, a value aj (critical) or hi,j (non-critical) should simply be excluded from the computations. The procedure to do that is very different, however, depending on Mi,j item being critical or not. Since aj appears in a product, it can be excluded by putting it equal to 1. On the other hand, since hi,j appears into a weighted average, it can be excluded by simply putting to zero the corresponding weight (wi,j = 0) and re-equilibrating the other weights in such a way the normalization condition is still fulfilled. Let’s define a function of the state (si,j ) in the following way:
(si,j ) = 1 if si,j = operating state
(si,j ) = 0 if si,j = non-operating state (for example, standby, switched-off, and so on) The expression for the health aj of a critical single item, to be inserted into definition (2), is then
state-affected critical single item health = aj + (1 - aj )[1 - 𝜎(si,j )] (Eq. 3) where it can be verified that the aj assumes its normal value if 𝜎(si,j)=1 or the value aj =1 if 𝜎(si,j)=0. The expression for the weight wi,j of a non-critical item, to be inserted into equation (1), is instead
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 104 of 115
state-affected non-critical weight = 𝜎(si,j )wi,j all other wk,j (k ≠ i) being changed to preserve normalization. A series of simple steps can lead to a more general expression for Equation (1). First of all, let us consider the case when there is a k-th item in a non-operating state. This means forcing wk,j = 0 and therefore changing the normalization condition as
∑ 𝑤𝑖,𝑗(𝑛𝑒𝑤)
𝑛𝑗
𝑖=1
= ∑ 𝑤𝑖,𝑗
𝑖≠𝑘
= ∑ 𝑤𝑖,𝑗 − 𝑤𝑘,𝑗
𝑛𝑗
𝑖=1
= (1 − 𝑤𝑘,𝑗) ≠ 1
This condition can be restored, however, by re-computing the weights as follows:
𝑤𝑖,𝑗 =𝑤𝑖,𝑗
1 − 𝑤𝑘,𝑗
so that
∑ 𝑤𝑖,𝑗
𝑛𝑗
𝑖=1
= 1
(Note that the range of i values is unchanged, since null value for wk,j produces a null values for wk,j and can be therefore inserted into the expression). In general, if there are N0 items in a non-operating state, the weights distribution will be changed as
𝑤𝑖,𝑗 =𝜎(𝑠𝑖,𝑗)𝑤𝑖,𝑗
1 − ∑ 𝑤𝑘,𝑗𝑁0𝑘=1
A general expression can then be found by noticing that the sum at the denominator can be written over the full range of values for k (1 to nj) in a different form:
𝑤𝑖,𝑗 =𝜎(𝑠𝑖,𝑗)𝑤𝑖,𝑗
1 − ∑ [1 − 𝜎(𝑠𝑘,𝑗)]𝑤𝑘,𝑗𝑛𝑗
𝑘=1
Therefore, by taking into account the state of each item, equation (1) can be rewritten in the more general form as follows:
(𝑬𝒒. 𝟒) 𝑎𝑗 = ∑𝜎(𝑠𝑖,𝑗)𝑤𝑖,𝑗
1 − ∑ [1 − 𝜎(𝑠𝑘,𝑗)]𝑤𝑘,𝑗𝑛𝑗
𝑘=1
ℎ𝑖,𝑗 𝑤𝑖𝑡ℎ ∑ 𝑤𝑖,𝑗 = 1
𝑛𝑗
𝑖=1
𝑛𝑗
𝑖=1
It is easy to verify that equation (4) returns to equation (1) when all (si,j)=1 within the group Gj .
What happens, however, if all (si,j) within the group Gj are zero? In this case equation (4) should produce aj = 0 and therefore lead to a false result for equation (2). Essentially, even excluding all non-critical Mi,j , equation (4) does not allow to exclude the whole group Gj from the computation.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 105 of 115
To solve this final issue, let us introduce a second function 𝜂(x) defined as follows:
𝜂(x) = 1 if x = 1 𝜂(x) = 0 if x > 1
Then equation (4) can be finally modified by introducing an additional term which will return aj=1 in
case all (si,j) are zero or the value of aj given by equation (4) in case at least one 𝜎(si,j) is not zero:
(𝑬𝒒. 𝟓) 𝑎𝑗 = ∑ (𝜎(𝑠𝑖,𝑗)𝑤𝑖,𝑗
1 − ∑ [1 − 𝜎(𝑠𝑘,𝑗)]𝑤𝑘,𝑗𝑛𝑗
𝑘=1
) ℎ𝑖,𝑗 + η (1 + ∑ 𝜎(𝑠𝑖,𝑗)
𝑛𝑗
𝑖=1
) 𝑤𝑖𝑡ℎ ∑ 𝑤𝑖,𝑗 = 1
𝑛𝑗
𝑖=1
𝑛𝑗
𝑖=1
Equation (2) can therefore be written in its complete and general form to include both single critical items and critical groups as follows:
(𝐄𝐪. 𝟔) 𝑎𝑇𝑂𝑇 = ∏[𝑎𝑗 + 𝜂(𝑛𝑗)(1 − 𝑎𝑗)(1 − 𝜎(𝑠1,𝑗))]𝑐/𝑁
𝑁
𝑗=1
In conclusion, an aggregated health status is completely defined by equations (5) and (6). It appears evident that the basis of the computation is the set of health statuses hi,j together with their associated weights wi,j . The definition of the values to be associated to the health statuses is clearly a matter of performance metrics. Generally speaking, the range [0,1] can be decomposed in a number of adjacent, strictly separated intervals, each one representing a well-defined status: for example one could imagine that 0 ≤ hi,j (or aj) ≤ 0.1 ⇒ faulty 0.1 < hi,j (or aj) < 0.8 ⇒ degraded performance 0.8 ≤ hi,j (or aj) ≤ 0.1 ⇒ fully operating or make a different choice, increasing the ‘resolution’ as follows: 0 ≤ hi,j (or aj) ≤ 0.1 ⇒ faulty 0.1 < hi,j (or aj) ≤ 0.4 ⇒ severely degraded performance 0.4 < hi,j (or aj) ≤ 0.7 ⇒ normally degraded performance 0.7 < hi,j (or aj) ≤ 0.8 ⇒ slightly degraded performance 0.8 < hi,j (or aj) ≤ 0.1 ⇒ fully operating and so on. The values for hi,j could even be discrete, being 0 (faulty), 0.5 (degraded) and 1 (fully operating) the most simple assumption. These values can be inserted into Equation (6) as well, producing floating values for both aj and aTOT, without making these Equations invalid, however. A performance metrics should be the result of a detailed performance analysis carried on the final system, which in turn is closely related to the dependability analysis (FMECA, Fault Tree, and so on). It is essentially out of the scope of the system design at the Pre-Construction Phase.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 106 of 115
Dependability analysis also is the only way to compute the weights wi,j . In addition, since every group Gj is essentially identified by the corresponding set of weights wi,j, we can conclude that the dependability analysis will automatically define also the criteria for the specific grouping of non-critical items. Finally, it is certainly worth of notice the fact that the proposed approach makes the aggregation metrics poorly sensitive to the hierarchical organization of the system, being it based on critical and non-critical items only. Internal dependencies of the system organization are reflected in the weights wi,j only.
13.3.1.5.1.1 Example
According to the components of the Telescope Manager Test Configurations highlighted in [AD8], consider the following TM applications:
1. Observation Execution Tool 2. Central Coordinator 3. EDA
Suppose that each of them are running on two redundant (virtual or not does not change the essence of the present example) servers. Consider also the following parameters for the TM applications:
1. CPU usage (double precision number from 0, no use, to 1, all CPU used), 2. Memory usage (double precision number from 0, no use, to 1, all memory used).
And consider the following parameters for the two servers:
1. CPU usage (double precision number from 0, no use, to 1, all CPU used), 2. Memory usage (double precision number from 0, no use, to 1, all memory used), 3. Hard disk space usage (double precision number from 0, no use, to 1, all disk space used).
Figure 23 shows how the mathematical representation is adapted for this specific example and indicates the monitoring activities to build, that is the scripts (or software module according to 10.1) that calculates the health status aOET, aCC, aEDA, h1,SERVER_OET1, h2,SERVER_OET2, h1,SERVER_CC1, h2,SERVER_CC2, h1,SERVER_EDA1, h2,SERVER_EDA2.
Figure 23: Adapted from Figure 21.
There are endless possibilities to calculate those values: aOET could assume the three basic discrete values according to a look-up table based on the CPU and memory usage, in the following way:
CPU ≥ 95% and Memory ≥ 95% ⇒ aOET = 0 ⇒ faulty, 55% ≤ CPU < 95% and 55% ≤ Memory < 95% ⇒ aOET = 0.5 ⇒ degraded performance,
CPU and Memory < 55% ⇒ aOET = 1 ⇒ fully operating. Note that:
this option for the calculation was described in 13.3.1.5.1,
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 107 of 115
it is possible to create many different kinds of performance metrics, by only putting the constraint that the resulting health status must be a number between 0 and 1,
the servers are redundant so the variables 𝑤𝑖,𝑗 = 0.5 i, j
Another possibility is computing a floating value, ranging between 0 and 1, for the selected component:
𝑎𝐶𝐶 =(1 − 𝑃𝐶𝑃𝑈) + (1 − 𝑃𝑀𝐸𝑀)
2= 1 −
𝑃𝐶𝑃𝑈 + 𝑃𝑀𝐸𝑀
2
where PCPU and PMEM represent the percentage of CPU and Memory used that component, respectively. Once defined all the health statuses of Figure 23, the SSM automatically allows the Operator to check the overall health status of the considered applications. Figure 24 shows the final calculation of a degraded system and the (one level only) drill-down capability that shows to the Operator the degraded performance of the OET.
Figure 24: Aggregated health status drill-down possibility and calculation.
It is very important to remark that the final aggregated health status cannot have labels, because this could lead to wrong conclusions. Although its absolute value can be interesting, what is important are its variations. The resulting value is indeed a performance value that a trained Operator can take as input in order to perform some control actions, if required. In general, defining more aggregation levels as shown in Figure 25 provides a full drill-down capability that allows the Operator to display the components responsible for the degraded performance.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 108 of 115
Figure 25: Drill-down with aggregation levels.
A very important application of the present architecture is the Trend Analysis that can be performed onto the system and at any level. By looking at the values assumed by aTOT (and the drilled-down aj) over the time, it is possible to predict faulty behaviours of subsystems, components or even the whole system, by intervening in order to prevent them and to preserve the system availability. On the other hand, in case an alarm is raised at time t0, the trend of the aggregated health status up to the alarm should allow to find the correspondence with the specific event and see if there are, for instance, hidden dependencies.
TM State
While it is possible to aggregate a value for the TM Health Status (aka TM performance), this does not appear true for the TM State. This is because TM States defines an enumerative set of at least 4 values (start-up, shutdown, standby and operational), which however cannot be ordered. Unless more complex definitions of the states will be provided in the future, it appears that for any two different states sa and sb it is not possible to say that one and only one of the following relations is valid (numerical order rule): sa > sb sa = sb sa < sb or that one and only one of the following relations are valid (set inclusion order rule): sa ⊃ sb sa ⊂ sb As a consequence an aggregation of states, at least in terms of algebraic rules, aimed at keeping or understanding the behaviour of the entire system, is not possible. The states can be managed with logical rules only.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 109 of 115
Part of these rules are actually defined in TM Requirements, while others can be derived from them. According to the requirement analysis we can indeed say that:
TM is in Operational state when all its sub-systems are ready for operational use (TM_REQ_201);
TM is in standby state when all its sub-systems are shutted down or in standby state (the standby state does not mean the power is down; this means that the SER is still able to start the system up, TM_REQ_386) ;
The shutdown and startup state are temporary states. Currently the requirements define only four states with clear logical aggregation rules for them. Therefore there should be no need to store those information for the TM SER. However, in case the number of states will be increased in the future, it could become appropriate to store them in 10.3.
Virtualization View
The view describe the flow and the interaction of Services with the virtualization service provided by the LINFRA team.
Primary Presentation
This view will be built on a virtualization orchestrator and managing system for the virtualization service to ensure high availability requirements, deduplication of efforts by the different TM Services and uniformization of hardware access. This will allow virtualization service to respond to the computing requirements of the different parts of the Telescope Manager permitting a total utilization of the available physical computing resources and therefore be a fundamental module to enforce a demanding TM services high availability requirements. The virtualization service will also free the other parts of the telescope manager from having to deal with different machines since they will all be exposed under a unified interface by the Virtualization View implementing this virtualization layer above the physical computing resources. Furthermore it will avoid the duplication and diminish the complexity of part of the low level requests pertaining the hardware access since they will be directly handled by the virtualization service. The proposed architecture accommodates several paradigms of computation and data acquisition as well as different technology stacks, still it was built using the present working models of some existing projects and relies on the use of off the shelf technology stacks. Therefore we introduce a the Virtualized Resource Layer representing the Orchestrator, networking, storage, and monitoring based upon Heat [RD17], Neutron [RD29] and Cinder [RD30]; a Product Execution Layer representing the TM App based upon Openstack itself; and a Physical Resource Layer representing the Container based upon Docker [RD31] or KVM [RD32]. There will be a hardware redundancy of the machines which according to best practices described in the RAM and ILS Report [AD6] should be of 10%. This redundancy will make it able to provide an active-active failover system for all the virtualization services made possible by the very distributed nature of all their components and the agent based nature of Heat. This architecture is the most adequate for a large scale computing project since if the redundancy of the hardware available for Virtualized
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 110 of 115
Resources is assured then all TM Apps will continue working - though with more constrained computing resources - even if the computing hardware assigned to them suffers any failure.
Figure 26: Overall Presentation of Execution Environment.
Figure 26: an orchestrator will manage the access to monitoring storage and networking while being in charge of starting and allocating resources to diverse TM Apps as needed and according to the priorities defined by the different TM Apps and services. The Execution platform will manage the communication and hardware resources of each individual TM App and their allocated containers.
Figure 27: Template and Instance presentation.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 111 of 115
Figure 27: The Virtualization Service will be based on a three tier template definition. At the machine level there will be the Physical Resource Template that will consist on every hardware that is available for supporting computation of LINFRA. At the TM App level the Product Execution Layer, consisting of a distributed environment operating towards the provisioning of highly available products. And at the Orchestration level the Virtualized Resource Layer, consisting of the virtual machines, containers and other hardware that has a logical representation to the Product Execution Layer and is part of a template or is available to be used by future templates.
Element Catalog
Elements
Element Description Shown
Orchestrator The Orchestrator component manages templates and Instance resource allocation; It is the entry point for the Virtualization Service
Figure 26 Figure 27
Monitoring Receives the monitoring signals from the different virtualization components and logs them appropriately.
Figure 26 Figure 27
Storage Persistent storage for the different virtualization components. Figure 26 Figure 27
Network Outside networking control for all the layers (definition in 11.4) running under the orchestrator.
Figure 26 Figure 27
Virtual Machine A container or a VM. Figure 26 Figure 27
TM App A generic application running on one or more Virtual Machines. Figure 26 Figure 27
Template A TM Template (see[RD17]) that is composed by a set of instances (application servers, VMs or Containers) with a Service Level Agreement (SLA), user Access Control List (ACL) and network ACL. The Virtualization Service will manage the resources of the entire set so that it matches the SLA, and will implement basic failover mechanisms.
Figure 26
Product Execution Platform
Platform managing the distributed execution of TM applications in virtualized environments.
Figure 26 Figure 27
Hardware Hardware computing resource. Figure 26 Figure 27
Virtualized Resource Template
Template defining virtual machines, containers and other hardware that has a logical representation to the Product Execution Layer (definition in 11.4).
Figure 27
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 112 of 115
Product Execution Template
Template defining the distributed Virtual Machine environment and its networking, storage and monitoring accesses used to deliver a TM App.
Figure 27
Product Execution Platform
Distributed environment (set of several virtual machines) operating towards the provisioning of highly available products.
Figure 26 Figure 27
Physical Resource Template
Template defining the Virtual Machine setup and its requirements that are available as physical machine resources (that is,
Dockerfile, [RD33]).
Figure 27
Interfaces
The interface will be constrained by exposing template actions externally, while working internally based on a structure relying on a state based architecture that will control the Virtualization Orchestrator and its managed templates and instances. Externally the requesting services will need only to define the template for the virtualization service, being all the internal state and allocated resources out of the provided resource pool, managed by the virtualization service itself. The interface to the Virtualization Service will be divided into three configuration layers. See figure 1 for an overview. The Interface is described in greater detail at 11.4.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 113 of 115
Behaviour
Figure 28: Activity diagram for managing a template for a TM App.
Figure 28 Notes:
1. The containers validate the correct startup processes, allocation of computing resources and
networking.
2. The scripting will define a template and once the template finishes computing the deliverable
the orchestrator will then free the computing resources.
3. The TM App is the reference for the template needed as defined in the Abstract Data Model.
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 114 of 115
4. The TM App State is the state of the running template defined in the TM Template State table.
5. The TM Instance State uses the same states defined in the TM Template State table.
Context Diagram
The context of the present view of the TM SER can be summarized in the following diagram:
● Domain/Business Layer: functional monitoring and controlling of business logic performed by
each application,
● Services Layer: Monitors and controls processes on a generic level (not functionality) like web
services, database servers, custom applications,
● Infrastructure Layer: Monitors and controls virtualisation, servers, OS, network, storage.
An important consideration to make concerns a possible failover mechanism. Failover is an automatic action to recover from a specific situation and can happen at different level. The boundaries for the level are:
1. if the failover is needed at application server level, then it is a responsibility of the
Virtualization,
2. if it can be solved with a lifecycle action then is a responsibility of the Service
3. otherwise it is at level of the TANGO facility (for instance in case of capability transfer) and it
is a responsibility of the M&C Module.
Rationale
● Ensure the availability of the entire TM (and the availability of the TM SER) by dynamical
allocation of compute resources (see also [AD6]).
● By defining only a virtualization template for each service, there is no need to repeat work
by managing each individual instance and resource allocation.
● Uniformization of hardware access by TM Services.
● The OS Services are not included in the OS distribution
● The OS Services are not aware of the upper layer containers
● Every container will have its own configuration files per application
● In figure 1, the upper layer can access the lower level and not vice versa
● An example of Virtualization Orchestrator is Heat (see [RD17]).
Quality attribute characteristics
This design of the Virtualization Service focus on the following quality attributes:
INFRASTRUCTURE
SERVICES
DOMAIN/BUSINESS
Document No.: Revision: Date:
T0800-0000-AR-001 02 2018-06-29
For Project use only Author: Matteo Di Carlo
Page 115 of 115
Maintainability: By creating an uniformization layer above the hardware and by providing a set of descriptive creation of the TM Applications running times, it becomes significantly simpler to maintain the service.
Reusability: The three tier system for the description of the Virtualization Service allows for each building block to be reused in creating more complex ones.
Availability: By defining the priority of the processes in the interface therefore allowing the Virtualization Service to manage the allocation of resources, we are able to ensure high hardware fault tolerance and high concurrent availability of resources.
Interoperability: Creating a layer above the hardware and having a built in interoperability of resource access and communication within the Virtual Machine Service and the containers ensures the interoperability of the TM Applications.
Manageability: Dividing the system into a three tier rational provides a balance between complexity and manageability of resources and systems.
Performance: The Virtual Machine Service automatic allocation or resources according to their priority will ensure that the available resources are maximized.
Reliability: Having a fault tolerant system by containerizing all its components makes it capable of redirecting computing resources upon their failure to the available ones automatically greatly increases reliability.
Scalability: The system is built to be scalable by design and from the ground up. Increasing or decreasing the available computing resources will automatically be managed by the Virtualization Service and immediately put to use as soon as they are added to the existing computing resources pool.
Testability: In this case resides on two aspects. First the inbuilt testing tools on the three layers provided by the health checks, second by providing a totally modular approach simplify the testing since it now also becomes modular and contained.
Usability: Like already stated the three tier rational provides a balance between complexity and manageability, this allows the teams to focus on the building blocks of the TM Apps providing a more intuitive approach at building and deploying the TM Applications.
For a more in depth analysis of the quality attribute characteristics of virtualization platforms, namely OpenStack see: [RD28].