Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Name Designation Affiliation Signature
Authored by:
A. Magro
Subject
Matter
Expert
AADC
Date:
Owned by:
M. Waterson Domain
Specialist SKAO
Date:
Approved by:
P. Gibbs Engineering
Project
Manager
SKAO
Date:
Released by:
J. G. Bij de Vaate Consortium
Lead AADC
Date:
MCCS ARCHITECTURE OVERVIEW
Document number ...................................................................... SKA-TEL-LFAA-0600050
Context ........................................................................................................................ DRE
Revision ......................................................................................................................... 01
Author ........................................................................................... A. Magro, A. DeMarco
Date ................................................................................................................. 2018-10-31
Document Classification ............................................................. FOR PROJECT USE ONLY
Status ................................................................................................................... Released
2018-11-01
Philip Gibbs
2018-11-01
2018-11-01
2018-11-01
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 2 of 53
DOCUMENT HISTORY
Revision Date Of Issue Engineering Change
Number
Comments
A 2018-06-04 - Draft Template version released within consortium
01 2018-10-31 First Release
DOCUMENT SOFTWARE
Package Version Filename
Wordprocessor MsWord Word 2016 SKA-TEL-LFAA-0600050-01 MCCS Architecture Overview
Block diagrams
Other
ORGANISATION DETAILS Name Aperture Array Design and Construction Consortium
Registered Address ASTRON
Oude Hoogeveensedijk 4
7991 PD Dwingeloo
The Netherlands
+31 (0)521 595100
Fax. +31 (0)521 595101
Website www.skatelescope.org/lfaa/
Copyright Document owner Aperture Array Design and Construction Consortium
This document is written for internal use in the SKA
project
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 3 of 53
TABLE OF CONTENTS
1 INTRODUCTION ............................................................................................. 8
1.1 Purpose of the document ....................................................................................................... 8
1.2 Scope of the document ........................................................................................................... 8
1.3 Intended Audience .................................................................................................................. 8
1.4 Document Overview ............................................................................................................... 8
1.5 Document Tree ....................................................................................................................... 9
2 REFERENCES .............................................................................................. 10
2.1 Applicable documents........................................................................................................... 10
2.2 Reference documents ........................................................................................................... 10
3 MCCS ARCHITECTURE OVERVIEW ................................................................... 11
3.1 Telescope Overview .............................................................................................................. 11
3.2 LFAA Overview ...................................................................................................................... 12
3.3 Role of MCCS in LFAA ............................................................................................................ 14
3.4 Main MCCS Responsibilities .................................................................................................. 14
3.5 MCCS Top-Level Static Decomposition Diagram .................................................................. 18
3.6 Interfaces .............................................................................................................................. 18
External Entities ............................................................................................................ 18
Level 4 and Level 5 Components .................................................................................. 18
External Interfaces ........................................................................................................ 19
Internal Interfaces ......................................................................................................... 20
4 OPERATIONAL CONCEPTS .............................................................................. 23
Operational Environment ............................................................................................. 23
Operations................................................................................................................. 23
Maintenance ............................................................................................................. 24
Operator Role ............................................................................................................ 24
Support Environment .................................................................................................... 24
On-site Maintainer role ............................................................................................. 24
Off-site Maintainer role ............................................................................................ 24
Remote support ........................................................................................................ 24
States and Modes.......................................................................................................... 25
5 MCCS SOFTWARE OVERVIEW ........................................................................ 28
5.1 Overview of Software Architecture ...................................................................................... 28
5.2 Software Component List ..................................................................................................... 33
5.3 Software-Hardware Mapping ............................................................................................... 37
5.4 Software Life Cycle ................................................................................................................ 38
Agile Release Trains ...................................................................................................... 38
SAFe Implementation Overview ................................................................................... 38
Essential SAFe ............................................................................................................... 39
Software Development Process During Construction Iterations .............................. 40
The Test-First Approach to Construction .................................................................. 41
5.5 Commissioning ...................................................................................................................... 43
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 4 of 53
6 MCCS PHYSICAL OVERVIEW .......................................................................... 44
6.1 Compute Server .................................................................................................................... 44
6.2 Network ................................................................................................................................ 44
6.3 Rack Assembly ....................................................................................................................... 46
7 SCENARIOS ................................................................................................ 48
7.1 Application of power ............................................................................................................. 48
7.2 Transition to Low Power Mode ............................................................................................. 49
7.3 Transition to Off-line ............................................................................................................. 49
Controlled shutdown .................................................................................................... 49
Uncontrolled shutdown ................................................................................................ 50
7.4 Set up and Start Observation ................................................................................................ 50
7.5 Calibration ............................................................................................................................. 50
7.6 Stop Observing ...................................................................................................................... 51
7.7 MCCS Failures ....................................................................................................................... 51
7.8 Software Upgrades ............................................................................................................... 52
Software upgrades ........................................................................................................ 52
BIOS updates ................................................................................................................. 53
LRU firmware updates .................................................................................................. 53
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 5 of 53
LIST OF FIGURES Figure 1-1 SKA1 LFAA Element Documentation Tree ............................................................................. 9
Figure 3-1 SKA1 Telescope Overview .................................................................................................... 11
Figure 3-2 SKA1_Low Functional Diagram ............................................................................................ 12
Figure 3-3. LFAA overall architecture.................................................................................................... 13
Figure 3-4. LFAA observation organization ........................................................................................... 15
Figure 3-5. MCCS top-level static decomposition ................................................................................. 17
Figure 3-6. LFAA L3 context diagram .................................................................................................... 21
Figure 3-7. MCCS - Field interface ......................................................................................................... 21
Figure 3-8. MCCS - SPS interface ........................................................................................................... 22
Figure 4-1 MCCS Sub-Element top-level context diagram showing all external interfaces ................. 23
Figure 4-2. Derived state transition diagram for all TANGO devices in SKA LMC. Not all states are
mandatory for each hardware and software component. ......................................................... 27
Figure 5-1. LFAA overall software architecture overview ..................................................................... 28
Figure 5-2. LFAA observation management overview .......................................................................... 30
Figure 5-3. LFAA local monitoring and control overview ...................................................................... 32
Figure 5-4. TANGO control structure .................................................................................................... 33
Figure 5-5. Software module decomposition diagram ......................................................................... 37
Figure 5-6. Mapping between array and software components .......................................................... 38
Figure 5-7: Essential SAFe configuration............................................................................................... 39
Figure 5-8: Software development process during a construction iteration. ...................................... 41
Figure 5-9: Test-first development approach. ...................................................................................... 42
Figure 5-10: Testing during construction iterations. ............................................................................ 42
Figure 6-1. Network links between MCCS and external entities .......................................................... 45
Figure 6-2. MCCS network diagram ...................................................................................................... 46
Figure 6-3. MCCS rack assembly ........................................................................................................... 47
LIST OF TABLES Table 3-1. LFAA numbers ...................................................................................................................... 14
Table 3-2. External interfaces ............................................................................................................... 19
Table 3-3. L2 interfaces to other LFAA sub-elements ........................................................................... 20
Table 4-1. MCCS states and modes ....................................................................................................... 25
Table 5-1. Link between hardware components as described in software and the physical
components as defined in the PBS ............................................................................................. 29
Table 5-2. List of elements in the Architecture System Overview ........................................................ 33
Table 5-3. Relationships between major elements in the Architecture System Overview .................. 36
Table 6-1. MCCS compute server configuration ................................................................................... 44
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 6 of 53
LIST OF ABBREVIATIONS
AADC ................................. Aperture Array Design and construction Consortium
AAVS ................................. Aperture Array Verification System
ADC ................................... Analog to Digital converter
Ad-n .................................. nth document in the list of Applicable Documents
APIU .................................. Antenna Power Interface Unit
AIV .................................... Assembly Integration and Verification
BIOS ................................. Basic Input/Output System
CDR ................................... Critical Design Review
CI ....................................... Configuration Item
CMB .................................. Cabinet Management Board
COTS ................................. Commercial Off The Shelf
CPF .................................... Central Processing Facility
CM .................................... Configuration Manager
CPU .................................. Central Processing Unit
CSP .................................... Central Signal Processing
DAQ .................................. Data Acquisition
DDD ................................... Detailed Design Document
DMS .................................. Document/Data Management System
ECP .................................... Engineering Change Proposal
EMI .................................... Electro Magnetic Interference
FN ..................................... Field Node
FoV .................................... Field of View
FPGA ................................. Field Programmable Gate Array
GPU ................................... Graphics Processing Unit
HW .................................... Hardware
ICD .................................... Interface Control Document
INFRAAUS ......................... Infrastructure Australia
ISO..................................... International Organisation for Standardisation
LFAA .................................. Low Frequency Aperture Array
LFAA-DN ............................ Low Frequency Aperture Array – Data Network
LMC ................................... Local Monitoring and Control
FQDN ................................ Fully Qualified Device Name
LNA ................................... Low Noise Amplifier
LMC ................................... Local monitoring and Control
LRU .................................... Line Replaceable Unit
MCCS................................. Monitor, Control and Calibration subsystem
MRO .................................. Murchison Radio-astronomy Observatory
MWA ................................. Murchison Widefield array
PBS .................................... Product Breakdown Structure
PPS .................................... Pulse Per Second
QA ..................................... Quality Assurance
RD-N .................................. nth document in the list of Reference Documents
RAM ................................. Random Access Memory
RMS .................................. Root Mean Square
RF ...................................... Radio Frequency
RFI ..................................... Radio Frequency Interference
RFoF .................................. Radio Frequency signal over Fibre
RPF .................................... Remote Processing Facility
SAD ................................... Software Architecture Document
SaDT .................................. Signal and Data Transport
SDP .................................... Science Data Processor
SKA .................................... Square Kilometre Array
SKA-LOW ........................... SKA low frequency part of the full telescope
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 7 of 53
SKAO ................................. SKA Office
S/N .................................... Signal to noise
SPS ................................... Signal Processing Subsystem
SRMB ................................ Sub-Rack Management Board
SSD ................................... Solid State Drive
SW ..................................... Software
TANGO .............................. TAco Next Generation Objects
TCP-IP ................................ Transmission Control Protocol – Internet Protocol
TBC .................................... To Be Continued
TBD ................................... To Be Done
TM ..................................... Telescope Management
TPM ................................... Tile Processor Module
UPS.................................... Unlimited Power Supply
WBS .................................. Work Breakdown Structure
WP .................................... Work Package
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 8 of 53
1 Introduction
1.1 Purpose of the document
The purpose of this document is to describe the architecture for the Monitoring, Control and
Calibration Sub-System for the Low Frequency Aperture Array (LFAA) of the SKA Phase 1, which
references detailed design documents for the hardware and network setup, as well as one software
architecture document for describing the software system which will run on the MCCS. Combined,
these will determine the operational concept, cost, power, equipment space, reliability, availability
and maintainability of the MCCS.
This document should be read after the LFAA Architectural Design and Analysis Document [AD3].
1.2 Scope of the document
This document describes how the LFAA MCCS architecture can meet the requirements within the
SKA LFAA Signal Processing Requirement Specification.
The level of detail in this document is sufficient to:
1. Define interfaces with other SKA Elements and LFAA Sub-elements.
2. Establish a reasonable baseline design at reasonably low perceived risk.
3. Estimate time, effort and cost to deliver the functionality specified in the LFAA Signal
Processing Sub-Element Requirements Specification [AD7].
In other words, the LFAA Sub-Element design is defined in enough detail as to reduce risk of
effort/time/cost overruns in the Construction Phase.
The current release (100% version) will support the Critical Design Review for the LFAA Element. The
level of detail is enough to have high confidence in the referenced design being compliant and able
to be constructed with low risk. This Architecture Design Document (ADD), with references to
supporting information and data, will provide a design artefact to support the Construction Phase
activities.
1.3 Intended Audience
This document is expected to be used by the LFAA Element Consortium Engineering and
Management Team and the SKAO System Engineering Team and SKAO LFAA Project Manager. This
document is expected to be read by the external CDR review panel
1.4 Document Overview
This document follows a template that was agreed to between the SKAO and the LFAA Consortium.
It covers the key contents called out in the LFAA SOW [AD8].
Detailed information is contained in reference documents.
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 9 of 53
1.5 Document Tree
The overall document tree for the LFAA Element is shown in Figure 1-1. Level 1 (L1) is the SKA
System (telescope) level, L2 is the LFAA Element level and L3 is the LFAA sub-element level (where
MCCS resides).
Figure 1-1 SKA1 LFAA Element Documentation Tree
L1 Requirements
L2 Requirements
LFAA ADD
LFAA Costing
Planning Verification Specifications Design Costing
Baseline Design/
Architecture Data Pack
L3 Requirements
Internal ICDs
L1
L2
L3
Design DocsSub-element
Costings
Sub-element Detailed
Design and
Prototyping Docs
Sub-element
Test Specs and
Statement of
Compliance at
CDR
LFAA
Test Spec
PMP SEMP
Risk Reg
External ICDs
LFAA
AIVP
Sub-element
Prototyping
Plans
Sub-element
Dev Plans
(SOW,WBS)
External ICDs
SE-6*
Construction Plan
Legend
LFAA CIDL Tree – Rev 1.aJune 06, 2018
…
(Additional Planning Docs)
Sub-element
Signal Models
Con Ops
LFAA RAMS/Logistics/
Safety/EMI/EMC
SKAO Doc
LFAA Doc for PDR; SKAO Doc for CDR updates
LFAA Doc at PDR; updates for CDR as required
LFAA Doc to be delivered for CDR
* L2 docs split between Sub-elements** L3 requirements split per sub-element
Not Delivered
Recovery
Plan
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 10 of 53
2 References
2.1 Applicable documents
The following documents are applicable to the extent stated herein. In the event of conflict between
the contents of the applicable documents and this document, the applicable documents shall take
precedence.
[AD1] SKA-1 System Baseline design, SKA-TEL-SKO-0000002 Issue 01
[AD2] Roll-out Plan for SKA1 Low, SKA-TEL-AIV-4410001 Issue 05
[AD3] LFAA Architectural Design Document, SKA-TEL-LFAA-0200028
[AD4] SKA1 TM to LFAA ICD, 100-000000-028, Issue 02
[AD5] SKA1 LFAA to INFRA AUS ICD, 100-000000-003, Issue 03
[AD6] SKA1 SADT to LFAA ICD, 100-000000-026, Issue 04
[AD7] SKA1 LFAA SPS Sub-Element Requirements Specification, SKA-TEL-LFAA-0400014
[AD8] SKA1 LFAA Element Statement of Work
2.2 Reference documents
The following documents are referenced in this document. In the event of conflict between the
contents of the referenced documents and this document, this document shall take precedence.
[RD1] SKA1 Control System Guidelines, 000-000000-010, Issue 01
[RD2] LFAA Internal Interface Control Document SKA-TEL-LFAA-0200030, Issue 01
[RD3] CISPR 22 Information technology equipment - Radio disturbance characteristics - Limits
and methods of measurement R2014
[RD4] CISPR 24 Information technology equipment - Immunity characteristics - Limits and
methods of measurement 2010
[RD5] CISPR 32 Electromagnetic compatibility of multimedia equipment - Emission
requirements 2015
[RD6] CISPR 35 Electromagnetic compatibility of multimedia equipment - Immunity
requirements
[RD7] MCCS Software Architecture Document, SKA-TEL-LFAA-0600052
[RD8] SPS Detailed Design Document, SKA-TEL-LFAA-0500035
[RD9] MCCS Detailed Design Document, SKA-TEL-LFAA-0600051
[RD10] MCCS Assembly Verification and Test Plan, SKA-TEL-LFAA-0600053
[RD11] Safe principles: https://www.scaledagileframework.com/safe-lean-agile-principles/
[RD12] Essential Safe: https://www.scaledagileframework.com/essential-safe/
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 11 of 53
3 MCCS Architecture Overview
3.1 Telescope Overview
Figure 3-1 shows the major SKA1 Observatory entities: SKA1-Low in Australia, SKA1-Mid in South
Africa and the SKA Global Headquarters in the UK. The thick flow-lines show the unidirectional
transport of large amounts of digitised data from the antennas to the Central Processing Facilities
(CPF) on the sites, and from the CPFs to the Science Data Processor (SDP) and Archive facilities. The
thin blue dash-dot lines show the bidirectional transport of system monitor and control data.
The SKA1-Low telescope array includes 512 stations, each consisting of 256 dual-polarisation log-
periodic antennas. The stations are distributed over a distance of 65 km, with the greatest density of
stations in the central core. The Central Processing facility is located on site and the SDP and archive
are located in Perth. Additionally, each station can be divided into a number of smaller sub-stations
at reduced bandwidth.
A more detailed schematic of the SKA1-Low telescope, extracted from the SKA1 System Baseline V3
Description (in preparation), is shown in Figure 3-2. This figure shows the major SKA1-Low signal
flow components, as well as the areas of consortia responsibility (red boxes) and the key
technologies needed to implement the components. The green dashed line shows the bi-directional
flow of monitor, control and operational data, and the orange dot-dashed line shows the distribution
of synchronisation and timing signals.
Figure 3-1 SKA1 Telescope Overview
A schematic of the SKA1_Low Telescope, extracted from the Baseline Design [AD1], is shown below
including the LFAA Element, product [101-000000].
SKA1-Low operates concurrently in imaging mode and non-imaging mode with concurrent operation
of between 1 and 16+ sub-arrays. Each sub-array is programmable as a separate conceptual
telescope in terms of antenna pointing, band selection and the setting of configurable imaging and
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 12 of 53
non-imaging parameters. The only things that are not shared between sub-arrays are observation
time, communications links and some processing resources.
Figure 3-2 SKA1_Low Functional Diagram
3.2 LFAA Overview
The LFAA is primarily a hardware-centric element, such that hardware configuration, monitoring and
control is a central feature and architectural driver. The physical architecture is defined in Figure 3-3
and the system consists of the following major components:
1. Stations, consisting of Field Nodes, Antenna Power Interface Unit(s) and meshes
2. Digital System, consisting of:
a. Signal Processing Subsystem (SPS)
b. SPS Network
3. Monitor, Control and Calibration Sub-system (MCCS), including the MCCS network
LFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350
MHz) signals transmitted from astronomical objects. The architecture is built around a high-speed
switched network which is controlled by MCCS in a centralized and highly configurable system. The
SPS provides the infrastructure required to support signal conditioning, digitization and processing
functionalities of the TPM. It consists of cabinets with internal cooling, power and clock distribution,
each receiving a 10 MHz and 1 PPS from the synchronization and timing (SAT) system which are
distributed to each TPM. Each cabinet also includes the first level (i.e. directly connected to the
TPMs) data switches which allow the forming of tile beams by summing the signals from sixteen
antennas together, followed by the forming of station beams by summing tile beams within a single
station together. Beamforming is performed within TPMs.
Advanced Time Keeping &
Distribution
Advanced Data Storage
Central Processing Facility
Visibility Data
Ca
nd
ida
tes &
T
imin
g D
ata
Synch
ronis
ation
& T
imin
g
Data Transport
LNA & Amplifier RF over Fibre,Opto-
electronics
Filterbanks,Beamformer &Stn Correlator
Antenna Array Design
Outer Antenna
Station Array
RF Electronics
RF
Transport Links
Channelisation Beamforming& Transient
Capture
Low-Frequency Aperture Array Stations
Science Data Processing Facility
Channeliser,
Correlator&
Beamformer
Science Data
ProcessingScience
Data Archive &
Distribution
High-speed Digital
Hardware
Fibre OpticDigital Data
Links
Specialised Digital
Hardware
Synchronisation & Timing
Distribution
Pulsar Search
Pulsar Timing
Observatory Clock System
Telescope Manager
Operations,Control and
Monitoring Systems
Core Antenna Station Array
RF Electronics
RF Transport
Links
Channelisation Beamforming& Transient
Capture
Long-haul Links
Telescope Mgt
RF Gain Digitisation
RF Gain Digitisation
Amplification& Filtering
VLBI DataVLBI
Terminal Equipment/
Interface
Transient Data
SampleClock &
Time Stamp
Generation
Sample Clock &
Time StampGeneration
Switch
VLBIObserving
Log
VLBI Data
Ca
nd
ida
tes &
T
imin
g D
ata
Visibility DataTransient Data
Super-computer
Hardware,Software
Science DataProcessingFront-end
Data Routing
Time stamp
Data Transport
Long-haul Links
Fibre OpticDigital Data
Links
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 13 of 53
Figure 3-3. LFAA overall architecture
TPMs are the primary component responsible for the processing of signals. They are located within
the processing facilities (CPF shown in diagram) and will be housed within Signal Processing Sub-
system (SPS) cabinets. TPMs receive the analogue RF over fiber optical signals from 16 antennas
(dual-polarization) and convert it back to electrical RF signal. This signal is then filtered to limit the
frequency bandwidth, amplified, digitized and channelized into ~1MHz coarse frequency stream (to
a total of 512 coarse frequency channels). Calibration coefficients are applied to each frequency
channel, whilst beamforming delays are applied to each antenna per beam. The partial beam stream
is sent to a digital switch to generate station beams. To generate a station beam, the output of 16
TPMs are combined by making use of one of the data switches.
The data network is a standard high-speed (40Gb or 100Gb) network which will transport the various
data streams, i.e. control and monitoring information as well as signal data. This network contributes
to the provision of connectivity between the TPMs, the MCCS and the Low correlator beam former
(CBF-LOW), involving long haul links from the TPMs that are in the Remote Processing Facilities
(RPF).
There will be a total of 256 SPS cabinets, each containing four sub-racks, such that each cabinet is
responsible for two stations. Each sub-rack contains a Sub-rack Management board which
distributes power, 1Gb network, 10 MHz and PPS signals. A cabinet-wide management unit is
responsible for distributing these signals to the sub-racks. A single 100 Gb switch connects the TPMs
to the LFAA Network. The Sub-rack Management board also acts as a proxy for monitoring and
controlling APIUs. MCCS cabinets host at least 16 high-performance servers (and one or two
additional servers for redundancy), such that each server is responsible for at most eight stations.
MCCS cabinets also host a number of 100Gb switches to connect the SPS racks to the MCCS servers.
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 14 of 53
Table 3-1 provides a summary of the number of components in the LFAA and how they are spread
across cabinets (refer to [RD8] for a detailed SPS cabinet design).
Table 3-1. LFAA numbers
Total number of antennas 131072 Total number of SPS cabinets 256
Total number of stations 512 Sub-racks per cabinet 4
Antennas per station 256 TPMs per sub-rack 8
Antennas per TPM 16 100Gb switches per MCCS cabinet 2
TPMs per station 16 Total number of MCCS cabinets 4
Total number of TPMs 8192 Servers per cabinet 17 (+ 0/1)
Signals per TPM 32 SDP-LFAA link speed 100Gb/s
Frequency channels 512 TM-LFAA link speed 1Gb/s
Maximum beams per station 8
3.3 Role of MCCS in LFAA
The MCCS performs the local monitoring, control and calibration functions for the stations and
supporting products. It receives commands and reports the LFAA status to TM. It comprises of a
compute cluster (hardware resources composed by off-the-shelf high-performance servers), local
power and cooling distribution, local network and job management software to support the LFAA
monitor and control functions. The MCCS is connected to both the SPS and LFAA Network. It also
calculates the beamforming and calibration coefficients. The MCCS controls both TPMs, the M&C
and data network, as well as supporting hardware in the cabinets. It is also responsible for
implementing the transient buffer and transmitting the buffer, when instructed, to SDP via a
dedicated 100Gb link.
3.4 Main MCCS Responsibilities
The two primary responsibilities of the MCCS sub-system is to:
1. Create and monitor of observations, including calibration and buffering beamformed data
for transient detection
2. Provide monitoring and control capability for all the hardware and software components
The software architecture for the LFAA is primarily driven by these responsibilities, whilst the sizing
of the MCCS hardware is defined by the resource requirements for calibration, transient buffers and
supporting operations. Observation management is the primary use case for MCCS and defines the
primary functional requirements for the software system, whilst most of the remaining
requirements can be seen as features and specifications required to make sure that the primary use
case remains online, available, working properly and meets the science cases to which the LFAA
should cater. The functional requirements which the MCCS should provide can be summarized as
follows:
• Create and manage observations, where an observation consists of one subarray containing
multiple stations, which in turn can be composed of multiple sub-stations
• Perform calibration, pointing and bandpass flattening coefficient calculation for running
observations
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 15 of 53
• Manage TPMs, including downloading firmware, initialising and synchronising the boards
and firmware, and updating required coefficients and configurations throughout the lifetime
of an observation
• Provide a transient buffer such that, when triggered by TM, buffered station beams can be
forwarded to SDP
• Expose maintenance functionality for fault finding, mitigation and correction
• Monitor and control TPMs, antennas and other hardware and software components, and
provide a reporting mechanism for generating reports
• Provide a logging mechanism and store logs for a period of time, where said logs should be
queryable by external parties
• Raise alarms and events to inform internal and external entities of state and other changes
of LFAA components
• Routinely perform status and diagnostic checks
• Provide an inventory database where labelled hardware components and cables are stored,
to be able to easily localise issues within the CPF and RPFs
• Interact with external entities, including TM, SDP, CSP, operators, engineers and hardware
and software deployers
Observation creation and management, with the associated need to control TPMs, calibrate the
arrays and buffer station beams are the main driving factor of the architecture as well as for defining
the minimal performance requirements for sizing the MCCS hardware. The need to monitor all
hardware and software devices, including the need to have an alarm and notification system, led to
the adoption of TANGO by the SKA community as the primary control system for the SKA. Through
TANGO, most of the purely LMC-related requirements are met by properly integrating TANGO within
the architecture.
Figure 3-4. LFAA observation organization
The primary use case of the LFAA is to generate station beams. Observations organization is shown
in Figure 3-4 and described below:
• A group of 16 Antennas (connected to a TPM) is called a Tile
• A Subarray is a set of Stations grouped together for a single observation scheduling block. A
Station is composed of 256 antennas (distributed across 16 Tiles). The LFAA uses the concept
of a Sub-array to conform with the SKA control guidelines, for grouping related Tiles and
storing Sub-array related metadata. There is no Sub-array specific operation performed in
the signal chain.
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 16 of 53
• The number of Subarray which can be defined is configurable (there is no fixed limit). This
document assumes a maximum of 16 Subarrays, however this can be changed
• A sub-station is defined as a specific instance of a station beam in which a subset of the
antennas does not contribute to the beam (a weight of 0 is applied to these antennas)
• Each Station can generate up to 8 Station Beams
• The Antennas within each Station need to be calibrated (gain and phase calibration). This is
performed on the MCCS servers. The calibration cycle is 10 minutes. During these 10
minutes, coarse frequency channel (from the channels in the Station Beams) are calibrated
in a round-robin fashion, such that each channel is calibrated in ~1 second
• For each Station Beam, given a pointing polynomial, the delay and delay rate per antenna
need to be calculated so that pointing coefficients can be generated. Delays and delay rates
per antenna are calculated on the MCCS servers, whilst pointing coefficients per
antenna/channel (given the delay and delay rate) are calculated on the TPMs.
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 17 of 53
Figure 3-5. MCCS top-level static decomposition
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 18 of 53
3.5 MCCS Top-Level Static Decomposition Diagram
MCCS Compute Processor and MCCS Software are the main components of the MCCS. A brief
architectural description of the software component is presented in Section 5, while an in-depth
analysis and description is provided in [RD7]. The MCCS Compute Processor component is composed
of four almost-identical MCCS cabinets deployed in the CPF. Each cabinet hosts at least 16 high
performance servers which are connected to SPS via a data network and interconnected both
though the same data network as well as a dedicated monitoring and control network. In two of the
four cabinet an LMC head node (one master, one shadow) hosts the core LMC software and
manages both MCCS and SPS. Figure 3-5 show the top-level static decomposition of MCCS.
3.6 Interfaces
This section describes the external entities to MCCS, the level 4 and level 5 components composing
MCCS, as well as all internal and external interfaces to MCCS.
External Entities
There are no external entities in the MCCS static decomposition diagram. Verification and
maintenance support equipment is not described in detail in this DDD.
Level 4 and Level 5 Components
Level 4 decomposition has only three elements:
• 4 MCCS Compute Processors which house the High-Performance Computing Units together
with data network LRUs which connect the servers together as well as provide connections
to SPS.
• MCCS software, which encapsulates all the LMC and supporting software infrastructure for
MCCS
• LMC infrastructure hardware, comprising of one master node and a shadow master node
which is used as a failover in the event where the master node becomes compromised.
An MCCS Compute Processor is composed of:
• The cabinet chassis, holding all other hardware components
• 17 high performance computing unit, one of which is spare (kept in lower power mode until
needed)
• Four 100Gb 32-port Ethernet switches, implementing a single 100Gb Ethernet network for
science and LMC data
• One 1 Gbps 32-port Ethernet switch for control and management across MCCS
• The AC distribution system, distributing power to all Level 4 components under CMB control
• Required cabling
• Additionally, two of the MCCS Compute Processors contain one LMC Infrastructure node
together with an associated UPS
The MCCS software is logically partitioned into several L5 components:
• Local Monitor and Control TANGO Framework, which encapsulate most of the LMC
functionality
• Management Software Module, which manages the hardware and software configuration of
all LFAA
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 19 of 53
• Graphical User Interface, which provides and engineering user interface for use in
commissioning, testing and maintenance
• Data Acquisition Software Module, which is responsible for acquiring LMC data transmitted
by SPS, used for calibration, transient buffering and diagnostics
• Pointing Software, which compute the delay and delay rates per antenna for a given
station/sub-station configuration
• Calibration Software, which runs the calibration algorithm and generates calibration
coefficients that are transmitted to SPS
• Diagnostic Software, which monitors the state of the LFAA (both hardware and software),
including FieldNode diagnostic, calibration diagnostics and network diagnostics. Some of
these diagnostics can be performed within the associated TANGO devices, however others
require a higher amount of processing power, in which case they are run as standalone
applications.
External Interfaces
The external interfaces between MCCS and other elements are list in Table 3-2 and shown in Figure
3-6, whilst the external interfaces between MCCS and other LFAA sub-elements are listed in Table
3-3 and shown in Figure 3-7 and Figure 3-8. The external interfaces are defined in [AD4]/[AD5]/[AD6]
whilst the internal interfaces are defined in [RD2] and MCCS intends to be compliant with them.
Table 3-2. External interfaces
External
Entity
Interface ID Leading
Organization
Key Data or Message flows
TM S1L.TM_LFAA.001 TM Overall LFAA monitoring and control
functionality
SDP S1L.SDP_LFAA.002 SDP Transient Buffer
SDP S1L.SPA_LFAA.001 SDP Global sky model updates
SaDT S1L.SADT_LFAA.007 SaDT Monitor and Control and NTP – physical
link
SaDT S1L.SADT_LFAA.009 SaDT Transient buffer data – physical link
INAU S1L.LFAA_INAU.005 LFAA Rack power
INAU S1L.LFAA_INAU.008 LFAA Rack cooling
INAU S1L.LFAA_INAU.009 LFAA Floor space
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 20 of 53
Table 3-3. L2 interfaces to other LFAA sub-elements
LFAA Entity Interface ID Key Data or Message flows
SPS S1L.MCCS_SPS.001 Physical links between CPF and MCCS
SPS S1L.MCCS_SPS.002 Physical links between RPFs and MCCS
SPS S1L.MCCS_SPS.003 Calibration, transient data exchange between SPS and MCCS
SPS S1L.MCCS_SPS.004 LMC data exchange between SPS TPMs and MCCS
SPS S1L.MCCS_SPS.005 LMC data exchange between SPS CMBs and MCCS
SPS S1L.MCCS_SPS.006 LMC data exchange between SPS SRMBs and MCCS
SPS S1L.MCCS_SPS.007 LMC data exchange between SPS Network and MCCS
Field Node S1L.MCCS_FN.001 LMC data exchange between FN and MCCS
Internal Interfaces
The physical interfaces within MCCS are those required for:
• Distribution of power from the rack power supplies to the PDU and subsequently to the rack
equipment (via the UPS in case of the head and shadow node)
• 1Gb and 100Gb network connectivity.
Interfaces between software components are described in the MCCS Software Architecture
Document [RD7].
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 21 of 53
Figure 3-6. LFAA L3 context diagram
Figure 3-7. MCCS - Field interface
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 22 of 53
Figure 3-8. MCCS - SPS interface
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 23 of 53
4 Operational Concepts
Figure 4-1 MCCS Sub-Element top-level context diagram showing all external interfaces
Operational Environment
The Central Processing Facility (CPF) screened will house the MCCS equipment as well as other
surrounding/support equipment such as that used for SaDT timing and networks, the CSP correlator
and LFAA SPS.
The CPF is an RFI-shielded facility supporting liquid cooling. This facility has some level of ESD
protection and has HVAC filters to prevent dust accumulation on equipment. Notwithstanding the
RFI-shielded facility, LFAA LRUs, including those comprising MCCS, are required individually to meet
CISPR-22/32 Class A [RD3]/[RD5] radiated and conducted emissions levels. Additionally, MCCS LRUs
must meet CISPR 24/35 [RD4]/[RD6] Class A radiated and conducted susceptibility levels or
equivalent.
Operations
During normal operations MCCS is controlled via the interface with TM [AD4]. MCCS implements a
high-level interface which allows TM to control and monitor MCCS as a single instrument. A single
point of access is provided for housekeeping commands such as power-up, power-down, and state
and mode transitions. Monitoring and error reporting are subscription-based; all parameters that
may be of interest to TM and operations in general, including the rolled-up overall operational state
and health, are available for subscription. In addition, the MCCS interface provides introspection,
i.e. allows an authorized client to ‘discover’ and access parameters and commands implemented by
the lower level components when required to support diagnostics and maintenance.
Signal processing functions are controlled via sub-arrays and scans. MCCS supports the configuration
and monitoring of sub-arrays i.e. provides high-level commands [AD4] that TM can use to sub-divide
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 24 of 53
the Low Telescope into up to 16 sub-arrays and operate each sub-array independently. MCCS
exposes sub-arrays as top-level entities and makes provision for TM to assign antennae to sub-arrays
and select signal processing functions to be performed per sub-array. A scan is defined as a time
interval during which a sub-array’s configuration does not change. During normal operations, MCCS
accesses a sub-array directly to assign antennae, select signal processing functions, as well as start
and stop the scan (i.e. start and stop signal processing).
Maintenance
The MCCS Sub-element is an IT system, and therefore requires typical data centre/IT system
maintenance. Therefore, maintenance personnel will be typical IT-support personnel; normally only
typical IT-support personnel are expected to be required at the site itself.
Maintenance is described further in [RD9] Section 14
Operator Role
Apart from maintenance activities, LFAA (and thus MCCS) is remotely controlled via TM and
ultimately by an operator within the TM environment. The “operator”/maintainer role and how it
relates to MCCS is described in more detail in [RD9] Section 14
Support Environment
Support for MCCS will be provided both on-site (i.e. Boolardy) and off-site (i.e. at the SKA1_Low
Telescope support facility at Geraldton and/or in/near Perth), as well as remotely. On and off-site
support is described in more detail in [RD9] Section 14
On-site Maintainer role
The MCCS on-site maintainer needs a technical hardware support background as described in [RD9]
Section 14 to execute the required maintenance tasks. This maintainer’s primary objective is to
detect and isolate faulty LRUs (corrective maintenance) and to remove and replace these to restore
the MCCS functionality. The maintainer’s secondary objective is to determine what maintenance
needs to be scheduled (predicted and preventative maintenance) and to coordinate and perform the
required tasks when scheduled.
Off-site Maintainer role
The off-site maintainer needs software/hardware technical support background to perform second
line LRU repairs, configuration, and verification as described in [RD9] Section 14. The off-site
maintainer is located at the SKA1_Low Telescope support facility. The off-site maintainer removes
and replaces selected SRUs to repair LRUs, configures the repaired COTS LRUs, and tests all repaired
equipment in a representative environment to verify that they are fully operational. Once this is
confirmed, LRUs are returned to the on-site or close-to-on-site spares store.
Remote support
MCCS maintenance and support personnel will remotely connect to the CPF over the SKAO
communication network to read equipment status, review equipment log files and access MCCS long
term monitoring data that is stored in the TM Engineering Data Archive (EDA), to help isolate faults.
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 25 of 53
Off-site support is generally more specialized, and hence is preferred to detect and isolate faults.
Off-site support will be provided where possible to assist the Telescope on-site operations personnel
to diagnose faulty LRU equipment and problems with telescope functionality (firmware and
software). The general rule is, if something can be done remotely, it should be done remotely, but
with on-site capability and assistance if such capability is useful (such as GUIs on consoles to track
down problems, as described in [RD7] Section 7). See [RD9] Section 14 for more information on
remote support.
States and Modes
The MCCS implementation of states and modes is compliant with the SKA Control System Guidelines
document [RD1]. Per these guidelines, MCCS implements and reports the standard set of SKA state
and mode indicators for SPS, individual sub-arrays and MCCS itself. MCCS monitors state and mode
transitions and based on the status reported by LFAA sub-systems derives overall LFAA state and
mode indicators. For more detailed information on how states and modes are implemented in the
MCCS software architecture refer to [RD7].
Table 4-1 lists the states and modes for a sub-array. The states and modes are applicable to all
hardware, software and logical components, although it is not mandatory that all states and modes
are applied to each component. Figure 4-2 show the state transition diagram as derived from [RD1].
Table 4-1. MCCS states and modes
Attribute Range Description and comments
adminMode
(read-write)
Set by an outside authority (operations via TM and MCCS).
ONLINE The sub-array can be used for scientific observing.
MAINTENANCE
The sub-array is not to be used for scientific observing but can be used
for testing and commissioning.
OFFLINE The sub-array is not to be used at all.
NOT_FITTED Set by operations to suppress alarm generation.
opState
(read-only)
MCCS intelligently rolls-up the operational state of all components used
by the sub-array and reports the overall operational state for the sub-
array.
INIT The sub-array is being initialized.
OFF The sub-array is ‘empty’; no receptors have been assigned to the sub-
array.
ON At least one receptor has been allocated to the sub-array; the sub-array
is ready to accept a scan configuration.
ALARM The Quality Factor for at least one attribute is outside the pre-defined
ALARM limits. Some or all functionality may not be available.
DISABLE The sub-array is administratively disabled (adminMode=OFFLINE or
NOT_FITTED); basic monitor and control functionality is available, but
signal processing functionality is not available.
FAULT An unrecoverable fault has been detected. The sub-array is not available
for use; maintainer/operator intervention is required.
UNKNOWN The sub-array is unresponsive, e.g. due to loss of communication.
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 26 of 53
Attribute Range Description and comments
healthState
(read-only)
OK
DEGRADED
FAILED
MCCS intelligently rolls-up attribute quality factors, states, and other
indicators for all components and capabilities used by the sub-array and
reports the overall sub-array healthState.
obsState
(read-only)
The sub-array Observing State indicates status related to scan
configuration and execution.
IDLE The sub-array is not processing input data and is not generating output
products. When a sub-array is IDLE, SCAN ID=0.
CONFIGURING Transient state entered when a command to re-configure the sub-array
is received. The sub-array leaves this state when re-configuration is
completed.
READY The sub-array enters READY when re-configuration has been completed.
SCANNING The sub-array is processing input data and generating output products.
ABORTED The sub-array transitions to this state when a command ‘abort scan’ is
received. In this state re-configuration, delay tracking, and any other on-
going processing functions are stopped.
FAULT An unrecoverable error that requires operator intervention has been
detected.
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 27 of 53
Figure 4-2. Derived state transition diagram for all TANGO devices in SKA LMC. Not all states are mandatory
for each hardware and software component.
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 28 of 53
5 MCCS Software Overview
5.1 Overview of Software Architecture
The software infrastructure of the LFAA must cater for the responsibilities specified above, with a
focus on telescope monitoring and control, and observation management. Additionally, the
architecture must meet the non-functional requirements listed in [RD9] Section 3.3. A high-level
description of the LFAA software architecture is shown in Figure 5-1. This shows diagram separates
components which are within the software architecture context from those which are considered
external (here the Telescope Manager and hardware devices). Note that not all software
components are shown here to avoid clutter. The architecture itself is separated into four sub-
systems which communicate with each other over the TANGO bus. This separation is purely logical
since almost all software components are implemented as TANGO devices (or have an associated
TANGO device). These sub-systems are:
Hardware Devices: Each monitorable and/or controllable hardware device in the LFAA has an
associated TANGO device through which all operations are performed. These include: TPMs,
antennas, APIU, switches, rack management units and servers. Note that an antenna cannot be
monitored and controlled directly, these operations have to go through the APIU and TPM to which
the antenna is connected.
Figure 5-1. LFAA overall software architecture overview
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 29 of 53
MCCS physical devices are represented by green boxes in Figure 5-1. Table 5-1 shows the
relationship between these hardware devices and the physical devices listed in the LFAA PBS. Note
that in some instances a hardware device is mapped to multiple physical devices, in which case the
hardware device can be interacting with each physical device separately or through a controlling
management device. For example, the Sub Rack can control and monitor power and signal
distribution, presenting a rolled-up status (although direct device access is still permitted). The
detailed design of these devices, in terms of monitoring and control functionality is still not finalised.
Table 5-1. Link between hardware components as described in software and the physical
components as defined in the PBS
Component in Figure Physical Component in PBS PBS #
CMB SPS Cabinet
- Cabinet Chassis
- AC Power distribution
- Cooling System
- Cabinet Management Board
MCCS Cabinet
- Cabinet Chassis
- AC Power distribution
- Cooling System
- UPS
95
105
101
109
106
120
105
101
109
133
SRMB TPM Sub-rack
- AC DC Power Supply
- Sub-Rack Management Board
- TPM Sub-rack
128
138
158
162
APIU Antenna Power Interface Unit (as a single entity) 103
Antenna Antenna (through APIU and TPM) 139
TPM TPM 161
Switch 100G Ethernet Switch (SPS, MCCS)
1 Gb Ethernet Switch (MCCS)
98, 99,
129
MCCS Server MCCS High Performance Computing Units
LMC Head Node
121
130
Observation Management: Observation creation and management is a complex task which requires
the interaction of most of the software components show in Figure 5-1. The observation
management sub-system contains the software components which are unique to this functionality,
essentially showing the TANGO devices which manage subarrays, stations, station beams and
transient buffers. This sub-system includes the calibration, pointing, DAQ and transient buffer
processes, and is described in greater detail in Figure 5-2.
Cluster Management: The MCCS will be composed of at least 64 high-performance servers, each
housing several GPUs. These numbers are based on the estimated bandwidth, memory and compute
power required to calibrate and buffer (transient buffer) all the stations in LFAA. Each server is
responsible for at most eight stations, such that each GPU can calibrate two. Cabinet (and hardware
within) TANGO devices and observation-related components are partitioned across the cluster and
deployed on their associated server. Distributed storage is assumed, such that there is not central
point of failure. A cluster manager and a storage manager will be used to administer these
resources, as well as allow the TANGO control systems and observation components to submit jobs
on the cluster.
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 30 of 53
Figure 5-2. LFAA observation management overview
Monitoring and Control: This subsystem contains all the elements defined in the SKA monitor and
control guidelines, including the LFAA Master device which is the root of TANGO hierarchy and the
main communication point for the LFAA, logging and alarm handling, as well as the TelState device
which is managed by TM.
The interaction of observation-related devices is shown in Figure 5-2, while the following provides a
high-level step-by-step description of what happens during observation creation and management
(certain steps are omitted here, the full sequence is detailed in [RD7] Section 5.2):
1. When the system is started, 16 Subarrays and 512 Stations are created, each unassigned. For
each Station, 8 Station Beams and one Transient Buffer devices are instantiated. These
remain idle until they are required for an observation.
2. At any point, TM can send an observation configuration command to a Subarray. Assuming
all resources are available, Tiles are grouped into Stations, and the Stations are associated
with the Subarray. If the stations were already initialised for a prior scan (such that all
required SPS and MCCS resources are not in low-power mode and already configured and
calibrated), then the process skips directly to step 4. Subarray configuration includes the
following operations:
a. MCCS will transition all required Field Nodes, SPS and MCCS resources from low-
power mode to the Ready state. The time it takes to do so depends on the time
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 31 of 53
required to stabilise the SPS racks (network switches take some time to switch on,
and the cooling system needs to stabilise)
b. When ready, the stations are initialised. TPMs are programmed and initialised (if
required) and signal processing starts. The beamforming chain does not need to be
initialised at this point.
c. Each Station submits a DAQ, Calibration and Bandpass job to the Cluster Manager,
which instantiates them. The jobs are provided with the TANGO FQDN of the creator
(the station) such that a proxy can be created. These jobs are initialised and wait for
incoming calibration spigot and diagnostic data from the station’s TPMs.
d. The Calibration process loads the previous gain and phase coefficients for this
station (if any).
e. The calibration cycle is started by instructing the TPMs to send LMC data to the
MCCS server. The DAQ process reads this stream and generates the correlation
matrix, which is dumped to disk. The Calibration process reads this and computes
the phase and gain coefficients for one frequency channel at a time. These
coefficients are written to the Station device, which downloads them on the TPMs. A
frequency channel is calibrated every second in this manner.
f. Device-specific checks are performed, and any required alarms are created
3. Once the system is fully calibrated (this can take one to two calibration cycles), TM is
notified that configuration is complete
4. TM send the full subarray configuration and MCCS perform final configuration (Note that
this step should be compliant with SKA1-LFAA_MCCS_REQ-19):
a. The beamforming chain is configured on the TPMs (the station beams are not
transmitted to CSP at this point)
b. Each Station Beam and Transient Buffer device creates a Pointing and Transient
Buffer process (respectively).
5. TM sends the initial beam pointing polynomials, which are distributed to the respective
Station Beams. The pointing processes calculates the required delay and delay rates per
antenna and download them to the TPM. The delay and delay rates are then updated
periodically.
6. TM sends the start observation command to the subarray:
a. TPMs are instructed to send the generated station beam(s) to CSP
b. TPMs start transmitting the quantised station beam which is received by the
Transient Buffer process and stored in the internal buffer. If triggered by TM, the
required section of the buffer is transmitted to SDP (see [RD7] Section 5.2.9)
c. Diagnostic operations are performed routinely
7. At any point, TM can update the beam pointing polynomials
8. At any point, TM can read attributes from the devices contributing to the observation
9. At any point, TM can issue a command on the subarray which changes its state. These
include abort and stop. The stop command stops the transmission of calibrated station
beams to CSP. The abort command will in addition, result in the de-configuration of the
components. When stopping, the processes are terminated but the station/subarray
configuration remains as is.
Figure 5-3 shows additional elements in the monitoring and control subsystem which are exclusive
to LFAA (not covered in the SKA monitoring and control guidelines), which include:
- Beam Model Device, which can provide beam metrics for a particular azimuth and elevation
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 32 of 53
- Inventory Database and associated TANGO device. The LFAA must keep track of all hardware
components and their serial number, which might be request by TM. This database is also
used during fault finding
- Command Line Interface and Graphical User Interface, which are clients to the LFAA LMC
used by external users and operators
The unified state of the telescope can be collected via the LFAA Master, however the TelState device
can also be communicated with directly (as with all other TANGO devices on the TANGO bus) to
investigate the state of groups of devices, or individual ones.
Figure 5-3. LFAA local monitoring and control overview
Figure 5-4 shows the TANGO control hierarchy for the LFAA. Four types of TANGO devices are
shown: Green representing TANGO devices which are associated with a hardware component,
Yellow representing TANGO devices which are observation-related (logical devices representing
observational entities), Red representing TANGO devices which interface with third party software,
and Blue which represent TANGO devices which support the TANGO infrastructure or are required
for the overall monitoring and control of the system, including devices specified in the SKA control
guidelines [RD1]. The connections between devices in the diagram show relationships and
multiplicities, with LFAA Master being the root of the hierarchy tree. Note that element-level devices
(Alarm Handler, TelState, Element Logger) are functionally independent, providing different type of
aggregation and functionality to TM or the other Elements.
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 33 of 53
Figure 5-4. TANGO control structure
5.2 Software Component List
Table 5-2 provides a short description of each entity in the figures described above, whilst Table 5-3
describes some of the relations between these components.
Table 5-2. List of elements in the Architecture System Overview
# Name Type Multiplicity Description
1 Graphical User
Interface
SW 1 A graphical interface through which users can locally
access parts of the LMC, mainly to support
maintenance and debugging
2 Command Line
Interface
SW 1 A wrapper around the LFAA Master which allows
external libraries and client to perform actions and
request information
3 Configuration
Database
DB 1 A central store where the required configuration to
load and run the LMC is stored
4 Log Storage DB 1 Storage for generated logs
5 LFAA Master SW 1 The LFAA Master device, which orchestrates all the
operations of the LMC as well as act as
communication point with external entities,
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 34 of 53
particularly TM
6 Element Logger SW 1 The LMC device which handles element logging
functionality
7 Inventory Database DB 1 A database containing the list of hardware devices
and cables, including their location with the CPF and
how they are interconnected
8 Inventory Device SW 1 A TANGO device which interfaces with the Inventory
database
9 Subarray SW 16 Creates, monitors and controls a subarray (a
collection of station devices) when and as instructed
by TM
10 Station SW 1..512 Creates, monitors and control a logical station
11 Station Beam SW 1..8 per
station
Controls the pointing functionality for a station beam
12 Beam Model
Device
SW 1 TANGO device which contains the beam pointing
model for an antenna and station
13 TelState Device SW 1 TANGO device which mirrors the TelState device in
Telescope Manager
14 Cluster Manager SW/HW 1 TANGO device which interfaces with the cluster
manager for monitoring, control and execution of
jobs
15 Transient Buffer SW 1 per station TANGO device which control the transient buffer
process and process triggers
16 Transient Buffer
Process
SW 1 per station Process which takes care of the transient buffer for a
station
17 DAQ Process SW 1 per station Process which enables the reception and storage of
data from TPMs
18 Pointing Process SW 1..8 per
station
Process which calculates the pointing coefficients for
station beams
19 Calibration Process SW 1 per station Process which performs station calibration
20 Bandpass Process SW 1 per station Process which calculate the scaling factors for
flattening the bandpass and runs diagnostics based
on the antenna bandpass
21 Cabinet Device SW 256 TANGO device which monitors and control the
Cabinet Management Boards in SPS cabinets
22 Sub-Rack Device SW 512 TANGO device which monitors and control the Sub
Rack Management Boards in SPS cabinets
23 MCCS Server HW 64 Physical high-performance server, making up the
MCCS
24 Switch HW 512+20 Physical network switch, composing the LFAA-DN
25 Switch Device SW 512+20 TANGO device for monitoring and controlling
switches
26 TPM HW 8192 Physical TPM which hosts the digital signal processing
chain
27 Tile SW 8192 TANGO devices for monitoring and controlling TPMs
28 Antenna HW 131072 Physical antenna
29 Antenna Device SE 131072 TANGO devices which monitors antennas
30 APIU HW 2048 Physical APIU, which powers and monitors antenna
31 APIU Device SW 2048 TANGO device which monitors and controls APIUSs
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 35 of 53
# Component A Component B Relationship Description
1 Graphical User
Interface
LFAA Master The GUI uses the LFAA Master’s API to provide the local users and
personnel access to the LMC
2 Command Line
Interface
LFAA Master The CLI uses the LFAA Master’s API to allow local and remote
access to the LMC
3 LFAA Master Configuration
Database
The LFAA Master uses the Configuration Database for initializing
the LMC and keep track of configuration changes
4 LFAA Master Inventory
Device
LFAA Master provides high level information and actions from/on
the Inventory Database
5 Inventory Device Inventory
Database
The Inventory Device manages and updates the Inventory
Database
6 Element Logger
Device
Log Store Element Logger Device receives logs from TANGO devices and
stores them in the Log Store
7 LFAA Master Beam Model
Device
LFAA Master provides beam metrics for a given Azimuth and
Elevation when requested by TM
8 TANGO Device Alarm Handler Alarms defined on TANGO devices are captured and processed by
the Alarm Handler
9 TANGO Device Tel State TANGO devices can read the overall state of the telescope from
the TelState devices
10 TANGO Device Element Logger Logs generated by TANGO devices are forwarded to the Element
Logger for filtering and storage
11 Station and
Station Beam
Cluster
Manager
The Station and Station Beam devices submit jobs to the Cluster
Manager
12 Subarray Device Station Device Sub Array devices create, monitors and controls a Station Device
for each station in the sub array.
13 Station Device Station Beam
Device
The Station device create a Sub Station Device for each station
beam. Each Station Beam will have an associated pointing
process
14 Station Device Cluster
Manager Device
Each Station Device submits jobs DAQ, Calibration and Transient
Buffer jobs to the Cluster Manager via the Cluster Manager
Device
15 Station Device Transient Buffer
Device
Each Station Device create a Transient Buffer Device which keeps
track of the transient buffer for that station and responds to
triggers, sending the buffered data to SDP
16 Transient Buffer
Process
Transient Buffer The Transient Buffer Device launches a Transient Buffer Process
and processed triggers
17 DAQ Process Station DAQ Process notifies the associated Station when a new file has
been written to disk
18 Calibration
Process
Station Calibration Process updates the calibration coefficients being
used by the associated station
19 Bandpass Process Station Bandpass Process calculates bandpass flattening factors and
performs diagnostics on antenna bandpass
20 Pointing Process Station Beam Pointing Process updates the delay and delay rates being used by
the associated Station Beam
21 Cluster Manager MCCS Server The Cluster Manager Device communicates with the Cluster
Manager, allowing the rest of the LMC to submit jobs and
monitor the state of running jobs
22 Storage Manager MCCS Server The Storage manager manages the disk space allocated to the
distributed storage on MCCS servers
23 Cabinet Device Rack
Management
Board
The Cabinet device monitors the cabinet environment (such as
temperature) by interfacing with the rack management board
24 Server Device Server The Server Device monitors the state of a server
25 Switch Device Switch The Switch Device monitors the state of a switch, including
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 36 of 53
statistics per port
26 Cabinet Device CMB Monitors and control Cabinet Management Board
27 SubRack
DrivDeviceer
SRMB Monitors and controls Sub Rack Management Board
28 TPM Device TPM The TPM Device monitors and controls a TPM, including
programming and initializing it and allows the LMC to control the
running firmware
29 Antenna Device Antenna The Antenna Device monitors the state of an Antenna
30 APIU Device APIU The APUI Device monitors and controls an APIU, including the
ability to read out antenna power and shut off the antenna if
required.
Table 5-3. Relationships between major elements in the Architecture System Overview
Figure 5-5 shows a high-level module decomposition diagram which groups several of the software
components describes in this section into modules. It also shows the system services and software
which are required to run the system (the System Services module), which are described in [RD9].
The Hardware TANGO Device and System Service TANGO Devices represent all the TANGO devices
which interface with hardware devices (in MCCS and SPS) and software services (including the
cluster manager, storage manager, node provisioner, and so on).
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 37 of 53
Figure 5-5. Software module decomposition diagram
5.3 Software-Hardware Mapping
Figure 5-6 shows the mapping between a subset of the array and some of the software components
shown in Figure 5-2 for a specific observation setup. In this case a subarray is configured to contain
three stations, with one of them containing two sub-stations. For the software architecture, a sub-
station is defined as a specific instance of a station beam in which a subset of the antennas does not
contribute to the beam (a weight of 0 is applied to these antennas), such that a Sub-Station TANGO
device is not required. Sub-stations are therefore defined through appropriate configuration of the
station beams. Each station has 256 antennas, which are connected to 16 TPMs. Each TPM has an
associated Tile TANGO Device instance through which all interactions with the TPM (and hence
control of the antennas and beams) is performed. A Station Device instance is associated with each
station and the required number of Station Beam Devices instances are then configured with the
station. Since station beams (and sub-stations) are pointed independently, delay calculation
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 38 of 53
(pointing) is performed at the station beam level, whilst calibration is performed at the station level
(these relationships are shown in Figure 5-2). The Station instances are then grouped and associated
to a single Subarray TANGO Device instance, through which observation control is coordinated by
TM.
Figure 5-6. Mapping between array and software components
5.4 Software Life Cycle
Scaled Agile Framework, also known as SAFe, is an enterprise-scale development methodology
developed by Scaled Agile, Inc. SAFe combines Lean and Agile principles within a templated
framework. The main principles of SAFe interweaves systems thinking and fast incremental
development based on small and regular milestones within those increments. A summary of these
principles can be found in [RD11] .
Agile Release Trains
An Agile Release Train, or ART, is a fundamental concept within the scaled agile framework.
The ART is the primary value delivery method of SAFe. Agile Teams are a small group of individuals
focused on defining, building, and testing solutions within a short time frame. An ART is a self-
organizing, long-lived group of Agile Teams, whose purpose is to plan, commit, and execute solutions
together. System development will have backlog items assigned in logical groupings, worked around
along increments within stipulated periods of time (e.g. a few weeks per increment).
SAFe Implementation Overview
Given the sheer size and scope of SAFe, proper implementation can be rather daunting, especially
starting out. Since a full explanation of SAFe implementation would require tens of thousands of
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 39 of 53
words — and because more detailed information is available on the official website — we’ll cover a
brief overview of implementation here:
1. Train Implementers: Due to the sheer scope and challenge required in adopting SAFe, most
organizations will need a combination of internal and external mentors and coaches. These
people should be capable of easily teaching and delivering SAFe techniques to others
throughout the organization.
2. Train Executes, Managers, and Leaders: The initial batch of Implementers should first focus
on training all executives, managers, and leaders. Once these fundamental team members
understand the Lean-Agile mindset, core SAFe principles, and implementation techniques,
the process will become much smoother for the entire organization.
3. Train Teams: Individuals should initially be organized into Agile Teams, who can then all be
trained on the various Lean, Agile, and SAFe principles.
4. Launch Agile Release Trains: Finally, once the organization has been properly trained, it’s
time to group Agile Teams together into ARTs, and then generate models for objective
planning, program execution, program increment planning, and all the other components
required for a successful Agile Release Train.
Essential SAFe
The essential basic configuration of the SAFe framework is shown in Figure 5-7 and provides all the
elements necessary to have a complete SAFe system. Rather than focus on explaining the SAFe
framework, we shall focus on particular elements within this framework, which require some
discussion.
Figure 5-7: Essential SAFe configuration.
The software development process will employ the following key principles – adapted in general
from the SAFe framework:
1. Collaborating closely with both the stakeholders and with other developers, adding valuable
feedback and collaboration.
2. Implementing functionality in priority order – the requirements will be developed based on
array assembly prioritisations – and these might change along the way.
3. Analysing and designing - The individual requirements are analysed by model storming on a
just-in-time (JIT) basis for a few minutes before spending several hours or days
implementing the requirement.
4. Ensuring quality – Use coding conventions, development guidelines and constant refactoring
for quality.
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 40 of 53
5. Regularly delivering working solutions - At the end of each development cycle/iteration
there will be a partial, working solution for demonstration/analysis.
6. Testing – Perform a significant amount of testing throughout construction.
For more detail on the framework, refer to [RD12].
Software Development Process During Construction Iterations
During construction iterations developers will incrementally deliver high-quality working software
which meets the changing needs of the system as overviewed in Figure 5-8.
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 41 of 53
Figure 5-8: Software development process during a construction iteration.
The Test-First Approach to Construction
The test-first approach to software development is shown in Figure 5-9. The full testing regime for
MCCS is detailed [RD10].
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 42 of 53
Figure 5-9: Test-first development approach.
Within the context of development iterations in an Agile approach, this test-first approach is
encompassed within iterations as shown in Figure 5-10.
Figure 5-10: Testing during construction iterations.
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 43 of 53
5.5 Commissioning
Project commissioning is the process of assuring that all systems and components of the project are
designed, installed, tested, operated, and maintained according to the operational requirements of
the stakeholders. A commissioning process may be applied not only to new projects but also to
existing units and systems subject to updates, refactoring, etc.
In practice, the commissioning process comprises the integrated application of a set of engineering
techniques and procedures to check, inspect and test every operational component of the project,
from individual functions, such as instruments and equipment, up to complex amalgamations such
as modules, software subsystems and systems.
Commissioning activities, in the broader sense, are applicable to all phases of the project, from the
basic and detailed design, procurement, construction and assembly, until the final handover of the
unit to the owner, including sometimes an assisted operation phase.
The testing procedures and acceptance process for all sub-units of the system, as well as the
integrated system working as a single element is detailed in the MCCS Assembly Verification and
Test Plan [RD10]. The commissioning procedure is made up of:
• Functional tests
• Non-functional tests
• A testing cycle for each test
• Regression testing
• A qualification process
• An acceptance process
It is assumed that the commissioning process for MCCS will form part of a wider commissioning
procedure. There are various completion and commissioning tools which can be utilised for this
purpose. With regards to MCCS, the commissioning process will support the AIV Element roll-out
plan [AD2]. The commissioning process will be split to cater for:
1. Full system commissioning
2. Hardware commissioning
3. Software/Code commissioning
More details of this split can be seen in the MCCS Detailed Design Document [RD9].
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 44 of 53
6 MCCS Physical Overview
MCCS is essentially a compute cluster, requiring enough compute processing power, network
bandwidth and memory space to run the MCCS software. Compute processing power is dominated
by the correlation and calibration processes, network bandwidth is dominated by the transmission
of calibration spigots from SPS to MCCS, while memory space is dominated by the fast transient
buffer. The compute servers are distributed across 4 MCCS cabinets which, apart from the compute
servers themselves, contain the required number of network switches to transport SPS LMC data
from SPS to MCCS, interconnect the compute server and transmit the fast transient buffer to SDP.
The following section analyse the compute, network and cabinet requirements, as well as provides
describes the software necessary for these components to function properly.
6.1 Compute Server
There are a total of 68 compute servers in MCCS distributed across 4 racks (including one spare per
rack). Each compute server is responsible for 8 stations. [RD9] Section 5.1 provides an analysis of the
compute, network and memory requirements for a single server, summarised below, resulting the
compute server configuration listed in Table 6-1:
• 4 high performance GPUs to run the correlation and calibration related processes
• One 100Gb interface for receiving data from 8 stations (64 TPMs)
• At least 1.5 TB RAM, primarily dominated by the space required to store the transient
buffers for eight stations
• About 80 CPU cores
Table 6-1. MCCS compute server configuration
Item Quantity Minimum Specification
Chassis 1 1U, min 2x SATA, dual 1Gb Ethernet, 2 kW redundant
power supply, NVLink support
CPU 2 20-cores, 2 GHz minimum
GPU 4 NVIDIA P100 with NVLink or equivalent
RAM 12 128GB 2666MHz DDR4
1 Gb interfaces 1 On chassis
100 Gb interfaces 2 Mellanox 100-Gb ConnectX-5 with 1 QSFP, or equivalent
SSDs 2 1 TB 2.5” SATA 6.0 Gb/s
Two additional servers are included which act as the master and shadow master nodes of the MCCS
cluster, on which the core LMC functionality, hardware configuration database, maintenance
support tools, graphical user interface and other high level software components will operate. These
servers will also be responsible for configuring all LFAA and interact with TM. The shadow node
takes over when the master node is compromised.
6.2 Network
MCCS is connected to external entities and other LFAA Sub-elements through network link, shown in
Figure 6-1. Communication with SPS goes through a single 100Gb link between each SPS cabinet in
the RPF and groups of two SPS cabinets in the CPF, totalling to 110 100Gb links. Communication with
TM goes through a 1Gb link, of which there are two for redundancy. The transient buffer is
transmitted to SDP via a 100 Gb link provided by SaDT.
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 45 of 53
Figure 6-1. Network links between MCCS and external entities
These connections need to be distributed across the four racks which host MCCS. Core SPS cabinets
have one 100 Gbps per two cabinets to MCCS, RPFs within 25km have one 100 Gbps links to MCCS
and RPFs which are farther away than 25km use DWDM through a muxponder, multiplexed to 100
Gbps to MCCS. Apart from the 100G network, there is a separate 1G network local to MCCS used for
monitoring and control, and acts as a back-up in case the 100G network goes offline. MCCS is also
responsible for the configuration, management and control of all the network and network
components within LFAA, including the data network which forms the backbone of SPS, as well as
well as all external network links provided by SaDT.
Figure 6-2 shows the network diagram for a single MCCS rack. Compute servers are grouped into
two groups, each connected to a separate 32-port 100Gb network switch. Each 100Gb network
switch ingests 14 SPS links, except for the bottom switch of the first and last rack which ingests 13
SPS links (totalling 110 links). A single 32-port 1Gb network switch is required to interconnect all
hardware devices within an MCCS rack, with enough free ports for creating a full 1G mesh with the
rest of the racks. Links to TM and SDP are also shown, however these are not present in all racks.
The TM links are connected to the 1G switch in the central two racks, whilst the SDP links can be
connected to any of the racks. The head/shadow nodes are also located in the central two racks,
each requiring two 1Gb links for redundancy. Note that in the diagram, links without multiplicity
mean that there is a single link. The MCCS layout is described in detail in [RD9] Section 4.
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 46 of 53
Figure 6-2. MCCS network diagram
6.3 Rack Assembly
The cabinet design is present in Figure 6-3, each containing:
• 16 compute servers and one spare compute server
• 2 100 Gb switches
• 1 1Gb switch
• For two of the racks an additional server is required to act as a master/shadow node
• For the racks containing the master/shadow node, a UPS is included
The head/shadow node and 1Gb switch connecting the head/shadow node to TM are connected to
the UPS, such that if a power failure arises then MCCS can inform TM and perform and emergency
shutdown operations, ensuring that the system will be capable of going back online when power is
restored. Since The head/master server will be low-power servers (when compared to the compute
servers), a standard rack UPS should be able to provide enough up time for the head/shadow node
to perform these operations.
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 47 of 53
Figure 6-3. MCCS rack assembly
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 48 of 53
7 Scenarios
In this section some example scenarios have been chosen to be described in detail. The selection is
considered to represent many similar operational scenarios. The following scenarios are
documented in the following subsections:
1. Application of power to MCCS, including power sequencing
2. Transitioning to low power mode
3. Power down sequencing, that is transitioning to offline mode
4. Observation configuration and start
5. Calibration of LFAA and how this can detect failed antennas
6. Stopping an observation
7. Detection and MCCS failures, including redundancy for continued operations and how
failures are reported, replaced and detected by the software
8. Software, BIOS and LRU firmware update
7.1 Application of power
When power is applied to MCCS, a boot-up sequence of minimal hardware and software
components occurs. This will transition the operational state of these component from Unknown or
Offline to Ready:
1. Power is applied to one of the racks which has a master or shadow node
2. The master and shadow nodes are configured to boot up on power, such that they will boot
up and load the operating system. The 1Gb network switch also powers up when power is
applied, such that MCCS can then directly access the Rack power supply.
3. An LMC bootstrap mechanism is run automatically at start-up which loads:
a. The bare-metal provisioning software
b. The distributed storage management software
c. The TANGO database
d. The TANGO starter
e. The LFAA Element Master (root TANGO device)
f. The Software Configuration database (in the case that this is an actual database and
needs to be loaded)
4. The LFAA Element Master will then start-up the rest the core LMC system by reading the
required configuration, and communication with TM is established
5. At this point the MCCS head node is powered on. Action from TM is required to power the
rest of the MCCS as well as SPS. Once TM issues this command the power-on continues
6. The master/shadow starts powering on the rest of the MCCS hardware one rack at a time:
a. Rack power is enabled (the ones which do not host the head/shadow node)
b. Power to the data switches is enabled
c. Power to the compute nodes is enabled (compute nodes are not configured to start
up when power is applied)
d. For each compute node, the LMC Element Master instructs the bare metal
provisioning software to power it on. Nodes are powered sequentially, with time
between each node TBD. The provisioning software loads an operating system
image on the compute node, which in turn go through the boot up process. Nodes
are then ready for software provisioning
7. The LFAA Element master provisions the required containers on each compute node to start
the TANGO devices for monitoring and controlling the associated SPS hardware
components, as well as the distributed storage and other required services
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 49 of 53
8. MCCS is now ready to accept and start observation configurations
7.2 Transition to Low Power Mode
Compute servers in MCCS cannot be turned off since they host TANGO devices which monitor and
control SPS equipment. Equipment in SPS which is in low power mode still need to be monitored
(sensor and health status can still be accessed). The master and shadow nodes do not have a low
power mode. Low-power mode for MCCS can be described as follows:
• A compute server can only be in low-power mode if all associated SPS equipment is in low-
power mode (not being used for observations or are part of a maintenance subarray)
• Network switches cannot be switched to low-power mode; however, they generally have
power saving feature which can be used to reduce their power consumption
A compute server in low power mode translate to the following operations being performed:
• Switch off GPUs or set their power management configuration to the minimal power
consumption setting (depends on available GPU settings, PCI devices can be disabled with
appropriate kernel modules as well)
• Set all CPU cores to low power mode. CPU cores can also be disabled through appropriate Linux
configurations
The power consumption of network switches depends on the network traffic, so they will automatically
consume less power. Additionally, unused ports on the switch are disabled such that they do not
consumer any power.
During observation configuration, if a compute server in low power mode is required, the required GPUs
are and CPU core re-enabled or switched to normal power configuration.
7.3 Transition to Off-line
MCCS must support the ability to shut down the entire sub-element. This may be related to maintenance,
power saving measures, power supply emergency, etc. MCCS will support two types of shutdown:
• Controlled: orderly shutdown of servers and equipment
• Uncontrolled: immediate removal of power to running equipment
Controlled shutdown
To transition to off-line (controlled power shutdown) the MCCS head node will:
• Terminate all running observations (through the appropriate Devices, which will in turn
terminate all running compute processes on the compute nodes),
• Terminate the LMC control hierarchy for SPS (keeping the LMC core running up to this
point)
• Instruct the node provisioning system to shut down all compute servers
• Disable rack power to all racks except for the rack containing the head node
• Disable power to switches in the racks containing the master or shadow node
• Shutdown itself
Note that power of the main rack must be manually switched off if required (or through the building
management system)
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 50 of 53
Uncontrolled shutdown
In the event of a power emergency, MCCS may be instructed to perform an uncontrolled shutdown,
whereby the equipment is turned off as quickly as possible. In this situation the head node will send
the shutdown signal to all the compute nodes (through the node provisioning system, regardless of
what processing is being performed). When all the compute nodes are powered down (in the order
of a few seconds), the head node will disable rack power to all racks and shut itself down.
In the event of power failure, where MCCS power is lost, all MCCS will become offline except for the
head and shadow nodes and the 1 Gb switches in the central 2 racks which are connected to a UPS.
This perms the head node to perform a proper shutdown and inform TM that MCCS lost power and
is going to shut down. The latter assumes that all intermediary switches between MCCS and TM are
still powered. If the entire CPF loses power than TM will be unreachable.
7.4 Set up and Start Observation
Observation setup is described in detail in [RD7] Section 5.2.1 and summarised in this document,
Section 5.1. When the start observation command is received the following steps are performed:
1. TM sends the start scan command to the Subarray
2. The Subarray calls the start command on all associated Stations in parallel
3. The Station finalizes configuration on the Tiles. This includes:
a. Setting the CSP ingest node IP, MAC and port as the destination parameters for the
final Tile in the chain
b. Instructs the Tiles to start transmission of data
4. Once all Tiles are configured, the Station returns a reply to the Subarray
5. The Subarray in turn waits for all Stations to finalize their configuration and returns a reply
to TM once configuration is finished
At this point signals are being processed and station beams are being sent to CSP. Throughout the
observation calibration and pointing coefficients are being calculated and updated, and control data
from the Tiles is being received and processed accordingly.
7.5 Calibration
The calibration process as well as diagnostics which can be performed on the generated calibration
solutions is described in [RD7] Sections 5.2.6 and 5.2.7, summarised below:
• Raw channel data needs to be transmitted by all the TPMs forming part of a station. This is
used for calibration (and diagnostics) and is not transmitted to CSP. This data is directed
towards a MCCS compute node, assigned during initialization, on which a DAQ process is
running.
• The DAQ process reads in this data and buffers it for correlation. This data stream amounts
to ~6.4Gbps.
• Once all the time samples for a frequency channel are received (that is, the stream switches
to a new frequency channel), the buffer is marked as ready and copied to GPU memory.
• The GPU correlator computes the auto and cross correlation of the data and integrates the
entire buffer to a single correlation matrix.
• The correlation matrix for the current frequency channel is saved to disk.
• Once the file is written, the Calibration process is notified
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 51 of 53
• Assuming a standard calibration algorithm implementation, the difference between the sky
model and acquired visibilities is minimized, generating a set of coefficients which describe
the difference between the two.
• The generated coefficients are sent to the Station device.
• The Station device then distributes the calibration coefficients to its Tiles, which download
them on the TPMs.
• The Tile devices also distribute the calibration coefficients to the respective Antenna devices
(not shown), where they are archived for diagnostic purposes. These coefficients are kept in
the LFAA archive for several days.
Sanity checks and diagnostics on the generated calibration solutions are also performed to ensure
that the system is stable and to detect misbehaving devices. These checks include:
• Compare calibration solutions and for each antenna against each other to detect outlier
antennas (for example by computing the RMS and evaluating antennas against this RMS)
• Check how calibration solutions are evolving in time to check system stability (for example
by seeing how antennas RMS varies)
• Identify noise frequency channels (RFI)
7.6 Stop Observing
At any point TM can issue a stop or abort command on a running subarray:
• Stop: The current observation is stopped, and the observation is move back to the READY
state. Data output to CSP is stopped. Jobs and Tiles are left configured so that of the next
observation required the same parameters the devices do not have to be re-configured
• Abort: Abort moves the subarray to the ABORT state. The possible state changes from this
are to the CONFIGURING and IDLE state, which means that all resources can be freed up (to
be re-used later). Output to CSP is first stopped to avoid invalid data being transmitted while
aborting the observation. All running jobs are terminated (through the initiating device via
the Cluster Manager device). Tiles are de-configured (but not put in low-power mode).
Station, Station Beam and Tile devices are unassigned.
When the subarray receives a reset command while in the READY state, the same operations as
abort above are performed. Additionally, the Tiles are de-programmed are placed in low-power
mode. This also happens when the command is received whilst in the FAULT state.
7.7 MCCS Failures
MCCS can suffer failures at any point during observation configuration, running or whilst in low
power mode. For hardware failures, the hardware is switched off, its status is changed to FAULTY
and TM is notified. MCCS software has a hardware configuration database which, apart from storing
the configuration of all hardware components, contains their location within the CPF and RPF to aid
maintenance personnel to quickly localise the equipment. The following equipment can become
faulty (for several reasons):
• Compute node, in which case the spare server in the rack takes over the operations of this
compute node. A total of four spare compute nodes are always present in MCCS, such that
up to four nodes can become faulty. If more than four nodes are faulty or offline then the
associated resources (stations) cannot be used until the faulty nodes are replaced.
• 100G switch, in which case the incoming signal from SPS passing through this switch will be
blocked (there is not redundancy for high bandwidth data between SPS and MCCS). LMC
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 52 of 53
communication can be re-routed through other switches. Within MCCS there are several
redundant links between switches such that LMC traffic can be re-routed if a switch is faulty
of offline
• 1 G switch, in which case rack devices which are controlled through the 1 Gb network (such
as switches and rack power) will become unreachable. LMC communication between
compute nodes can still be routed through the 100Gb network.
• Master node, in which case the slave node will take over all operations performed by the
master node. If the slave node becomes faulty then all MCCS and SPS will be unreachable by
TM, and the LMC system will become unavailable.
When faulty LRUs are replaced, maintenance personnel must update the hardware configuration
database with the device’s new IP address (through the provided tools). This is not required for
compute servers since the provisioning software can automatically detect new node once they are
physically powered up.
Software failure are entirely handled by the LMC software system. TANGO devices and system
services are automatically restarted after a crash. It is assumed that all software running in MCCS
will have been thoroughly tested during the verification stage (goes through the software test cycle).
7.8 Software Upgrades
Software upgrades and updates will happen through the commissioning phases of the SKA, as well as
during its long lifetime. Upgrades should happen with minimal disruption of service, that is, with
minimal impact on the capabilities of the telescope. These upgrades can be split into three types:
Software upgrades
This refers to all software running on MCCS, including the OS and other system software,
management software, third party software and bespoke software developed for MCCS. The way in
which these are updates depends on whether they are running on a compute node or a cluster/head
node.
Upgrading software on the master node
When a software upgrade is required on the head nodes, the shadow node is updated first (since it
only mirrors the functionality of the master node, no disruption is caused). Once the upgrade is
complete several tests are performed to verify that the upgrade progress was successful, and all
required functionality is still available. Once ready, the shadow node takes over control from the
master (becomes the master) which is in turn upgraded in the same manner. This can be used to
update the operating system, system libraries and services and the TANGO core system.
Upgrading software on compute nodes
The OS images for compute nodes are stored locally on the master node. These images can be
updated and versioned independently of what is running on compute nodes. This scheme is used to
update the operating system and system libraries and services. When a compute node is rebooted
the new OS image will be used to load the compute node. This can either be performed during a
scheduled maintenance time where all the MCCS compute nodes are rebooted, or as a staged
system where compute nodes are rebooted when they are not in use (with their running system
offloaded to the spare servers, where the spare servers are upgraded first).
Document No.:
Revision:
Date:
SKA-TEL-LFAA-0600050
01
2018-10-31
FOR PROJECT USE ONLY
Author: A. Magro
Page 53 of 53
Observation-related software (such as the calibration and correlation algorithms) updates result in a
new version of the binaries which are stored in the master node. When a new observation is defined
then it will simply use the new (or any required) version of the software. These programs are
launched in containers on the compute nodes, so no system updates are required.
TANGO devices running on the compute nodes are also launched in containers, such that the same
scheme as that for observation-related software can be used. In this case the new version of the
TANGO device is first launched, and when it’s running the older version of the device is stopped.
BIOS updates
It is inevitable that BIOS updates will become available during the lifetime of the MCCS servers. They
are generally installed through software provided by the manufacturer and will require the node to
be rebooted. For the master and shadow nodes the same scheme as above can be used, where
operations are taken over by one server whilst the other runs the BIOS update program. For
updating the BIOS of a compute node (which must not be performing any observation-related
functionality), all TANGO device running on the node are offloaded to a spare server after which the
BIOS update program is run. When complete, the TANGO devices are set to run again on the update
node.
LRU firmware updates
Additional hardware will need firmware and software updates in MCCS and SPS, including network
switches and power supply units. There updates are generally performed by manufacturer-provided
software. The device will be offline during this update, which will result in down-time.