MCCS ARCHITECTURE OVERVIEW - SKA SDPska-sdp.org/sites/default/files/attachments/ska-tel-lfaa...Document No.: Revision: Date: SKA-TEL-LFAA-0600050 01 2018-10-31 FOR PROJECT USE ONLY

Name Designation Affiliation Signature

Authored by:

A. Magro

Subject

Matter

Expert

AADC

Date:

Owned by:

M. Waterson Domain

Specialist SKAO

Date:

Approved by:

P. Gibbs Engineering

Project

Manager

SKAO

Date:

Released by:

J. G. Bij de Vaate Consortium

Lead AADC

Date:

MCCS ARCHITECTURE OVERVIEW

Document number ...................................................................... SKA-TEL-LFAA-0600050

Context ........................................................................................................................ DRE

Revision ......................................................................................................................... 01

Author ........................................................................................... A. Magro, A. DeMarco

Date ................................................................................................................. 2018-10-31

Document Classification ............................................................. FOR PROJECT USE ONLY

Status ................................................................................................................... Released

2018-11-01

Philip Gibbs

2018-11-01

2018-11-01

2018-11-01

https://skaoffice.na1.echosign.com/verifier?tx=CBJCHBCAABAAd_l-MgAiopWNsBujBF0CHWUB4T7EHy7_




Document No.:

Revision:

Date:

SKA-TEL-LFAA-0600050

01

2018-10-31

FOR PROJECT USE ONLY

Author: A. Magro

Page 2 of 53

DOCUMENT HISTORY

Revision Date Of Issue Engineering Change

Number

Comments

A 2018-06-04 - Draft Template version released within consortium

01 2018-10-31 First Release

DOCUMENT SOFTWARE

Package Version Filename

Wordprocessor MsWord Word 2016 SKA-TEL-LFAA-0600050-01 MCCS Architecture Overview

Block diagrams

Other

ORGANISATION DETAILS Name Aperture Array Design and Construction Consortium

Registered Address ASTRON

Oude Hoogeveensedijk 4

7991 PD Dwingeloo

The Netherlands

+31 (0)521 595100

Fax. +31 (0)521 595101

Website www.skatelescope.org/lfaa/

Copyright Document owner Aperture Array Design and Construction Consortium

This document is written for internal use in the SKA

project

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 3 of 53

TABLE OF CONTENTS

1 INTRODUCTION ............................................................................................. 8

1.1 Purpose of the document ....................................................................................................... 8

1.2 Scope of the document ........................................................................................................... 8

1.3 Intended Audience .................................................................................................................. 8

1.4 Document Overview ............................................................................................................... 8

1.5 Document Tree ....................................................................................................................... 9

2 REFERENCES .............................................................................................. 10

2.1 Applicable documents........................................................................................................... 10

2.2 Reference documents ........................................................................................................... 10

3 MCCS ARCHITECTURE OVERVIEW ................................................................... 11

3.1 Telescope Overview .............................................................................................................. 11

3.2 LFAA Overview ...................................................................................................................... 12

3.3 Role of MCCS in LFAA ............................................................................................................ 14

3.4 Main MCCS Responsibilities .................................................................................................. 14

3.5 MCCS Top-Level Static Decomposition Diagram .................................................................. 18

3.6 Interfaces .............................................................................................................................. 18

External Entities ............................................................................................................ 18

Level 4 and Level 5 Components .................................................................................. 18

External Interfaces ........................................................................................................ 19

Internal Interfaces ......................................................................................................... 20

4 OPERATIONAL CONCEPTS .............................................................................. 23

Operational Environment ............................................................................................. 23

Operations................................................................................................................. 23

Maintenance ............................................................................................................. 24

Operator Role ............................................................................................................ 24

Support Environment .................................................................................................... 24

On-site Maintainer role ............................................................................................. 24

Off-site Maintainer role ............................................................................................ 24

Remote support ........................................................................................................ 24

States and Modes.......................................................................................................... 25

5 MCCS SOFTWARE OVERVIEW ........................................................................ 28

5.1 Overview of Software Architecture ...................................................................................... 28

5.2 Software Component List ..................................................................................................... 33

5.3 Software-Hardware Mapping ............................................................................................... 37

5.4 Software Life Cycle ................................................................................................................ 38

Agile Release Trains ...................................................................................................... 38

SAFe Implementation Overview ................................................................................... 38

Essential SAFe ............................................................................................................... 39

Software Development Process During Construction Iterations .............................. 40

The Test-First Approach to Construction .................................................................. 41

5.5 Commissioning ...................................................................................................................... 43

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 4 of 53

6 MCCS PHYSICAL OVERVIEW .......................................................................... 44

6.1 Compute Server .................................................................................................................... 44

6.2 Network ................................................................................................................................ 44

6.3 Rack Assembly ....................................................................................................................... 46

7 SCENARIOS ................................................................................................ 48

7.1 Application of power ............................................................................................................. 48

7.2 Transition to Low Power Mode ............................................................................................. 49

7.3 Transition to Off-line ............................................................................................................. 49

Controlled shutdown .................................................................................................... 49

Uncontrolled shutdown ................................................................................................ 50

7.4 Set up and Start Observation ................................................................................................ 50

7.5 Calibration ............................................................................................................................. 50

7.6 Stop Observing ...................................................................................................................... 51

7.7 MCCS Failures ....................................................................................................................... 51

7.8 Software Upgrades ............................................................................................................... 52

Software upgrades ........................................................................................................ 52

BIOS updates ................................................................................................................. 53

LRU firmware updates .................................................................................................. 53

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 5 of 53

LIST OF FIGURES Figure 1-1 SKA1 LFAA Element Documentation Tree ............................................................................. 9

Figure 3-1 SKA1 Telescope Overview .................................................................................................... 11

Figure 3-2 SKA1_Low Functional Diagram ............................................................................................ 12

Figure 3-3. LFAA overall architecture.................................................................................................... 13

Figure 3-4. LFAA observation organization ........................................................................................... 15

Figure 3-5. MCCS top-level static decomposition ................................................................................. 17

Figure 3-6. LFAA L3 context diagram .................................................................................................... 21

Figure 3-7. MCCS - Field interface ......................................................................................................... 21

Figure 3-8. MCCS - SPS interface ........................................................................................................... 22

Figure 4-1 MCCS Sub-Element top-level context diagram showing all external interfaces ................. 23

Figure 4-2. Derived state transition diagram for all TANGO devices in SKA LMC. Not all states are

mandatory for each hardware and software component. ......................................................... 27

Figure 5-1. LFAA overall software architecture overview ..................................................................... 28

Figure 5-2. LFAA observation management overview .......................................................................... 30

Figure 5-3. LFAA local monitoring and control overview ...................................................................... 32

Figure 5-4. TANGO control structure .................................................................................................... 33

Figure 5-5. Software module decomposition diagram ......................................................................... 37

Figure 5-6. Mapping between array and software components .......................................................... 38

Figure 5-7: Essential SAFe configuration............................................................................................... 39

Figure 5-8: Software development process during a construction iteration. ...................................... 41

Figure 5-9: Test-first development approach. ...................................................................................... 42

Figure 5-10: Testing during construction iterations. ............................................................................ 42

Figure 6-1. Network links between MCCS and external entities .......................................................... 45

Figure 6-2. MCCS network diagram ...................................................................................................... 46

Figure 6-3. MCCS rack assembly ........................................................................................................... 47

LIST OF TABLES Table 3-1. LFAA numbers ...................................................................................................................... 14

Table 3-2. External interfaces ............................................................................................................... 19

Table 3-3. L2 interfaces to other LFAA sub-elements ........................................................................... 20

Table 4-1. MCCS states and modes ....................................................................................................... 25

Table 5-1. Link between hardware components as described in software and the physical

components as defined in the PBS ............................................................................................. 29

Table 5-2. List of elements in the Architecture System Overview ........................................................ 33

Table 5-3. Relationships between major elements in the Architecture System Overview .................. 36

Table 6-1. MCCS compute server configuration ................................................................................... 44

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 6 of 53

LIST OF ABBREVIATIONS

AADC ................................. Aperture Array Design and construction Consortium

AAVS ................................. Aperture Array Verification System

ADC ................................... Analog to Digital converter

Ad-n .................................. nth document in the list of Applicable Documents

APIU .................................. Antenna Power Interface Unit

AIV .................................... Assembly Integration and Verification

BIOS ................................. Basic Input/Output System

CDR ................................... Critical Design Review

CI ....................................... Configuration Item

CMB .................................. Cabinet Management Board

COTS ................................. Commercial Off The Shelf

CPF .................................... Central Processing Facility

CM .................................... Configuration Manager

CPU .................................. Central Processing Unit

CSP .................................... Central Signal Processing

DAQ .................................. Data Acquisition

DDD ................................... Detailed Design Document

DMS .................................. Document/Data Management System

ECP .................................... Engineering Change Proposal

EMI .................................... Electro Magnetic Interference

FN ..................................... Field Node

FoV .................................... Field of View

FPGA ................................. Field Programmable Gate Array

GPU ................................... Graphics Processing Unit

HW .................................... Hardware

ICD .................................... Interface Control Document

INFRAAUS ......................... Infrastructure Australia

ISO..................................... International Organisation for Standardisation

LFAA .................................. Low Frequency Aperture Array

LFAA-DN ............................ Low Frequency Aperture Array – Data Network

LMC ................................... Local Monitoring and Control

FQDN ................................ Fully Qualified Device Name

LNA ................................... Low Noise Amplifier

LMC ................................... Local monitoring and Control

LRU .................................... Line Replaceable Unit

MCCS................................. Monitor, Control and Calibration subsystem

MRO .................................. Murchison Radio-astronomy Observatory

MWA ................................. Murchison Widefield array

PBS .................................... Product Breakdown Structure

PPS .................................... Pulse Per Second

QA ..................................... Quality Assurance

RD-N .................................. nth document in the list of Reference Documents

RAM ................................. Random Access Memory

RMS .................................. Root Mean Square

RF ...................................... Radio Frequency

RFI ..................................... Radio Frequency Interference

RFoF .................................. Radio Frequency signal over Fibre

RPF .................................... Remote Processing Facility

SAD ................................... Software Architecture Document

SaDT .................................. Signal and Data Transport

SDP .................................... Science Data Processor

SKA .................................... Square Kilometre Array

SKA-LOW ........................... SKA low frequency part of the full telescope

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 7 of 53

SKAO ................................. SKA Office

S/N .................................... Signal to noise

SPS ................................... Signal Processing Subsystem

SRMB ................................ Sub-Rack Management Board

SSD ................................... Solid State Drive

SW ..................................... Software

TANGO .............................. TAco Next Generation Objects

TCP-IP ................................ Transmission Control Protocol – Internet Protocol

TBC .................................... To Be Continued

TBD ................................... To Be Done

TM ..................................... Telescope Management

TPM ................................... Tile Processor Module

UPS.................................... Unlimited Power Supply

WBS .................................. Work Breakdown Structure

WP .................................... Work Package

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 8 of 53

1 Introduction

1.1 Purpose of the document

The purpose of this document is to describe the architecture for the Monitoring, Control and

Calibration Sub-System for the Low Frequency Aperture Array (LFAA) of the SKA Phase 1, which

references detailed design documents for the hardware and network setup, as well as one software

architecture document for describing the software system which will run on the MCCS. Combined,

these will determine the operational concept, cost, power, equipment space, reliability, availability

and maintainability of the MCCS.

This document should be read after the LFAA Architectural Design and Analysis Document [AD3].

1.2 Scope of the document

This document describes how the LFAA MCCS architecture can meet the requirements within the

SKA LFAA Signal Processing Requirement Specification.

The level of detail in this document is sufficient to:

1. Define interfaces with other SKA Elements and LFAA Sub-elements.

2. Establish a reasonable baseline design at reasonably low perceived risk.

3. Estimate time, effort and cost to deliver the functionality specified in the LFAA Signal

Processing Sub-Element Requirements Specification [AD7].

In other words, the LFAA Sub-Element design is defined in enough detail as to reduce risk of

effort/time/cost overruns in the Construction Phase.

The current release (100% version) will support the Critical Design Review for the LFAA Element. The

level of detail is enough to have high confidence in the referenced design being compliant and able

to be constructed with low risk. This Architecture Design Document (ADD), with references to

supporting information and data, will provide a design artefact to support the Construction Phase

activities.

1.3 Intended Audience

This document is expected to be used by the LFAA Element Consortium Engineering and

Management Team and the SKAO System Engineering Team and SKAO LFAA Project Manager. This

document is expected to be read by the external CDR review panel

1.4 Document Overview

This document follows a template that was agreed to between the SKAO and the LFAA Consortium.

It covers the key contents called out in the LFAA SOW [AD8].

Detailed information is contained in reference documents.

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 9 of 53

1.5 Document Tree

The overall document tree for the LFAA Element is shown in Figure 1-1. Level 1 (L1) is the SKA

System (telescope) level, L2 is the LFAA Element level and L3 is the LFAA sub-element level (where

MCCS resides).

Figure 1-1 SKA1 LFAA Element Documentation Tree

L1 Requirements

L2 Requirements

LFAA ADD

LFAA Costing

Planning Verification Specifications Design Costing

Baseline Design/

Architecture Data Pack

L3 Requirements

Internal ICDs

L1

L2

L3

Design DocsSub-element

Costings

Sub-element Detailed

Design and

Prototyping Docs

Sub-element

Test Specs and

Statement of

Compliance at

CDR

LFAA

Test Spec

PMP SEMP

Risk Reg

External ICDs

LFAA

AIVP

Sub-element

Prototyping

Plans

Sub-element

Dev Plans

(SOW,WBS)

External ICDs

SE-6*

Construction Plan

Legend

LFAA CIDL Tree – Rev 1.aJune 06, 2018

…

(Additional Planning Docs)

Sub-element

Signal Models

Con Ops

LFAA RAMS/Logistics/

Safety/EMI/EMC

SKAO Doc

LFAA Doc for PDR; SKAO Doc for CDR updates

LFAA Doc at PDR; updates for CDR as required

LFAA Doc to be delivered for CDR

* L2 docs split between Sub-elements** L3 requirements split per sub-element

Not Delivered

Recovery

Plan

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 10 of 53

2 References

2.1 Applicable documents

The following documents are applicable to the extent stated herein. In the event of conflict between

the contents of the applicable documents and this document, the applicable documents shall take

precedence.

[AD1] SKA-1 System Baseline design, SKA-TEL-SKO-0000002 Issue 01

[AD2] Roll-out Plan for SKA1 Low, SKA-TEL-AIV-4410001 Issue 05

[AD3] LFAA Architectural Design Document, SKA-TEL-LFAA-0200028

[AD4] SKA1 TM to LFAA ICD, 100-000000-028, Issue 02

[AD5] SKA1 LFAA to INFRA AUS ICD, 100-000000-003, Issue 03

[AD6] SKA1 SADT to LFAA ICD, 100-000000-026, Issue 04

[AD7] SKA1 LFAA SPS Sub-Element Requirements Specification, SKA-TEL-LFAA-0400014

[AD8] SKA1 LFAA Element Statement of Work

2.2 Reference documents

The following documents are referenced in this document. In the event of conflict between the

contents of the referenced documents and this document, this document shall take precedence.

[RD1] SKA1 Control System Guidelines, 000-000000-010, Issue 01

[RD2] LFAA Internal Interface Control Document SKA-TEL-LFAA-0200030, Issue 01

[RD3] CISPR 22 Information technology equipment - Radio disturbance characteristics - Limits

and methods of measurement R2014

[RD4] CISPR 24 Information technology equipment - Immunity characteristics - Limits and

methods of measurement 2010

[RD5] CISPR 32 Electromagnetic compatibility of multimedia equipment - Emission

requirements 2015

[RD6] CISPR 35 Electromagnetic compatibility of multimedia equipment - Immunity

requirements

[RD7] MCCS Software Architecture Document, SKA-TEL-LFAA-0600052

[RD8] SPS Detailed Design Document, SKA-TEL-LFAA-0500035

[RD9] MCCS Detailed Design Document, SKA-TEL-LFAA-0600051

[RD10] MCCS Assembly Verification and Test Plan, SKA-TEL-LFAA-0600053

[RD11] Safe principles: https://www.scaledagileframework.com/safe-lean-agile-principles/

[RD12] Essential Safe: https://www.scaledagileframework.com/essential-safe/

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 11 of 53

3 MCCS Architecture Overview

3.1 Telescope Overview

Figure 3-1 shows the major SKA1 Observatory entities: SKA1-Low in Australia, SKA1-Mid in South

Africa and the SKA Global Headquarters in the UK. The thick flow-lines show the unidirectional

transport of large amounts of digitised data from the antennas to the Central Processing Facilities

(CPF) on the sites, and from the CPFs to the Science Data Processor (SDP) and Archive facilities. The

thin blue dash-dot lines show the bidirectional transport of system monitor and control data.

The SKA1-Low telescope array includes 512 stations, each consisting of 256 dual-polarisation log-

periodic antennas. The stations are distributed over a distance of 65 km, with the greatest density of

stations in the central core. The Central Processing facility is located on site and the SDP and archive

are located in Perth. Additionally, each station can be divided into a number of smaller sub-stations

at reduced bandwidth.

A more detailed schematic of the SKA1-Low telescope, extracted from the SKA1 System Baseline V3

Description (in preparation), is shown in Figure 3-2. This figure shows the major SKA1-Low signal

flow components, as well as the areas of consortia responsibility (red boxes) and the key

technologies needed to implement the components. The green dashed line shows the bi-directional

flow of monitor, control and operational data, and the orange dot-dashed line shows the distribution

of synchronisation and timing signals.

Figure 3-1 SKA1 Telescope Overview

A schematic of the SKA1_Low Telescope, extracted from the Baseline Design [AD1], is shown below

including the LFAA Element, product [101-000000].

SKA1-Low operates concurrently in imaging mode and non-imaging mode with concurrent operation

of between 1 and 16+ sub-arrays. Each sub-array is programmable as a separate conceptual

telescope in terms of antenna pointing, band selection and the setting of configurable imaging and

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 12 of 53

non-imaging parameters. The only things that are not shared between sub-arrays are observation

time, communications links and some processing resources.

Figure 3-2 SKA1_Low Functional Diagram

3.2 LFAA Overview

The LFAA is primarily a hardware-centric element, such that hardware configuration, monitoring and

control is a central feature and architectural driver. The physical architecture is defined in Figure 3-3

and the system consists of the following major components:

1. Stations, consisting of Field Nodes, Antenna Power Interface Unit(s) and meshes

2. Digital System, consisting of:

a. Signal Processing Subsystem (SPS)

b. SPS Network

3. Monitor, Control and Calibration Sub-system (MCCS), including the MCCS network

LFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350

MHz) signals transmitted from astronomical objects. The architecture is built around a high-speed

switched network which is controlled by MCCS in a centralized and highly configurable system. The

SPS provides the infrastructure required to support signal conditioning, digitization and processing

functionalities of the TPM. It consists of cabinets with internal cooling, power and clock distribution,

each receiving a 10 MHz and 1 PPS from the synchronization and timing (SAT) system which are

distributed to each TPM. Each cabinet also includes the first level (i.e. directly connected to the

TPMs) data switches which allow the forming of tile beams by summing the signals from sixteen

antennas together, followed by the forming of station beams by summing tile beams within a single

station together. Beamforming is performed within TPMs.

Advanced Time Keeping &

Distribution

Advanced Data Storage

Central Processing Facility

Visibility Data

Ca

nd

ida

tes &

T

imin

g D

ata

Synch

ronis

ation

& T

imin

g

Data Transport

LNA & Amplifier RF over Fibre,Opto-

electronics

Filterbanks,Beamformer &Stn Correlator

Antenna Array Design

Outer Antenna

Station Array

RF Electronics

RF

Transport Links

Channelisation Beamforming& Transient

Capture

Low-Frequency Aperture Array Stations

Science Data Processing Facility

Channeliser,

Correlator&

Beamformer

Science Data

ProcessingScience

Data Archive &

Distribution

High-speed Digital

Hardware

Fibre OpticDigital Data

Links

Specialised Digital

Hardware

Synchronisation & Timing

Distribution

Pulsar Search

Pulsar Timing

Observatory Clock System

Telescope Manager

Operations,Control and

Monitoring Systems

Core Antenna Station Array

RF Electronics

RF Transport

Links

Channelisation Beamforming& Transient

Capture

Long-haul Links

Telescope Mgt

RF Gain Digitisation

RF Gain Digitisation

Amplification& Filtering

VLBI DataVLBI

Terminal Equipment/

Interface

Transient Data

SampleClock &

Time Stamp

Generation

Sample Clock &

Time StampGeneration

Switch

VLBIObserving

Log

VLBI Data

Ca

nd

ida

tes &

T

imin

g D

ata

Visibility DataTransient Data

Super-computer

Hardware,Software

Science DataProcessingFront-end

Data Routing

Time stamp

Data Transport

Long-haul Links

Fibre OpticDigital Data

Links

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 13 of 53

Figure 3-3. LFAA overall architecture

TPMs are the primary component responsible for the processing of signals. They are located within

the processing facilities (CPF shown in diagram) and will be housed within Signal Processing Sub-

system (SPS) cabinets. TPMs receive the analogue RF over fiber optical signals from 16 antennas

(dual-polarization) and convert it back to electrical RF signal. This signal is then filtered to limit the

frequency bandwidth, amplified, digitized and channelized into ~1MHz coarse frequency stream (to

a total of 512 coarse frequency channels). Calibration coefficients are applied to each frequency

channel, whilst beamforming delays are applied to each antenna per beam. The partial beam stream

is sent to a digital switch to generate station beams. To generate a station beam, the output of 16

TPMs are combined by making use of one of the data switches.

The data network is a standard high-speed (40Gb or 100Gb) network which will transport the various

data streams, i.e. control and monitoring information as well as signal data. This network contributes

to the provision of connectivity between the TPMs, the MCCS and the Low correlator beam former

(CBF-LOW), involving long haul links from the TPMs that are in the Remote Processing Facilities

(RPF).

There will be a total of 256 SPS cabinets, each containing four sub-racks, such that each cabinet is

responsible for two stations. Each sub-rack contains a Sub-rack Management board which

distributes power, 1Gb network, 10 MHz and PPS signals. A cabinet-wide management unit is

responsible for distributing these signals to the sub-racks. A single 100 Gb switch connects the TPMs

to the LFAA Network. The Sub-rack Management board also acts as a proxy for monitoring and

controlling APIUs. MCCS cabinets host at least 16 high-performance servers (and one or two

additional servers for redundancy), such that each server is responsible for at most eight stations.

MCCS cabinets also host a number of 100Gb switches to connect the SPS racks to the MCCS servers.

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 14 of 53

Table 3-1 provides a summary of the number of components in the LFAA and how they are spread

across cabinets (refer to [RD8] for a detailed SPS cabinet design).

Table 3-1. LFAA numbers

Total number of antennas 131072 Total number of SPS cabinets 256

Total number of stations 512 Sub-racks per cabinet 4

Antennas per station 256 TPMs per sub-rack 8

Antennas per TPM 16 100Gb switches per MCCS cabinet 2

TPMs per station 16 Total number of MCCS cabinets 4

Total number of TPMs 8192 Servers per cabinet 17 (+ 0/1)

Signals per TPM 32 SDP-LFAA link speed 100Gb/s

Frequency channels 512 TM-LFAA link speed 1Gb/s

Maximum beams per station 8

3.3 Role of MCCS in LFAA

The MCCS performs the local monitoring, control and calibration functions for the stations and

supporting products. It receives commands and reports the LFAA status to TM. It comprises of a

compute cluster (hardware resources composed by off-the-shelf high-performance servers), local

power and cooling distribution, local network and job management software to support the LFAA

monitor and control functions. The MCCS is connected to both the SPS and LFAA Network. It also

calculates the beamforming and calibration coefficients. The MCCS controls both TPMs, the M&C

and data network, as well as supporting hardware in the cabinets. It is also responsible for

implementing the transient buffer and transmitting the buffer, when instructed, to SDP via a

dedicated 100Gb link.

3.4 Main MCCS Responsibilities

The two primary responsibilities of the MCCS sub-system is to:

1. Create and monitor of observations, including calibration and buffering beamformed data

for transient detection

2. Provide monitoring and control capability for all the hardware and software components

The software architecture for the LFAA is primarily driven by these responsibilities, whilst the sizing

of the MCCS hardware is defined by the resource requirements for calibration, transient buffers and

supporting operations. Observation management is the primary use case for MCCS and defines the

primary functional requirements for the software system, whilst most of the remaining

requirements can be seen as features and specifications required to make sure that the primary use

case remains online, available, working properly and meets the science cases to which the LFAA

should cater. The functional requirements which the MCCS should provide can be summarized as

follows:

• Create and manage observations, where an observation consists of one subarray containing

multiple stations, which in turn can be composed of multiple sub-stations

• Perform calibration, pointing and bandpass flattening coefficient calculation for running

observations

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 15 of 53

• Manage TPMs, including downloading firmware, initialising and synchronising the boards

and firmware, and updating required coefficients and configurations throughout the lifetime

of an observation

• Provide a transient buffer such that, when triggered by TM, buffered station beams can be

forwarded to SDP

• Expose maintenance functionality for fault finding, mitigation and correction

• Monitor and control TPMs, antennas and other hardware and software components, and

provide a reporting mechanism for generating reports

• Provide a logging mechanism and store logs for a period of time, where said logs should be

queryable by external parties

• Raise alarms and events to inform internal and external entities of state and other changes

of LFAA components

• Routinely perform status and diagnostic checks

• Provide an inventory database where labelled hardware components and cables are stored,

to be able to easily localise issues within the CPF and RPFs

• Interact with external entities, including TM, SDP, CSP, operators, engineers and hardware

and software deployers

Observation creation and management, with the associated need to control TPMs, calibrate the

arrays and buffer station beams are the main driving factor of the architecture as well as for defining

the minimal performance requirements for sizing the MCCS hardware. The need to monitor all

hardware and software devices, including the need to have an alarm and notification system, led to

the adoption of TANGO by the SKA community as the primary control system for the SKA. Through

TANGO, most of the purely LMC-related requirements are met by properly integrating TANGO within

the architecture.

Figure 3-4. LFAA observation organization

The primary use case of the LFAA is to generate station beams. Observations organization is shown

in Figure 3-4 and described below:

• A group of 16 Antennas (connected to a TPM) is called a Tile

• A Subarray is a set of Stations grouped together for a single observation scheduling block. A

Station is composed of 256 antennas (distributed across 16 Tiles). The LFAA uses the concept

of a Sub-array to conform with the SKA control guidelines, for grouping related Tiles and

storing Sub-array related metadata. There is no Sub-array specific operation performed in

the signal chain.

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 16 of 53

• The number of Subarray which can be defined is configurable (there is no fixed limit). This

document assumes a maximum of 16 Subarrays, however this can be changed

• A sub-station is defined as a specific instance of a station beam in which a subset of the

antennas does not contribute to the beam (a weight of 0 is applied to these antennas)

• Each Station can generate up to 8 Station Beams

• The Antennas within each Station need to be calibrated (gain and phase calibration). This is

performed on the MCCS servers. The calibration cycle is 10 minutes. During these 10

minutes, coarse frequency channel (from the channels in the Station Beams) are calibrated

in a round-robin fashion, such that each channel is calibrated in ~1 second

• For each Station Beam, given a pointing polynomial, the delay and delay rate per antenna

need to be calculated so that pointing coefficients can be generated. Delays and delay rates

per antenna are calculated on the MCCS servers, whilst pointing coefficients per

antenna/channel (given the delay and delay rate) are calculated on the TPMs.

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 17 of 53

Figure 3-5. MCCS top-level static decomposition

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 18 of 53

3.5 MCCS Top-Level Static Decomposition Diagram

MCCS Compute Processor and MCCS Software are the main components of the MCCS. A brief

architectural description of the software component is presented in Section 5, while an in-depth

analysis and description is provided in [RD7]. The MCCS Compute Processor component is composed

of four almost-identical MCCS cabinets deployed in the CPF. Each cabinet hosts at least 16 high

performance servers which are connected to SPS via a data network and interconnected both

though the same data network as well as a dedicated monitoring and control network. In two of the

four cabinet an LMC head node (one master, one shadow) hosts the core LMC software and

manages both MCCS and SPS. Figure 3-5 show the top-level static decomposition of MCCS.

3.6 Interfaces

This section describes the external entities to MCCS, the level 4 and level 5 components composing

MCCS, as well as all internal and external interfaces to MCCS.

External Entities

There are no external entities in the MCCS static decomposition diagram. Verification and

maintenance support equipment is not described in detail in this DDD.

Level 4 and Level 5 Components

Level 4 decomposition has only three elements:

• 4 MCCS Compute Processors which house the High-Performance Computing Units together

with data network LRUs which connect the servers together as well as provide connections

to SPS.

• MCCS software, which encapsulates all the LMC and supporting software infrastructure for

MCCS

• LMC infrastructure hardware, comprising of one master node and a shadow master node

which is used as a failover in the event where the master node becomes compromised.

An MCCS Compute Processor is composed of:

• The cabinet chassis, holding all other hardware components

• 17 high performance computing unit, one of which is spare (kept in lower power mode until

needed)

• Four 100Gb 32-port Ethernet switches, implementing a single 100Gb Ethernet network for

science and LMC data

• One 1 Gbps 32-port Ethernet switch for control and management across MCCS

• The AC distribution system, distributing power to all Level 4 components under CMB control

• Required cabling

• Additionally, two of the MCCS Compute Processors contain one LMC Infrastructure node

together with an associated UPS

The MCCS software is logically partitioned into several L5 components:

• Local Monitor and Control TANGO Framework, which encapsulate most of the LMC

functionality

• Management Software Module, which manages the hardware and software configuration of

all LFAA

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 19 of 53

• Graphical User Interface, which provides and engineering user interface for use in

commissioning, testing and maintenance

• Data Acquisition Software Module, which is responsible for acquiring LMC data transmitted

by SPS, used for calibration, transient buffering and diagnostics

• Pointing Software, which compute the delay and delay rates per antenna for a given

station/sub-station configuration

• Calibration Software, which runs the calibration algorithm and generates calibration

coefficients that are transmitted to SPS

• Diagnostic Software, which monitors the state of the LFAA (both hardware and software),

including FieldNode diagnostic, calibration diagnostics and network diagnostics. Some of

these diagnostics can be performed within the associated TANGO devices, however others

require a higher amount of processing power, in which case they are run as standalone

applications.

External Interfaces

The external interfaces between MCCS and other elements are list in Table 3-2 and shown in Figure

3-6, whilst the external interfaces between MCCS and other LFAA sub-elements are listed in Table

3-3 and shown in Figure 3-7 and Figure 3-8. The external interfaces are defined in [AD4]/[AD5]/[AD6]

whilst the internal interfaces are defined in [RD2] and MCCS intends to be compliant with them.

Table 3-2. External interfaces

External

Entity

Interface ID Leading

Organization

Key Data or Message flows

TM S1L.TM_LFAA.001 TM Overall LFAA monitoring and control

functionality

SDP S1L.SDP_LFAA.002 SDP Transient Buffer

SDP S1L.SPA_LFAA.001 SDP Global sky model updates

SaDT S1L.SADT_LFAA.007 SaDT Monitor and Control and NTP – physical

link

SaDT S1L.SADT_LFAA.009 SaDT Transient buffer data – physical link

INAU S1L.LFAA_INAU.005 LFAA Rack power

INAU S1L.LFAA_INAU.008 LFAA Rack cooling

INAU S1L.LFAA_INAU.009 LFAA Floor space

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 20 of 53

Table 3-3. L2 interfaces to other LFAA sub-elements

LFAA Entity Interface ID Key Data or Message flows

SPS S1L.MCCS_SPS.001 Physical links between CPF and MCCS

SPS S1L.MCCS_SPS.002 Physical links between RPFs and MCCS

SPS S1L.MCCS_SPS.003 Calibration, transient data exchange between SPS and MCCS

SPS S1L.MCCS_SPS.004 LMC data exchange between SPS TPMs and MCCS

SPS S1L.MCCS_SPS.005 LMC data exchange between SPS CMBs and MCCS

SPS S1L.MCCS_SPS.006 LMC data exchange between SPS SRMBs and MCCS

SPS S1L.MCCS_SPS.007 LMC data exchange between SPS Network and MCCS

Field Node S1L.MCCS_FN.001 LMC data exchange between FN and MCCS

Internal Interfaces

The physical interfaces within MCCS are those required for:

• Distribution of power from the rack power supplies to the PDU and subsequently to the rack

equipment (via the UPS in case of the head and shadow node)

• 1Gb and 100Gb network connectivity.

Interfaces between software components are described in the MCCS Software Architecture

Document [RD7].

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 21 of 53

Figure 3-6. LFAA L3 context diagram

Figure 3-7. MCCS - Field interface

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 22 of 53

Figure 3-8. MCCS - SPS interface

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 23 of 53

4 Operational Concepts

Figure 4-1 MCCS Sub-Element top-level context diagram showing all external interfaces

Operational Environment

The Central Processing Facility (CPF) screened will house the MCCS equipment as well as other

surrounding/support equipment such as that used for SaDT timing and networks, the CSP correlator

and LFAA SPS.

The CPF is an RFI-shielded facility supporting liquid cooling. This facility has some level of ESD

protection and has HVAC filters to prevent dust accumulation on equipment. Notwithstanding the

RFI-shielded facility, LFAA LRUs, including those comprising MCCS, are required individually to meet

CISPR-22/32 Class A [RD3]/[RD5] radiated and conducted emissions levels. Additionally, MCCS LRUs

must meet CISPR 24/35 [RD4]/[RD6] Class A radiated and conducted susceptibility levels or

equivalent.

Operations

During normal operations MCCS is controlled via the interface with TM [AD4]. MCCS implements a

high-level interface which allows TM to control and monitor MCCS as a single instrument. A single

point of access is provided for housekeeping commands such as power-up, power-down, and state

and mode transitions. Monitoring and error reporting are subscription-based; all parameters that

may be of interest to TM and operations in general, including the rolled-up overall operational state

and health, are available for subscription. In addition, the MCCS interface provides introspection,

i.e. allows an authorized client to ‘discover’ and access parameters and commands implemented by

the lower level components when required to support diagnostics and maintenance.

Signal processing functions are controlled via sub-arrays and scans. MCCS supports the configuration

and monitoring of sub-arrays i.e. provides high-level commands [AD4] that TM can use to sub-divide

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 24 of 53

the Low Telescope into up to 16 sub-arrays and operate each sub-array independently. MCCS

exposes sub-arrays as top-level entities and makes provision for TM to assign antennae to sub-arrays

and select signal processing functions to be performed per sub-array. A scan is defined as a time

interval during which a sub-array’s configuration does not change. During normal operations, MCCS

accesses a sub-array directly to assign antennae, select signal processing functions, as well as start

and stop the scan (i.e. start and stop signal processing).

Maintenance

The MCCS Sub-element is an IT system, and therefore requires typical data centre/IT system

maintenance. Therefore, maintenance personnel will be typical IT-support personnel; normally only

typical IT-support personnel are expected to be required at the site itself.

Maintenance is described further in [RD9] Section 14

Operator Role

Apart from maintenance activities, LFAA (and thus MCCS) is remotely controlled via TM and

ultimately by an operator within the TM environment. The “operator”/maintainer role and how it

relates to MCCS is described in more detail in [RD9] Section 14

Support Environment

Support for MCCS will be provided both on-site (i.e. Boolardy) and off-site (i.e. at the SKA1_Low

Telescope support facility at Geraldton and/or in/near Perth), as well as remotely. On and off-site

support is described in more detail in [RD9] Section 14

On-site Maintainer role

The MCCS on-site maintainer needs a technical hardware support background as described in [RD9]

Section 14 to execute the required maintenance tasks. This maintainer’s primary objective is to

detect and isolate faulty LRUs (corrective maintenance) and to remove and replace these to restore

the MCCS functionality. The maintainer’s secondary objective is to determine what maintenance

needs to be scheduled (predicted and preventative maintenance) and to coordinate and perform the

required tasks when scheduled.

Off-site Maintainer role

The off-site maintainer needs software/hardware technical support background to perform second

line LRU repairs, configuration, and verification as described in [RD9] Section 14. The off-site

maintainer is located at the SKA1_Low Telescope support facility. The off-site maintainer removes

and replaces selected SRUs to repair LRUs, configures the repaired COTS LRUs, and tests all repaired

equipment in a representative environment to verify that they are fully operational. Once this is

confirmed, LRUs are returned to the on-site or close-to-on-site spares store.

Remote support

MCCS maintenance and support personnel will remotely connect to the CPF over the SKAO

communication network to read equipment status, review equipment log files and access MCCS long

term monitoring data that is stored in the TM Engineering Data Archive (EDA), to help isolate faults.

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 25 of 53

Off-site support is generally more specialized, and hence is preferred to detect and isolate faults.

Off-site support will be provided where possible to assist the Telescope on-site operations personnel

to diagnose faulty LRU equipment and problems with telescope functionality (firmware and

software). The general rule is, if something can be done remotely, it should be done remotely, but

with on-site capability and assistance if such capability is useful (such as GUIs on consoles to track

down problems, as described in [RD7] Section 7). See [RD9] Section 14 for more information on

remote support.

States and Modes

The MCCS implementation of states and modes is compliant with the SKA Control System Guidelines

document [RD1]. Per these guidelines, MCCS implements and reports the standard set of SKA state

and mode indicators for SPS, individual sub-arrays and MCCS itself. MCCS monitors state and mode

transitions and based on the status reported by LFAA sub-systems derives overall LFAA state and

mode indicators. For more detailed information on how states and modes are implemented in the

MCCS software architecture refer to [RD7].

Table 4-1 lists the states and modes for a sub-array. The states and modes are applicable to all

hardware, software and logical components, although it is not mandatory that all states and modes

are applied to each component. Figure 4-2 show the state transition diagram as derived from [RD1].

Table 4-1. MCCS states and modes

Attribute Range Description and comments

adminMode

(read-write)

Set by an outside authority (operations via TM and MCCS).

ONLINE The sub-array can be used for scientific observing.

MAINTENANCE

The sub-array is not to be used for scientific observing but can be used

for testing and commissioning.

OFFLINE The sub-array is not to be used at all.

NOT_FITTED Set by operations to suppress alarm generation.

opState

(read-only)

MCCS intelligently rolls-up the operational state of all components used

by the sub-array and reports the overall operational state for the sub-

array.

INIT The sub-array is being initialized.

OFF The sub-array is ‘empty’; no receptors have been assigned to the sub-

array.

ON At least one receptor has been allocated to the sub-array; the sub-array

is ready to accept a scan configuration.

ALARM The Quality Factor for at least one attribute is outside the pre-defined

ALARM limits. Some or all functionality may not be available.

DISABLE The sub-array is administratively disabled (adminMode=OFFLINE or

NOT_FITTED); basic monitor and control functionality is available, but

signal processing functionality is not available.

FAULT An unrecoverable fault has been detected. The sub-array is not available

for use; maintainer/operator intervention is required.

UNKNOWN The sub-array is unresponsive, e.g. due to loss of communication.

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 26 of 53

Attribute Range Description and comments

healthState

(read-only)

OK

DEGRADED

FAILED

MCCS intelligently rolls-up attribute quality factors, states, and other

indicators for all components and capabilities used by the sub-array and

reports the overall sub-array healthState.

obsState

(read-only)

The sub-array Observing State indicates status related to scan

configuration and execution.

IDLE The sub-array is not processing input data and is not generating output

products. When a sub-array is IDLE, SCAN ID=0.

CONFIGURING Transient state entered when a command to re-configure the sub-array

is received. The sub-array leaves this state when re-configuration is

completed.

READY The sub-array enters READY when re-configuration has been completed.

SCANNING The sub-array is processing input data and generating output products.

ABORTED The sub-array transitions to this state when a command ‘abort scan’ is

received. In this state re-configuration, delay tracking, and any other on-

going processing functions are stopped.

FAULT An unrecoverable error that requires operator intervention has been

detected.

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 27 of 53

Figure 4-2. Derived state transition diagram for all TANGO devices in SKA LMC. Not all states are mandatory

for each hardware and software component.

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 28 of 53

5 MCCS Software Overview

5.1 Overview of Software Architecture

The software infrastructure of the LFAA must cater for the responsibilities specified above, with a

focus on telescope monitoring and control, and observation management. Additionally, the

architecture must meet the non-functional requirements listed in [RD9] Section 3.3. A high-level

description of the LFAA software architecture is shown in Figure 5-1. This shows diagram separates

components which are within the software architecture context from those which are considered

external (here the Telescope Manager and hardware devices). Note that not all software

components are shown here to avoid clutter. The architecture itself is separated into four sub-

systems which communicate with each other over the TANGO bus. This separation is purely logical

since almost all software components are implemented as TANGO devices (or have an associated

TANGO device). These sub-systems are:

Hardware Devices: Each monitorable and/or controllable hardware device in the LFAA has an

associated TANGO device through which all operations are performed. These include: TPMs,

antennas, APIU, switches, rack management units and servers. Note that an antenna cannot be

monitored and controlled directly, these operations have to go through the APIU and TPM to which

the antenna is connected.

Figure 5-1. LFAA overall software architecture overview

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 29 of 53

MCCS physical devices are represented by green boxes in Figure 5-1. Table 5-1 shows the

relationship between these hardware devices and the physical devices listed in the LFAA PBS. Note

that in some instances a hardware device is mapped to multiple physical devices, in which case the

hardware device can be interacting with each physical device separately or through a controlling

management device. For example, the Sub Rack can control and monitor power and signal

distribution, presenting a rolled-up status (although direct device access is still permitted). The

detailed design of these devices, in terms of monitoring and control functionality is still not finalised.

Table 5-1. Link between hardware components as described in software and the physical

components as defined in the PBS

Component in Figure Physical Component in PBS PBS #

CMB SPS Cabinet

- Cabinet Chassis

- AC Power distribution

- Cooling System

- Cabinet Management Board

MCCS Cabinet

- Cabinet Chassis

- AC Power distribution

- Cooling System

- UPS

95

105

101

109

106

120

105

101

109

133

SRMB TPM Sub-rack

- AC DC Power Supply

- Sub-Rack Management Board

- TPM Sub-rack

128

138

158

162

APIU Antenna Power Interface Unit (as a single entity) 103

Antenna Antenna (through APIU and TPM) 139

TPM TPM 161

Switch 100G Ethernet Switch (SPS, MCCS)

1 Gb Ethernet Switch (MCCS)

98, 99,

129

MCCS Server MCCS High Performance Computing Units

LMC Head Node

121

130

Observation Management: Observation creation and management is a complex task which requires

the interaction of most of the software components show in Figure 5-1. The observation

management sub-system contains the software components which are unique to this functionality,

essentially showing the TANGO devices which manage subarrays, stations, station beams and

transient buffers. This sub-system includes the calibration, pointing, DAQ and transient buffer

processes, and is described in greater detail in Figure 5-2.

Cluster Management: The MCCS will be composed of at least 64 high-performance servers, each

housing several GPUs. These numbers are based on the estimated bandwidth, memory and compute

power required to calibrate and buffer (transient buffer) all the stations in LFAA. Each server is

responsible for at most eight stations, such that each GPU can calibrate two. Cabinet (and hardware

within) TANGO devices and observation-related components are partitioned across the cluster and

deployed on their associated server. Distributed storage is assumed, such that there is not central

point of failure. A cluster manager and a storage manager will be used to administer these

resources, as well as allow the TANGO control systems and observation components to submit jobs

on the cluster.

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 30 of 53

Figure 5-2. LFAA observation management overview

Monitoring and Control: This subsystem contains all the elements defined in the SKA monitor and

control guidelines, including the LFAA Master device which is the root of TANGO hierarchy and the

main communication point for the LFAA, logging and alarm handling, as well as the TelState device

which is managed by TM.

The interaction of observation-related devices is shown in Figure 5-2, while the following provides a

high-level step-by-step description of what happens during observation creation and management

(certain steps are omitted here, the full sequence is detailed in [RD7] Section 5.2):

1. When the system is started, 16 Subarrays and 512 Stations are created, each unassigned. For

each Station, 8 Station Beams and one Transient Buffer devices are instantiated. These

remain idle until they are required for an observation.

2. At any point, TM can send an observation configuration command to a Subarray. Assuming

all resources are available, Tiles are grouped into Stations, and the Stations are associated

with the Subarray. If the stations were already initialised for a prior scan (such that all

required SPS and MCCS resources are not in low-power mode and already configured and

calibrated), then the process skips directly to step 4. Subarray configuration includes the

following operations:

a. MCCS will transition all required Field Nodes, SPS and MCCS resources from low-

power mode to the Ready state. The time it takes to do so depends on the time

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 31 of 53

required to stabilise the SPS racks (network switches take some time to switch on,

and the cooling system needs to stabilise)

b. When ready, the stations are initialised. TPMs are programmed and initialised (if

required) and signal processing starts. The beamforming chain does not need to be

initialised at this point.

c. Each Station submits a DAQ, Calibration and Bandpass job to the Cluster Manager,

which instantiates them. The jobs are provided with the TANGO FQDN of the creator

(the station) such that a proxy can be created. These jobs are initialised and wait for

incoming calibration spigot and diagnostic data from the station’s TPMs.

d. The Calibration process loads the previous gain and phase coefficients for this

station (if any).

e. The calibration cycle is started by instructing the TPMs to send LMC data to the

MCCS server. The DAQ process reads this stream and generates the correlation

matrix, which is dumped to disk. The Calibration process reads this and computes

the phase and gain coefficients for one frequency channel at a time. These

coefficients are written to the Station device, which downloads them on the TPMs. A

frequency channel is calibrated every second in this manner.

f. Device-specific checks are performed, and any required alarms are created

3. Once the system is fully calibrated (this can take one to two calibration cycles), TM is

notified that configuration is complete

4. TM send the full subarray configuration and MCCS perform final configuration (Note that

this step should be compliant with SKA1-LFAA_MCCS_REQ-19):

a. The beamforming chain is configured on the TPMs (the station beams are not

transmitted to CSP at this point)

b. Each Station Beam and Transient Buffer device creates a Pointing and Transient

Buffer process (respectively).

5. TM sends the initial beam pointing polynomials, which are distributed to the respective

Station Beams. The pointing processes calculates the required delay and delay rates per

antenna and download them to the TPM. The delay and delay rates are then updated

periodically.

6. TM sends the start observation command to the subarray:

a. TPMs are instructed to send the generated station beam(s) to CSP

b. TPMs start transmitting the quantised station beam which is received by the

Transient Buffer process and stored in the internal buffer. If triggered by TM, the

required section of the buffer is transmitted to SDP (see [RD7] Section 5.2.9)

c. Diagnostic operations are performed routinely

7. At any point, TM can update the beam pointing polynomials

8. At any point, TM can read attributes from the devices contributing to the observation

9. At any point, TM can issue a command on the subarray which changes its state. These

include abort and stop. The stop command stops the transmission of calibrated station

beams to CSP. The abort command will in addition, result in the de-configuration of the

components. When stopping, the processes are terminated but the station/subarray

configuration remains as is.

Figure 5-3 shows additional elements in the monitoring and control subsystem which are exclusive

to LFAA (not covered in the SKA monitoring and control guidelines), which include:

- Beam Model Device, which can provide beam metrics for a particular azimuth and elevation

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 32 of 53

- Inventory Database and associated TANGO device. The LFAA must keep track of all hardware

components and their serial number, which might be request by TM. This database is also

used during fault finding

- Command Line Interface and Graphical User Interface, which are clients to the LFAA LMC

used by external users and operators

The unified state of the telescope can be collected via the LFAA Master, however the TelState device

can also be communicated with directly (as with all other TANGO devices on the TANGO bus) to

investigate the state of groups of devices, or individual ones.

Figure 5-3. LFAA local monitoring and control overview

Figure 5-4 shows the TANGO control hierarchy for the LFAA. Four types of TANGO devices are

shown: Green representing TANGO devices which are associated with a hardware component,

Yellow representing TANGO devices which are observation-related (logical devices representing

observational entities), Red representing TANGO devices which interface with third party software,

and Blue which represent TANGO devices which support the TANGO infrastructure or are required

for the overall monitoring and control of the system, including devices specified in the SKA control

guidelines [RD1]. The connections between devices in the diagram show relationships and

multiplicities, with LFAA Master being the root of the hierarchy tree. Note that element-level devices

(Alarm Handler, TelState, Element Logger) are functionally independent, providing different type of

aggregation and functionality to TM or the other Elements.

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 33 of 53

Figure 5-4. TANGO control structure

5.2 Software Component List

Table 5-2 provides a short description of each entity in the figures described above, whilst Table 5-3

describes some of the relations between these components.

Table 5-2. List of elements in the Architecture System Overview

# Name Type Multiplicity Description

1 Graphical User

Interface

SW 1 A graphical interface through which users can locally

access parts of the LMC, mainly to support

maintenance and debugging

2 Command Line

Interface

SW 1 A wrapper around the LFAA Master which allows

external libraries and client to perform actions and

request information

3 Configuration

Database

DB 1 A central store where the required configuration to

load and run the LMC is stored

4 Log Storage DB 1 Storage for generated logs

5 LFAA Master SW 1 The LFAA Master device, which orchestrates all the

operations of the LMC as well as act as

communication point with external entities,

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 34 of 53

particularly TM

6 Element Logger SW 1 The LMC device which handles element logging

functionality

7 Inventory Database DB 1 A database containing the list of hardware devices

and cables, including their location with the CPF and

how they are interconnected

8 Inventory Device SW 1 A TANGO device which interfaces with the Inventory

database

9 Subarray SW 16 Creates, monitors and controls a subarray (a

collection of station devices) when and as instructed

by TM

10 Station SW 1..512 Creates, monitors and control a logical station

11 Station Beam SW 1..8 per

station

Controls the pointing functionality for a station beam

12 Beam Model

Device

SW 1 TANGO device which contains the beam pointing

model for an antenna and station

13 TelState Device SW 1 TANGO device which mirrors the TelState device in

Telescope Manager

14 Cluster Manager SW/HW 1 TANGO device which interfaces with the cluster

manager for monitoring, control and execution of

jobs

15 Transient Buffer SW 1 per station TANGO device which control the transient buffer

process and process triggers

16 Transient Buffer

Process

SW 1 per station Process which takes care of the transient buffer for a

station

17 DAQ Process SW 1 per station Process which enables the reception and storage of

data from TPMs

18 Pointing Process SW 1..8 per

station

Process which calculates the pointing coefficients for

station beams

19 Calibration Process SW 1 per station Process which performs station calibration

20 Bandpass Process SW 1 per station Process which calculate the scaling factors for

flattening the bandpass and runs diagnostics based

on the antenna bandpass

21 Cabinet Device SW 256 TANGO device which monitors and control the

Cabinet Management Boards in SPS cabinets

22 Sub-Rack Device SW 512 TANGO device which monitors and control the Sub

Rack Management Boards in SPS cabinets

23 MCCS Server HW 64 Physical high-performance server, making up the

MCCS

24 Switch HW 512+20 Physical network switch, composing the LFAA-DN

25 Switch Device SW 512+20 TANGO device for monitoring and controlling

switches

26 TPM HW 8192 Physical TPM which hosts the digital signal processing

chain

27 Tile SW 8192 TANGO devices for monitoring and controlling TPMs

28 Antenna HW 131072 Physical antenna

29 Antenna Device SE 131072 TANGO devices which monitors antennas

30 APIU HW 2048 Physical APIU, which powers and monitors antenna

31 APIU Device SW 2048 TANGO device which monitors and controls APIUSs

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 35 of 53

# Component A Component B Relationship Description

1 Graphical User

Interface

LFAA Master The GUI uses the LFAA Master’s API to provide the local users and

personnel access to the LMC

2 Command Line

Interface

LFAA Master The CLI uses the LFAA Master’s API to allow local and remote

access to the LMC

3 LFAA Master Configuration

Database

The LFAA Master uses the Configuration Database for initializing

the LMC and keep track of configuration changes

4 LFAA Master Inventory

Device

LFAA Master provides high level information and actions from/on

the Inventory Database

5 Inventory Device Inventory

Database

The Inventory Device manages and updates the Inventory

Database

6 Element Logger

Device

Log Store Element Logger Device receives logs from TANGO devices and

stores them in the Log Store

7 LFAA Master Beam Model

Device

LFAA Master provides beam metrics for a given Azimuth and

Elevation when requested by TM

8 TANGO Device Alarm Handler Alarms defined on TANGO devices are captured and processed by

the Alarm Handler

9 TANGO Device Tel State TANGO devices can read the overall state of the telescope from

the TelState devices

10 TANGO Device Element Logger Logs generated by TANGO devices are forwarded to the Element

Logger for filtering and storage

11 Station and

Station Beam

Cluster

Manager

The Station and Station Beam devices submit jobs to the Cluster

Manager

12 Subarray Device Station Device Sub Array devices create, monitors and controls a Station Device

for each station in the sub array.

13 Station Device Station Beam

Device

The Station device create a Sub Station Device for each station

beam. Each Station Beam will have an associated pointing

process

14 Station Device Cluster

Manager Device

Each Station Device submits jobs DAQ, Calibration and Transient

Buffer jobs to the Cluster Manager via the Cluster Manager

Device

15 Station Device Transient Buffer

Device

Each Station Device create a Transient Buffer Device which keeps

track of the transient buffer for that station and responds to

triggers, sending the buffered data to SDP

16 Transient Buffer

Process

Transient Buffer The Transient Buffer Device launches a Transient Buffer Process

and processed triggers

17 DAQ Process Station DAQ Process notifies the associated Station when a new file has

been written to disk

18 Calibration

Process

Station Calibration Process updates the calibration coefficients being

used by the associated station

19 Bandpass Process Station Bandpass Process calculates bandpass flattening factors and

performs diagnostics on antenna bandpass

20 Pointing Process Station Beam Pointing Process updates the delay and delay rates being used by

the associated Station Beam

21 Cluster Manager MCCS Server The Cluster Manager Device communicates with the Cluster

Manager, allowing the rest of the LMC to submit jobs and

monitor the state of running jobs

22 Storage Manager MCCS Server The Storage manager manages the disk space allocated to the

distributed storage on MCCS servers

23 Cabinet Device Rack

Management

Board

The Cabinet device monitors the cabinet environment (such as

temperature) by interfacing with the rack management board

24 Server Device Server The Server Device monitors the state of a server

25 Switch Device Switch The Switch Device monitors the state of a switch, including

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 36 of 53

statistics per port

26 Cabinet Device CMB Monitors and control Cabinet Management Board

27 SubRack

DrivDeviceer

SRMB Monitors and controls Sub Rack Management Board

28 TPM Device TPM The TPM Device monitors and controls a TPM, including

programming and initializing it and allows the LMC to control the

running firmware

29 Antenna Device Antenna The Antenna Device monitors the state of an Antenna

30 APIU Device APIU The APUI Device monitors and controls an APIU, including the

ability to read out antenna power and shut off the antenna if

required.

Table 5-3. Relationships between major elements in the Architecture System Overview

Figure 5-5 shows a high-level module decomposition diagram which groups several of the software

components describes in this section into modules. It also shows the system services and software

which are required to run the system (the System Services module), which are described in [RD9].

The Hardware TANGO Device and System Service TANGO Devices represent all the TANGO devices

which interface with hardware devices (in MCCS and SPS) and software services (including the

cluster manager, storage manager, node provisioner, and so on).

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 37 of 53

Figure 5-5. Software module decomposition diagram

5.3 Software-Hardware Mapping

Figure 5-6 shows the mapping between a subset of the array and some of the software components

shown in Figure 5-2 for a specific observation setup. In this case a subarray is configured to contain

three stations, with one of them containing two sub-stations. For the software architecture, a sub-

station is defined as a specific instance of a station beam in which a subset of the antennas does not

contribute to the beam (a weight of 0 is applied to these antennas), such that a Sub-Station TANGO

device is not required. Sub-stations are therefore defined through appropriate configuration of the

station beams. Each station has 256 antennas, which are connected to 16 TPMs. Each TPM has an

associated Tile TANGO Device instance through which all interactions with the TPM (and hence

control of the antennas and beams) is performed. A Station Device instance is associated with each

station and the required number of Station Beam Devices instances are then configured with the

station. Since station beams (and sub-stations) are pointed independently, delay calculation

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 38 of 53

(pointing) is performed at the station beam level, whilst calibration is performed at the station level

(these relationships are shown in Figure 5-2). The Station instances are then grouped and associated

to a single Subarray TANGO Device instance, through which observation control is coordinated by

TM.

Figure 5-6. Mapping between array and software components

5.4 Software Life Cycle

Scaled Agile Framework, also known as SAFe, is an enterprise-scale development methodology

developed by Scaled Agile, Inc. SAFe combines Lean and Agile principles within a templated

framework. The main principles of SAFe interweaves systems thinking and fast incremental

development based on small and regular milestones within those increments. A summary of these

principles can be found in [RD11] .

Agile Release Trains

An Agile Release Train, or ART, is a fundamental concept within the scaled agile framework.

The ART is the primary value delivery method of SAFe. Agile Teams are a small group of individuals

focused on defining, building, and testing solutions within a short time frame. An ART is a self-

organizing, long-lived group of Agile Teams, whose purpose is to plan, commit, and execute solutions

together. System development will have backlog items assigned in logical groupings, worked around

along increments within stipulated periods of time (e.g. a few weeks per increment).

SAFe Implementation Overview

Given the sheer size and scope of SAFe, proper implementation can be rather daunting, especially

starting out. Since a full explanation of SAFe implementation would require tens of thousands of

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 39 of 53

words — and because more detailed information is available on the official website — we’ll cover a

brief overview of implementation here:

1. Train Implementers: Due to the sheer scope and challenge required in adopting SAFe, most

organizations will need a combination of internal and external mentors and coaches. These

people should be capable of easily teaching and delivering SAFe techniques to others

throughout the organization.

2. Train Executes, Managers, and Leaders: The initial batch of Implementers should first focus

on training all executives, managers, and leaders. Once these fundamental team members

understand the Lean-Agile mindset, core SAFe principles, and implementation techniques,

the process will become much smoother for the entire organization.

3. Train Teams: Individuals should initially be organized into Agile Teams, who can then all be

trained on the various Lean, Agile, and SAFe principles.

4. Launch Agile Release Trains: Finally, once the organization has been properly trained, it’s

time to group Agile Teams together into ARTs, and then generate models for objective

planning, program execution, program increment planning, and all the other components

required for a successful Agile Release Train.

Essential SAFe

The essential basic configuration of the SAFe framework is shown in Figure 5-7 and provides all the

elements necessary to have a complete SAFe system. Rather than focus on explaining the SAFe

framework, we shall focus on particular elements within this framework, which require some

discussion.

Figure 5-7: Essential SAFe configuration.

The software development process will employ the following key principles – adapted in general

from the SAFe framework:

1. Collaborating closely with both the stakeholders and with other developers, adding valuable

feedback and collaboration.

2. Implementing functionality in priority order – the requirements will be developed based on

array assembly prioritisations – and these might change along the way.

3. Analysing and designing - The individual requirements are analysed by model storming on a

just-in-time (JIT) basis for a few minutes before spending several hours or days

implementing the requirement.

4. Ensuring quality – Use coding conventions, development guidelines and constant refactoring

for quality.

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 40 of 53

5. Regularly delivering working solutions - At the end of each development cycle/iteration

there will be a partial, working solution for demonstration/analysis.

6. Testing – Perform a significant amount of testing throughout construction.

For more detail on the framework, refer to [RD12].

Software Development Process During Construction Iterations

During construction iterations developers will incrementally deliver high-quality working software

which meets the changing needs of the system as overviewed in Figure 5-8.

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 41 of 53

Figure 5-8: Software development process during a construction iteration.

The Test-First Approach to Construction

The test-first approach to software development is shown in Figure 5-9. The full testing regime for

MCCS is detailed [RD10].

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 42 of 53

Figure 5-9: Test-first development approach.

Within the context of development iterations in an Agile approach, this test-first approach is

encompassed within iterations as shown in Figure 5-10.

Figure 5-10: Testing during construction iterations.

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 43 of 53

5.5 Commissioning

Project commissioning is the process of assuring that all systems and components of the project are

designed, installed, tested, operated, and maintained according to the operational requirements of

the stakeholders. A commissioning process may be applied not only to new projects but also to

existing units and systems subject to updates, refactoring, etc.

In practice, the commissioning process comprises the integrated application of a set of engineering

techniques and procedures to check, inspect and test every operational component of the project,

from individual functions, such as instruments and equipment, up to complex amalgamations such

as modules, software subsystems and systems.

Commissioning activities, in the broader sense, are applicable to all phases of the project, from the

basic and detailed design, procurement, construction and assembly, until the final handover of the

unit to the owner, including sometimes an assisted operation phase.

The testing procedures and acceptance process for all sub-units of the system, as well as the

integrated system working as a single element is detailed in the MCCS Assembly Verification and

Test Plan [RD10]. The commissioning procedure is made up of:

• Functional tests

• Non-functional tests

• A testing cycle for each test

• Regression testing

• A qualification process

• An acceptance process

It is assumed that the commissioning process for MCCS will form part of a wider commissioning

procedure. There are various completion and commissioning tools which can be utilised for this

purpose. With regards to MCCS, the commissioning process will support the AIV Element roll-out

plan [AD2]. The commissioning process will be split to cater for:

1. Full system commissioning

2. Hardware commissioning

3. Software/Code commissioning

More details of this split can be seen in the MCCS Detailed Design Document [RD9].

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 44 of 53

6 MCCS Physical Overview

MCCS is essentially a compute cluster, requiring enough compute processing power, network

bandwidth and memory space to run the MCCS software. Compute processing power is dominated

by the correlation and calibration processes, network bandwidth is dominated by the transmission

of calibration spigots from SPS to MCCS, while memory space is dominated by the fast transient

buffer. The compute servers are distributed across 4 MCCS cabinets which, apart from the compute

servers themselves, contain the required number of network switches to transport SPS LMC data

from SPS to MCCS, interconnect the compute server and transmit the fast transient buffer to SDP.

The following section analyse the compute, network and cabinet requirements, as well as provides

describes the software necessary for these components to function properly.

6.1 Compute Server

There are a total of 68 compute servers in MCCS distributed across 4 racks (including one spare per

rack). Each compute server is responsible for 8 stations. [RD9] Section 5.1 provides an analysis of the

compute, network and memory requirements for a single server, summarised below, resulting the

compute server configuration listed in Table 6-1:

• 4 high performance GPUs to run the correlation and calibration related processes

• One 100Gb interface for receiving data from 8 stations (64 TPMs)

• At least 1.5 TB RAM, primarily dominated by the space required to store the transient

buffers for eight stations

• About 80 CPU cores

Table 6-1. MCCS compute server configuration

Item Quantity Minimum Specification

Chassis 1 1U, min 2x SATA, dual 1Gb Ethernet, 2 kW redundant

power supply, NVLink support

CPU 2 20-cores, 2 GHz minimum

GPU 4 NVIDIA P100 with NVLink or equivalent

RAM 12 128GB 2666MHz DDR4

1 Gb interfaces 1 On chassis

100 Gb interfaces 2 Mellanox 100-Gb ConnectX-5 with 1 QSFP, or equivalent

SSDs 2 1 TB 2.5” SATA 6.0 Gb/s

Two additional servers are included which act as the master and shadow master nodes of the MCCS

cluster, on which the core LMC functionality, hardware configuration database, maintenance

support tools, graphical user interface and other high level software components will operate. These

servers will also be responsible for configuring all LFAA and interact with TM. The shadow node

takes over when the master node is compromised.

6.2 Network

MCCS is connected to external entities and other LFAA Sub-elements through network link, shown in

Figure 6-1. Communication with SPS goes through a single 100Gb link between each SPS cabinet in

the RPF and groups of two SPS cabinets in the CPF, totalling to 110 100Gb links. Communication with

TM goes through a 1Gb link, of which there are two for redundancy. The transient buffer is

transmitted to SDP via a 100 Gb link provided by SaDT.

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 45 of 53

Figure 6-1. Network links between MCCS and external entities

These connections need to be distributed across the four racks which host MCCS. Core SPS cabinets

have one 100 Gbps per two cabinets to MCCS, RPFs within 25km have one 100 Gbps links to MCCS

and RPFs which are farther away than 25km use DWDM through a muxponder, multiplexed to 100

Gbps to MCCS. Apart from the 100G network, there is a separate 1G network local to MCCS used for

monitoring and control, and acts as a back-up in case the 100G network goes offline. MCCS is also

responsible for the configuration, management and control of all the network and network

components within LFAA, including the data network which forms the backbone of SPS, as well as

well as all external network links provided by SaDT.

Figure 6-2 shows the network diagram for a single MCCS rack. Compute servers are grouped into

two groups, each connected to a separate 32-port 100Gb network switch. Each 100Gb network

switch ingests 14 SPS links, except for the bottom switch of the first and last rack which ingests 13

SPS links (totalling 110 links). A single 32-port 1Gb network switch is required to interconnect all

hardware devices within an MCCS rack, with enough free ports for creating a full 1G mesh with the

rest of the racks. Links to TM and SDP are also shown, however these are not present in all racks.

The TM links are connected to the 1G switch in the central two racks, whilst the SDP links can be

connected to any of the racks. The head/shadow nodes are also located in the central two racks,

each requiring two 1Gb links for redundancy. Note that in the diagram, links without multiplicity

mean that there is a single link. The MCCS layout is described in detail in [RD9] Section 4.

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 46 of 53

Figure 6-2. MCCS network diagram

6.3 Rack Assembly

The cabinet design is present in Figure 6-3, each containing:

• 16 compute servers and one spare compute server

• 2 100 Gb switches

• 1 1Gb switch

• For two of the racks an additional server is required to act as a master/shadow node

• For the racks containing the master/shadow node, a UPS is included

The head/shadow node and 1Gb switch connecting the head/shadow node to TM are connected to

the UPS, such that if a power failure arises then MCCS can inform TM and perform and emergency

shutdown operations, ensuring that the system will be capable of going back online when power is

restored. Since The head/master server will be low-power servers (when compared to the compute

servers), a standard rack UPS should be able to provide enough up time for the head/shadow node

to perform these operations.

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 47 of 53

Figure 6-3. MCCS rack assembly

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 48 of 53

7 Scenarios

In this section some example scenarios have been chosen to be described in detail. The selection is

considered to represent many similar operational scenarios. The following scenarios are

documented in the following subsections:

1. Application of power to MCCS, including power sequencing

2. Transitioning to low power mode

3. Power down sequencing, that is transitioning to offline mode

4. Observation configuration and start

5. Calibration of LFAA and how this can detect failed antennas

6. Stopping an observation

7. Detection and MCCS failures, including redundancy for continued operations and how

failures are reported, replaced and detected by the software

8. Software, BIOS and LRU firmware update

7.1 Application of power

When power is applied to MCCS, a boot-up sequence of minimal hardware and software

components occurs. This will transition the operational state of these component from Unknown or

Offline to Ready:

1. Power is applied to one of the racks which has a master or shadow node

2. The master and shadow nodes are configured to boot up on power, such that they will boot

up and load the operating system. The 1Gb network switch also powers up when power is

applied, such that MCCS can then directly access the Rack power supply.

3. An LMC bootstrap mechanism is run automatically at start-up which loads:

a. The bare-metal provisioning software

b. The distributed storage management software

c. The TANGO database

d. The TANGO starter

e. The LFAA Element Master (root TANGO device)

f. The Software Configuration database (in the case that this is an actual database and

needs to be loaded)

4. The LFAA Element Master will then start-up the rest the core LMC system by reading the

required configuration, and communication with TM is established

5. At this point the MCCS head node is powered on. Action from TM is required to power the

rest of the MCCS as well as SPS. Once TM issues this command the power-on continues

6. The master/shadow starts powering on the rest of the MCCS hardware one rack at a time:

a. Rack power is enabled (the ones which do not host the head/shadow node)

b. Power to the data switches is enabled

c. Power to the compute nodes is enabled (compute nodes are not configured to start

up when power is applied)

d. For each compute node, the LMC Element Master instructs the bare metal

provisioning software to power it on. Nodes are powered sequentially, with time

between each node TBD. The provisioning software loads an operating system

image on the compute node, which in turn go through the boot up process. Nodes

are then ready for software provisioning

7. The LFAA Element master provisions the required containers on each compute node to start

the TANGO devices for monitoring and controlling the associated SPS hardware

components, as well as the distributed storage and other required services

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 49 of 53

8. MCCS is now ready to accept and start observation configurations

7.2 Transition to Low Power Mode

Compute servers in MCCS cannot be turned off since they host TANGO devices which monitor and

control SPS equipment. Equipment in SPS which is in low power mode still need to be monitored

(sensor and health status can still be accessed). The master and shadow nodes do not have a low

power mode. Low-power mode for MCCS can be described as follows:

• A compute server can only be in low-power mode if all associated SPS equipment is in low-

power mode (not being used for observations or are part of a maintenance subarray)

• Network switches cannot be switched to low-power mode; however, they generally have

power saving feature which can be used to reduce their power consumption

A compute server in low power mode translate to the following operations being performed:

• Switch off GPUs or set their power management configuration to the minimal power

consumption setting (depends on available GPU settings, PCI devices can be disabled with

appropriate kernel modules as well)

• Set all CPU cores to low power mode. CPU cores can also be disabled through appropriate Linux

configurations

The power consumption of network switches depends on the network traffic, so they will automatically

consume less power. Additionally, unused ports on the switch are disabled such that they do not

consumer any power.

During observation configuration, if a compute server in low power mode is required, the required GPUs

are and CPU core re-enabled or switched to normal power configuration.

7.3 Transition to Off-line

MCCS must support the ability to shut down the entire sub-element. This may be related to maintenance,

power saving measures, power supply emergency, etc. MCCS will support two types of shutdown:

• Controlled: orderly shutdown of servers and equipment

• Uncontrolled: immediate removal of power to running equipment

Controlled shutdown

To transition to off-line (controlled power shutdown) the MCCS head node will:

• Terminate all running observations (through the appropriate Devices, which will in turn

terminate all running compute processes on the compute nodes),

• Terminate the LMC control hierarchy for SPS (keeping the LMC core running up to this

point)

• Instruct the node provisioning system to shut down all compute servers

• Disable rack power to all racks except for the rack containing the head node

• Disable power to switches in the racks containing the master or shadow node

• Shutdown itself

Note that power of the main rack must be manually switched off if required (or through the building

management system)

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 50 of 53

Uncontrolled shutdown

In the event of a power emergency, MCCS may be instructed to perform an uncontrolled shutdown,

whereby the equipment is turned off as quickly as possible. In this situation the head node will send

the shutdown signal to all the compute nodes (through the node provisioning system, regardless of

what processing is being performed). When all the compute nodes are powered down (in the order

of a few seconds), the head node will disable rack power to all racks and shut itself down.

In the event of power failure, where MCCS power is lost, all MCCS will become offline except for the

head and shadow nodes and the 1 Gb switches in the central 2 racks which are connected to a UPS.

This perms the head node to perform a proper shutdown and inform TM that MCCS lost power and

is going to shut down. The latter assumes that all intermediary switches between MCCS and TM are

still powered. If the entire CPF loses power than TM will be unreachable.

7.4 Set up and Start Observation

Observation setup is described in detail in [RD7] Section 5.2.1 and summarised in this document,

Section 5.1. When the start observation command is received the following steps are performed:

1. TM sends the start scan command to the Subarray

2. The Subarray calls the start command on all associated Stations in parallel

3. The Station finalizes configuration on the Tiles. This includes:

a. Setting the CSP ingest node IP, MAC and port as the destination parameters for the

final Tile in the chain

b. Instructs the Tiles to start transmission of data

4. Once all Tiles are configured, the Station returns a reply to the Subarray

5. The Subarray in turn waits for all Stations to finalize their configuration and returns a reply

to TM once configuration is finished

At this point signals are being processed and station beams are being sent to CSP. Throughout the

observation calibration and pointing coefficients are being calculated and updated, and control data

from the Tiles is being received and processed accordingly.

7.5 Calibration

The calibration process as well as diagnostics which can be performed on the generated calibration

solutions is described in [RD7] Sections 5.2.6 and 5.2.7, summarised below:

• Raw channel data needs to be transmitted by all the TPMs forming part of a station. This is

used for calibration (and diagnostics) and is not transmitted to CSP. This data is directed

towards a MCCS compute node, assigned during initialization, on which a DAQ process is

running.

• The DAQ process reads in this data and buffers it for correlation. This data stream amounts

to ~6.4Gbps.

• Once all the time samples for a frequency channel are received (that is, the stream switches

to a new frequency channel), the buffer is marked as ready and copied to GPU memory.

• The GPU correlator computes the auto and cross correlation of the data and integrates the

entire buffer to a single correlation matrix.

• The correlation matrix for the current frequency channel is saved to disk.

• Once the file is written, the Calibration process is notified

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 51 of 53

• Assuming a standard calibration algorithm implementation, the difference between the sky

model and acquired visibilities is minimized, generating a set of coefficients which describe

the difference between the two.

• The generated coefficients are sent to the Station device.

• The Station device then distributes the calibration coefficients to its Tiles, which download

them on the TPMs.

• The Tile devices also distribute the calibration coefficients to the respective Antenna devices

(not shown), where they are archived for diagnostic purposes. These coefficients are kept in

the LFAA archive for several days.

Sanity checks and diagnostics on the generated calibration solutions are also performed to ensure

that the system is stable and to detect misbehaving devices. These checks include:

• Compare calibration solutions and for each antenna against each other to detect outlier

antennas (for example by computing the RMS and evaluating antennas against this RMS)

• Check how calibration solutions are evolving in time to check system stability (for example

by seeing how antennas RMS varies)

• Identify noise frequency channels (RFI)

7.6 Stop Observing

At any point TM can issue a stop or abort command on a running subarray:

• Stop: The current observation is stopped, and the observation is move back to the READY

state. Data output to CSP is stopped. Jobs and Tiles are left configured so that of the next

observation required the same parameters the devices do not have to be re-configured

• Abort: Abort moves the subarray to the ABORT state. The possible state changes from this

are to the CONFIGURING and IDLE state, which means that all resources can be freed up (to

be re-used later). Output to CSP is first stopped to avoid invalid data being transmitted while

aborting the observation. All running jobs are terminated (through the initiating device via

the Cluster Manager device). Tiles are de-configured (but not put in low-power mode).

Station, Station Beam and Tile devices are unassigned.

When the subarray receives a reset command while in the READY state, the same operations as

abort above are performed. Additionally, the Tiles are de-programmed are placed in low-power

mode. This also happens when the command is received whilst in the FAULT state.

7.7 MCCS Failures

MCCS can suffer failures at any point during observation configuration, running or whilst in low

power mode. For hardware failures, the hardware is switched off, its status is changed to FAULTY

and TM is notified. MCCS software has a hardware configuration database which, apart from storing

the configuration of all hardware components, contains their location within the CPF and RPF to aid

maintenance personnel to quickly localise the equipment. The following equipment can become

faulty (for several reasons):

• Compute node, in which case the spare server in the rack takes over the operations of this

compute node. A total of four spare compute nodes are always present in MCCS, such that

up to four nodes can become faulty. If more than four nodes are faulty or offline then the

associated resources (stations) cannot be used until the faulty nodes are replaced.

• 100G switch, in which case the incoming signal from SPS passing through this switch will be

blocked (there is not redundancy for high bandwidth data between SPS and MCCS). LMC

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 52 of 53

communication can be re-routed through other switches. Within MCCS there are several

redundant links between switches such that LMC traffic can be re-routed if a switch is faulty

of offline

• 1 G switch, in which case rack devices which are controlled through the 1 Gb network (such

as switches and rack power) will become unreachable. LMC communication between

compute nodes can still be routed through the 100Gb network.

• Master node, in which case the slave node will take over all operations performed by the

master node. If the slave node becomes faulty then all MCCS and SPS will be unreachable by

TM, and the LMC system will become unavailable.

When faulty LRUs are replaced, maintenance personnel must update the hardware configuration

database with the device’s new IP address (through the provided tools). This is not required for

compute servers since the provisioning software can automatically detect new node once they are

physically powered up.

Software failure are entirely handled by the LMC software system. TANGO devices and system

services are automatically restarted after a crash. It is assumed that all software running in MCCS

will have been thoroughly tested during the verification stage (goes through the software test cycle).

7.8 Software Upgrades

Software upgrades and updates will happen through the commissioning phases of the SKA, as well as

during its long lifetime. Upgrades should happen with minimal disruption of service, that is, with

minimal impact on the capabilities of the telescope. These upgrades can be split into three types:

Software upgrades

This refers to all software running on MCCS, including the OS and other system software,

management software, third party software and bespoke software developed for MCCS. The way in

which these are updates depends on whether they are running on a compute node or a cluster/head

node.

Upgrading software on the master node

When a software upgrade is required on the head nodes, the shadow node is updated first (since it

only mirrors the functionality of the master node, no disruption is caused). Once the upgrade is

complete several tests are performed to verify that the upgrade progress was successful, and all

required functionality is still available. Once ready, the shadow node takes over control from the

master (becomes the master) which is in turn upgraded in the same manner. This can be used to

update the operating system, system libraries and services and the TANGO core system.

Upgrading software on compute nodes

The OS images for compute nodes are stored locally on the master node. These images can be

updated and versioned independently of what is running on compute nodes. This scheme is used to

update the operating system and system libraries and services. When a compute node is rebooted

the new OS image will be used to load the compute node. This can either be performed during a

scheduled maintenance time where all the MCCS compute nodes are rebooted, or as a staged

system where compute nodes are rebooted when they are not in use (with their running system

offloaded to the spare servers, where the spare servers are upgraded first).

Document No.:

Revision:

Date:


01

2018-10-31


Author: A. Magro

Page 53 of 53

Observation-related software (such as the calibration and correlation algorithms) updates result in a

new version of the binaries which are stored in the master node. When a new observation is defined

then it will simply use the new (or any required) version of the software. These programs are

launched in containers on the compute nodes, so no system updates are required.

TANGO devices running on the compute nodes are also launched in containers, such that the same

scheme as that for observation-related software can be used. In this case the new version of the

TANGO device is first launched, and when it’s running the older version of the device is stopped.

BIOS updates

It is inevitable that BIOS updates will become available during the lifetime of the MCCS servers. They

are generally installed through software provided by the manufacturer and will require the node to

be rebooted. For the master and shadow nodes the same scheme as above can be used, where

operations are taken over by one server whilst the other runs the BIOS update program. For

updating the BIOS of a compute node (which must not be performing any observation-related

functionality), all TANGO device running on the node are offloaded to a spare server after which the

BIOS update program is run. When complete, the TANGO devices are set to run again on the update

node.

LRU firmware updates

Additional hardware will need firmware and software updates in MCCS and SPS, including network

switches and power supply units. There updates are generally performed by manufacturer-provided

software. The device will be offline during this update, which will result in down-time.