38
Clara Gaspar, October 2013 Many thanks to the colleagues in the four experiments and the EN/ICE group, in particular: ALICE: Franco Carena, Vasco Chibante Barroso (DAQ), Andre Augustinus (DCS) ATLAS: Giovanna Lehmann Miotto (DAQ), Stefan Schlenker (DCS) CMS: Hannes Sakulin, Andrea Petrucci (DAQ), Frank Glege (DCS) JCOP: Fernando Varela Rodriguez Experiment Control@LHC An Overview

SMI++ A Finite State Machine Toolkit

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

Many thanks to the colleagues in the four experiments and the EN/ICE group, in particular: ALICE: Franco Carena, Vasco Chibante Barroso (DAQ), Andre Augustinus (DCS) ATLAS: Giovanna Lehmann Miotto (DAQ), Stefan Schlenker (DCS) CMS: Hannes Sakulin, Andrea Petrucci (DAQ), Frank Glege (DCS) JCOP: Fernando Varela Rodriguez

Experiment Control@LHC

An Overview

Page 2: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

The LHC Experiments

2

Page 3: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

Control System Scope

Detector Channels

Front End Electronics

Readout Boards

High Level Trigger (Farm)

Storage

Trigger

Cont

rol S

yste

m

DAQ

DCS Devices (HV, LV, GAS, Cooling, etc.)

External Systems (LHC, Technical Services, Safety, etc)

Timing

DQ Monitoring

3

Page 4: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

Control System Tasks ❚ Configuration

❙ Selecting which components take part in a certain “Activity” ❙ Loading of parameters (according to the “Activity”)

❚ Control core ❙ Sequencing and Synchronization of operations across the various

components

❚ Monitoring, Error Reporting & Recovery ❙ Detect and recover problems as fast as possible

❘ Monitor operations in general ❘ Monitor Data Quality

❚ User Interfacing ❙ Allow the operator to visualize and interact with the system

4

Page 5: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013 5

Some Requirements ❚ Large number of devices/IO channels

➨ Need for Distributed Hierarchical Control ❘ De-composition in Systems, sub-systems, … , Devices ❘ Maybe: Local decision capabilities in sub-systems

❚ Large number of independent teams and very different operation modes ➨ Need for Partitioning Capabilities (concurrent usage)

❚ High Complexity & (few) non-expert Operators ➨ Need for good Diagnostics tools and if possible Automation of:

❘ Standard Procedures ❘ Error Recovery Procedures

➨ And for Intuitive User Interfaces

❚ + Scalability, reliability, maintainability, etc.

Page 6: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

History ❚ None of this is really new…

❙ Ex.: At LEP (in the 80s/90s) both ALEPH and DELPHI Control Systems: ❘ Were Distributed & Hierarchical Systems, implemented Partitioning,

were highly Automated and were operated by few shifters: 〡ALEPH: 2 (Shift Leader, Data Quality) 〡DELPHI: 3 (Run Control, Slow Control, Data Quality)

6

❚ ALEPH: DEXPERT ❚ DELPHI: Big Brother

Page 7: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

LHC Exp. Commonalities ❚ Joint COntrols Project (JCOP)

❙ A common project between the four LHC experiments and a CERN Control Group (IT/CO -> EN/ICE)

❙ Mandate (1997/1998): ❘ “Provide a common DCS for all 4 experiments in a resource effective

manner” ❘ “Define, select and/or implement as appropriate the architecture,

framework and components required to build the control system”

❙ Scope: ❘ DCS - Detector Control System (at least)

❙ Main Deliverable: ❘ JCOP Framework (JCOP FW)

➨ Major Success! Still active

7

Page 8: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

LHC Exp. Differences ❚ Basically the Control of everything else:

❙ DAQ, Trigger, etc. -> Run Control ❚ Design Principles

❙ Similar requirements, different emphasis, for example: ❘ ATLAS: Large detector -> Scalability ❘ CMS: Many users -> Web Based ❘ LHCb: Few shifters -> Integration, homogeneity ❘ ALICE: Many sub-detectors -> Customization, Flexibility

8

Page 9: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

Architecture & Scope ❚ ALICE

9

LV Dev1

LV Dev2

LV DevN

SubDetN DCS

SubDet1 DCS

SubDet1 LV

SubDet1 TEMP

SubDet1 GAS

Com

man

ds

SubDetN DAQ

SubDet2 DAQ

SubDet1 DAQ

SubDet1 FEE

SubDet1 RO

FEE Dev1

FEE Dev2

FEE DevN

INFR.

LHC DAQ

Central DAQ

S

tatu

s &

Ala

rms

ECS

HLT Trigger& Timing

DCS

FEE Dev1

FEE Dev2

JCOP FW

ALICE DAQ FW

Other Tools

Page 10: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

Architecture & Scope ❚ ATLAS

10

Com

man

ds

S

tatu

s &

Ala

rms

LV Dev1

LV Dev2

LV DevN

SubDetN DCS

SubDet1 DCS

SubDet1 LV

SubDet1 TEMP

SubDet1 GAS

SubDetN DAQ

SubDet2 DAQ

SubDet1 DAQ

SubDet1 FEE

SubDet1 RO

FEE Dev1

FEE Dev2

FEE DevN

INFR.

LHC Run Control

DAQ HLT Trigger& Timing

DCS

JCOP FW

ATLAS DAQ FW

Other Tools

Page 11: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

Architecture & Scope ❚ CMS

11

Com

man

ds

S

tatu

s &

Ala

rms

LV Dev1

LV Dev2

LV DevN

SubDetN DCS

SubDet1 DCS

SubDet1 LV

SubDet1 TEMP

SubDet1 GAS

SubDetN DAQ

SubDet2 DAQ

SubDet1 DAQ

SubDet1 FEE

SubDet1 RO

FEE Dev1

FEE Dev2

FEE DevN

INFR.

LHC Run Control

DAQ HLT Trigger& Timing

DCS

JCOP FW

CMS DAQ FW

Other Tools

Page 12: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

Architecture & Scope ❚ LHCb

12

Com

man

ds

S

tatu

s &

Ala

rms

LV Dev1

LV Dev2

LV DevN

DCS

SubDetN DCS

SubDet1 DCS

SubDet1 LV

SubDet1 TEMP

SubDet1 GAS

SubDetN DAQ

SubDet2 DAQ

SubDet1 DAQ

SubDet1 FEE

SubDet1 RO

FEE Dev1

FEE Dev2

FEE DevN

INFR.

LHC Run Control

DAQ Trigger& Timing

ECS

HLT

JCOP FW

Other Tools

Page 13: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

Tools & Components ❚ Main Control System Components:

❙ Communications ❘ Device Access and Message Exchange between processes

❙ Finite State Machines ❘ System Description, Synchronization and Sequencing

❙ Expert System Functionality ❘ Error Recovery, Assistance and Automation

❙ Databases ❘ Configuration, Archive, Conditions, etc.

❙ User Interfaces ❘ Visualization and Operation

❙ Other Services: ❘ Process Management (start/stop processes across machines) ❘ Resource Management (allocate/de-allocate common resources) ❘ Logging, etc.

13

Page 14: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

Frameworks ❚ JCOP FW (All Experiments DCSs + LHCb)

❙ Based on SCADA System PVSS II (Now Siemens WinCC-OA) ❘ Comms, FSM, UI, UI builder, Configuration, Archive, HW Access, Alarms, etc.

(also guidelines and ready-made components for many types of equipment)

❚ ALICE ❙ DAQ: DATE (Data Acquisition and Test Environment)

❘ Comms, FSM, UI, Logging, etc.

❚ ATLAS ❙ DAQ: Set of high-level Services + Sub-Detector FW: RodCrateDAQ

❘ Comms, FSM, UI, Configuration, Monitoring, + HW Access libraries

❚ CMS ❙ Control: RCMS (Run Control and Monitoring System)

❘ Comms, FSM, UI, Configuration, Archive ❙ DAQ: XDAQ (DAQ Software Framework)

❘ Comms, FSM, UI, Hw Access, Archive

14

Page 15: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

Communications ❚ Each experiment chose one

❙ ALICE DAQ: DIM (mostly within the FSM toolkit) ❘ Mostly for Control, some Configuration and Monitoring

❙ ATLAS DAQ: CORBA (under IPC and IS packages) ❘ IPC (Inter Process Comm.) for Control and Configuration ❘ IS (Information Service) for Monitoring

❙ CMS DAQ: Web Services (used by RCMS, XDAQ) ❘ RCMS for Control ❘ XDAQ for Configuration ❘ XMAS (XDAQ Monitoring and Alarm System) for Monitoring

❙ LHCb & DCSs: PVSSII+drivers+DIM (within JCOP FW) ❘ PVSSII offers many drivers (most used in DCS is OPC) ❘ LHCb DAQ: DIM for Control, Configuration and Monitoring

15

Page 16: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

Communications ❚ All Client/Server mostly Publish/Subscribe

❙ Difficult to compare (different “paradigms”) ❘ DIM is a thin layer on top of TCP/IP ❘ ATLAS IPC is a thin layer on top of CORBA

〡Both provide a simple API, a Naming Service and error recovery ❘ CMS RCMS & XDAQ use WebServices (XML/Soap)

〡Remote Procedure Call (RPC) like, also used as Pub./Sub. ❘ OPC is based on Microsoft’s OLE, COM and DCOM

❙ ATLAS IS, CMS XMAS and PVSS II in the DCSs and LHCb

❘ work as data repositories (transient and/or permanent) to be used by clients (UIs, etc.)

16

DIM Efficient, Easy to use Home made

CORBA Efficient, Easy to use (via API) Not so popular anymore Web Services Standard, modern protocol Performance: XML overhead

OPC Industry Standard Only Windows (-> OPC UA)

Page 17: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

Finite State Machines ❚ All experiments use FSMs

❙ In order to model the system behaviour: ❘ For Synchronization, Sequencing, in some cases also for Error

Recovery and Automation of procedures

❙ ALICE DAQ: SMI++ ❘ FSM for all sub-systems provided centrally (can be different)

❙ ATLAS DAQ: CHSM -> CLIPS -> C++ ❘ FSM for all sub-systems provided centrally (all the same)

❙ CMS DAQ: Java for RCMS, C++ for XDAQ ❘ Each sub-system provided specific transition code (Java/C++)

❙ LHCb & DCSs: SMI++ (integrated in the JCOP FW) ❘ LHCb: FSM provided centrally, sub-systems can modify template

graphically 17

Page 18: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

FSM Model Design ❚ Two Approaches:

❙ Few, coarse-grained States: ❘ Generic actions are sent from the top

〡Each sub-system synchronizes it’s own operations to go to the required state

❘ The top-level needs very little knowledge of the sub-systems ❘ Assumes most things can be done in parallel ➨ Followed by most experiments (both DAQ & DCS)

〡Ex: CMS States from “ground” to Running: Initial -> Halted -> Configured -> Running

❙ Many, fine-grained States ❘ Every detailed transition is sequenced from the top ❘ The top-level knows the details of the sub-systems ➨ Followed by ALICE DAQ (20 to 25 states, 15 to get to Running)

18

READY

NOT_READY

StartRun StopRun

ERROR

Configure Reset

RUNNING

Recover

Page 19: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

Expert System Functionality ❚ Several experiments saw the need…

❙ Approach: ❘ “We are in the mess, how do we get out of it?” ❘ No Learning…

❚ Used for: ❙ Advising the Shifter

➨ ATLAS, CMS

❙ Automated Error Recovery ➨ ATLAS, CMS, LHCb, ALICE (modestly)

❙ Completely Automate Standard Operations ➨ LHCb, and within the DCSs

19

Page 20: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

Expert System Functionality ❚ ATLAS

❙ CLIPS for Error Recovery ❘ Central and distributed, domain specific, rules ❘ Used by experts only, sub-system rules on request

❙ Esper for “Shifter Assistant” ❘ Centralised, global “Complex Event Processing”

➨ Moving more towards this approach…

❚ CMS ❙ Java (within RCMS) for Error recovery and Automation ❙ Perl for “DAQ Doctor”

❘ “Rules” are hardcoded by experts

❚ LHCb & DCSs (within JCOP FW) + ALICE (in standalone) ❙ SMI++ for Error Recovery and Automation

❘ Distributed FSM and Rule based system ❘ Sub-systems use it for local rules, central team for top-level rules

20

Page 21: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013 21

Expert System Functionality ❚ Decision Making, Reasoning, Approaches

❙ Decentralized (Ex.: SMI++) ❘ Bottom-up: Sub-systems react only to their “children”

〡In an event-driven, asynchronous, fashion

❘ Distributed: Each Sub-System can recover its errors 〡Normally each team knows how to handle local errors

❘ Hierarchical/Parallel recovery ❘ Scalable

❙ Centralized (Ex.: Esper) ❘ All “rules” in the same repository, one central engine

Page 22: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013 22

Online Databases ❚ Three main logical Database concepts in the Online

System

➨ But naming, grouping and technology can be different in the different

experiments…

Expe

rimen

t’s H

W &

SW

PVSS

To Offline Cond. DB

Config. DB

To Offline

Control System Archive

DB

Page 23: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013 23

Online Databases ❚ Three main logical Database concepts in the Online

System

➨ But naming, grouping and technology can be different in the different

experiments…

Expe

rimen

t’s H

W &

SW

PVSS

To Offline Cond. DB

Config. DB

To Offline

Control System Archive

DB

Page 24: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013 24

Online Databases ❚ Three main logical Database concepts in the Online

System

➨ But naming, grouping and technology can be different in the different

experiments…

Expe

rimen

t’s H

W &

SW

PVSS

To Offline Cond. DB

Config. DB

To Offline

Configuration settings for a running mode

Control System Archive

DB

Page 25: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013 25

Online Databases ❚ Three main logical Database concepts in the Online

System

➨ But naming, grouping and technology can be different in the different

experiments…

Expe

rimen

t’s H

W &

SW

PVSS

To Offline Cond. DB

Config. DB

To Offline

Configuration settings for a running mode

Control System Archive

DB

Page 26: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013 26

Online Databases ❚ Three main logical Database concepts in the Online

System

➨ But naming, grouping and technology can be different in the different

experiments…

Expe

rimen

t’s H

W &

SW

PVSS

To Offline Cond. DB

Config. DB

To Offline

Configuration settings for a running mode

Monitoring data (at regular intervals)

Control System Archive

DB

Page 27: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013 27

Online Databases ❚ Three main logical Database concepts in the Online

System

➨ But naming, grouping and technology can be different in the different

experiments…

Expe

rimen

t’s H

W &

SW

PVSS

To Offline Cond. DB

Config. DB

To Offline

Configuration settings for a running mode

Monitoring data (at regular intervals)

Control System Archive

DB

Page 28: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013 28

Online Databases ❚ Three main logical Database concepts in the Online

System

➨ But naming, grouping and technology can be different in the different

experiments…

Expe

rimen

t’s H

W &

SW

PVSS

To Offline Cond. DB

Config. DB

To Offline

Configuration settings for a running mode

Monitoring data (at regular intervals)

if History needed

Control System Archive

DB

Page 29: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013 29

Online Databases ❚ Three main logical Database concepts in the Online

System

➨ But naming, grouping and technology can be different in the different

experiments…

Expe

rimen

t’s H

W &

SW

PVSS

To Offline Cond. DB

Config. DB

To Offline

Configuration settings for a running mode

Monitoring data (at regular intervals)

if History needed

Control System Archive

DB

Page 30: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013 30

Online Databases ❚ Three main logical Database concepts in the Online

System

➨ But naming, grouping and technology can be different in the different

experiments…

Expe

rimen

t’s H

W &

SW

PVSS

To Offline Cond. DB

Config. DB

To Offline

Configuration settings for a running mode

Monitoring data (at regular intervals)

if History needed

if needed by Offline

Control System Archive

DB

Page 31: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013 31

Online Databases ❚ Three main logical Database concepts in the Online

System

➨ But naming, grouping and technology can be different in the different

experiments…

Expe

rimen

t’s H

W &

SW

PVSS

To Offline Cond. DB

Config. DB

To Offline

Configuration settings for a running mode

Monitoring data (at regular intervals)

if History needed

if needed by Offline

Control System Archive

DB

Page 32: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013 32

Online Databases ❚ Three main logical Database concepts in the Online

System

➨ But naming, grouping and technology can be different in the different

experiments…

Expe

rimen

t’s H

W &

SW

PVSS

To Offline Cond. DB

Config. DB

To Offline

Configuration settings for a running mode

Monitoring data (at regular intervals)

if History needed

if needed by Offline

if needed for next run settings (Pedestal Followers)

Control System Archive

DB

Page 33: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

User Interfacing ❚ Types of User Interfaces

❙ Alarm Screens and/or Message Displays

❙ Monitoring Displays ❙ Run Control & DCS Control

33

Page 34: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

Run Control

34

ALICE: Tcl/Tk

ATLAS: Java

(modular)

CMS: Web Tools (JavaScript+HTML)

LHCb: JCOP FW

(UI builder)

Page 35: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

Detector Control System

35

JCOP FW ALICE

ATLAS

CMS

LHCb

Page 36: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

Operations ❚ Experiment Operations

❙ Shifters: ❘ ALICE: 4 (SL, DCS, RC, DQ+HLT) ❘ ATLAS: 8 (SL, DCS, RC, TRG, DQ, ID, Muon, Calo) ❘ CMS: 5 (SL, DCS, RC, TRG, DQ) ❘ LHCb: 2 (SL, DQ)

❙ Ex.: Start of Fill sequence ❘ In general DCS (HV) automatically handled

driven by the LHC State ❘ In most cases Run Control Shifter manually

Configures/Starts the Run

36

Page 37: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

Size and Performance ❚ Size of the Control Systems (in PCs)

❙ ALICE: 1 DAQ + ~100 DCS ❙ ATLAS: 32 DAQ + 130 DCS ❙ CMS: 12 DAQ + ~80 DCS ❙ LHCb: ~50 DAQ + ~50 HLT + ~50 DCS

❚ Some Performance numbers

➨ All Experiments work Beautifully! 37

ALICE ATLAS CMS LHCb Cold Start to Running (min.) 5 5 3 4 Stop/Start Run (min.) 6 2 1 1 Fast Stop/Start (sec.) - <10 <10 <10 DAQ Inefficiency (%) 1 <1 <1 <1

Page 38: SMI++ A Finite State Machine Toolkit

Clara Gaspar, October 2013

LHCb Control System

❚ Courtesy of CMS DCS Team 38