Click here to load reader

ACET Accelerator Controls Exploitation Tools Progress and plans, December 2012

Embed Size (px)

Citation preview

ACET

ACETAccelerator Controls Exploitation ToolsProgress and plans, December 2012OutlineControls system overviewMotivation and purposeFocus points2013ConclusionsACET - TC on 06 December 20122ACET - TC on 06 December 20123Controls system overviewKnobsServicesCoreDiagnosticsApplicationsMiddletierFront EndsSequencerOrbitInCA/LSAProxiesJMSSISCMW/FESATimingDriversDBBootNFScmwDirRBACDiaMoncmwAdminFESA NavigatorVideoSyslogHardwareTuneRT425 Consoles400 GUIs300 Servers200 Java servers1300 FECs600 module types85.000 devicesTypical languages:Applications: Java, some XMotifMiddletier: Java, some C++Front End: C and C++3OutlineControls system overviewMotivation and purposeFocus points2013ConclusionsACET - TC on 06 December 20124ACETMotivationDistributed and complex controls systemKnowledge distributed over many expertsMove towards uniform (LHC) exploitation model across machinesPurpose: Allow (non-)experts to carry out more efficient diagnosticsACET collaborates with CO projects to improve diagnostic facilities of the control systemACET - TC on 06 December 20125OutlineControls system overviewMotivation and purposeFocus points2013ConclusionsACET - TC on 06 December 20126Focus pointsDiagnostic Tools aggregation and trainingProcess metrics JMX & CMXDiaMon GUI and CLIC agentDocumentationWiki/site structure, Portal and Useful linksDynamic/runtime dependenciesFeedback Tracing & Configmessage format, transport, analysisTrace analysis using SplunkConfig analysis in CCDBACET - TC on 06 December 20127Diagnostic toolsTools evaluated for criticality

Aggregation into CCM diagnostic menu

Training given during shutdown lectures

ACET - TC on 06 December 20128

CCM, Common Console Manager is a centralized tool for launching applicationsThe Diagnostics menu was created for all machines and appropriate tools put in there8Focus pointsDiagnostic Tools aggregation and trainingProcess metrics JMX & CMXDiaMon GUI and clic agentDocumentationWiki/site structure, Portal and Useful linksDynamic/runtime dependenciesFeedback Tracing & Configmessage format, transport, analysisTrace analysis using SplunkConfig analysis in CCDBACET - TC on 06 December 20129Process Metrics JMX architecturehttp://wikis/display/ACET/JMX+client+instrumentationC2MonSRVJMX-DAQDiaMon GUIMetricsRMIJMXmBeansJMX viewerJmxDirectoryjConsolejar1jar2mgtJVMjmx-dir-clientjVisualVMSRVACET - TC on 06 December 201210

Color codes: Green=done, Yellow=Todo, Grey=no changeMetrics are internal variables and counters preferably indicating overall health stateProjects must define what this implies for themJMX is a Java technology that allows to expose metrics and operations, but it lacked support for central registration and dynamic port allocationContributors: Joel, Mike, Steen, Ivan10Process metrics CMX architecturehttp://wikis/display/MW/CMXC2MonCLIC-DAQDiaMon GUIlib1lib2p1lib1lib2cmx-lib-csharedmemorysegmentsC process p1cmx-libregistrylib3lib4cmx-lib-c++C++ process p2lib3lib4p2cmx-lib-c++CLIC agentCMX viewerACET - TC on 06 December 201211FECCommand line toolDBMetrics

No C/C++ mechanism similar to JMX seems to exist, so CMX was developed in-houseDevelopers: Gina/Felix, Peter, Mark, JoelTests by Xavier (C), Jeremy and Gina (C++, FESA class in testbed)11Process metrics DiaMon JMX integrationACET - TC on 06 December 201212

Integration done by Joel12Process metrics - jConsoleACET - TC on 06 December 201213

Process metrics - Viewers

ACET - TC on 06 December 201214

Metrics are internal variables and counters preferably indicating overall health stateProjects must define what this implies for them14Process metrics JMX lookupACET - TC on 06 December 201215

The service URL allows connecting various JMX-related tools such as jConsole and jVisualVM15Focus pointsDiagnostic Tools aggregation and trainingProcess metrics JMX & CMXDiaMon GUI and clic agentDocumentationWiki/site structure, Portal and Useful linksDynamic/runtime dependenciesFeedback Tracing & Configmessage format, transport, analysisTrace analysis using SplunkConfig analysis in CCDBACET - TC on 06 December 201216Documentation - StructureACET - TC on 06 December 201217

This structure is a proposal for SLM, maintenance responsibilities need to be agreed17Documentation PortalACET - TC on 06 December 201218

A prototype to serve as base for discussions18Documentation Useful linksACET - TC on 06 December 201219

A prototype to serve as base for discussions

19Focus pointsDiagnostic Tools aggregation and trainingProcess metrics JMX & CMXDiaMon GUI and clic agentDocumentationWiki/site structure, Portal and Useful linksDynamic/runtime dependenciesFeedback Tracing & Configmessage format, transport, analysisTrace analysis using SplunkConfig analysis in CCDBACET - TC on 06 December 201220Dependencies - architectureFECFECFECcmwadmin-scannerVisualizationclient connectionscmwAdminCMW/FESADependency analysisFECFEC

cmwDirectorydot files

log files

ACET - TC on 06 December 201221Data collection before LS1

http://wikis/display/MW/StatisticsA central process (cmwadmin-scanner) retrieves the list of all CMW servers (from cmwDirectory), then queries the cmwAdmin interface of each one in sequenceThis gives a snapshot of the current client connections in each server21Dependencies a viewACET - TC on 06 December 201222

One way of depicting connections between clients and CMW services.Read dots generally indicate a CMW server that is not running, but still present in the cmwDirectory service22Dependencies a viewACET - TC on 06 December 201223

http://wikis/display/MW/StatisticsFace FecBookOne way of depicting connections between clients and CMW services.Read dots generally indicate a CMW server that is not running, but still present in the cmwDirectory service23Focus pointsDiagnostic Tools aggregation and trainingProcess metrics JMX & CMXDiaMon GUI and clic agentDocumentationWiki/site structure, Portal and Useful linksDynamic/runtime dependenciesFeedback Tracing & Configmessage format, transport, analysisTrace analysis using SplunkConfig analysis in CCDBACET - TC on 06 December 201224Feedback architecturehttp://wikis/display/MW/Log+and+TracingJMS@cs-ccr-tracingcmw-fb-cCprocesscmwFESA3cmw-logCCDBcmw-log4jJava process

jar1jar2ACET - TC on 06 December 201225ListenersGUIsCprocesssyslog@cs-ccr-feopsyslog@cs-ccr-tracing

/var/log/messagesFEC/SRVJMS@cs-ccr-cmwSyslog tracingAPEX GUIs

Splunksyslogconverters

Java tracingTracing & ConfiglibslibslogfileslogfilesImplmakeScriptscmmnblddeploywreboot

Syslog forwarding and filtering: Nicolas, MikeMessages: David/Michel(drivers), Jeremy(Make), Pavel(cmmnbuild), Felix(deploy), Nicolas (wreboot), Xavier(Timing), Frank/Stefano(IEPLC)API: AndrzejJMS: FelixFiltering/throttling: Mike, Steen, VitoAPI testers: Xavier, David, StefanoCCDB: Zory, Jose25Feedback CCDB tracing GUI

ACET - TC on 06 December 201226

All errors and warnings are captured in CCDB which allows programmatic extraction26Feedback Hardware config CCDB GUI

ACET - TC on 06 December 201227

A possible result of comparing hardware configuration feedback with the entered configuration data27Splunk - architectureCentral instance running on dedicated machineProject accounts set upTraining given to projectsProject-specific searches createdFECJMS@cs-ccr-tracingFECSplunk@cs-ccr-tracingsyslog@cs-ccr-feopsyslog@cs-ccr-tracing/var/log/messages

FECFECJMS@cs-ccr-cmw

FECSRVlogfilesACET - TC on 06 December 201228Contact Steen for Splunk accessFECfilter&throttle

logfiles

cmw-logSRVcmw-log4jfiltersThe filter is based on a JMS-client API provided by Felix28Splunk Message filter GUIACET - TC on 06 December 201229

Splunk saved searches

ACET - TC on 06 December 201230

30Splunk - visualizationACET - TC on 06 December 201231

Notice the timelines which provides a tendency overview you cannot get with textual command line tools31Splunk dashboard

ACET - TC on 06 December 201232

Splunk Use case: japc-ext-dirQueue overflow messages from CMW proxyHosts and PIDs reportedClient application identifiedjapc-ext-dir suspected and verifiedSubscriptions made to constant propertiesData never consumed => Queue overflow in proxyProblem fixed by EricACET - TC on 06 December 201233Splunk Use casesLeap secondRBAC tokens missing/malformed/expiredCMW slow clientsTelegram layout and configurationJAPC applying wrong token in certain casesFESA handling of Timlib errorSeparating test environment from operationalACET - TC on 06 December 201234Splunk Comments (1)Proper usage requires very good configurationWe need to rework our way to log informationLog files are a bit of a mess now, and only contain a sub-set of necessary datait is necessary to clean up and extend loggingit must be possible for others to access the dataACET - TC on 06 December 201235Splunk Comments (2)ACET - TC on 06 December 201236Positive commentsPowerful tool for detecting and reporting anomaliesVery useful for proactive actionsPowerful tool to make statisticsIt avoids spending time creating tools for decoding tracesIt is an agile way to gather analytics, to inform design decisionsIt is a very powerful auditing toolTrends over time allow spotting new types of problemsIt was useful for me several times for seeing if a problem is on one or multiple machinesIt gives an easy, reusable way of looking at logfilesIt could become a valuable tool to spot errors, where currently we feel blind whenever there is a problemMost projects are only beginning to use Splunk, and the raw data still needs improvementAlready there are many positive comments36Splunk visionActive, daily use by component providers - DashboardsExploit tracing forPro-active operationInformed evolutionPreventive maintenance10 user-friendly message types per projectERROR or WARNINGContact informationLink to documentationMessage body meaningful to non-expertNo java stack traceContinuous improvement of messagesACET - TC on 06 December 20123737OutlineControls system overviewMotivation and purposeFocus points2013ConclusionsACET - TC on 06 December 201238Plans for 2013 (a)DiaMonInteractive service-oriented dependency viewDeclare and monitor process metricsIntegrate metrics viewersLaunching of external toolsMake contact information accessibleSplunkImprove current setup and configurationsIncrease support and project uptakeInvestigate integration of ITAT

ACET - TC on 06 December 201239Plans for 2013 (b)DocumentationAgree/implement CO-wide website/wiki structureAgree on maintenance responsibilitiesPortal review, add and extend pagesContent all projects provide -page descriptionDatabasesFinalize Hardware Configuration Feedback mechanismsCapturing version information, detecting time bombsUpdate contact information

ACET - TC on 06 December 201240Plans for 2013 (c)Feedback (Tracing and Configuration)Improve message quality (structure, content, level)Increase project usage of feedback APIAll projects review configuration/version feedbackProcess metricsWork with projects to expose metricsExtend CMX (commands,) ?MW team take over jmxDirectoryACET - TC on 06 December 201241Plans for 2013 (d)Runtime dependency dataAnalysis and visualization of CMW dataCollecting network connection informationDriversFinalize hardware configuration feedbackVersion feedback implementationACET - TC on 06 December 201242OutlineControls system overviewMotivation and purposeFocus points2013ConclusionsACET - TC on 06 December 201243ConclusionsDoneMeans for provision/transport of tracing, configuration and metricsCentralized Tracing and analysisTodoData generation by projectsDocumentationAnalysis and presentation

Good support from projects in 2012, butToo many other priorities for developers and for me

2013 is for bringing the pieces togetherACET - TC on 06 December 201244ACET needs time from all projects in 201344