Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
A Review of SoA HPC power monitoring
systems & future trend on fine-grain
data analyticsEE in HPC @CINECA
Ostrava, IT4I30-Jan-2020
Andrea Bartolini
(slides from Antonio Libri)
EU H2020 FETHPC
project ANTAREX
(g.a. 671623)
IIS - D-ITET - ETH Zurich
▪ Motivation – why reliable and high-res power/energy mon?
▪ SoA power monitoring systems
▪ Future trend on fine-grain measurements: data analytics
22-Feb-2019Antonio Libri 2
Outline
IIS - D-ITET - ETH Zurich 22-Feb-2019Antonio Libri 3
Fine-Grain Power and Performance Measurements:
- Verify and classify node performance (& anomalies)- In spec / out of spec behaviour
- Miss configuration- Aging and wear out
- Predictive maintenance
Coarse grain
Fine grain
CPU
CPU
ACC ACC
Node
DIMMDIMMDIMM
req
requtil
Job Scheduler
System Power Capping (reliable energy measurements)
- New Installations, Grid SLA, Power Shortage, Natural Disasters
- Ensures operating power below a maximum power consumption level
Several Challenges for HPC and Data-centers
IIS - D-ITET - ETH Zurich
▪ Document in [1] reports requirements to rank a HPC system in Top500/Green500
22-Feb-2019Antonio Libri 4
Top500/Green500 Power Meas methodology [1]
[1] EEHPC WG, “Energy Efficient High Performance Computing Power Measurement Methodology”, v2.0 RC 1.0
Requirements Level 1
(Min Quality)
Level 2 Level 3
(Best Quality)
Granularity 1S/s 1S/s • Continuously integrated energy
• V & I sampled at least
• @5kS/s for AC
• @120S/s for DC
Precision
(1σ - relative error)
5% 2% Below 1%
Meas Synch b/w
different meters
Below sampling
period (e.g., NTP)
Below sampling
period (e.g., NTP)
Below sampling period (e.g., NTP)
IIS - D-ITET - ETH Zurich
▪ Often the terms accuracy and precision are wrongly interchanged
▪ However
▪ Accuracy → mean (can be fixed by calibration)
▪ Precision → Std Dev (no fix)
22-Feb-2019Antonio Libri 5
Some (CORRECT) terminology
Note: This definitions are used for any set of
measurements (e.g., power meas, but also time
synch measurements)
IIS - D-ITET - ETH Zurich 12-Nov-2018 6
Fine-grain Sync w. App Phases on HPC Clusters [1]
μs resolved time stamps
ESoC_1ESoC_n
Rack
node1
Cold air/water
CRAC
HPC cluster Hot air/water
C˚
RPM FAN
Power
PerfcountersGPU
CPU1
CPUn
Clock
Clock
SeveralMetrics P0
Pn
APP MPI Synch
Time
Node 1
Node n
ParallelApplication
Node 1
Node nTimeTemp
Power
Cache Miss
Antonio Libri
Power @1s
[1] Libri et al., 2018, Evaluation of NTP / PTP Fine-
Grain Synchronization Performance in HPC Clusters
IIS - D-ITET - ETH Zurich
▪ Motivation – why reliable and high-res power/energy mon?
▪ SoA power monitoring systems
▪ Future trend on fine-grain measurements: data analytics
22-Feb-2019Antonio Libri 7
Outline
IIS - D-ITET - ETH Zurich 22-Feb-2019 8
SoA HPC Monitoring Systems
Antonio Libri
▪ Current solutions allow to collect measurements in-band
and out-of-band (no overhead on PE) via
▪ built-in tools (e.g., IPMI, Amester, RAPL → hw perf counters)
▪ custom sensors (e.g., HDEEM, HAEC → fine grain power meas)
9
PE1 PE2 PE3
DCDC
VR VR VR
PSU
BMC
User - SpaceSW
HW
SoA In-band HPC Power Monitoring systems
State-of-the-art:
1. Intel RAPL [1]
[1] M. Hahnel et al., “Measuring Energy
Consumption for short Code Paths Using RAPL”
- Sampling time up to 1ms
- Reading energy via RAPL MSR registers
- Synchronization via NTP/PTP (vendor dependent)
- Precision is vendors dependent
- Scalable
RAPL MSR Register
(MSR_Safe @User Space)
10
PE1 PE2 PE3
DCDC
VR VR VR
PSU
BMC
User - Space
Remote
Management Node
System
Administrator
P(t)
IPMI
SW
HW
SoA Out-of-band HPC Power Monitoring systems
State-of-the-art:
1. BMC – IPMI [1]
[1] IPMI spec v2.0 rev1.1, April 2015
- Slow sampling time (seconds)
- Unreliable time stamping
- Instantaneous power measurement – no energy / aliasing
- Precision is vendors dependent
- Scalable
11
SoA Out-of-band HPC Power Monitoring systems
- BMC reads Power measurements via OCC (within PE)
- Also in-band version in new Power9
- Time resolution up to 250μs (8kB buffers)
- Synchronization via NTP/PTP (vendor dependent)
- Precision is vendors dependent
- Scalable
[1] T. Rosedahl et al., “Power/Performance Controlling
Techniques in OpenPOWER”
State-of-the-art:
1. BMC – IPMI
2. IBM Amester [1]
PE1 PE2 PE3
DCDC
OCC OCC OCC
PSU
BMC
User - Space
Remote
Management Node
System
Administrator
IPMI
SW
HW
12
PE1 PE2 PE3
DCDC
VR VR VR
BMC
User - Space
Remote
Management Node
P(t)
IPMI
System
Administrator SW
HW
FPGA
DatabaseUser
SoA Out-of-band HPC Power Monitoring systems
- Time resolution up to 1ms (VR on CPU, DDR) - 125 μs (plug)
- Precision of 2% and 3%, respectively
- Time synchronization up to ms via NTP
- Scalable
[1] Ilsche et al., “Power Measurement Techniques for
Energy-Efficient Computing: Reconciling Scalability,
Resolution and Accuracy”
State-of-the-art:
1. BMC – IPMI
2. IBM Amester
3. HDEEM [1]
13
PE1 PE2 PE3
DCDC
Shunt Shunt Shunt
BMC
User - Space
Remote
Management Node
P(t)
System
Administrator SW
HW
NI-DAC
User
SoA Out-of-band HPC Power Monitoring systems
Shunt Resistor
- 2 NI-DAC, one @7 kS/s (T=143μs - VR on CPU, DDR ), one
@500kS/s (T=2μs – power plug)
- Precision below 2%
- Current monitoring with Shunt Resistors (tested also HE Sensor)
- Time synchronization up to ms via NTP
- Not Scalable (single node only)
[1] Ilsche et al., “Power Measurement Techniques for
Energy-Efficient Computing: Reconciling Scalability,
Resolution and Accuracy”
State-of-the-art:
1. BMC – IPMI
2. IBM Amester
3. HDEEM
4. HAEC [1]
14
PE1 PE2 PE3
DCDC
VR VR VR
BMC
User - Space
Remote Management
Node
P(t)
IPMI
System
Administrator SW
HW
PMBUS
DatabaseUser
SoA Out-of-band HPC Power Monitoring systems
State-of-the-art:
1. BMC – IPMI
2. IBM Amester
3. HDEEM
4. HAEC
5. CRAY XC APM [1]
- Measurements via PMBus
- Time res up to 1s (Node Power) and 10s (CPU & Memory Power)
- Precision ±2.5% (by datasheet)
- Time synchronization up to ms via NTP
- Scalable
[1] Steven J. Martin et al., “Cray XC
Advanced Power Management Updates”
15
PE1 PE2 PE3
DCDC
VR VR VR
BMC
User - Space
Remote
Man. Node
P(t)
IPMI
Sys
Admin SW
HW
ESoC
User
SoA Out-of-band HPC Power Monitoring systems
State-of-the-art:
1. BMC – IPMI
2. IBM Amester
3. HDEEM
4. HAEC
5. CRAY XC APM
6. PowerInsight [1] /
Ardupower [2]
- Open, low cost embedded SoC (PI → Beaglebone; AP → Arduino)
- Current monitoring with HE Sensor
- Time res up to 1ms on PowerInsight; ~2ms on ArduPower
- Precision 1.8% PowerInsight; not reported on ArduPower
- Time synchronization up to ms via NTP
- Scalable
ArduPower
-
Arduino Mega 2560
PowerInsight
-
Beaglebone
[1] J. L. Laros et al., “PowerInsight – A commodity Power Measurement Capability”
[2] M. F. Dolz et al., “ARDUPOWER: A Low-cost Wattmeter to improve Energy Efficiency of HPC
Applications”
16
- 1GHz ARM Cortex-A8
- 12bit 8-ch SAR ADC
- PTP HW enabled
Beaglebone Black (BBB)
SoA Out-of-band HPC Power Monitoring systems
State-of-the-art:
1. BMC – IPMI
2. IBM Amester
3. HDEEM
4. HAEC
5. CRAY XC APM
6. PowerInsight /
Ardupower
7. DiG [1]
- Open, Low-Cost Embeeded (BeagleboneBlack - ARM-A8)
- Tested both HE Sens / Shunt Res on several arch (Intel, ARM, IBM)
- Time res up to 20 μs @plug and reading VR at different rates
depending on arch
- Precision below 1%
- Time synchronization up to μs via PTP
- Scalable: Big data communication protocol (MQTT) + Real-time
edge analytics of the fine-grain measurements
[1] Libri et al., “DiG: Enabling Out-of-Band
Scalable High-Resolution for Data-Center
Analytics, Automation and Control”
PE1 PE2 PE3
DCDC
VR VR VR
BMC
User - Space
P(t)
Sys
Admin SW
HW
BBBUserMQTT
Database
Remote
Man. Node
IIS - D-ITET - ETH Zurich
▪ Motivation – why reliable and high-res power/energy mon?
▪ SoA power monitoring systems
▪ Future trend on fine-grain measurements: data analytics
22-Feb-2019Antonio Libri 17
Outline
IIS - D-ITET - ETH Zurich 12-Nov-2018Antonio Libri 18
Example of Fine-Grain Power Meas w. DiG
Coarse Grain View
BB View1 Node -20 min
20 min
45 Nodes -4s
BB @1s
BB @1ms
BB @1ms 45 Nodes -1s
IIS - D-ITET - ETH Zurich 22-Feb-2019
Antonio Libri
19
Cluster Analytics – Data Collection SW Stack [1,2]
Back-end• Send Pow and Perf measurements for
cluster-level analytics
Front-end • Exploit Cassandra (NoSQL DB)• Data Visualization (Grafana) and Cluster Level ML
(Spark, both RT & batch mode)
Meas Meas Meas Meas Meas Meas
Target Facility
GrafanaApacheSpark
Applications
Python Matlab
Cassandranode1
CassandranodeM
NoSQL
MQTT2Kairos MQTT2kairos
Kairosdb
Broker1
MQTT
BrokerM
MQTT Brokers
[1] https://github.com/EEESlab/examon
[2] F. Beneventi et al., “Continuous learning of HPC infrastructure models using big data analytics
and in-memory processing tools”
Antonio Libri
IIS - D-ITET - ETH Zurich 22-Feb-2019 20
DiG: High-Res Power Monitoring [1,2]
Antonio Libri
▪ Fourier on high-resolution power measurements as example of feature extraction technique for time series
Application 1
Application 2
[1] Libri et al., 2018, DiG: Enabling Out-of-Band Scalable High-Resolution Monitoring for Data-Center Analytics, Automation and Control[2] Borghesi et al., 2018, “Online Anomaly Detection in HPC systems”
IIS - D-ITET - ETH Zurich 22-Feb-2019 21
Our Envision on Future Fine-Grain Analytics – The
DiG Approach [1,2]
PSU DCDC
PEPE
PEPE
Leverage real-time analysis between:
▪ Embed Mon – High Res Pow/Perf Edge Analytics
▪ Central Mon - Pow/Perf Cluster Analytics
Edge Analytics @high rate
MQTTPub
MQTTBroker
MQTTSub
CentralMon
Cluster Analytics @low rate
Power
I V
Perf Mon
Embedded Computer
Data-center ServersPub(top, data)
Sub(top)
Antonio Libri
[1] Libri et al., 2018, DiG: Enabling Out-of-Band Scalable High-Resolution Monitoring for Data-Center Analytics, Automation and Control[2] Borghesi et al., 2018, “Online Anomaly Detection in HPC systems”
IIS - D-ITET - ETH Zurich
▪ Several challenges on HPC systems require reliable and high-
resolution measurements (precise, fine-grained & synchronized)
▪ Several SoA methods → in-band and out-of-band methods which
use built-in and custom sensors
▪ Fine-grain measurements can reveal precious information that can
be used to profiling applications and system behavior (e.g., detection
of anomalies)
▪ Leverage the huge amount of data of fine-grain measurements
between edge and cluster level analytics
22-Feb-2019 22
Take home messages
Antonio Libri
Thanks for your interest
Contact:
▪ Antonio Libri, [email protected]
Acknowledge:
EU H2020 FETHPC
project ANTAREX
(g.a. 671623)
▪ Andrea Bartolini, Francesco Beneventi,
Andrea Borghesi and Luca Benini
IIS - D-ITET - ETH Zurich 22-Feb-2019Antonio Libri 24
Backup Slides
IIS - D-ITET - ETH Zurich
▪ SW overhead: DiG CPU usage of mon daemons < 46 %
(soon 0% thanks to co-processor offloading)
▪ Synch: via PTP up to μs (below sampling period 20μs)
▪ MQTT Scalability: tested on 512 nodes of GALILEO
(CINECA) → suitable for large-scale systems
▪ Real-Time ML Inference preliminary benchmarks:▪ RT feature extraction via FFT on high-resolution measurements in a time
window of 40ms w. around 7% of DiG CPU usage
▪ RT ML inference via TF w. Resnet of 16 layers and chan {16;16; 32; 64} respecting real-time constraint of 40ms (FFT time window)
22-Feb-2019 25
DiG: SW Overhead, Scalability & ML Infer
Antonio Libri