Department of Measurement and Information SystemsBudapest University of Technology and Economics, Hungary
Measurement-Driven Resilience Design of Cloud-Based Cyber-Physical Systems
Imre [email protected]
SERENE’14 Autumn School2014.10.14.
A View of Cyber-Physical Systems
Cyber-Physical Systems (CPSs)
3
Ubiquitous embedded and networkedsystems that can monitor and control the
physical world with a high level of intelligence and dependability
Networked embedded systems everywhere
Clouds, „infusable” analytics, Big Data
From embedded to CPS
4
Direct manual control, „closed world” engineering
From embedded to CPS
5
Direct manual control, „closed world” engineering
Highly autonomous, „cyber” backend,
environment, swarms, …
From embedded to CPS
6
Direct manual control, „closed world” engineering
Highly autonomous, „cyber” backend,
environment, swarms, …
Cyber-Physical Systems
Different flavorso NSF, EU, academia, industry…
Still: it is hereo From smart cities & IoT to self-
driving carso Scalable, reconfigurable
backend is a must
7
Health Care
Transportation
Energy
„Classical” case for cloud computing: a brain for a CPS
Video surveillance
Citizen devices
Env. sensors …
Traffic control Situational awareness Deep analytics Normalday
Disaster
See: Naphade et. al (IBM), „Smarter Cities and Their Innovation Challanges”, Computer, 2011
Elastic, reconfigurable
computing
Re
con
figu
rati
on
Converging domains
CPS
Cloudcomputing
Big Data
9
Detour 1: Cloud Computing
Cloud computing: leased resources
Source: http://cloud.dzone.com/articles/introduction-cloud-computing
Definition?
NIST 800-145
Cloud computing is a model for enabling ubiquitous,convenient, on-demand network access to a sharedpool of configurable computing resources (e.g., networks,servers, storage, applications, and services) thatcan be rapidly provisioned and released with minimalmanagement effort or service provider interaction.
Properties
On-demand self-service
Broad network access
Resource pooling
Rapid elasticity
Measured service
13
On the provider side…
~?
Why is it good for the provider?
(Without CLT)
𝑋𝑖 independent prob. Vars with 𝜇 and σ2
Coefficient of variation: 𝜎
𝜇
Exp. value of sum: sum of exp. values
Variance of sum: sum of variances
CV 𝑋𝑠𝑢𝑚 =𝑛𝜎2
𝑛𝜇=
1
𝑛
𝜎
𝜇=
1
𝑛𝐶𝑉(𝑋𝑖)
„Statistical multiplexing”
Variance w.r.t. meangets smaller
1
𝑛: quick – smaller
private clouds
Reality is a bit different
Source: http://en.wikipedia.org/wiki/Central_limit_theorem
Gartner, 2013
„For larger businesses with existing internal data centers, well-managed virtualized infrastructure and efficient IT operations teams, IaaS for steady-state workloads is often no less expensive, and may be more expensive, than an internal private cloud.”
„I need it now, and need it fast…”?
Parallellizable loads
More and more embarrassingly parallel, „scale-out” application categories exist
NYT TimesMachine: public domain archive
o Conversion to web-friendly format: Apache Hadoop, a few hundred VMs, 36 hours
In the cloud: costs the same as with one VM
Practically: „speedup for free”
Scaling resources
„Scale up”
„Scale out”
o Algorithmics?
o „webscale”technologies
Detour 2: Big Data
1.) Big Data at Rest
Distributed storage
„Computation to data”
„At rest Big Data”
o No update
o No sampling
„Not true, but a very, very good lie!”(T.Pratchett, Nightwatch)
MapReduce (Apache Hadoop)
Distributed File System
[ , ][ , ][ , ]
[ , ][ , ][ , ]
[ , ][ , ][ , ]
[ , ][ , ][ , ]
[ , ][ , ][ , ]
[ ,[ , , ]]
[ ,[ , , ]]
[ ,[ , , ]]
[ ,[ , , ]]
[ ,[ , , ]]
SHUFFLE
Map
Reduce
[ , ] [ , ] [ , ] [ , ] [ , ]
2.) „Big Data in Motion”
Stream processing
Inherently scalable the same way
Streaming data
Sensor data
o From smart grid toturbine testing
Images
o Satellites: n TB/day
Web services
Network traffic
Trading
…
The stream processor model
Source: Rajaraman, A., & Ullman, J. D. (2011). Mining of Massive Datasets. Cambridge: Cambridge University Press. p130
Design & composition
Source: International Technical Support Organization. IBM InfoSphere Streams: Harnessing Data in Motion. September 2010, p76
When we have a WCET constraint…
Emphasis in „plain” Big Data: keeping step with ingresso But largely the same for direct timeliness
No (direct) disk access
Memory: bounded
Per-tuple processing: bounded
Algorithmic patterns:o Per-tuple processing
o Sliding window storage and processing
o Specialized sampling• Gets ugly fast
o Various heuristics
Application classes
Source: International Technical Support Organization. IBM InfoSphere Streams: Harnessing Data in Motion. September 2010, p80
Takes on cyber-physical clouds:Cloud-in-CPS…
Converging domains
CPS
Cloudcomputing
Big Data
30
standard link
Intelligence Reconfigurability
Clouds in CPS – reality, not promise
31
SENSORS ACTUATORS
Architectural landscape
32
Takes on cyber-physical clouds:…CPS-in-cloud
Extending Apache VCL for CPS
34
Apache VCL
Virtualized Data Center
...
Virtualmachines
Internet/CAN/LAN
Remote client
ReservationEstablishing connection
Remote desktop or terminal access
Proof of Concept
35
Time-shareable arrangements
Cloud-on-Cloud
Apache VCL
VCL management network
VCL public network
Cloud instance
Network-attachedphys. devices
Experiment video stream
„Cloud on Cloud” capability
36
Apache VCL
VCL management network
VCL public network
Apache VCL/OpenStack/...
CoC virtual networks
„Cloud on Cloud” capability
37
Apache VCL
VCL management network
VCL public network
Apache VCL/OpenStack/...
CoC virtual networks
Bootstrap & capture XaaS
Hypervisors
„Cloud on Cloud” (CoC)
38
With nestedvirtualization
We have…o virtualesxi
o VCL over VCL on that
Some restrictionsapply; in VCL, no…o storage virtualization
o network virtualization
o dynamic reservations
Integrating a field device: Raspberry Pi
39
Surprisingly popular
o In the target demographic
Almost a lab PC: rpi VCL module
Linux
o gentler learning curve
o In reservation: SSH access
Useful set of interfaces
ASM C scripting Java Wolfram
Integrating field devices?
Other device types: adapter computer needed
o E.g. a Rasberry Pi for an Arduino
o Scopes/spectrometers/…: already there
o Autonomous cameras/mesh GWs/…: already inside
Lab.pm: starting point, needs rework
o Field devices: „sanitization” is stronger concept
o Harder work - Pi: reset + read-only SD netboot
40
Container/VMContainer/VM
Future: field devices as true cloud hosts
Real-time/embeddedvirtualization is maturing
o Check out: Siemens Jailhouse
o Xen for ARM
o …
Also see: carrier clouds
Raspberry Pi already has containers!
41
Container/VM
Educational prototype
42
Immediate applications: cloud engineering
CoC: teaching virt. & cloudo E.g. we use it for an ESXi lab; o support for local VCL devel in
progress
Real-life: faults, errors, failureso CPS: performance!
Virtualization in the loopo There are existing SWIFI tools…o … and VCL can be a harness
43
Immediate applications: people & labs
44
Internet/CAN/LAN
Remote client
We have EE/CE in view; chemistry, biology,
physics, …?
Trusting your cloud with deadlines- is it a good idea?
Clouds for demanding applications?
Standard infrastructure vs
demanding application?
Clouds for demanding applications?
Virtual Desktop Infrastructure
Telecommunications
Extra-functional reqs: throughput, timeliness, availability
„Small problems” have high impact(soft real time)
Test automation
Hypervisor
Interference
Lab
OS and hypervisor
metrics
OS and hypervisor
metrics
LOLO
HIHI
Experimental setup
Short transient faults – long recovery
8 sec platform overload
30 sec service outage
120 sec SLA violation
As if you unplug your
desktop for a second...
Deterministic (?!) run-time in the public cloud...
Variance tolerable by overcapacity
Performance outage
intolerable by overcapacity
The noisy neighbour problem
Hypervisor
Tenant Neighbor
Tenant-side measurability and observability
Hypervisor
Tenant Neighbor
Characterizing IaaS performance
IaaS performance
HW not necessarily known
Unknown / uncontrollable
deployment
Unknown / uncontrollable
scheduling„Noisy neighbors”
Also: management action performance?
IaaS performance
Deployment decisionso Should I use this cloud?
Capacity planningo Type and amount of res.
Perf. predictiono QoS to be expected
o And its deviancesBenchmarking!
Benchmarking (a pragmatic take on)
(De-facto) standard applications
with well defined execution metrics
that may exercise specific subsystems
to compare IT systems via said metrics.
Popular benchmarks: e.g. Phoronix Test Suite
Benchmarking as a Service: cloudharmony.com
Why traditional benchmarking is not enough
Stability
Homogeneity
Rare events
Repeatability?o Provider/tenant
Micro/component benchmarks?o Application sensitivity?
o Cloud functions (scale in and out)?
Towards Measurement-DrivenResilience Design for Clouds
A performance feature model+ exp. behavior, homogeneity, stability
Li, Z., OBrien, L., Cai, R., & Zhang, H. (2012). Towards a Taxonomy of Performance Evaluation of Commercial Cloud Services. In 2012 IEEE Fifth International Conference on Cloud Computing (pp. 344–351). IEEE. doi:10.1109/CLOUD.2012.74
Modeling IaaS performance experiments
Li, Z., OBrien, L., Cai, R., & Zhang, H. (2012). Towards a Taxonomy of Performance Evaluation of Commercial Cloud Services. In 2012 IEEE Fifth International Conference on Cloud Computing (pp. 344–351). IEEE. doi:10.1109/CLOUD.2012.74
„Cloud metrology” and its application
Full stack instrumentation
Full adaptive data acquisition
Fine-grained storage
Exploratory Data Analysis
Confirmatory Data Analysis
Mystery shoppers and routine excercises
Application sensitivity model
(Platform) fault modelPerformance/capacity
model
Structural defenses
Dynamic defenses
MO
NIT
OR
ING
BE
NC
HM
AR
KIN
G
Example: characterizing VDI „CPU Ready Time”
„Ready”: VM ready to run, but not scheduledo VDI: „stutter”
Rare eventso Sampling
Needs fine granularity! + at least a few months Very „wide” data
Result: ~QoE capacity + load
Big Data tooling
EDA: hypotheses from „visual tours” of the data
Cloud responsetime ~ nw delay
client ID ~ loc
Client locationsDoes not scale for
Big Data (yet)
Workflow? (As of now)
Classicaltools
Slow EDA On Big Data
Interactive EDAOn samples
statistics on samples
Big Data statistics
Hadoop, Storm, Cassandra, …
The effect of CPS cloud backend instability
Experimental environment
Host1 Host2
Workstation Workstation
OS_
con
tr
OS_compute
nim
bu
s
OS_
net
wo
rk
Co
llect
D
rep
lay
sup
erv
2
sup
erv
1
Application
Application topology
Redisspout
Gatherer1
Gatherer2
Aggregator
Timerspout
Sweeper
<ts, city, delay>
<city, delay>
WorkloadBaselineworkload
Start of stress End of stress
CPU utilization
Process latency
Relationship with guest resource usage?
Correlation: 0.890
Acknowledgements
Special thanks go for the experimental environmentand data to our OpenStack Measurement „taskforce”:
Ágnes Salánki, Dávid Zilahi, Tamás Nádudvari, György Nádudvari, Gábor Kiss (BME) and
Gábor Urbanics (Quanopt Ltd, our spinoff)
72