SERENE 2014 School: Measurement-Driven Resilience Design of Cloud-Based Cyber-Physical Systems

Preview:

Citation preview

Department of Measurement and Information SystemsBudapest University of Technology and Economics, Hungary

Measurement-Driven Resilience Design of Cloud-Based Cyber-Physical Systems

Imre Kocsisikocsis@mit.bme.hu

SERENE’14 Autumn School2014.10.14.

A View of Cyber-Physical Systems

Cyber-Physical Systems (CPSs)

3

Ubiquitous embedded and networkedsystems that can monitor and control the

physical world with a high level of intelligence and dependability

Networked embedded systems everywhere

Clouds, „infusable” analytics, Big Data

From embedded to CPS

4

Direct manual control, „closed world” engineering

From embedded to CPS

5

Direct manual control, „closed world” engineering

Highly autonomous, „cyber” backend,

environment, swarms, …

From embedded to CPS

6

Direct manual control, „closed world” engineering

Highly autonomous, „cyber” backend,

environment, swarms, …

Cyber-Physical Systems

Different flavorso NSF, EU, academia, industry…

Still: it is hereo From smart cities & IoT to self-

driving carso Scalable, reconfigurable

backend is a must

7

Health Care

Transportation

Energy

„Classical” case for cloud computing: a brain for a CPS

Video surveillance

Citizen devices

Env. sensors …

Traffic control Situational awareness Deep analytics Normalday

Disaster

See: Naphade et. al (IBM), „Smarter Cities and Their Innovation Challanges”, Computer, 2011

Elastic, reconfigurable

computing

Re

con

figu

rati

on

Converging domains

CPS

Cloudcomputing

Big Data

9

Detour 1: Cloud Computing

Cloud computing: leased resources

Source: http://cloud.dzone.com/articles/introduction-cloud-computing

Definition?

NIST 800-145

Cloud computing is a model for enabling ubiquitous,convenient, on-demand network access to a sharedpool of configurable computing resources (e.g., networks,servers, storage, applications, and services) thatcan be rapidly provisioned and released with minimalmanagement effort or service provider interaction.

Properties

On-demand self-service

Broad network access

Resource pooling

Rapid elasticity

Measured service

13

On the provider side…

~?

Why is it good for the provider?

(Without CLT)

𝑋𝑖 independent prob. Vars with 𝜇 and σ2

Coefficient of variation: 𝜎

𝜇

Exp. value of sum: sum of exp. values

Variance of sum: sum of variances

CV 𝑋𝑠𝑢𝑚 =𝑛𝜎2

𝑛𝜇=

1

𝑛

𝜎

𝜇=

1

𝑛𝐶𝑉(𝑋𝑖)

„Statistical multiplexing”

Variance w.r.t. meangets smaller

1

𝑛: quick – smaller

private clouds

Reality is a bit different

Source: http://en.wikipedia.org/wiki/Central_limit_theorem

Gartner, 2013

„For larger businesses with existing internal data centers, well-managed virtualized infrastructure and efficient IT operations teams, IaaS for steady-state workloads is often no less expensive, and may be more expensive, than an internal private cloud.”

„I need it now, and need it fast…”?

Parallellizable loads

More and more embarrassingly parallel, „scale-out” application categories exist

NYT TimesMachine: public domain archive

o Conversion to web-friendly format: Apache Hadoop, a few hundred VMs, 36 hours

In the cloud: costs the same as with one VM

Practically: „speedup for free”

Scaling resources

„Scale up”

„Scale out”

o Algorithmics?

o „webscale”technologies

Detour 2: Big Data

1.) Big Data at Rest

Distributed storage

„Computation to data”

„At rest Big Data”

o No update

o No sampling

„Not true, but a very, very good lie!”(T.Pratchett, Nightwatch)

MapReduce (Apache Hadoop)

Distributed File System

[ , ][ , ][ , ]

[ , ][ , ][ , ]

[ , ][ , ][ , ]

[ , ][ , ][ , ]

[ , ][ , ][ , ]

[ ,[ , , ]]

[ ,[ , , ]]

[ ,[ , , ]]

[ ,[ , , ]]

[ ,[ , , ]]

SHUFFLE

Map

Reduce

[ , ] [ , ] [ , ] [ , ] [ , ]

2.) „Big Data in Motion”

Stream processing

Inherently scalable the same way

Streaming data

Sensor data

o From smart grid toturbine testing

Images

o Satellites: n TB/day

Web services

Network traffic

Trading

The stream processor model

Source: Rajaraman, A., & Ullman, J. D. (2011). Mining of Massive Datasets. Cambridge: Cambridge University Press. p130

Design & composition

Source: International Technical Support Organization. IBM InfoSphere Streams: Harnessing Data in Motion. September 2010, p76

When we have a WCET constraint…

Emphasis in „plain” Big Data: keeping step with ingresso But largely the same for direct timeliness

No (direct) disk access

Memory: bounded

Per-tuple processing: bounded

Algorithmic patterns:o Per-tuple processing

o Sliding window storage and processing

o Specialized sampling• Gets ugly fast

o Various heuristics

Application classes

Source: International Technical Support Organization. IBM InfoSphere Streams: Harnessing Data in Motion. September 2010, p80

Takes on cyber-physical clouds:Cloud-in-CPS…

Converging domains

CPS

Cloudcomputing

Big Data

30

standard link

Intelligence Reconfigurability

Clouds in CPS – reality, not promise

31

SENSORS ACTUATORS

Architectural landscape

32

Takes on cyber-physical clouds:…CPS-in-cloud

Extending Apache VCL for CPS

34

Apache VCL

Virtualized Data Center

...

Virtualmachines

Internet/CAN/LAN

Remote client

ReservationEstablishing connection

Remote desktop or terminal access

Proof of Concept

35

Time-shareable arrangements

Cloud-on-Cloud

Apache VCL

VCL management network

VCL public network

Cloud instance

Network-attachedphys. devices

Experiment video stream

„Cloud on Cloud” capability

36

Apache VCL

VCL management network

VCL public network

Apache VCL/OpenStack/...

CoC virtual networks

„Cloud on Cloud” capability

37

Apache VCL

VCL management network

VCL public network

Apache VCL/OpenStack/...

CoC virtual networks

Bootstrap & capture XaaS

Hypervisors

„Cloud on Cloud” (CoC)

38

With nestedvirtualization

We have…o virtualesxi

o VCL over VCL on that

Some restrictionsapply; in VCL, no…o storage virtualization

o network virtualization

o dynamic reservations

Integrating a field device: Raspberry Pi

39

Surprisingly popular

o In the target demographic

Almost a lab PC: rpi VCL module

Linux

o gentler learning curve

o In reservation: SSH access

Useful set of interfaces

ASM C scripting Java Wolfram

Integrating field devices?

Other device types: adapter computer needed

o E.g. a Rasberry Pi for an Arduino

o Scopes/spectrometers/…: already there

o Autonomous cameras/mesh GWs/…: already inside

Lab.pm: starting point, needs rework

o Field devices: „sanitization” is stronger concept

o Harder work - Pi: reset + read-only SD netboot

40

Container/VMContainer/VM

Future: field devices as true cloud hosts

Real-time/embeddedvirtualization is maturing

o Check out: Siemens Jailhouse

o Xen for ARM

o …

Also see: carrier clouds

Raspberry Pi already has containers!

41

Container/VM

Educational prototype

42

Immediate applications: cloud engineering

CoC: teaching virt. & cloudo E.g. we use it for an ESXi lab; o support for local VCL devel in

progress

Real-life: faults, errors, failureso CPS: performance!

Virtualization in the loopo There are existing SWIFI tools…o … and VCL can be a harness

43

Immediate applications: people & labs

44

Internet/CAN/LAN

Remote client

We have EE/CE in view; chemistry, biology,

physics, …?

Trusting your cloud with deadlines- is it a good idea?

Clouds for demanding applications?

Standard infrastructure vs

demanding application?

Clouds for demanding applications?

Virtual Desktop Infrastructure

Telecommunications

Extra-functional reqs: throughput, timeliness, availability

„Small problems” have high impact(soft real time)

Test automation

Hypervisor

Interference

Lab

OS and hypervisor

metrics

OS and hypervisor

metrics

LOLO

HIHI

Experimental setup

Short transient faults – long recovery

8 sec platform overload

30 sec service outage

120 sec SLA violation

As if you unplug your

desktop for a second...

Deterministic (?!) run-time in the public cloud...

Variance tolerable by overcapacity

Performance outage

intolerable by overcapacity

The noisy neighbour problem

Hypervisor

Tenant Neighbor

Tenant-side measurability and observability

Hypervisor

Tenant Neighbor

Characterizing IaaS performance

IaaS performance

HW not necessarily known

Unknown / uncontrollable

deployment

Unknown / uncontrollable

scheduling„Noisy neighbors”

Also: management action performance?

IaaS performance

Deployment decisionso Should I use this cloud?

Capacity planningo Type and amount of res.

Perf. predictiono QoS to be expected

o And its deviancesBenchmarking!

Benchmarking (a pragmatic take on)

(De-facto) standard applications

with well defined execution metrics

that may exercise specific subsystems

to compare IT systems via said metrics.

Popular benchmarks: e.g. Phoronix Test Suite

Benchmarking as a Service: cloudharmony.com

Why traditional benchmarking is not enough

Stability

Homogeneity

Rare events

Repeatability?o Provider/tenant

Micro/component benchmarks?o Application sensitivity?

o Cloud functions (scale in and out)?

Towards Measurement-DrivenResilience Design for Clouds

A performance feature model+ exp. behavior, homogeneity, stability

Li, Z., OBrien, L., Cai, R., & Zhang, H. (2012). Towards a Taxonomy of Performance Evaluation of Commercial Cloud Services. In 2012 IEEE Fifth International Conference on Cloud Computing (pp. 344–351). IEEE. doi:10.1109/CLOUD.2012.74

Modeling IaaS performance experiments

Li, Z., OBrien, L., Cai, R., & Zhang, H. (2012). Towards a Taxonomy of Performance Evaluation of Commercial Cloud Services. In 2012 IEEE Fifth International Conference on Cloud Computing (pp. 344–351). IEEE. doi:10.1109/CLOUD.2012.74

„Cloud metrology” and its application

Full stack instrumentation

Full adaptive data acquisition

Fine-grained storage

Exploratory Data Analysis

Confirmatory Data Analysis

Mystery shoppers and routine excercises

Application sensitivity model

(Platform) fault modelPerformance/capacity

model

Structural defenses

Dynamic defenses

MO

NIT

OR

ING

BE

NC

HM

AR

KIN

G

Example: characterizing VDI „CPU Ready Time”

„Ready”: VM ready to run, but not scheduledo VDI: „stutter”

Rare eventso Sampling

Needs fine granularity! + at least a few months Very „wide” data

Result: ~QoE capacity + load

Big Data tooling

EDA: hypotheses from „visual tours” of the data

Cloud responsetime ~ nw delay

client ID ~ loc

Client locationsDoes not scale for

Big Data (yet)

Workflow? (As of now)

Classicaltools

Slow EDA On Big Data

Interactive EDAOn samples

statistics on samples

Big Data statistics

Hadoop, Storm, Cassandra, …

The effect of CPS cloud backend instability

Experimental environment

Host1 Host2

Workstation Workstation

OS_

con

tr

OS_compute

nim

bu

s

OS_

net

wo

rk

Co

llect

D

rep

lay

sup

erv

2

sup

erv

1

Application

Application topology

Redisspout

Gatherer1

Gatherer2

Aggregator

Timerspout

Sweeper

<ts, city, delay>

<city, delay>

WorkloadBaselineworkload

Start of stress End of stress

CPU utilization

Process latency

Relationship with guest resource usage?

Correlation: 0.890

Acknowledgements

Special thanks go for the experimental environmentand data to our OpenStack Measurement „taskforce”:

Ágnes Salánki, Dávid Zilahi, Tamás Nádudvari, György Nádudvari, Gábor Kiss (BME) and

Gábor Urbanics (Quanopt Ltd, our spinoff)

72

Recommended