44
Cloud Dependability: How Close Are We? Challenges and Research Issues Towards Dependable Clouds Marc Lacoste, Thierry Coupaye Orange Labs International Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS’11) Grenoble, France - October 12th, 2011 © Orange Labs, Research & Development 2010 Large Magellanic Cloud (Distance: 160, 000 light -years) Source: NASA ? Safe & Secure Clouds

Cloud Dependability: How Close Are We? Challenges and Research Issues Towards Dependable Clouds Marc Lacoste, Thierry Coupaye Orange Labs International

Embed Size (px)

Citation preview

Cloud Dependability: How Close Are We?

Challenges and Research Issues Towards Dependable Clouds

Marc Lacoste, Thierry Coupaye

Orange Labs

International Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS’11)Grenoble, France - October 12th, 2011

© Orange Labs, Research & Development 2010

Large Magellanic Cloud(Distance: 160, 000 light -years)Source: NASA

?

Safe & Secure Clouds

2

The Dark Cloud: Recent Threats on Cloud Dependability

Availability:

– Major outage on Amazon EC2 storage (2011).

– DDoS attack on AWS brings Bitbucket services to a halt (2009).

Inter-VM attacks:

– Hey You! Get off My Cloud! on Amazon VMs [Ristenpart et al., CCS’09]. Hypervisor subversion:

– Virtunoid: KVM isolation breakout [Elhage, DEFCON’11].

– CloudBurst: VMware guest VM escape [Kortchinsky et al. BLACKHAT’09].

– Bluepill: rogue hypervisor beneath VMs [Rutkowska et al., BLACKHAT’06].

– SubVirt: VM-based rootkit [King et al., Security&Privacy’06]. Crimeware-as-a-Service:

– EC2 cloud used against Sony’s PlayStation Network (2011).

– Rootkits (SpyEye: 2011), botnets (ZeuS: 2009) in clouds. Network security:

– Critical vulnerability in Eucalyptus open source cloud (2011).

Where are we on cloud dependability?

3

Agenda

A Short Walk through the Clouds

Dependability Barriers

An Autonomic Perspective on Solutions

– Self-Healing Clouds

– Self-Protecting Clouds Open Issues and Perspectives

© Orange Labs, Research & Development 2010

4

A Short Walk through the Clouds

5

Definition of Cloud Computing*

Technological vision: Cloud computing is a model for enabling on-demand network access to a shared pool of virtualized computing resources (networks, servers, storage, applications, devices/mobiles and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction (self-service model through API or web portals)

Market vision (XaaS): same + pay-per-use (or pay-as-you-go) billing models

5 characteristics

1. On Demand Self-Service

2. Broad Network Access

3. Virtualized Resource Pooling

4. Rapid Elasticity

5. Measured Service

3 delivery models

(markets)

1. Cloud Software as a Service (SaaS)

2. Cloud Platform as a Service (PaaS)

3. Cloud Infrastructure as a Service (IaaS)

4 deployment

models

1. Private cloud

2. Public cloud

3. Hybrid cloud

4. Community Cloud

3 pivotal technologies

1. Virtualization

2. Autonomics (automation)

3. Grid Computing (job scheduling)

* Adapted from NIST definition

6

Technologies, Visions, Usages, Markets

Server VirtualizationStorage VirtualizationNetwork Virtualization

Desktop / Laptop Virtualization Device / Embedded Virtualization

Autonomics

Grid ComputingP2P

Cloud Computing ~ On-demand Computing

Utility ComputingElastic Computing

Scalable Computing

Future InternetInternet of Services

Green IT

Enterprise CloudMass market Clouds

Technologies

Data Center Consolidation

Public / Private /Virtual Private / Hybrid Clouds

Compute / Storage Clouds

Data Center Resource (servers, routers)~ Computer Center Hosting Facility, Server Farm, Data Farm

Desktops / Laptops

Other Devices (mobile, network equipments)

Visions

Usages

Markets

PhysicalResources

SaaS (Service, Apps)

PaaS (Platform)IaaS (Infrastructure)RaaS (Resource)

FaaS (Facility)

Drivers for hosters

Consolidation through virtualization and automation allows for:

– Ease and speed up (un)provisioning drastically

– Maintain far larger IT infrastructure

– Reduction of risk of human errors

– Possibility of better energy management

– … altogether tailor IT infrastructure to the need and optimized OPEX/CAPEX

© Orange Labs, Research & Development 2010

security/ compliance

faster deployment

of new applications

cost reductions / budget

constraints

IT managemen

t

IT optimizatio

n

rapid adaptati

on to real

needs

8

Dependability Barriers

Dependability Barriers

9

Barriers Lack of technical maturity

– SLA, auto-scaling/auto-sharing, performance, availability, dependability

– PaaS: applications packaging, deployment, management, test, configuration management…

– Storage

– Network

– Security, privacy Major risks of lock-in

– Lack of standards (API, programming models)

– Lack interoperability

– Lack of portability (applications, mgt tools)

Legal Issues

– Software licences

– Data location (eg government, health)

Integration with legacy IT (IS)

Huge investments in data centers

- building, hardware, cooling, energy

10

The cloud raises many security issues

11

Risks are multi-faceted, both for the customer and the provider

Business & Financial

Data Protection Legal Technical

Customer Perspective

■ Abuse, nefarious use of cloud computing■ Business continuity■ Loss of control■ Vendor lock-in■ Return on investment

■ Data protection (at rest, in transit): access, integrity, leakage, loss, destruction...■ Migration to the cloud, reversibility

■ Contractual terms with third parties■ Data localization■ Responsibility sharing

■ Service hijacking■ Black-box environment■ Insecure APIs■ Integration with IT infrastructure■ Malicious insiders■ Availability■ Shared environment

Provider Perspective

■ Third-party cloud termination■ Transparency with customers■ Auditability■ Vendor lock-in ■ Return on investment■ Billing data

■ Unsafe customer isolation■ Encryption key loss

■ Licensing risks■ Data localization■ Jurisdictional heterogeneity■ Subpoena, e- discovery

■ Service hijacking■ Compromise of management interface■ Malicious insiders■ Resource exhaustion■ Customer identities

12

Cloud computing security: 10 major roadblocks ahead

End-point SecurityEnsure security of virtualized computing

infrastructures

Important to critical

Data Protection Ensure data security/privacy in a shared

context

Important to critical

Network SecurityEnsure security of virtualized

network infrastructures

Incremental

Trust EnablersProve to users that clouds are trustworthy

Critical

Identity management

Privacy

Data traceability

Legal & regulatory issues

Transparency & compliance

Openness

End-to-end securityNetwork isolation

Hypervisor security

Elastic security

Critical Important Incremental

Critical Important Incremental

Critical Important Incremental

Critical Important Incremental

Critical: The roadblock is essential to lift for adoption. Almost no known solutions.

Important: Lifting the roadblock is a major step forward. Some first solutions.

Incremental: The roadblock may be overcome by small enhancement of already existing technologies.

13

End-point securityGuarantee security when computing resources are virtualized.

VMs have also their own vulnerabilities.

Those threats may be mitigated by:

Hardened images. Strict VM security life-cycle management.

Apply security-by-default configurations.

Virtualization brings many threats.

Safe in theory.

But…

Hyperjacking, misconfigurations, malicious device drivers, backdoors between VM and hardware…

Barrier #1: Hypervisor securityBarrier #1: Hypervisor security

14

Network securityGuarantee security when network resources are virtualized.

? ?

? ?

Isolation is no longer physical but logical.

Isolation is less precise. Security guarantees are weaker.

Challenge: map existing network security components to new cloud architectures.

Are traditional security architectures still effective in a virtual environment?

Risks similar to known networks.

Known counter-measures are thus still applicable.

VPNs, VLANs, firewalls, IDS/IPS, encryption, signature…

Barrier #2: Network isolationBarrier #2: Network isolation

15

Network securityGuarantee security when network resources are virtualized.

Automated security management is still lacking.

Research: autonomic, self-protecting security architectures.

Operational: early warning systems, integrated proactive security architectures.

Flexible security provisioning to match

fast evolving risks.

First solutions:

Flexible management of VPNs.

Overlays in full network virtualization.SIEMs.

Barrier #3: Elastic securityBarrier #3: Elastic security

16

Data protectionGuarantee data security in a shared multi-tenant environment.

Lack of end-to-end identity management

(in and across data-centers).

Barriers: scalability, heterogeneity, interoperability.

Authentication: should be overcome.

Authorization: in its infancy.

A Security-as-a-Service opportunity.

Barrier #4: IdentityBarrier #4: Identity

Strong isolation throughout the life- cycle of personal information.

Many tough questions:

Need-to-know enforcement, secure data storage, data retention and destruction, legal implications…

Today’s PETs are not enough!

Barrier #5: PrivacyBarrier #5: Privacy

17

Data protectionGuarantee data security in a shared multi-tenant environment.

Barrier #6: TraceabilityBarrier #6: Traceability

Being unable to locate the data adds to

the loss of control over IT resources.

Legal, political, trust issues: Compliance, data hosted abroad exposed to foreign governments. Inability to prove that data comes from a trusted source.

A widely unchartered area.

Barrier #7: Legal issuesBarrier #7: Legal issues

Multiple conflicting jurisictions over the cloud data flows.

Cloud providers: have trouble providing assurance of compliance with regulations.

Customers: have trouble understanding the rights and obligations of each party.

Importance of security SLAs.

18

Trust enablersProve to third parties that the cloud infrastructure is trustworthy.

Source: Gartner. Analyzing the Risk Dimensions of Cloud and Saas Computing, 2010.

Barrier #8: TransparencyBarrier #8: Transparency

Prove security hygiene of providerinfrastructure to third parties.

Auditability, certification process, risk analysis methodologies, compliance.

Trusted cloud computing technologies provide cryptographic evidence.

Clear-cut SLAs to clarify responsibilities.

Barrier #9: OpennessBarrier #9: Openness

Avoid lock-in, interoperate with other

cloud infrastructures.

Issues: API portability across providers, interoperability, scalability.

Enabler of the « inter-cloud » vision (« sky computing »).

Flexibility and security benefits of open source cloud architectures.

19

Source: Cloud Security Alliance. Security Guidancefor Critical Areas of Focus in Cloud Computing, 2009.

Trust enablersProve to third parties that the cloud infrastructure is trustworthy.

Barrier #10: End-to-end securityBarrier #10: End-to-end security

Seamless orchestration of multiple fragmented security building blocks.

Definition of a cloud reference security architecture providing a consistent overall view of cloud security.

Importance of standardization.

20

Cloud Security Today

Security = #1 barrier for cloud computing adoption

To develop trust of cloud customers, cloud security should be…

Multiple dangers

No inside / outside frontier

Fast evolving threats

Multiple security goals

Integrated security management

Easy administration

Good performance

Cost-effective security

21

Cloud Security Today

… but unfortunately today cloud security is:

Not

Strong

Not

Flexible

Not

Efficient

Not

Simple

Hard-to-detect vulnerabilities

Hard-to-place and synchronize countermeasures

Static configurations, low reactivity

No generic security supervision architecture

Complex, heterogeneous security mechanisms

Manual administration nightmare

Costly, error-prone management of security

Traditional approaches to protection are not enough!

22

Towards Solutions ? Self-Healing, Self-Protecting Clouds

23

Autonomic Cloud Dependability Management

An automated vision of cloud dependability:

self-healing, self-defending clouds

Approach (for security):

Benefits:

– Simple, strong, flexible, autonomous cloud safety for its customers.

– Lighter administration, increased reactivity and agility, lower operating costs.

– Graduated response to threats / to faults.

– Enabler for integrated supervision.

24

Resilience and Fault-Tolerance in Cloud Computing

What we do need for cloud resilience is

– Failure statistics from operational cloud environments

– Fault models

– Fault tolerance algorithms and mechanisms– replication, backup, disaster recovery, self-repair…

… that deal with cloud specificities

– Open, dynamic, virtualized environments : VMs can move between physical servers, so does latency and hardware reliability

– Scalability and multi-tenancy: thousands of applications (each generally made of multiple VMs) hosted on a single platform, resources actually shared by multiple applications/users

– Layered infrastructures (SaaS/PaaS/IaaS) with multiple administrative roles/domains with limited control offered to application developpers

– Pay-per-use: you have to pay for ressources you use (VMs)!

25

Where are we on cloud resilience ?

A huge background on distributed systems dependability, including in grid and autonomic computing (self-repair), but few works specifically targeting cloud reliability

– e.g. 0 paper in IEEE Cloud 2009, 1 paper in IEEE Cloud 2010, 1 paper in IEEE Cloud 2011

Some illustrative work-in-progress

– Server failures characterization with objective of characterizing the complete data center hardware (server, storage, network) reliability model [Microsoft Research, ACM Symposium on Cloud Computing’2010]

– 3-replicas ring with scatter placement algorithm of backups of cloud management nodes [U. Maryland, IBM Research, IEEE IPDPS’2009]

– Crash and timing fault tolerance (not Byzantine faults) through strong replica consistency of applicative nodes (semi-active and semi-passive replication) [U. Cleveland, UC Santa Barbara, IEEE Cloud’2010]

– Bizantine fault tolerance through 3f+1 active replication in community (‘’volontary’’) cloud [U. Hong Kong, NUDT, IEEE Cloud’2011]

– Adaptive fault tolerance (forward and backward recovery) in real time cloud through runtime continuous reliability assessment of nodes and replication with variants (VMs) [INRIA, IEEE World Congres on Services’2011]

26

Virtual Appliance Management Platform (VAMP)

VAMP is a PaaS Application Lifecycle Platform (ALM) under development inside Orange Labs.

VAMP targets the construction (e.g. VM image generation), deployment and management of distributed applications in the cloud.

VAMP is based on a architectural description (component-based with the Fractal component model) of applications that is used throughout the complete lifecycle of applications (including ‘model@runtime’)

Control Plane (VAMP)

Applicative Plane (legacy application)

Reification and control

Components

27

VAMP Runtime Architecture

deploymentmanager

configurator agent

configurator agent

VM0 VM1 VM2

Control Plane (VAMP)

Applicative Plane (legacy application)

Message Oriented Middleware (MOM)

VAMPmanager

VAMP managercreates and repairDeployment Managers.VAMP manager may be inside ou outside the IaaS

Deployment Manager (1/application)create applicative VMs and recreates VMs when they fail

Configurator Agents self-configure the application

28

Reliable Self-configuration in VAMP (work in progress)

Self-configuration protocol

– At VM creation, each component C (managed by a

Configurator inside a VM):– Announces itself to the configurator network;– Configures locally the applicative elements;– Exports its server interfaces to its “client components”

which can then bind to C (‘bind’ signals);– Once all C client interfaces are bound, C starts and

notifies its start to its “client components” (‘start’ signals).

– Gradually (“epidemically”), the complete application is started.

– When a VM failure is detected, the faulty VM is replaced by a new VM which is simply introduced in the self-configuration protocol as if it was in its initial deployment phase (except that ‘start’ signals are interpreted as ‘restart’ by already running components).

29

Synthesis on VAMP

A decentralized approach based on asynchronous and reliable communications.

– Each VM evolves in parallel at its own pace.

– No need for global synchronization between VMs. Replication (primary-backup schema, passive replication) of DMs

complete the self-configuration protocol for complete application reliability.

The approach works for VM crash, transient network failures, and for stateless components.

The self-configuration protocols is also a (basic) self-repair protocol: a running application is seen as in a continuous deployment process.

30

Attack targets and countermeasures

Hypervisor

vSwitch

Hardware

VM VM VM VM

STORAGE

VNIC

CPU MEMORY NETWORK

PNIC

MAC spoofing/snooping: Static address allocation.

IP attacks: Virtual firewalling.

VLAN hopping: Physical traffic segregation.

Hyperspacing: MMU, IOMMU.

Hyperthreading: No hyperthreading.

Buffer overflows: No eXecute bit.

Hyperjacking:

High attack surface: Certification, open, modular solutions.

Hypervisor integrity: TPM

Secrecy violations: Authentication, signature.

Integrity violations: Encryption of stored data (self-encrypting drives).

DoS (resource starvation): Memory overcommit, optimized page sharing, balloon drivers… Quotas, priorities.

31

Detection: VM Introspection

Benefits Issues

Monitor VM behavior, detect unusual behaviors.

Little remediation actions (simple actions, e.g., restart, kill VM).

Rely on VMM to measure VM integrity.What happens if VMM is compromised?

In context measurement is difficult (real-time monitoning)

Semantic gap problem.

May require to add hooks in the VM.

Stealth is difficult to achieve.

In-VM monitoring:Proximity to monitored target.

Possibility of corruption by malware. How to protect monitoring component?

In hypervisor:Users cannot kill monitoring component.

May require to modify VMM.

In management VM:Performance improvement (offload philosophy)

Less reactive?

2. monitoring

agent

Protection target: VM

Systems

With hooks in VM: Lares, XenAccess, KVMSec

With no hooks in VM: CloudSec

In VM monitoring code: SIM

Offline: LiveWire

Miscellaneous: Re-virt (log), Ether, HIMA

1. hook

Monitored VM

Management VM

1. Monitoring

agent

Hypervisor

2. Monitoring

agent

3. Monitoring

agent

2/3. hook

?

?

32

Detection: Trusted Computing

Benefits Issues

High assurance, strong isolation. Protection of hypervisor remains undefined, e.g., malicious host OS drivers.

Flexibility: allows to support different security requirements / policies.

Problem of a sotware-only approach: e.g., how to define a dynamic root of trust. May be lifted if TPM is present.

Efficient. Compatibility with legacy systems?

Attestation capabilities. Dynamic monitoring is difficult: in-context measurement.

2. monitoring

agent

Protection target: VM

Systems

Trusted VMM: Terra + TPM

In management VM: vTPM

Certification: seL4

1. hook

Monitored VM

e.g., for integrity

Management VM

Hypervisor

2. Monitoring

agent

1. Monitoring

agent

Host OSdrivers ??

33

Reaction: Sandboxing

Benefits Issues

Good performance (by placing performance-critical code in kernel or using hardware virtualization).

Strong security against malicious driver behavior in hypervisor (avoid DMA attacks).

Difficult to protect RM against attacks if malicious code in hypervisor (if no hardware virtualization is used).

Security policy in the RM is difficult to configure by hand (benefits of the autonomic approach).

Flexible isolation policies.Flexible frontier between u-driver and k-driver (Micro-Drivers).

Reduced code size: executing part of the drivers in user-mode may reduce amount of kernel code.

No remediation. Limited to fault / attack containment.

Requires modification of kernel.

2. monitoring

agent

Systems

RM between driver and kernel: Software Fault Isolation (SFI) techniques

RM between driver and user space: MicroDrivers Proxos

RM between driver and device: Nooks

1. hookVMs

Management VM

1. Reference Monitor

(RM)

Hypervisor / Host

OS (Kernel)

Driver

Protection target: Hypervisor

VMHost OS (Userlan

d)

Device

2. RM

2. Driver

3. RM

?

?

?

34

Reaction: Virtualization

Benefits Issues

Strong security and isolation (separation from protection target, i.e., one layer below).

Stealthy (ring -1 technique, for VM protection).

How to protect the hypervisor layer? Nested virtualization but enormous performance penalty.

Hypervisor protection mechanisms remains very limited (e.g., memory hiding in VASP/HBSP).

No driver code modification.Compatibility with legacy code.

Flexibility: open hypervisor APIs to customize memory management, execution control.

Performance overhead may be reduced by hardware virtualization.

Heavy performance penalty.

2. monitoring

agent

Protection target: VM / Hypervisor

Systems

VM protection (guest OS I/O mediation): BitVisor

VM protection (guest OS isolation): HBSP/VASP

VM/hypervisor protection (driver virtualization): iKernel (full), LeVasseur et al. (partial).

Hypervisor protection (nested virtualization): Turtles

1. hook

Management VM

Hypervisor / Host

OS

VM

Driver

Stub

Stub

VM

Hyper-Hypervisor

Driver

Stub

VM

35

Virtual Security Appliances Design issues:

– Which network security architecture? (physical, virtual, hybrid…)

– Which software layer? (vSwitch, hypervisor, VM, multi-layer…)

Architecture #1:Physical

Architecture #2:Virtual

Architecture #3:Hybrid

36

Autonomic Management of IaaS Infrastructure

Results:

– First framework for building self-protecting IaaS infrastructure.

– Prototype running on KVM hypervisor – extensible to others (VMware).

Flexible confinement of VMs according to risk level

37

Autonomic security management of IaaS infrastructure.

Orchestration patterns to compose autonomic loops:

– Between views.– Between layers.

Pattern types: centralized, hierarchical, P2P...

Difficult (impossible?) administration by hand.

2 autonomic loops: network-level and device-level.

Extension to Multiple Loops

Self-protectionloops

38

Interference between loops:

– Physical layer: node isolation in a VLAN (network quarantine).– Security appliance: update attack signature database.

Framework overview:

A hierarchical + layered cloud security self-management model.

Difficult (impossible?) administration by hand.

2 autonomic loops: network-level and device-level.

Orchestration of Multiple Loops

39

Open Issues and Perspectives

40

Open Issues: Detection

Place of detection mechanisms: which monitoring granularity and scope?

– In which layer? hypervisor (VM introspection), VM (self-defending VMs), hybrid.

– Where in the cloud architecture? host-based, network-based, hybrid.

– Coordination of detection mechanisms.

– Should the monitoring system be co-located with the protection target?

Semantic gap:

– Correlation of monitored information across layers.

– Possible approach: ontologies and Semantic Web.

Monitoring intrusiveness:

– Performance impact?

– Stealth of detection mechanisms?

Forensics: real-time detection vs. post-mortem analysis?

Malware detection: a new field in the cloud not yet understood.

41

Open Issues: Reaction

Place of reaction mechanisms (as for detection).

Adaptable reaction mechanisms:

– Defense configuration remains static / complex, e.g., for network security.

– Emergence of cloud networking. Issues: real-time virtual network reconfiguration; on-demand chaining of security services; security of network management interface.

Influence of VM migration: long distance migration (across data centers, across clouds). Migration of the security state consistently, securely, and efficiently (e.g., IP address space)?

Remediation mechanisms: Currently limited to threat containment, with no elimination.

Hypervisor protection: still little addressed.

– Isolation of device drivers: rich literature in the system community (virtualization, sandboxing, language techniques, new kernel architectures...) is applicable.

– Assurance: control flow integrity, attestation, compliance proofs, trusted computing.

42

Open Issues: Decision

Multi-lateral security policy management:

– Policy definition: despite advances, current policies are not sufficiently flexible for the cloud and the inter-cloud setting. Need for policies spanning organization boundaries, geographical borders, applicable in multiple contexts, with multiple actors.

– Automated policy aggregation and deployment.

Security adaptation strategy:

– Flexible policy negotiation: Some first frameworks (e.g., XENA), but a lot still to be done. Issues: interoperability, multiple conflicting stakeholder responsibilities, multiple jurisdictions. Promising solution: ontologies.

– Strategy representation: notion of policy continuum, DSLs.

Coordination of multiple self-protection loops: Stability of the result?

– Event correlation from monitoring components? Decision synchronization towards between reaction components?

– Authentication of communications: detection / decision; decision / reaction.

– Lack of security supervision architecture: for the cloud, and the inter-cloud setting.

Learning from past attacks to improve security and build defenses against future threats.

43

Open Issues: Resilience

Main roadblocks toward cloud resilience:

– Cloud layered architecture with different responsabilities: coordination and consistency of mechanisms offered by platforms versus mechanisms programmed by applications?

– Reliability of cloud management services themselves.

– Applicative models (stateless/statefull, strong/weak coupling) and fault-tolerance?

Self-stabilization and cloud resilience?

– Applicativity of self-stabilization in cloud?

– At the hypervisor level? At VM and applicative level?

– Openness of cloud systems?

– Cost (w.r.t. performance) of self-stabilization in cloud?

44

Conclusion

In the cloud, security is clearly not an option:

– Migrating critical data and applications to the cloud is and remains risky.– Check (and stick to) best practices!

The main challenge of cloud dependability remains trust:

– A lot of building blocks already there, but still a long way to go.– Importance of SLAs to clarify responsibilities between customer and

provider. – Self-healing, self-protecting cloud architectures can help, both in

specification and enforcement of such SLAs.– Strong security, reliability, transparency, and accountability will be

the key enablers to maintain solid and durable trust between a CSP and its customers.