Upload
colin-flowers
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Cloud Dependability: How Close Are We?
Challenges and Research Issues Towards Dependable Clouds
Marc Lacoste, Thierry Coupaye
Orange Labs
International Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS’11)Grenoble, France - October 12th, 2011
© Orange Labs, Research & Development 2010
Large Magellanic Cloud(Distance: 160, 000 light -years)Source: NASA
?
Safe & Secure Clouds
2
The Dark Cloud: Recent Threats on Cloud Dependability
Availability:
– Major outage on Amazon EC2 storage (2011).
– DDoS attack on AWS brings Bitbucket services to a halt (2009).
Inter-VM attacks:
– Hey You! Get off My Cloud! on Amazon VMs [Ristenpart et al., CCS’09]. Hypervisor subversion:
– Virtunoid: KVM isolation breakout [Elhage, DEFCON’11].
– CloudBurst: VMware guest VM escape [Kortchinsky et al. BLACKHAT’09].
– Bluepill: rogue hypervisor beneath VMs [Rutkowska et al., BLACKHAT’06].
– SubVirt: VM-based rootkit [King et al., Security&Privacy’06]. Crimeware-as-a-Service:
– EC2 cloud used against Sony’s PlayStation Network (2011).
– Rootkits (SpyEye: 2011), botnets (ZeuS: 2009) in clouds. Network security:
– Critical vulnerability in Eucalyptus open source cloud (2011).
Where are we on cloud dependability?
3
Agenda
A Short Walk through the Clouds
Dependability Barriers
An Autonomic Perspective on Solutions
– Self-Healing Clouds
– Self-Protecting Clouds Open Issues and Perspectives
© Orange Labs, Research & Development 2010
5
Definition of Cloud Computing*
Technological vision: Cloud computing is a model for enabling on-demand network access to a shared pool of virtualized computing resources (networks, servers, storage, applications, devices/mobiles and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction (self-service model through API or web portals)
Market vision (XaaS): same + pay-per-use (or pay-as-you-go) billing models
5 characteristics
1. On Demand Self-Service
2. Broad Network Access
3. Virtualized Resource Pooling
4. Rapid Elasticity
5. Measured Service
3 delivery models
(markets)
1. Cloud Software as a Service (SaaS)
2. Cloud Platform as a Service (PaaS)
3. Cloud Infrastructure as a Service (IaaS)
4 deployment
models
1. Private cloud
2. Public cloud
3. Hybrid cloud
4. Community Cloud
3 pivotal technologies
1. Virtualization
2. Autonomics (automation)
3. Grid Computing (job scheduling)
* Adapted from NIST definition
6
Technologies, Visions, Usages, Markets
Server VirtualizationStorage VirtualizationNetwork Virtualization
Desktop / Laptop Virtualization Device / Embedded Virtualization
Autonomics
Grid ComputingP2P
Cloud Computing ~ On-demand Computing
Utility ComputingElastic Computing
Scalable Computing
Future InternetInternet of Services
Green IT
Enterprise CloudMass market Clouds
Technologies
Data Center Consolidation
Public / Private /Virtual Private / Hybrid Clouds
Compute / Storage Clouds
Data Center Resource (servers, routers)~ Computer Center Hosting Facility, Server Farm, Data Farm
Desktops / Laptops
Other Devices (mobile, network equipments)
Visions
Usages
Markets
PhysicalResources
SaaS (Service, Apps)
PaaS (Platform)IaaS (Infrastructure)RaaS (Resource)
FaaS (Facility)
Drivers for hosters
Consolidation through virtualization and automation allows for:
– Ease and speed up (un)provisioning drastically
– Maintain far larger IT infrastructure
– Reduction of risk of human errors
– Possibility of better energy management
– … altogether tailor IT infrastructure to the need and optimized OPEX/CAPEX
© Orange Labs, Research & Development 2010
security/ compliance
faster deployment
of new applications
cost reductions / budget
constraints
IT managemen
t
IT optimizatio
n
rapid adaptati
on to real
needs
9
Barriers Lack of technical maturity
– SLA, auto-scaling/auto-sharing, performance, availability, dependability
– PaaS: applications packaging, deployment, management, test, configuration management…
– Storage
– Network
– Security, privacy Major risks of lock-in
– Lack of standards (API, programming models)
– Lack interoperability
– Lack of portability (applications, mgt tools)
Legal Issues
– Software licences
– Data location (eg government, health)
Integration with legacy IT (IS)
Huge investments in data centers
- building, hardware, cooling, energy
11
Risks are multi-faceted, both for the customer and the provider
Business & Financial
Data Protection Legal Technical
Customer Perspective
■ Abuse, nefarious use of cloud computing■ Business continuity■ Loss of control■ Vendor lock-in■ Return on investment
■ Data protection (at rest, in transit): access, integrity, leakage, loss, destruction...■ Migration to the cloud, reversibility
■ Contractual terms with third parties■ Data localization■ Responsibility sharing
■ Service hijacking■ Black-box environment■ Insecure APIs■ Integration with IT infrastructure■ Malicious insiders■ Availability■ Shared environment
Provider Perspective
■ Third-party cloud termination■ Transparency with customers■ Auditability■ Vendor lock-in ■ Return on investment■ Billing data
■ Unsafe customer isolation■ Encryption key loss
■ Licensing risks■ Data localization■ Jurisdictional heterogeneity■ Subpoena, e- discovery
■ Service hijacking■ Compromise of management interface■ Malicious insiders■ Resource exhaustion■ Customer identities
12
Cloud computing security: 10 major roadblocks ahead
End-point SecurityEnsure security of virtualized computing
infrastructures
Important to critical
Data Protection Ensure data security/privacy in a shared
context
Important to critical
Network SecurityEnsure security of virtualized
network infrastructures
Incremental
Trust EnablersProve to users that clouds are trustworthy
Critical
Identity management
Privacy
Data traceability
Legal & regulatory issues
Transparency & compliance
Openness
End-to-end securityNetwork isolation
Hypervisor security
Elastic security
Critical Important Incremental
Critical Important Incremental
Critical Important Incremental
Critical Important Incremental
Critical: The roadblock is essential to lift for adoption. Almost no known solutions.
Important: Lifting the roadblock is a major step forward. Some first solutions.
Incremental: The roadblock may be overcome by small enhancement of already existing technologies.
13
End-point securityGuarantee security when computing resources are virtualized.
VMs have also their own vulnerabilities.
Those threats may be mitigated by:
Hardened images. Strict VM security life-cycle management.
Apply security-by-default configurations.
Virtualization brings many threats.
Safe in theory.
But…
Hyperjacking, misconfigurations, malicious device drivers, backdoors between VM and hardware…
Barrier #1: Hypervisor securityBarrier #1: Hypervisor security
14
Network securityGuarantee security when network resources are virtualized.
? ?
? ?
Isolation is no longer physical but logical.
Isolation is less precise. Security guarantees are weaker.
Challenge: map existing network security components to new cloud architectures.
Are traditional security architectures still effective in a virtual environment?
Risks similar to known networks.
Known counter-measures are thus still applicable.
VPNs, VLANs, firewalls, IDS/IPS, encryption, signature…
Barrier #2: Network isolationBarrier #2: Network isolation
15
Network securityGuarantee security when network resources are virtualized.
Automated security management is still lacking.
Research: autonomic, self-protecting security architectures.
Operational: early warning systems, integrated proactive security architectures.
Flexible security provisioning to match
fast evolving risks.
First solutions:
Flexible management of VPNs.
Overlays in full network virtualization.SIEMs.
Barrier #3: Elastic securityBarrier #3: Elastic security
16
Data protectionGuarantee data security in a shared multi-tenant environment.
Lack of end-to-end identity management
(in and across data-centers).
Barriers: scalability, heterogeneity, interoperability.
Authentication: should be overcome.
Authorization: in its infancy.
A Security-as-a-Service opportunity.
Barrier #4: IdentityBarrier #4: Identity
Strong isolation throughout the life- cycle of personal information.
Many tough questions:
Need-to-know enforcement, secure data storage, data retention and destruction, legal implications…
Today’s PETs are not enough!
Barrier #5: PrivacyBarrier #5: Privacy
17
Data protectionGuarantee data security in a shared multi-tenant environment.
Barrier #6: TraceabilityBarrier #6: Traceability
Being unable to locate the data adds to
the loss of control over IT resources.
Legal, political, trust issues: Compliance, data hosted abroad exposed to foreign governments. Inability to prove that data comes from a trusted source.
A widely unchartered area.
Barrier #7: Legal issuesBarrier #7: Legal issues
Multiple conflicting jurisictions over the cloud data flows.
Cloud providers: have trouble providing assurance of compliance with regulations.
Customers: have trouble understanding the rights and obligations of each party.
Importance of security SLAs.
18
Trust enablersProve to third parties that the cloud infrastructure is trustworthy.
Source: Gartner. Analyzing the Risk Dimensions of Cloud and Saas Computing, 2010.
Barrier #8: TransparencyBarrier #8: Transparency
Prove security hygiene of providerinfrastructure to third parties.
Auditability, certification process, risk analysis methodologies, compliance.
Trusted cloud computing technologies provide cryptographic evidence.
Clear-cut SLAs to clarify responsibilities.
Barrier #9: OpennessBarrier #9: Openness
Avoid lock-in, interoperate with other
cloud infrastructures.
Issues: API portability across providers, interoperability, scalability.
Enabler of the « inter-cloud » vision (« sky computing »).
Flexibility and security benefits of open source cloud architectures.
19
Source: Cloud Security Alliance. Security Guidancefor Critical Areas of Focus in Cloud Computing, 2009.
Trust enablersProve to third parties that the cloud infrastructure is trustworthy.
Barrier #10: End-to-end securityBarrier #10: End-to-end security
Seamless orchestration of multiple fragmented security building blocks.
Definition of a cloud reference security architecture providing a consistent overall view of cloud security.
Importance of standardization.
20
Cloud Security Today
Security = #1 barrier for cloud computing adoption
To develop trust of cloud customers, cloud security should be…
Multiple dangers
No inside / outside frontier
Fast evolving threats
Multiple security goals
Integrated security management
Easy administration
Good performance
Cost-effective security
21
Cloud Security Today
… but unfortunately today cloud security is:
Not
Strong
Not
Flexible
Not
Efficient
Not
Simple
Hard-to-detect vulnerabilities
Hard-to-place and synchronize countermeasures
Static configurations, low reactivity
No generic security supervision architecture
Complex, heterogeneous security mechanisms
Manual administration nightmare
Costly, error-prone management of security
Traditional approaches to protection are not enough!
23
Autonomic Cloud Dependability Management
An automated vision of cloud dependability:
self-healing, self-defending clouds
Approach (for security):
Benefits:
– Simple, strong, flexible, autonomous cloud safety for its customers.
– Lighter administration, increased reactivity and agility, lower operating costs.
– Graduated response to threats / to faults.
– Enabler for integrated supervision.
24
Resilience and Fault-Tolerance in Cloud Computing
What we do need for cloud resilience is
– Failure statistics from operational cloud environments
– Fault models
– Fault tolerance algorithms and mechanisms– replication, backup, disaster recovery, self-repair…
… that deal with cloud specificities
– Open, dynamic, virtualized environments : VMs can move between physical servers, so does latency and hardware reliability
– Scalability and multi-tenancy: thousands of applications (each generally made of multiple VMs) hosted on a single platform, resources actually shared by multiple applications/users
– Layered infrastructures (SaaS/PaaS/IaaS) with multiple administrative roles/domains with limited control offered to application developpers
– Pay-per-use: you have to pay for ressources you use (VMs)!
25
Where are we on cloud resilience ?
A huge background on distributed systems dependability, including in grid and autonomic computing (self-repair), but few works specifically targeting cloud reliability
– e.g. 0 paper in IEEE Cloud 2009, 1 paper in IEEE Cloud 2010, 1 paper in IEEE Cloud 2011
Some illustrative work-in-progress
– Server failures characterization with objective of characterizing the complete data center hardware (server, storage, network) reliability model [Microsoft Research, ACM Symposium on Cloud Computing’2010]
– 3-replicas ring with scatter placement algorithm of backups of cloud management nodes [U. Maryland, IBM Research, IEEE IPDPS’2009]
– Crash and timing fault tolerance (not Byzantine faults) through strong replica consistency of applicative nodes (semi-active and semi-passive replication) [U. Cleveland, UC Santa Barbara, IEEE Cloud’2010]
– Bizantine fault tolerance through 3f+1 active replication in community (‘’volontary’’) cloud [U. Hong Kong, NUDT, IEEE Cloud’2011]
– Adaptive fault tolerance (forward and backward recovery) in real time cloud through runtime continuous reliability assessment of nodes and replication with variants (VMs) [INRIA, IEEE World Congres on Services’2011]
26
Virtual Appliance Management Platform (VAMP)
VAMP is a PaaS Application Lifecycle Platform (ALM) under development inside Orange Labs.
VAMP targets the construction (e.g. VM image generation), deployment and management of distributed applications in the cloud.
VAMP is based on a architectural description (component-based with the Fractal component model) of applications that is used throughout the complete lifecycle of applications (including ‘model@runtime’)
Control Plane (VAMP)
Applicative Plane (legacy application)
Reification and control
Components
27
VAMP Runtime Architecture
deploymentmanager
configurator agent
configurator agent
VM0 VM1 VM2
Control Plane (VAMP)
Applicative Plane (legacy application)
Message Oriented Middleware (MOM)
VAMPmanager
VAMP managercreates and repairDeployment Managers.VAMP manager may be inside ou outside the IaaS
Deployment Manager (1/application)create applicative VMs and recreates VMs when they fail
Configurator Agents self-configure the application
28
Reliable Self-configuration in VAMP (work in progress)
Self-configuration protocol
– At VM creation, each component C (managed by a
Configurator inside a VM):– Announces itself to the configurator network;– Configures locally the applicative elements;– Exports its server interfaces to its “client components”
which can then bind to C (‘bind’ signals);– Once all C client interfaces are bound, C starts and
notifies its start to its “client components” (‘start’ signals).
– Gradually (“epidemically”), the complete application is started.
– When a VM failure is detected, the faulty VM is replaced by a new VM which is simply introduced in the self-configuration protocol as if it was in its initial deployment phase (except that ‘start’ signals are interpreted as ‘restart’ by already running components).
29
Synthesis on VAMP
A decentralized approach based on asynchronous and reliable communications.
– Each VM evolves in parallel at its own pace.
– No need for global synchronization between VMs. Replication (primary-backup schema, passive replication) of DMs
complete the self-configuration protocol for complete application reliability.
The approach works for VM crash, transient network failures, and for stateless components.
The self-configuration protocols is also a (basic) self-repair protocol: a running application is seen as in a continuous deployment process.
30
Attack targets and countermeasures
Hypervisor
vSwitch
Hardware
VM VM VM VM
STORAGE
VNIC
CPU MEMORY NETWORK
PNIC
MAC spoofing/snooping: Static address allocation.
IP attacks: Virtual firewalling.
VLAN hopping: Physical traffic segregation.
Hyperspacing: MMU, IOMMU.
Hyperthreading: No hyperthreading.
Buffer overflows: No eXecute bit.
Hyperjacking:
High attack surface: Certification, open, modular solutions.
Hypervisor integrity: TPM
Secrecy violations: Authentication, signature.
Integrity violations: Encryption of stored data (self-encrypting drives).
DoS (resource starvation): Memory overcommit, optimized page sharing, balloon drivers… Quotas, priorities.
31
Detection: VM Introspection
Benefits Issues
Monitor VM behavior, detect unusual behaviors.
Little remediation actions (simple actions, e.g., restart, kill VM).
Rely on VMM to measure VM integrity.What happens if VMM is compromised?
In context measurement is difficult (real-time monitoning)
Semantic gap problem.
May require to add hooks in the VM.
Stealth is difficult to achieve.
In-VM monitoring:Proximity to monitored target.
Possibility of corruption by malware. How to protect monitoring component?
In hypervisor:Users cannot kill monitoring component.
May require to modify VMM.
In management VM:Performance improvement (offload philosophy)
Less reactive?
2. monitoring
agent
Protection target: VM
Systems
With hooks in VM: Lares, XenAccess, KVMSec
With no hooks in VM: CloudSec
In VM monitoring code: SIM
Offline: LiveWire
Miscellaneous: Re-virt (log), Ether, HIMA
1. hook
Monitored VM
Management VM
1. Monitoring
agent
Hypervisor
2. Monitoring
agent
3. Monitoring
agent
2/3. hook
?
?
32
Detection: Trusted Computing
Benefits Issues
High assurance, strong isolation. Protection of hypervisor remains undefined, e.g., malicious host OS drivers.
Flexibility: allows to support different security requirements / policies.
Problem of a sotware-only approach: e.g., how to define a dynamic root of trust. May be lifted if TPM is present.
Efficient. Compatibility with legacy systems?
Attestation capabilities. Dynamic monitoring is difficult: in-context measurement.
2. monitoring
agent
Protection target: VM
Systems
Trusted VMM: Terra + TPM
In management VM: vTPM
Certification: seL4
1. hook
Monitored VM
e.g., for integrity
Management VM
Hypervisor
2. Monitoring
agent
1. Monitoring
agent
Host OSdrivers ??
33
Reaction: Sandboxing
Benefits Issues
Good performance (by placing performance-critical code in kernel or using hardware virtualization).
Strong security against malicious driver behavior in hypervisor (avoid DMA attacks).
Difficult to protect RM against attacks if malicious code in hypervisor (if no hardware virtualization is used).
Security policy in the RM is difficult to configure by hand (benefits of the autonomic approach).
Flexible isolation policies.Flexible frontier between u-driver and k-driver (Micro-Drivers).
Reduced code size: executing part of the drivers in user-mode may reduce amount of kernel code.
No remediation. Limited to fault / attack containment.
Requires modification of kernel.
2. monitoring
agent
Systems
RM between driver and kernel: Software Fault Isolation (SFI) techniques
RM between driver and user space: MicroDrivers Proxos
RM between driver and device: Nooks
1. hookVMs
Management VM
1. Reference Monitor
(RM)
Hypervisor / Host
OS (Kernel)
Driver
Protection target: Hypervisor
VMHost OS (Userlan
d)
Device
2. RM
2. Driver
3. RM
?
?
?
34
Reaction: Virtualization
Benefits Issues
Strong security and isolation (separation from protection target, i.e., one layer below).
Stealthy (ring -1 technique, for VM protection).
How to protect the hypervisor layer? Nested virtualization but enormous performance penalty.
Hypervisor protection mechanisms remains very limited (e.g., memory hiding in VASP/HBSP).
No driver code modification.Compatibility with legacy code.
Flexibility: open hypervisor APIs to customize memory management, execution control.
Performance overhead may be reduced by hardware virtualization.
Heavy performance penalty.
2. monitoring
agent
Protection target: VM / Hypervisor
Systems
VM protection (guest OS I/O mediation): BitVisor
VM protection (guest OS isolation): HBSP/VASP
VM/hypervisor protection (driver virtualization): iKernel (full), LeVasseur et al. (partial).
Hypervisor protection (nested virtualization): Turtles
1. hook
Management VM
Hypervisor / Host
OS
VM
Driver
Stub
Stub
VM
Hyper-Hypervisor
Driver
Stub
VM
35
Virtual Security Appliances Design issues:
– Which network security architecture? (physical, virtual, hybrid…)
– Which software layer? (vSwitch, hypervisor, VM, multi-layer…)
Architecture #1:Physical
Architecture #2:Virtual
Architecture #3:Hybrid
36
Autonomic Management of IaaS Infrastructure
Results:
– First framework for building self-protecting IaaS infrastructure.
– Prototype running on KVM hypervisor – extensible to others (VMware).
Flexible confinement of VMs according to risk level
37
Autonomic security management of IaaS infrastructure.
Orchestration patterns to compose autonomic loops:
– Between views.– Between layers.
Pattern types: centralized, hierarchical, P2P...
Difficult (impossible?) administration by hand.
2 autonomic loops: network-level and device-level.
Extension to Multiple Loops
Self-protectionloops
38
Interference between loops:
– Physical layer: node isolation in a VLAN (network quarantine).– Security appliance: update attack signature database.
Framework overview:
A hierarchical + layered cloud security self-management model.
Difficult (impossible?) administration by hand.
2 autonomic loops: network-level and device-level.
Orchestration of Multiple Loops
40
Open Issues: Detection
Place of detection mechanisms: which monitoring granularity and scope?
– In which layer? hypervisor (VM introspection), VM (self-defending VMs), hybrid.
– Where in the cloud architecture? host-based, network-based, hybrid.
– Coordination of detection mechanisms.
– Should the monitoring system be co-located with the protection target?
Semantic gap:
– Correlation of monitored information across layers.
– Possible approach: ontologies and Semantic Web.
Monitoring intrusiveness:
– Performance impact?
– Stealth of detection mechanisms?
Forensics: real-time detection vs. post-mortem analysis?
Malware detection: a new field in the cloud not yet understood.
41
Open Issues: Reaction
Place of reaction mechanisms (as for detection).
Adaptable reaction mechanisms:
– Defense configuration remains static / complex, e.g., for network security.
– Emergence of cloud networking. Issues: real-time virtual network reconfiguration; on-demand chaining of security services; security of network management interface.
Influence of VM migration: long distance migration (across data centers, across clouds). Migration of the security state consistently, securely, and efficiently (e.g., IP address space)?
Remediation mechanisms: Currently limited to threat containment, with no elimination.
Hypervisor protection: still little addressed.
– Isolation of device drivers: rich literature in the system community (virtualization, sandboxing, language techniques, new kernel architectures...) is applicable.
– Assurance: control flow integrity, attestation, compliance proofs, trusted computing.
42
Open Issues: Decision
Multi-lateral security policy management:
– Policy definition: despite advances, current policies are not sufficiently flexible for the cloud and the inter-cloud setting. Need for policies spanning organization boundaries, geographical borders, applicable in multiple contexts, with multiple actors.
– Automated policy aggregation and deployment.
Security adaptation strategy:
– Flexible policy negotiation: Some first frameworks (e.g., XENA), but a lot still to be done. Issues: interoperability, multiple conflicting stakeholder responsibilities, multiple jurisdictions. Promising solution: ontologies.
– Strategy representation: notion of policy continuum, DSLs.
Coordination of multiple self-protection loops: Stability of the result?
– Event correlation from monitoring components? Decision synchronization towards between reaction components?
– Authentication of communications: detection / decision; decision / reaction.
– Lack of security supervision architecture: for the cloud, and the inter-cloud setting.
Learning from past attacks to improve security and build defenses against future threats.
43
Open Issues: Resilience
Main roadblocks toward cloud resilience:
– Cloud layered architecture with different responsabilities: coordination and consistency of mechanisms offered by platforms versus mechanisms programmed by applications?
– Reliability of cloud management services themselves.
– Applicative models (stateless/statefull, strong/weak coupling) and fault-tolerance?
Self-stabilization and cloud resilience?
– Applicativity of self-stabilization in cloud?
– At the hypervisor level? At VM and applicative level?
– Openness of cloud systems?
– Cost (w.r.t. performance) of self-stabilization in cloud?
44
Conclusion
In the cloud, security is clearly not an option:
– Migrating critical data and applications to the cloud is and remains risky.– Check (and stick to) best practices!
The main challenge of cloud dependability remains trust:
– A lot of building blocks already there, but still a long way to go.– Importance of SLAs to clarify responsibilities between customer and
provider. – Self-healing, self-protecting cloud architectures can help, both in
specification and enforcement of such SLAs.– Strong security, reliability, transparency, and accountability will be
the key enablers to maintain solid and durable trust between a CSP and its customers.