Cloud Computing 3999

8/6/2019 Cloud Computing 3999

http://slidepdf.com/reader/full/cloud-computing-3999 1/86

1

UC Berkeley

* ,Director IntelResearchB e rke le y

:// . . .h ttp a b ov e th e clou d s cs b e rkele y/e d u

Cloud Computing:Past, Present, and Future

Professor Anthony D. Joseph*, UC BerkeleyReliable Adaptive Distributed Systems Lab

RWTH Aachen22 March 2010

UC Berkeley



RAD Lab 5-year Mission

Enable 1 person to develop, deploy, operate next -generation Internet application

• Key enabling technology: Statistical machine learning – debugging, monitoring, pwr mgmt, auto-configuration, perf

prediction, ...

• Highly interdisciplinary faculty & students – PI’s: Patterson/Fox/Katz (systems/networks), Jordan

(machine learning), Stoica (networks & P2P), Joseph(security), Shenker (networks), Franklin (DB)

– 2 postdocs, ~30 PhD students, ~6 undergrads

• Grad/Undergrad teaching integrated with research



Course Timeline

• Friday – 10:00-12:00 History of Cloud Computing:

Time-sharing, virtual machines,datacenter architectures, utility computing

– 12:00-13:30 Lunch – 13:30-15:00 Modern Cloud Computing:

economics, elasticity, failures – 15:00-15:30 Break – 15:30-17:00 Cloud Computing

Infrastructure: networking, storage,computation models

• Monday – 10:00-12:00 Cloud Computing research

topics: scheduling, multiple datacenters,testbeds



NEXUS: A COMMONSUBSTRATE FOR CLUSTERCOMPUTING

, , ,Joint work with Benjamin Hindman Andy Konwinski Matei Zaharia Ali,Ghodsi



Recall: Hadoop on HDFS

datanode daemon

Linux file system

…

tasktracker

slave node

datanode daemon

Linux file system

…

tasktracker

slave node

datanode daemon

Linux file system

…

tasktracker

slave node

namenode

namenode daemon

job submission node

jobtracker

, , , & - ,A d a p te d fro m slid e s by Jim m y Lin C h risto p h e B iscig lia A a ron K im b all S ie rra M ich e ls S le ttve t G oo g le D istrib u te d

, ( . )C om p utin g S em ina r 2 0 0 7 licen sed u nd er C rea tion C om m on s Attribu tion 3 0 Licen se



Problem

• Rapid innovation in cluster computingframeworks

• No single framework optimal for all

applications• Energy efficiency means maximizing

cluster utilization

• Want to run multiple frameworks in asingle cluster

•



What do we want to run in thecluster?

Dryad

ApacheHama

Pregel

Pig



Why share the cluster betweenframeworks?

• Better utilization and efficiency (e.g.,take advantage of diurnal patterns)

•

• Better data sharing acrossframeworks and applications



Solution

Nexus is an “operating system” for thecluster over which diverse frameworkscan run

– Nexus multiplexes resources betweenframeworks

– Frameworks control job execution



Goals

• Scalable

• Robust (i.e., simple enough toharden)

• Flexible enough for a variety of different cluster frameworks

• Extensible enough to encourageinnovative future frameworks



Question 1: Granularity of Sharing

Option: Coarse-grained sharing – Give framework a (slice of) machine for its

entire duration

–

Hadoop 1

Hadoop 2

Hadoop 3

ata localitycompromised if

machine held for longtime

Hard to account for new frameworks and changing->demands urts

tilization andinteractivity



Nexus: Fine-grained sharing – Support frameworks that use smaller tasks

(in time and space) by multiplexing themacross all available resources

Question 1: Granularity of Sharing

Frameworks can take turns accessing data on each node

Can resize frameworks shares to get

&utilizationinteractivity

Hadoop 1

Hadoop 1

Hadoop 1

Hadoop 1Hadoop 3

Hadoop 3 Hadoop 3

Hadoop 3

Hadoop 3

Hadoop 2

Hadoop 2Hadoop 2

Hadoop 2Hadoop 2

Hadoop 2

Hadoop 1

Hadoop 3

Hadoop 2Hadoop 3

Hadoop 1

Hadoop 2



Question 2: ResourceAllocation

Option: Global scheduler – Frameworks express needs in a specification

language, a global scheduler matchesresources to frameworks

• Requires encoding a framework’s semanticsusing the language, which is complex andcan lead to ambiguities

• Restricts frameworks if specification isunanticipated

Designing a general-purpose globalscheduler is hard



Question 2: ResourceAllocation

Nexus: Resource offers – Offer free resources to frameworks, let

frameworks pick which resources best

suit their needs+Keeps Nexus simple and allows us tosupport future jobs

- Distributed decisions might not beoptimal



Outline

• Nexus Architecture

• Resource Allocation

• Multi-Resource Fairness• Implementation

• Results

•



NEXUS ARCHITECTURE



Nexus slave

Nexus master

Hadoop v20 scheduler

Nexus slave

Hadoop job

Hadoop v20 executor

task

Nexus slave

Hadoop v19 executor

task

MPIscheduler

MPI job

MPIexecutor

task

Overview

Hadoop v19 scheduler

Hadoop job

Hadoop v19 executor

task

MPIexecutor

task



Nexus slaveNexus slave

Nexus master

MPI executor

task

Hadoopscheduler

Hadoop job

Resource Offers

MPIscheduler

MPI job

MPIexecutor

task

Pick framework to offer toResource

offer



Nexus slaveNexus slave

Nexus master

MPI executor

task

Hadoopscheduler

Hadoop job

Resource Offers

MPIscheduler

MPI job

MPIexecutor

task

Pick framework to offer toResource offer

ffer = { ,list of machine

}free_resources

:Example[ { , < ,node 1 2 CPUs 4

>},GB

{ , < ,node 2 2 CPUs 4>} ]GB



Nexus slave

Nexus master

Nexus slave

MPI executor

task

Hadoopscheduler

Hadoop job

Hadoopexecutor

Resource Offers

MPIscheduler

MPI job

MPIexecutor

task

Framework-specific schedulin

Pick framework to offer to

Launches & isolates execut

task

Resourceoffer



Resource Offer Details

• Min and max task sizes to controlfragmentation

• Filters let framework restrict offerssent to it

– By machine list

– By quantity of resources• Timeouts can be added to filters

• Frameworks can signal when to

destroy filters, or when they want

Using Offers for Data



Using Offers for DataLocality

We found that a simple policy calleddelay scheduling can give very high

locality: – Framework waits for offers on nodes

that have its data

– If waited longer than a certain delay,starts launching non-local tasks

–



Framework Isolation

• Isolation mechanism is pluggable dueto the inherentperfomance/isolation tradeoff

• Current implementation supportsSolaris projects and Linuxcontainers

– Both isolate CPU, memory andnetwork bandwidth

– Linux developers working on disk IOisolation

•



RESOURCE ALLOCATION



Allocation Policies

• Nexus picks framework to offerresources to, and hence controls howmany resources each framework can

get (but not which)• Allocation policies are pluggable to

suit organization needs, through

allocation modules

E l Hi hi l



Example: HierarchicalFairshare Policy

Facebook.com

Spam Ads

Job 3

Job 2

User 1

Job 1

User 2

Job 4

%100

CurrTime

%80%20

luster Share Policy

%20 %14%100

CurrTime

%6

CurrTime

%0

%70%30



Revocation

Killing tasks to make room for otherusersNot the normal case because fine-

grained tasks enable quick reallocationof resourcesSometimes necessary:

– Long running tasks neverrelinquishing resources

– Buggy job running forever

– Greedy user who decides to makes

his task long



Revocation Mechanism

Allocation policy defines a safe share foreach user

– Users will get at least safe share withinspecified timeRevoke only if a user is below its safe

share and is interested in offers

– Revoke tasks from users farthest abovetheir safe share

– Framework warned before its task iskilled



How Do We Run MPI?

Users always told their safe share

– Avoid revocation by staying below it

Giving each user a small safe sharemay not be enough if jobs need manymachinesCan run a traditional grid or HPC

scheduler as a user with a larger safeshare of the cluster, and have MPI jobsqueue up on it

– E.g. Torque gets 40% of cluster

xamp e: orque on



xamp e: orque onNexus

MPI Job

%40Safe share = 40%

MPI JobMPI Job

Torque

MPI Job

Facebook.com

Spam Ads

Job 1

Job 2

User 1

Job 1

User 2

Job 4

%40%20



MULTI-RESOURCEFAIRNESS



What is Fair?

• Goal: define a fair allocation of resources in the cluster betweenmultiple users

• Example: suppose we have: – 30 CPUs and 30 GB RAM

– Two users with equal shares

– User 1 needs <1 CPU, 1 GB RAM> pertask

– User 2 needs <1 CPU, 3 GB RAM> pertask

•

Definition 1: Asset



• Idea: give weights to resources (e.g. 1 CPU= 1 GB) and equalize value of resourcesgiven to each user

• Algorithm: when resources are free, offer towhoever has the least value

• Result:

– U1: 12 tasks: 12 CPUs, 12 GB ($24) – U2: 6 tasks: 6 CPUs, 18 GB ($24)

Definition 1: AssetFairness

PROBLEM

User 1 has < 50% of both CPUs and RAM

CPU

User1

User2%100

%50

%0RAM

essons rom



essons romDefinition 1

• “You shouldn’t do worse than if youran a smaller, private cluster equal in

size to your share”• Thus, given N users, each user shouldget ≥ 1/N of his dominating resource(i.e., the resource that he consumes

most of)•

D f 2 D i R



Def. 2: Dominant ResourceFairness

• Idea: give every user an equal share of herdominant resource (i.e., resource itconsumes most of)

• Algorithm: when resources are free, offer tothe user with the smallest dominant share( i.e., fractional share of the her dominantresource)

• Result: – U1: 15 tasks: 15 CPUs, 15 GB

– U2: 5 tasks: 5 CPUs, 15 GBCPU

User1

User2%100

%50

%0RAM



Fairness Properties→cheduler

↓ropertyAsset Dynamic CEEI DRF

Paretoefficiency

x x x x

-Single resourcefairness

x x x x

Bottleneckfairness

x x x

Share guarantee x x

Population

monotonicity

x x

-Envy freedom x x x

Resourcemonotonicity



IMPLEMENTATION



Implementation Stats

7000 lines of C++

APIs in C, C++, Java, Python, Ruby

Executor isolation using Linux

containers and Solaris projects



Frameworks

Ported frameworks:

– Hadoop (900 line patch)

– MPI (160 line wrapper scripts)

New frameworks:

– Spark, Scala framework for iterative jobs (1300 lines)

– Apache+haproxy, elastic web serverfarm (200 lines)



RESULTS

O h d



Overhead

Less than 4% seen in practice

Dynamic Resource



Dynamic ResourceSharing

Multiple Hadoops



Multiple HadoopsExperiment

Hadoop 1

Hadoop 2

Hadoop 3

Multiple Hadoops



Multiple HadoopsExperiment

Hadoop 1

Hadoop 1

Hadoop 1 Hadoop 1

Hadoop 1Hadoop 3

Hadoop 3 Hadoop 3

Hadoop 3

Hadoop 3

Hadoop 3

Hadoop 2

Hadoop 2Hadoop 2

Hadoop 2Hadoop 2

Hadoop 2

Hadoop 2 Hadoop 1

Hadoop 1

Hadoop 2

Hadoop 3 Hadoop 2

Hadoop 3

Results with 16



Results with 16Hadoops



WEB SERVER FARMFRAMEWORK

Web Framework



Load calculation

Nexus slave

Web FrameworkExperiment

Nexus master

Nexus slave

Web executor

task( )Apache

Scheduler (haproxy)Load gen framework

Load gen executor

task

httperf

Nexus slave

Web executor

task( )Apache

Load gen executor

task

HTTP requestHTTP request

Load gen

task task

executor Web executor

task( )Apache

HTTP request

resource offer

task

status update

W b F k R lt



Web Framework Results

F t W k



Future Work

Experiment with parallel programmingmodelsFurther explore low-latency services

on Nexus (web applications, etc)Shared services (e.g. BigTable, GFS)Deploy to users and open source



CLOUD COMPUTINGTESTBEDS



OPEN CIRRUS™: SEIZING THEOPEN SOURCE CLOUD STACK OPPORTUNITY A JOINT INITIATIVE SPONSORED BY HP, INTEL, AND YAHOO!: / / . /ttp o p e n cirru s o rg

Proprietary Cloud Computing

http://opencirrus.org/






Applications

Application FrameworksMapReduce, Sawzall, Google AppEngine, Protocol Buffers

Hardware InfrastructureBorg

Software Infrastructure

VM Management

Job Scheduling

BorgStorage Management

GFS, BigTableMonitoring

Borg

GOOGLE

Applications

Application FrameworksEMR – Hadoop

Hardware Infrastructure


VM Management

EC2Job Scheduling

Storage Management

S3, EBSMonitoring

Borg

AMAZON

Applications

Application Frameworks.NET Services

Hardware InfrastructureFabric Controller


VM Management

Fabric Controller Job Scheduling

Fabric Controller Storage Management

SQL Services, blobs, tables,queuesMonitoring

Fabric Controller

MICROSOFT

ublicly accessiblelayer

Proprietary Cloud Computingstacks



pen rrus ou ompu ng



pen rrus ou ompu ng Testbed

:ared research , applications , nfrastructure ( ),12K cores ata sets

obal services : , , .sign on monitoring store pen src stack ( , ,prs tashi hadoo ponsored by , , !P Intel and Yahoo (with additional support from NSF)

• , .9 sites currently target of around 20 in the next two years

O Ci G l

http://www.uiuc.edu/



Open Cirrus Goals

• Goals• Foster new systems and services

research around cloud computing

• Catalyze open-source stack and APIs for

the cloud•

• How are we unique?• Support for systems research and

applications research

• Federation of heterogeneous datacenters

Open Cirrus Organi ation



Open Cirrus Organization

• Central Management Office, overseesOpen Cirrus

• Currently owned by HP

• Governance model• Research team• Technical team• New site additions• Support (legal (export, privacy), IT, etc.)

• Each site• Runs its own research and technical teams• Contributes individual technologies• Operates some of the global services

• E.g.• HP site supports portal and PRS• Intel site developing and supporting Tashi• Yahoo! contributes to Hadoop



Open Cirrus Sites



Open Cirrus Sites

Site Characteristics#Cores#Srvrs

PublicMemory Storage Spindles Network Focus

HP 1,024 256 178 3.3TB 632TB 1152 10G internal1Gb/s x-rack

Hadoop, Cells,PRS, scheduling

IDA 2,400 300 100 4.8TB 43TB+16TB SAN

600 1Gb/s Apps based onHadoop, Pig

Intel 1,364 198 145 1.8TB 610TB local60TB attach

746 1Gb/s Tashi, PRS, MPI,Hadoop

KIT 2,048 256 128 10TB 1PB 192 1Gb/s Apps with highthroughput

UIUC 1,024 128 64 2TB ~500TB 288 1Gb/s Datasets, cloudinfrastructure

CMU 1,024 128 64 2TB -- -- 1 Gb/s Storage, Tashi

Yahoo(M45)

3,200 480 400 2.4TB 1.2PB 1600 1Gb/s Hadoop ondemand

,2 074 ,746 ,029 .6 3 TB PBotal

Testbed Comparison



Testbeds

Open

Cirrus

IBM/GoogleTeraGrid PlanetLab EmuLab Open Cloud

Consortium

Amazon

EC2

LANL/NSF

cluster

Type of

researchSystems & services

Data-intensiveapplicationsresearch

Scientificapplications

Systemsandservices

Systems Interoperab.across cloudsusing openAPIs

Commer.use

Systems

Approach Federationof hetero-geneousdatacenters

A clustersupportedby Googleand IBM

Multi-siteheteroclusterssupercomp

A few 100nodeshosted byresearchinstit.

A single-sitecluster withflexiblecontrol

Multi-siteheterosclusters,focus onnetwork

Rawaccess tovirtualmachines

Re-use of LANL’sretiringclusters

Participants HP, Intel,IDA, KIT,

UIUC,Yahoo!CMU

IBM, Google,Stanford,

U.Wash,MIT

Manyschools

and orgs

Manyschools

and orgs

University of Utah

4 centers Amazon CMU, LANL,NSF

Distribution 7(9) sites1,746nodes12,074cores

1 site 11partnersin US

> 700nodesworld-wide

>300 nodesuniv@Utah

480 cores,distributed infour locations

1 site 1 site1000s of older, stilluseful nodes

Testbed Comparison

Open Cirrus Stack



Open Cirrus Stack

+ +Compute network storage resources

+Power cooling

Management and

control subsystem

( )Physical Resource set Zoni service

: ( )Credit John Wilkes HP

Open Cirrus Stack



Open Cirrus Stack

Zoni service

Research Tashi NFS storageservice

HDFS storageservice

,PRS clients each with their“ ”own physical data center

Open Cirrus Stack



Open Cirrus Stack

Zoni service


HDFS storageservice

Virtual cluster Virtual cluster

( . ., )Virtual clusters e g Tashi

Open Cirrus Stack



Open Cirrus Stack

Zoni service


HDFS storageservice


BigData App

Hadoop

.1 Application running

.2 On Hadoop

.3 On Tashi virtual cluster.4 On a PRS

.5 On real hardware

Open Cirrus Stack



Open Cirrus Stack

Zoni service


HDFS storageservice


BigData app

Hadoop

/Experiment

/save restore

Open Cirrus Stack



Open Cirrus Stack

Zoni service


HDFS storageservice


BigData App

Hadoop

/Experiment

/save restore

Platformservices

Open Cirrus Stack



Open Cirrus Stack

Zoni service


HDFS storageservice


BigData App

Hadoop

/Experiment

/save restore

Platformservices

User services

Open Cirrus Stack



Open Cirrus Stack

Zoni


HDFS storageservice


BigData App

Hadoop

/Experiment

/save restore

Platformservices

User services





Open Cirrus Stack Tashi



Open Cirrus Stack - Tashi

• An open source Apache SoftwareFoundation project sponsored byIntel (with CMU, Yahoo, HP)

• Infrastructure for cloudcomputing on Big Data

• http://incubator.apache.org/projects/tashi

• Research focus:• Location-aware co-scheduling of

VMs, storage, and power.

• Seamless physical/virtualmigration.

• Joint with Greg Ganger (CMU),Mor Harchol-Balter (CMU), MilanMilenkovic (CTG)

T hi Hi h L l D i



C lu ste r

M a n ag e r

Tashi High-Level Design

Node

Node

Node

Node

Node

torage Service

irtualization Service

Node

Sche

duler

Cluster nodes are assumed to be commodity machines

Services are instantiated through virtual machines

Data location and power

information is exposed

to scheduler and services

CM maintains databases;and routes messages

decision logic is limited

Most decisions happen in

;the scheduler manages/ /compute storage power

in concert

The storage service aggregates the capacity of the commodity nodes

.to house Big Data repositories

Location Matters



Calculated (40 racks * 30 nodes * 2 disks)

0

50

100

150

200

250

300

Disk-1G SSD-1G Disk-10G SSD-10G

T h r o u g h p u t / d

i s k ( M B / s

Random Placement Location-Aware Placement

3 . 6

X

1 1 X

3 . 5

X

9 . 2

X

(calculated)

Open Cirrus Stack -



73

pHadoop

• An open-source Apache SoftwareFoundation project sponsored by

Yahoo!

• http://wiki.apache.org/hadoop/ProjectDesc

• Provides a parallel programming

model (MapReduce), a distributed filesystem, and a parallel database(HDFS)

projects are Open Cirrus

http://wiki.apache.org/hadoop/ProjectDescription

http://hadoop.apache.org/

http://wiki.apache.org/hadoop/ProjectDescription



projects are Open Cirrussites looking for?

• Open Cirrus is seeking research in thefollowing areas (different centers will weightthese differently):

• Datacenter federation

• Datacenter management• Web services

• Data-intensive applications and systems

• The following kinds of projects are generallynot of interest:

• Traditional HPC application development

• Production applications that just need lots of cycles

• Closed source system development

How do users get access to



gOpen Cirrus sites?

• Project PIs apply to each site separately.

• Contact names, email addresses, and web linksfor applications to each site will be available

on the Open Cirrus Web site (which goes liveQ209) – http://opencirrus.org

–

• Each Open Cirrus site decides which users andprojects get access to its site.

• Developing a global sign on for all sites (Q2 09) – Users will be able to login to each Open Cirrus

site for which they are authorized using the

Summary and Lessons





Summary and Lessons

• Intel is collaborating with HP and Yahoo! toprovide a cloud computing testbed for theresearch community

• Using the cloud as an accelerator for interactivestreaming/big data apps is an importantusage model

• Primary goals are to• Foster new systems research around cloud

computing

• Catalyze open-source reference stack and APIs

for the cloud – Access model, Local and global services,Application frameworks

• Explore location-aware and power-awareworkload scheduling

• Develop integrated physical/virtual allocations tocombat cluster squatting

• Design cloud storage models



OTHER CLOUD COMPUTINGRESEARCH TOPICS:ISOLATION AND DC ENERGY

Heterogeneity in Virtualizedi



Environments

• VM technology isolates CPU and memory, butdisk and network are shared

– Full bandwidth when no contention

– Equal shares when there is contention

• 2.5x performance difference

EC2 smallinstances

Isolation Research



Isolation Research

• Need predictable variance over rawperformance

• Some resources that people have run

into problems with: – Power, disk space, disk I/O rate (drive,

bus), memory space (user/kernel),memory bus, cache at all levels (TLB,

etc), hyperthreading/etc, CPU rate,interrupts

– Network: NIC (Rx/Tx), Switch, cross-datacenter, cross-country

– OS resources: File descriptors, ports,

Datacenter Energy



Datacenter Energy

• EPA, 8/2007: – 1.5% of total U.S. energy consumption – Growing from 60 to 100 Billion kWh in 5

yrs

– 48% of typical IT budget spent onenergy

• 75 MW new DC deployments in PG&E’sservice area – that they know about!

(expect another 2x)• Microsoft: $500m new Chicago facility – Three substations with a capacity of

198MW

–200+ shipping containers w/ 2,000

servers each

Power/Cooling Issues



81

Power/Cooling Issues

First Milestone:DC E C ti



DC Energy Conservation

• DCs limited by power – For each dollar spent on servers, add

$0.48 (2005)/$0.71 (2010) for

power/cooling – $26B spent to power and cool servers

in 2005 grows to $45B in 2010

• Within DC racks, network equipmentoften the “hottest” components inthe hot spot

Thermal Image of TypicalCluster Rack



Cluster Rack

. . , . , . ,M K P a tte rso n A P ra tt P K u m a r

“ : - -From UPS to Silicon an end to end evaluation of” ,d ata ce n te r e fficie n cy In te lC o rp o ra tio n

RackSwitch

DC Networking andP



gPower

• Selectively power down ports/portions of netelements

• Enhanced power-awareness in the network stack – Power-aware routing and support for system

virtualization

• Support for datacenter “slice” power down andrestart

– Application and power-aware media access/control• Dynamic selection of full/half duplex• Directional asymmetry to save power,

e.g., 10Gb/s send, 100Mb/s receive

– Power-awareness in applications and protocols• Hard state (proxying), soft state (caching),protocol/data “streamlining” for power as well asb/w reduction

• Power implications for topology design – Tradeoffs in redundancy/high-availability vs. power

consumption – VLANs su ort for ower-aware s stem virtualization





UC Berkeley

Thank you!

[email protected]

http://abovetheclouds.cs.berkeley.edu/

Documents

Cloud Computing 3999