26
Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox [email protected] http://www.infomall.org http://www.futuregrid.org Director, Digital Science Center, Pervasive Technology Institute Associate Dean for Research and Graduate Studies, School of Informatics and Computing Indiana University Bloomington

Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox [email protected] :

Embed Size (px)

Citation preview

Page 1: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

Clouds will win!CTS Conference 2011

PhiladelphiaMay 23 2011

Geoffrey [email protected]

http://www.infomall.org http://www.futuregrid.org

Director, Digital Science Center, Pervasive Technology Institute

Associate Dean for Research and Graduate Studies,  School of Informatics and Computing

Indiana University Bloomington

Page 2: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

Important Trends

• Data Deluge in all fields of science• Multicore implies parallel computing important again

– Performance from extra cores – not extra clock speed– GPU enhanced systems can give big power boost

• Clouds – new commercially supported data center model replacing compute grids (and your general purpose computer center)

• Light weight clients: Sensors, Smartphones and tablets accessing and supported by backend services in cloud

• Commercial efforts moving much faster than academia in both innovation and deployment

Page 3: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

Jaliya Ekanayake - School of Informatics and Computing3

Big Data in Many Domains According to one estimate, we created 150 exabytes (billion gigabytes) of

data in 2005. This year, we will create 1,200 exabytes PC’s have ~100 Gigabytes disk and 4 Gigabytes of memory

Size of the web ~ 3 billion web pages: MapReduce at Google was on average processing 20PB per day in January 2008

During 2009, American drone aircraft flying over Iraq and Afghanistan sent back around 24 years’ worth of video footage http://www.economist.com/node/15579717 New models being deployed this year will produce ten times as many data streams as

their predecessors, and those in 2011 will produce 30 times as many

~108 million sequence records in GenBank in 2009, doubling in every 18 months

~20 million purchases at Wal-Mart a day

90 million Tweets a day

Astronomy, Particle Physics, Medical Records …

Most scientific task shows CPU:IO ratio of 10000:1 – Dr. Jim Gray

The Fourth Paradigm: Data-Intensive Scientific Discovery Large Hadron Collider at CERN; 100 Petabytes to find Higgs Boson

Page 4: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :
Page 5: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

Data Centers Clouds & Economies of Scale I

Range in size from “edge” facilities to megascale.

Economies of scaleApproximate costs for a small size

center (1K servers) and a larger, 50K server center.

Each data center is 11.5 times

the size of a football field

Technology Cost in small-sized Data Center

Cost in Large Data Center

Ratio

Network $95 per Mbps/month

$13 per Mbps/month

7.1

Storage $2.20 per GB/month

$0.40 per GB/month

5.7

Administration ~140 servers/Administrator

>1000 Servers/Administrator

7.1

2 Google warehouses of computers on the banks of the Columbia River, in The Dalles, OregonSuch centers use 20MW-200MW (Future) each with 150 watts per CPUSave money from large size, positioning with cheap power and access with Internet

Page 6: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

6

• Builds giant data centers with 100,000’s of computers; ~ 200-1000 to a shipping container with Internet access

• “Microsoft will cram between 150 and 220 shipping containers filled with data center gear into a new 500,000 square foot Chicago facility. This move marks the most significant, public use of the shipping container systems popularized by the likes of Sun Microsystems and Rackable Systems to date.”

Data Centers, Clouds & Economies of Scale II

Page 7: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

X as a Service• SaaS: Software as a Service imply software capabilities

(programs) have a service (messaging) interface– Applying systematically reduces system complexity to being linear in number of

components– Access via messaging rather than by installing in /usr/bin

• IaaS: Infrastructure as a Service or HaaS: Hardware as a Service – get your computer time with a credit card and with a Web interface

• PaaS: Platform as a Service is IaaS plus core software capabilities on which you build SaaS

• Cyberinfrastructure is “Research as a Service”Other Services

Clients

Page 8: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

Sensors as a ServiceCell phones are important sensor

Sensors as a Service

Sensor Processing as

a Service (MapReduce)

Page 9: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

Clouds and Jobs• Clouds are a major industry thrust with a growing fraction of IT expenditure

that IDC estimates will grow to $44.2 billion direct investment in 2013 while 15% of IT investment in 2011 will be related to cloud systems with a 30% growth in public sector.

• Gartner also rates cloud computing high on list of critical emerging technologies with for example “Cloud Computing” and “Cloud Web Platforms” rated as transformational (their highest rating for impact) in the next 2-5 years.

• Correspondingly there is and will continue to be major opportunities for new jobs in cloud computing with a recent European study estimating there will be 2.4 million new cloud computing jobs in Europe alone by 2015.

• Cloud computing is an attractive for projects focusing on workforce development. Note that the recently signed “America Competes Act” calls out the importance of economic development in broader impact of NSF projects

Page 10: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

Gartner 2009 Hype CurveClouds, Web2.0Service Oriented ArchitecturesTransformational

High

Moderate

Low

Cloud Computing

Cloud Web Platforms

Media Tablet

Page 11: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

Philosophy of Clouds and Grids• Clouds are (by definition) commercially supported

approach to large scale computing– So we should expect Clouds to replace Compute Grids– Current Grid technology involves “non-commercial” software

solutions which are hard to evolve/sustain

• Public Clouds are broadly accessible resources like Amazon and Microsoft Azure – powerful but not easy to optimize and perhaps data trust/privacy issues

• Private Clouds run similar software and mechanisms but on “your own computers”

• Services still are correct architecture with either REST (Web 2.0) or Web Services

• Clusters still critical concept

Page 12: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

Grids MPI and Clouds • Grids are useful for managing distributed systems

– Pioneered service model for Science– Developed importance of Workflow– Performance issues – communication latency – intrinsic to distributed systems– Can never run large differential equation based simulations or datamining

• Clouds can execute any job class that was good for Grids plus– More attractive due to platform plus elastic on-demand model– MapReduce easier to use than MPI for appropriate parallel jobs– Currently have performance limitations due to poor affinity (locality) for compute-

compute (MPI) and Compute-data – These limitations are not “inevitable” and should gradually improve as in July 13 2010

Amazon Cluster announcement– Will probably never be best for most sophisticated parallel differential equation based

simulations • Classic Supercomputers (MPI Engines) run communication demanding

differential equation based simulations – MapReduce and Clouds replaces MPI for other problems– Much more data processed today by MapReduce than MPI (Industry Informational

Retrieval ~50 Petabytes per day)

Page 13: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

Fault Tolerance and MapReduce

• MPI does “maps” followed by “communication” including “reduce” but does this iteratively

• There must (for most communication patterns of interest) be a strict synchronization at end of each communication phase– Thus if a process fails then everything grinds to a halt

• In MapReduce, all Map processes and all reduce processes are independent and stateless and read and write to disks– As 1 or 2 (reduce+map) iterations, no difficult synchronization issues

• Thus failures can easily be recovered by rerunning process without other jobs hanging around waiting

• Re-examine MPI fault tolerance in light of MapReduce– Twister will interpolate between MPI and MapReduce

Page 14: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

Components of a Scientific Computing Platform

Authentication and Authorization: Provide single sign in to both FutureGrid and Commercial Clouds linked by workflow

Workflow: Support workflows that link job components between FutureGrid and Commercial Clouds. Trident from Microsoft Research is initial candidate

Data Transport: Transport data between job components on FutureGrid and Commercial Clouds respecting custom storage patterns

Program Library: Store Images and other Program material (basic FutureGrid facility)Blob: Basic storage concept similar to Azure Blob or Amazon S3DPFS Data Parallel File System: Support of file systems like Google (MapReduce), HDFS (Hadoop) or Cosmos (dryad) with compute-data affinity optimized for data processing

Table: Support of Table Data structures modeled on Apache Hbase/CouchDB or Amazon SimpleDB/Azure Table. There is “Big” and “Little” tables – generally NOSQL

SQL: Relational DatabaseQueues: Publish Subscribe based queuing systemWorker Role: This concept is implicitly used in both Amazon and TeraGrid but was first introduced as a high level construct by Azure

MapReduce: Support MapReduce Programming model including Hadoop on Linux, Dryad on Windows HPCS and Twister on Windows and Linux

Software as a Service: This concept is shared between Clouds and Grids and can be supported without special attention

Web Role: This is used in Azure to describe important link to user and can be supported in FutureGrid with a Portal framework

Page 15: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

Amazon offers a lot!

Page 16: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

Traditional File System?

• Typically a shared file system (Lustre, NFS …) used to support high performance computing

• Big advantages in flexible computing on shared data but doesn’t “bring computing to data”

SData

SData

SData

SData

Compute Cluster

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

Archive

Storage Nodes

Page 17: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

Data Parallel File System?

• No archival storage and computing brought to data

CData

CData

CData

CData

CData

CData

CData

CData

CData

CData

CData

CData

CData

CData

CData

CData

File1

Block1

Block2

BlockN

……Breakup Replicate each block

File1

Block1

Block2

BlockN

……Breakup

Replicate each block

Page 18: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

Important Platform CapabilityMapReduce

• Implementations (Hadoop – Java; Dryad – Windows) support:– Splitting of data– Passing the output of map functions to reduce functions– Sorting the inputs to the reduce function based on the

intermediate keys– Quality of service

Map(Key, Value)

Reduce(Key, List<Value>)

Data Partitions

Reduce Outputs

A hash function maps the results of the map tasks to reduce tasks

Page 19: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

MapReduce “File/Data Repository” Parallelism

Instruments

Disks Map1 Map2 Map3

Reduce

Communication

Map = (data parallel) computation reading and writing dataReduce = Collective/Consolidation phase e.g. forming multiple global sums as in histogram

Portals/Users

Iterative MapReduceMap Map Map Map Reduce Reduce Reduce

Page 20: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

DNA Sequencing Pipeline

Visualization Plotviz

Blocking Sequencealignment

MDS

DissimilarityMatrix

N(N-1)/2 values

FASTA FileN Sequences

Form block

Pairings

Pairwiseclustering

Illumina/Solexa Roche/454 Life Sciences Applied Biosystems/SOLiD

Internet

Read Alignment

~300 million base pairs per day leading to~3000 sequences per day per instrument? 500 instruments at ~0.5M$ each

MapReduce

MPI

Page 21: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

Cap3 Cost

64 * 1024

96 * 1536

128 * 2048

160 * 2560

192 * 3072

0

2

4

6

8

10

12

14

16

18

Azure MapReduceAmazon EMRHadoop on EC2

Num. Cores * Num. Files

Cost

($)

Page 22: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

SWG Cost

64 * 1024 96 * 1536 128 * 2048

160 * 2560

192 * 3072

0

5

10

15

20

25

30

AzureMRAmazon EMRHadoop on EC2

Num. Cores * Num. Blocks

Cost

($)

Page 23: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

Twister v0.9March 15, 2011

New Interfaces for Iterative MapReduce Programminghttp://www.iterativemapreduce.org/

SALSA Group

Bingjing Zhang, Yang Ruan, Tak-Lon Wu, Judy Qiu, Adam Hughes, Geoffrey Fox, Applying Twister to Scientific Applications, Proceedings of IEEE CloudCom 2010 Conference, Indianapolis, November 30-December 3, 2010

Twister4Azure released May 2011http://salsahpc.indiana.edu/twister4azure/MapReduceRoles4Azure available for some time athttp://salsahpc.indiana.edu/mapreduceroles4azure/

Page 24: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

• Iteratively refining operation• Typical MapReduce runtimes incur extremely high overheads

– New maps/reducers/vertices in every iteration – File system based communication

• Long running tasks and faster communication in Twister enables it to perform close to MPI

Time for 20 iterations

K-Means Clustering

map map

reduce

Compute the distance to each data point from each cluster center and assign points to cluster centers

Compute new clustercenters

Compute new cluster centers

User program

Page 25: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

Twister

• Streaming based communication• Intermediate results are directly

transferred from the map tasks to the reduce tasks – eliminates local files

• Cacheable map/reduce tasks• Static data remains in memory

• Combine phase to combine reductions• User Program is the composer of

MapReduce computations• Extends the MapReduce model to

iterative computations

Data Split

D MRDriver

UserProgram

Pub/Sub Broker Network

D

File System

M

R

M

R

M

R

M

R

Worker Nodes

M

R

D

Map Worker

Reduce Worker

MRDeamon

Data Read/Write

Communication

Reduce (Key, List<Value>)

Iterate

Map(Key, Value)

Combine (Key, List<Value>)

User Program

Close()

Configure()Staticdata

δ flow

Different synchronization and intercommunication mechanisms used by the parallel runtimes

Page 26: Clouds will win! CTS Conference 2011 Philadelphia May 23 2011 Geoffrey Fox gcf@indiana.edu  :

Twister4Azureearly results

128 228 328 428 528 628 7280.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Hadoop-Blast EC2-ClassicCloud-Blast

DryadLINQ-Blast AzureTwister

Number of Query Files

Para

llel E

fficie

ncy