104
Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR IEEE e-Science 2010 Conference 7 - 10 DECEMBER 2010

Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Embed Size (px)

Citation preview

Page 1: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Windows Azure for ResearchRoger Barga Architect

Cloud Computing Futures MSR

IEEE e-Science 2010 Conference

7 - 10 DECEMBER 2010

The Million Server Datacenter

HPC and Clouds ndash Select Comparisons

bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)

HPC Node Architecture Moorersquos ldquoLawrdquo favored commodity systemsbull Specialized processors and systems falteredbull ldquoKiller microsrdquo and industry standard blades ledbull Inexpensive clusters now dominate

wwwtop500org

HPC Interconnects bull Ethernet for low end (cost sensitive)bull High end expectations bull Nearly flat networks and very large switchesbull Operating system bypass for low latency (microseconds)

wwwtop500org

6

Modern Data Center Network

InternetInternetCR CR

AR AR AR ARhellip

SSLB LB

Data CenterLayer 3

Internet

SS

A AA hellip

SS

A AA hellip

hellip

Layer 2

Keybull CR (L3 Border Router)bull AR (L3 Access Router)bull S (L2 Switch)bull LB (Load Balancer)bull A (20 Server RackTOR)

GigE

10 GigE

HPC Storage Systemsbull Local diskbull Scratch or non-existent

bull Secondary storagebull SAN and parallel file systemsbull Hundreds of TBs (at most)

bull Tertiary storagebull Tape robot(s)bull 3-5 GBs bandwidth

wwwnerscgov

~60 PB capacity

HPC and Clouds ndash Select Comparisons

bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)

A Tour Around Windows Azure

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit ndash November 2010 Updatehttpresearchmicrosoftcomazure xcgngagemicrosoftcom

11

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

12

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

Web Role Instances Worker RoleInstances

Azure StorageBlob Queue Table

SQL Azure

Windows Azure Application Model

Key ComponentsFabric Controller

bull Manages hardware and virtual machines for service

Computebull Web Roles

bull Web application front end

bull Worker Rolesbull Utility compute

bull VM Rolesbull Custom compute rolebull You own and customize the VM

Storagebull Blobs

bull Binary objects

bull Tablesbull Entity storage

bull Queuesbull Role coordination

bull SQL Azurebull SQL in the cloud

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service

bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage

bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings

Key ComponentsFabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers

bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash ComputeWeb Roles

Web Front Endbull Cloud web serverbull Web pagesbull Web services

You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles

Key Components ndash ComputeWorker Roles

bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile

storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services

bull Can expose external and internal endpoints

Suggested Application ModelUsing queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 2: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

The Million Server Datacenter

HPC and Clouds ndash Select Comparisons

bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)

HPC Node Architecture Moorersquos ldquoLawrdquo favored commodity systemsbull Specialized processors and systems falteredbull ldquoKiller microsrdquo and industry standard blades ledbull Inexpensive clusters now dominate

wwwtop500org

HPC Interconnects bull Ethernet for low end (cost sensitive)bull High end expectations bull Nearly flat networks and very large switchesbull Operating system bypass for low latency (microseconds)

wwwtop500org

6

Modern Data Center Network

InternetInternetCR CR

AR AR AR ARhellip

SSLB LB

Data CenterLayer 3

Internet

SS

A AA hellip

SS

A AA hellip

hellip

Layer 2

Keybull CR (L3 Border Router)bull AR (L3 Access Router)bull S (L2 Switch)bull LB (Load Balancer)bull A (20 Server RackTOR)

GigE

10 GigE

HPC Storage Systemsbull Local diskbull Scratch or non-existent

bull Secondary storagebull SAN and parallel file systemsbull Hundreds of TBs (at most)

bull Tertiary storagebull Tape robot(s)bull 3-5 GBs bandwidth

wwwnerscgov

~60 PB capacity

HPC and Clouds ndash Select Comparisons

bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)

A Tour Around Windows Azure

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit ndash November 2010 Updatehttpresearchmicrosoftcomazure xcgngagemicrosoftcom

11

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

12

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

Web Role Instances Worker RoleInstances

Azure StorageBlob Queue Table

SQL Azure

Windows Azure Application Model

Key ComponentsFabric Controller

bull Manages hardware and virtual machines for service

Computebull Web Roles

bull Web application front end

bull Worker Rolesbull Utility compute

bull VM Rolesbull Custom compute rolebull You own and customize the VM

Storagebull Blobs

bull Binary objects

bull Tablesbull Entity storage

bull Queuesbull Role coordination

bull SQL Azurebull SQL in the cloud

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service

bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage

bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings

Key ComponentsFabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers

bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash ComputeWeb Roles

Web Front Endbull Cloud web serverbull Web pagesbull Web services

You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles

Key Components ndash ComputeWorker Roles

bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile

storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services

bull Can expose external and internal endpoints

Suggested Application ModelUsing queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 3: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

HPC and Clouds ndash Select Comparisons

bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)

HPC Node Architecture Moorersquos ldquoLawrdquo favored commodity systemsbull Specialized processors and systems falteredbull ldquoKiller microsrdquo and industry standard blades ledbull Inexpensive clusters now dominate

wwwtop500org

HPC Interconnects bull Ethernet for low end (cost sensitive)bull High end expectations bull Nearly flat networks and very large switchesbull Operating system bypass for low latency (microseconds)

wwwtop500org

6

Modern Data Center Network

InternetInternetCR CR

AR AR AR ARhellip

SSLB LB

Data CenterLayer 3

Internet

SS

A AA hellip

SS

A AA hellip

hellip

Layer 2

Keybull CR (L3 Border Router)bull AR (L3 Access Router)bull S (L2 Switch)bull LB (Load Balancer)bull A (20 Server RackTOR)

GigE

10 GigE

HPC Storage Systemsbull Local diskbull Scratch or non-existent

bull Secondary storagebull SAN and parallel file systemsbull Hundreds of TBs (at most)

bull Tertiary storagebull Tape robot(s)bull 3-5 GBs bandwidth

wwwnerscgov

~60 PB capacity

HPC and Clouds ndash Select Comparisons

bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)

A Tour Around Windows Azure

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit ndash November 2010 Updatehttpresearchmicrosoftcomazure xcgngagemicrosoftcom

11

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

12

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

Web Role Instances Worker RoleInstances

Azure StorageBlob Queue Table

SQL Azure

Windows Azure Application Model

Key ComponentsFabric Controller

bull Manages hardware and virtual machines for service

Computebull Web Roles

bull Web application front end

bull Worker Rolesbull Utility compute

bull VM Rolesbull Custom compute rolebull You own and customize the VM

Storagebull Blobs

bull Binary objects

bull Tablesbull Entity storage

bull Queuesbull Role coordination

bull SQL Azurebull SQL in the cloud

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service

bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage

bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings

Key ComponentsFabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers

bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash ComputeWeb Roles

Web Front Endbull Cloud web serverbull Web pagesbull Web services

You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles

Key Components ndash ComputeWorker Roles

bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile

storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services

bull Can expose external and internal endpoints

Suggested Application ModelUsing queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 4: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

HPC Node Architecture Moorersquos ldquoLawrdquo favored commodity systemsbull Specialized processors and systems falteredbull ldquoKiller microsrdquo and industry standard blades ledbull Inexpensive clusters now dominate

wwwtop500org

HPC Interconnects bull Ethernet for low end (cost sensitive)bull High end expectations bull Nearly flat networks and very large switchesbull Operating system bypass for low latency (microseconds)

wwwtop500org

6

Modern Data Center Network

InternetInternetCR CR

AR AR AR ARhellip

SSLB LB

Data CenterLayer 3

Internet

SS

A AA hellip

SS

A AA hellip

hellip

Layer 2

Keybull CR (L3 Border Router)bull AR (L3 Access Router)bull S (L2 Switch)bull LB (Load Balancer)bull A (20 Server RackTOR)

GigE

10 GigE

HPC Storage Systemsbull Local diskbull Scratch or non-existent

bull Secondary storagebull SAN and parallel file systemsbull Hundreds of TBs (at most)

bull Tertiary storagebull Tape robot(s)bull 3-5 GBs bandwidth

wwwnerscgov

~60 PB capacity

HPC and Clouds ndash Select Comparisons

bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)

A Tour Around Windows Azure

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit ndash November 2010 Updatehttpresearchmicrosoftcomazure xcgngagemicrosoftcom

11

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

12

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

Web Role Instances Worker RoleInstances

Azure StorageBlob Queue Table

SQL Azure

Windows Azure Application Model

Key ComponentsFabric Controller

bull Manages hardware and virtual machines for service

Computebull Web Roles

bull Web application front end

bull Worker Rolesbull Utility compute

bull VM Rolesbull Custom compute rolebull You own and customize the VM

Storagebull Blobs

bull Binary objects

bull Tablesbull Entity storage

bull Queuesbull Role coordination

bull SQL Azurebull SQL in the cloud

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service

bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage

bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings

Key ComponentsFabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers

bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash ComputeWeb Roles

Web Front Endbull Cloud web serverbull Web pagesbull Web services

You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles

Key Components ndash ComputeWorker Roles

bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile

storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services

bull Can expose external and internal endpoints

Suggested Application ModelUsing queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 5: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

HPC Interconnects bull Ethernet for low end (cost sensitive)bull High end expectations bull Nearly flat networks and very large switchesbull Operating system bypass for low latency (microseconds)

wwwtop500org

6

Modern Data Center Network

InternetInternetCR CR

AR AR AR ARhellip

SSLB LB

Data CenterLayer 3

Internet

SS

A AA hellip

SS

A AA hellip

hellip

Layer 2

Keybull CR (L3 Border Router)bull AR (L3 Access Router)bull S (L2 Switch)bull LB (Load Balancer)bull A (20 Server RackTOR)

GigE

10 GigE

HPC Storage Systemsbull Local diskbull Scratch or non-existent

bull Secondary storagebull SAN and parallel file systemsbull Hundreds of TBs (at most)

bull Tertiary storagebull Tape robot(s)bull 3-5 GBs bandwidth

wwwnerscgov

~60 PB capacity

HPC and Clouds ndash Select Comparisons

bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)

A Tour Around Windows Azure

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit ndash November 2010 Updatehttpresearchmicrosoftcomazure xcgngagemicrosoftcom

11

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

12

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

Web Role Instances Worker RoleInstances

Azure StorageBlob Queue Table

SQL Azure

Windows Azure Application Model

Key ComponentsFabric Controller

bull Manages hardware and virtual machines for service

Computebull Web Roles

bull Web application front end

bull Worker Rolesbull Utility compute

bull VM Rolesbull Custom compute rolebull You own and customize the VM

Storagebull Blobs

bull Binary objects

bull Tablesbull Entity storage

bull Queuesbull Role coordination

bull SQL Azurebull SQL in the cloud

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service

bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage

bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings

Key ComponentsFabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers

bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash ComputeWeb Roles

Web Front Endbull Cloud web serverbull Web pagesbull Web services

You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles

Key Components ndash ComputeWorker Roles

bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile

storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services

bull Can expose external and internal endpoints

Suggested Application ModelUsing queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 6: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

6

Modern Data Center Network

InternetInternetCR CR

AR AR AR ARhellip

SSLB LB

Data CenterLayer 3

Internet

SS

A AA hellip

SS

A AA hellip

hellip

Layer 2

Keybull CR (L3 Border Router)bull AR (L3 Access Router)bull S (L2 Switch)bull LB (Load Balancer)bull A (20 Server RackTOR)

GigE

10 GigE

HPC Storage Systemsbull Local diskbull Scratch or non-existent

bull Secondary storagebull SAN and parallel file systemsbull Hundreds of TBs (at most)

bull Tertiary storagebull Tape robot(s)bull 3-5 GBs bandwidth

wwwnerscgov

~60 PB capacity

HPC and Clouds ndash Select Comparisons

bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)

A Tour Around Windows Azure

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit ndash November 2010 Updatehttpresearchmicrosoftcomazure xcgngagemicrosoftcom

11

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

12

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

Web Role Instances Worker RoleInstances

Azure StorageBlob Queue Table

SQL Azure

Windows Azure Application Model

Key ComponentsFabric Controller

bull Manages hardware and virtual machines for service

Computebull Web Roles

bull Web application front end

bull Worker Rolesbull Utility compute

bull VM Rolesbull Custom compute rolebull You own and customize the VM

Storagebull Blobs

bull Binary objects

bull Tablesbull Entity storage

bull Queuesbull Role coordination

bull SQL Azurebull SQL in the cloud

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service

bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage

bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings

Key ComponentsFabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers

bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash ComputeWeb Roles

Web Front Endbull Cloud web serverbull Web pagesbull Web services

You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles

Key Components ndash ComputeWorker Roles

bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile

storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services

bull Can expose external and internal endpoints

Suggested Application ModelUsing queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 7: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

HPC Storage Systemsbull Local diskbull Scratch or non-existent

bull Secondary storagebull SAN and parallel file systemsbull Hundreds of TBs (at most)

bull Tertiary storagebull Tape robot(s)bull 3-5 GBs bandwidth

wwwnerscgov

~60 PB capacity

HPC and Clouds ndash Select Comparisons

bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)

A Tour Around Windows Azure

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit ndash November 2010 Updatehttpresearchmicrosoftcomazure xcgngagemicrosoftcom

11

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

12

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

Web Role Instances Worker RoleInstances

Azure StorageBlob Queue Table

SQL Azure

Windows Azure Application Model

Key ComponentsFabric Controller

bull Manages hardware and virtual machines for service

Computebull Web Roles

bull Web application front end

bull Worker Rolesbull Utility compute

bull VM Rolesbull Custom compute rolebull You own and customize the VM

Storagebull Blobs

bull Binary objects

bull Tablesbull Entity storage

bull Queuesbull Role coordination

bull SQL Azurebull SQL in the cloud

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service

bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage

bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings

Key ComponentsFabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers

bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash ComputeWeb Roles

Web Front Endbull Cloud web serverbull Web pagesbull Web services

You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles

Key Components ndash ComputeWorker Roles

bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile

storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services

bull Can expose external and internal endpoints

Suggested Application ModelUsing queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 8: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

HPC and Clouds ndash Select Comparisons

bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)

A Tour Around Windows Azure

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit ndash November 2010 Updatehttpresearchmicrosoftcomazure xcgngagemicrosoftcom

11

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

12

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

Web Role Instances Worker RoleInstances

Azure StorageBlob Queue Table

SQL Azure

Windows Azure Application Model

Key ComponentsFabric Controller

bull Manages hardware and virtual machines for service

Computebull Web Roles

bull Web application front end

bull Worker Rolesbull Utility compute

bull VM Rolesbull Custom compute rolebull You own and customize the VM

Storagebull Blobs

bull Binary objects

bull Tablesbull Entity storage

bull Queuesbull Role coordination

bull SQL Azurebull SQL in the cloud

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service

bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage

bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings

Key ComponentsFabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers

bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash ComputeWeb Roles

Web Front Endbull Cloud web serverbull Web pagesbull Web services

You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles

Key Components ndash ComputeWorker Roles

bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile

storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services

bull Can expose external and internal endpoints

Suggested Application ModelUsing queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 9: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

A Tour Around Windows Azure

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit ndash November 2010 Updatehttpresearchmicrosoftcomazure xcgngagemicrosoftcom

11

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

12

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

Web Role Instances Worker RoleInstances

Azure StorageBlob Queue Table

SQL Azure

Windows Azure Application Model

Key ComponentsFabric Controller

bull Manages hardware and virtual machines for service

Computebull Web Roles

bull Web application front end

bull Worker Rolesbull Utility compute

bull VM Rolesbull Custom compute rolebull You own and customize the VM

Storagebull Blobs

bull Binary objects

bull Tablesbull Entity storage

bull Queuesbull Role coordination

bull SQL Azurebull SQL in the cloud

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service

bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage

bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings

Key ComponentsFabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers

bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash ComputeWeb Roles

Web Front Endbull Cloud web serverbull Web pagesbull Web services

You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles

Key Components ndash ComputeWorker Roles

bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile

storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services

bull Can expose external and internal endpoints

Suggested Application ModelUsing queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 10: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit ndash November 2010 Updatehttpresearchmicrosoftcomazure xcgngagemicrosoftcom

11

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

12

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

Web Role Instances Worker RoleInstances

Azure StorageBlob Queue Table

SQL Azure

Windows Azure Application Model

Key ComponentsFabric Controller

bull Manages hardware and virtual machines for service

Computebull Web Roles

bull Web application front end

bull Worker Rolesbull Utility compute

bull VM Rolesbull Custom compute rolebull You own and customize the VM

Storagebull Blobs

bull Binary objects

bull Tablesbull Entity storage

bull Queuesbull Role coordination

bull SQL Azurebull SQL in the cloud

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service

bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage

bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings

Key ComponentsFabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers

bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash ComputeWeb Roles

Web Front Endbull Cloud web serverbull Web pagesbull Web services

You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles

Key Components ndash ComputeWorker Roles

bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile

storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services

bull Can expose external and internal endpoints

Suggested Application ModelUsing queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 11: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

11

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

12

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

Web Role Instances Worker RoleInstances

Azure StorageBlob Queue Table

SQL Azure

Windows Azure Application Model

Key ComponentsFabric Controller

bull Manages hardware and virtual machines for service

Computebull Web Roles

bull Web application front end

bull Worker Rolesbull Utility compute

bull VM Rolesbull Custom compute rolebull You own and customize the VM

Storagebull Blobs

bull Binary objects

bull Tablesbull Entity storage

bull Queuesbull Role coordination

bull SQL Azurebull SQL in the cloud

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service

bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage

bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings

Key ComponentsFabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers

bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash ComputeWeb Roles

Web Front Endbull Cloud web serverbull Web pagesbull Web services

You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles

Key Components ndash ComputeWorker Roles

bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile

storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services

bull Can expose external and internal endpoints

Suggested Application ModelUsing queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 12: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

12

Application Model Comparison

Machines RunningIIS ASPNET

Machines RunningWindows Services

Machines RunningSQL Server

Ad Hoc Application Model

Web Role Instances Worker RoleInstances

Azure StorageBlob Queue Table

SQL Azure

Windows Azure Application Model

Key ComponentsFabric Controller

bull Manages hardware and virtual machines for service

Computebull Web Roles

bull Web application front end

bull Worker Rolesbull Utility compute

bull VM Rolesbull Custom compute rolebull You own and customize the VM

Storagebull Blobs

bull Binary objects

bull Tablesbull Entity storage

bull Queuesbull Role coordination

bull SQL Azurebull SQL in the cloud

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service

bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage

bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings

Key ComponentsFabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers

bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash ComputeWeb Roles

Web Front Endbull Cloud web serverbull Web pagesbull Web services

You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles

Key Components ndash ComputeWorker Roles

bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile

storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services

bull Can expose external and internal endpoints

Suggested Application ModelUsing queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 13: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Key ComponentsFabric Controller

bull Manages hardware and virtual machines for service

Computebull Web Roles

bull Web application front end

bull Worker Rolesbull Utility compute

bull VM Rolesbull Custom compute rolebull You own and customize the VM

Storagebull Blobs

bull Binary objects

bull Tablesbull Entity storage

bull Queuesbull Role coordination

bull SQL Azurebull SQL in the cloud

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service

bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage

bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings

Key ComponentsFabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers

bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash ComputeWeb Roles

Web Front Endbull Cloud web serverbull Web pagesbull Web services

You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles

Key Components ndash ComputeWorker Roles

bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile

storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services

bull Can expose external and internal endpoints

Suggested Application ModelUsing queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 14: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service

bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage

bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings

Key ComponentsFabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers

bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash ComputeWeb Roles

Web Front Endbull Cloud web serverbull Web pagesbull Web services

You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles

Key Components ndash ComputeWorker Roles

bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile

storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services

bull Can expose external and internal endpoints

Suggested Application ModelUsing queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 15: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Key ComponentsFabric Controller

bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor

bull Allows for automated management of virtual machines

bull Itrsquos job is to provision deploy monitor and maintain applications in data centers

bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service

bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage

bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings

Key ComponentsFabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers

bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash ComputeWeb Roles

Web Front Endbull Cloud web serverbull Web pagesbull Web services

You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles

Key Components ndash ComputeWorker Roles

bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile

storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services

bull Can expose external and internal endpoints

Suggested Application ModelUsing queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 16: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Key ComponentsFabric Controller

bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers

bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state

bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings

Creating a New Project

Windows Azure Compute

Key Components ndash ComputeWeb Roles

Web Front Endbull Cloud web serverbull Web pagesbull Web services

You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles

Key Components ndash ComputeWorker Roles

bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile

storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services

bull Can expose external and internal endpoints

Suggested Application ModelUsing queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 17: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Creating a New Project

Windows Azure Compute

Key Components ndash ComputeWeb Roles

Web Front Endbull Cloud web serverbull Web pagesbull Web services

You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles

Key Components ndash ComputeWorker Roles

bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile

storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services

bull Can expose external and internal endpoints

Suggested Application ModelUsing queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 18: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Windows Azure Compute

Key Components ndash ComputeWeb Roles

Web Front Endbull Cloud web serverbull Web pagesbull Web services

You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles

Key Components ndash ComputeWorker Roles

bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile

storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services

bull Can expose external and internal endpoints

Suggested Application ModelUsing queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 19: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Key Components ndash ComputeWeb Roles

Web Front Endbull Cloud web serverbull Web pagesbull Web services

You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles

Key Components ndash ComputeWorker Roles

bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile

storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services

bull Can expose external and internal endpoints

Suggested Application ModelUsing queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 20: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Key Components ndash ComputeWorker Roles

bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile

storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services

bull Can expose external and internal endpoints

Suggested Application ModelUsing queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 21: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Suggested Application ModelUsing queues for reliable messaging

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 22: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Scalable Fault Tolerant Applications

Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend

serversbull Mask faults in worker roles (reliable messaging)

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 23: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Key Components ndash ComputeVM Roles

bull Customized Rolebull You own the box

bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 24: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Application Hosting

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 25: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for

nodes and arrows describing how they communicate

bull The service model is the same diagram written down in a declarative format

bull You give the Fabric the service model and the binaries that go with each of those nodes

bull The Fabric can provision deploy and manage that diagram for you

bull Find hardware home

bull Copy and launch your app binaries

bull Monitor your app and the hardware

bull In case of failure take action Perhaps even relocate your app

bull At all times the lsquodiagramrsquo stays whole

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 26: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service

manages service healthbull Configuration is handled by two files

ServiceDefinitioncsdefServiceConfigurationcscfg

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 27: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Service Definition

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 28: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Service Configuration

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 29: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

GUI

Double click on Role Name in Azure Project

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 30: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Deploying to the cloud

bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file

bull You must create an Azure account then a service and then you deploy your code

bull Can take up to 20 minutes bull (which is better than six months)

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 31: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Service Management API

bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 32: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure

1Process service model1 Determine resource requirements

2 Create role images

2Allocate resources

3Prepare nodes1 Place role images on nodes

2 Configure settings

3 Start roles

4Configure load balancers

5Maintain service health1 If role fails restart the role based on policy

2 If node fails migrate the role based on policy

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 33: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

StorageReplicated Highly Available Load Balanced

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 34: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Durable Storage At Massive Scale

Blob- Massive files eg videos logs

Drive- Use standard file system APIs

Tables- Non-relational but with few scale limits- Use SQL Azure for relational data

Queues- Facilitate loosely-coupled reliable systems

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 35: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Blob Features and Functionsbull Store Large Objects (up to 1TB

in size)

bull You can have as many containers and Blobs as you want

bull Standard REST Interfacebull PutBlob

bull Inserts a new blob overwrites the existing blob

bull GetBlobbull Get whole blob or a specific range

bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob

bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 36: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Containers

bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs

Each container has an access level- Private

- Default will require the account key to access- Full public read- Public read only

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 37: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Two Types of Blobs Under the Hood

bull Block Blob bull Targeted at streaming

workloadsbull Each blob consists of a

sequence of blocksbull Each block is identified by a Block

ID

bull Size limit 200GB per blob

bull Page Blob bull Targeted at random

readwrite workloadsbull Each blob consists of an

arrayof pagesbull Each page is identified by its offset

from the start of the blob

bull Size limit 1TB per blob

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 38: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a

blobbull Final blob limited to 1 TB and up to 50000

blocksbull Can modify a blob by inserting updating and

removing blocksbull Blocks live for a week before being GCrsquod if not

committed to a blobbull Optimized for streaming

Blocks

Bigmpg1 6 8 3 5 4 7 2

Bigmpg

Brian Prince (DPE)
Fix the animation

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 39: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Pagesbull Similar to block blobsbull Optimized for random readwrite operations and

provide the ability to write to a range of bytes in a blob

bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are

immediately committed to the blobbull The maximum size for a page blob is 1 TB A

page written to a page blob may be up to 1 TB in size

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 40: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

BLOB Leases

bull Creates a 1 minute exclusive write lock on a BLOB

bull Operations Acquire Renew Release Break

bull Must have the lease id to perform operations

bull Can check LeaseStatus property

bull Currently can only be done through REST

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 41: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Windows Azure Drive

bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable

drivebull Durability and survival of data on application failover

bull Enables migrating existing NTFS applications tothe cloud

bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X

bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt

bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob

replicationbull Drive persists even when not mounted as a Page

Blob

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 42: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Windows Azure Drive API

bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD

bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance

bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using

bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive

bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the

drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another

drive (Page Blob) name to be used as a readwritable drive

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 43: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

BLOB Guidance

bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage

accountbull There is no method to detect container

existence call FetchAttributes() and detect the error if it doesnrsquot exist

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 44: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Table Structure

Account MovieData

Star WarsStar TrekFan Boys

Table Name Movies

Brian H PrinceJason ArgonautBill Gates

Table Name Customers

Account

Table

Entity

Tables store entities Entity schema can vary in the same table

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 45: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Windows Azure Tables

bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of

databull Can use thousands of servers as traffic

grows

bull Highly Available amp Durablebull Data is replicated several times

bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 46: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example

All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 47: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Windows Azure Queues

bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work

dispatch

bull Programming semantics ensure that a message can be processed at least once

bull Access is provided via REST

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 48: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Storage PartitioningUnderstanding partitioning is key to understanding

performance

bull Different for each data type (blobs entities queues)Every data object has a

partition key

bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality

Partition key is unit of scale

bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a

different serverSystem load balances

bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached

Server Busy

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 49: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Partition Keys In Each Abstraction

bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +

PartitionKeyPartitionKey (CustomerId) RowKey

(RowKind)Name CreditCardNumber OrderTotal

1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx

1 Order ndash 1 $3512

2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx

2 Order ndash 3 $1000

bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +

Blob name

bull All messages for a single queue belong to the same partitionMessages ndash Queue Name

Container Name Blob Name

image annarborbighousejpg

image foxboroughgillettejpg

video annarborbighousejpg

Queue Message

jobs Message1

jobs Message2

workflow Message1

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 50: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Replication Guarantee

bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has

written to all three replicasbull Reads are only load balanced to replicas in

syncServer 1 Server 2 Server 3

P1

P2

Pn

P1

P2

Pn

P1

P2

Pn

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 51: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Scalability TargetsStorage Account

bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second

Single QueueTable Partition

bull Up to 500 transactions per second

To go above these numbers partition between multiple storage accounts and partitions

When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff

Single Blob Partition

bull Throughput up to 60 MBs

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 52: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

PartitionKey(Category)

RowKey(Title)

Timestamp ReleaseDate

Action Fast amp Furious hellip 2009

Action The Bourne Ultimatum hellip 2007

hellip hellip hellip hellip

Animation Open Season 2 hellip 2009

Animation The Ant Bully hellip 2006

hellip hellip hellip hellip

Comedy Office Space hellip 1999

hellip hellip hellip hellip

SciFi X-Men Origins Wolverine hellip 2009

hellip hellip hellip hellip

War Defiance hellip 2008

Partitions and Partition Ranges

Server BTable = Movies[Comedy - Max]

Server ATable = Movies[Min - Comedy)

Server ATable = Movies

[Min - Max]

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 53: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Key Selection Things to Consider

bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability

See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information

bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient

bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips

Scalability

Query Efficiency amp Speed

Entity group transactions

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 54: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Expect Continuation Tokens ndash Seriously

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 1000 rows in a response

At the end of partition range boundary

Maximum of 5 seconds to execute the query

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 55: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load

Select PartitionKey and RowKey that help scale

Avoid ldquoAppend onlyrdquo patterns

Always Handlecontinuation tokens

ldquoORrdquo predicates are not optimized

Implement back-offstrategy for retries

bull Distribute by using a hash etc as prefix

bull Expect continuation tokens for range queries

bull Execute the queries that form the ldquoORrdquo predicates as separate queries

bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits

WCF Data Services

bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked

bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 56: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not

bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance

bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern

bull Why not simply use a table

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 57: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Queue Terminology

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 58: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Message Lifecycle

Queue

Msg 1

Msg 2

Msg 3

Msg 4

Worker Role

Worker Role

PutMessage

Web Role

GetMessage (Timeout)RemoveMessage

Msg 2Msg 1

Worker Role

Msg 2

POST httpmyaccountqueuecorewindowsnetmyqueuemessages

HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20

ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt

DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 59: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Truncated Exponential Back Off Polling

Consider a backoff polling approach Each empty poll

increases interval by 2x

A successful sets the interval back to 1

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 60: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

60

21

11

C1

C2

Removing Poison Messages

11

21

340

Producers Consumers

P2

P1

30

2 GetMessage(Q 30 s) msg 2

1 GetMessage(Q 30 s) msg 1

11

21

10

20

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 61: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

61

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

11

21

2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1

1 GetMessage(Q 30 s) msg 15 C1 crashed

11

21

6 msg1 visible 30 s after Dequeue30

12

11

12

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 62: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

62

C1

C2

Removing Poison Messages

340

Producers Consumers

P2

P1

12

2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed

1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1

2

6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue

30

13

12

13

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 63: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Queues Recap

bullNo need to deal with failuresMake messageprocessing idempotent

bull Invisible messages result in out of orderDo not rely on order

bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages

bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs

bullDynamically increasereduce workers

Use blob to storemessage data with

reference in message

Use message countto scale

bullNo need to deal with failures

bull Invisible messages result in out of order

bullEnforce threshold on messagersquos dequeue count

bullDynamically increasereduce workers

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 64: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Windows Azure Storage TakeawaysData abstractions to build your applications

Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages

Easy to use via the Storage Client Library

More info on Windows Azure Storage at

httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 65: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Best Practices

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 66: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Picking the Right VM Size

bull Having the correct VM size can make a big difference in costs

bull Fundamental choice ndash larger fewer VMs vs many smaller instances

bull If you scale better than linear across cores larger VMs could save you money

bull Pretty rare to see linear scaling across 8 cores

bull More instances may provide better uptime and reliability (more failures needed to take your service down)

bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 67: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it

bull Common mistake ndash split up code into multiple roles each not using up CPU

bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 68: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Exploiting Concurrencybull Spin up additional processes each with a specific task or as a

unit of concurrency

bull May not be ideal if number of active processes exceeds number of cores

bull Use multithreading aggressively

bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads

bull In NET 4 use the Task Parallel Library

bull Data parallelism

bull Task parallelism

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 69: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Finding Good Code Neighborsbull Typically code falls into one or more of these categories

bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-

and memory-intensive they may be a good neighbor for storage IO-intensive code

MemoryIntensive

CPUIntensive

Network IO Intensive Storage IO Intensive

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 70: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not

over-scaled)

bull Spinning VMs up and down automatically is good at large scale

bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running

bull Being too aggressive in spinning down VMs can result in poor user experience

bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs

Performance Cost

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 71: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Storage Costs

bullUnderstand an applicationrsquos storage profile and how storage billing works

bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per

transaction

bull Service choice can make a big cost difference based on your app profile

bull Caching and compressing They help a lot with storage costs

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 72: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Saving Bandwidth Costs

Bandwidth costs are a huge part of any popular web apprsquos billing profile

Sending fewer things over the wire often means getting fewer things from storage

Saving bandwidth costs often lead to savings inother places

Sending fewer things means your VM has time to do other tasks

All of these tips have the side benefit of improving your web apprsquos performance and user experience

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 73: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Compressing Content

1Gzip all output content

bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better

compression and freedom from patented algorithms

2Tradeoff compute costs for storage size

3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs

Uncompressed Content

Compressed Content

GzipMinify JavaScript

Minify CCSMinify Images

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 74: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Best Practices Summary

Doing lsquolessrsquo is the key to saving costs

Measure everything

Know your application profile in and out

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 75: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Cloud Computing for eScience Applications

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 76: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences

Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 77: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Opportunities for Cloud Computing

It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel

bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing

Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach

to 1TB

bull The output of BLAST is usually 10-100x larger than the input

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 78: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

AzureBLAST

bull Parallel BLAST engine on Azure

bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done

bull Follows the general suggested application model bull Web Role + Queue + Worker

bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific

Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 79: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

AzureBLAST Task-FlowA simple SplitJoin pattern

Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size

Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead

Best Practice test runs to profiling and set size to mitigate the overhead

Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 80: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best

choice

Instance size vs Performancebull Super-linear speedup with larger size

worker instancesbull Primarily due to the memory capability

Task SizeInstance Size vs Costbull Extra-large instance generated the best

and the most economical throughputbull Fully utilize the resource

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 81: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

AzureBLAST

Web Portal

Web Service

Job registration

Job Scheduler

WorkerWorker

WorkerWorker

WorkerWorker

Global dispatch

queue

Web Role

Azure Table

Job Management Role

Azure Blob

Database updating Role

helliphellip

Scaling Engine

Blast databases temporary data etc)

Job RegistryNCBI databases

BLAST task

Splitting task

BLAST task

BLAST task

BLAST task

hellip

Merging Task

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 82: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs

AuthenticationAuthorization based on Live ID

The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory

states

Web Portal

Web Service

Job registration

Job Scheduler

Job Portal

Scaling Engine

Job Registry

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 83: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Demonstration

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 84: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW

Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less

than 30 sec

AzureBLAST significantly saved computing timehellip

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 85: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences

ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried

bull Theoretically 100 billion sequence comparisons

Performance estimationbull Based on the sampling-running on one extra-large Azure

instancebull Would require 3216731 minutes (61 years) on one desktop

This scale of experiments usually are infeasible to most scientists

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 86: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Our Approachbull Allocated a total of ~4000 instances

bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe

bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service

bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions

bull When load imbalances redistribute the load manually

50

6262 62

6262

5062

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 87: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

End Resultbull Total size of the output result is ~230GB

bull The number of total hits is 1764579487

bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip

50

6262 62

6262

5062

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 88: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Understanding Azure by analyzing logs

A normal log record should be

Otherwise something is wrong (eg task failed to complete)

3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins

3312010 822 RD00155D3611B0 Executing the task 251774

3312010 950 RD00155D3611B0 Executing the task 251895

3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 89: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Surviving System Upgrades

North Europe Data Center totally 34256 tasks processed

All 62 compute nodes lost tasks and then came back in a group This is an

Update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 90: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

35 Nodes experience blob writing failure at same time

Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed

A reasonable guess the Fault Domain is working

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 91: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

MODISAzure Computing Evapotranspiration (ET) in the Cloud

You never miss the water till the well has run dryIrish Proverb

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 92: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Computing Evapotranspiration (ET)

ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)

Estimating resistanceconductivity across a catchment can be tricky

bull Lots of inputs big data reductionbull Some of the inputs are not so simple

119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592

Penman-Monteith (1964)

Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 93: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

ET Synthesizes Imagery Sensors Models and Field Data

NASA MODIS imagery source

archives5 TB (600K files)

FLUXNET curated sensor dataset

(30GB 960 files)

FLUXNET curated field dataset2 KB (1 file)

NCEPNCAR ~100MB (4K files)

Vegetative clumping~5MB (1file)

Climate classification~1MB (1file)

20 US year = 1 global year

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 94: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input

tiles from NASA ftp sitesbull Includes geospatial lookup for

non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile

Reprojection (map) stagebull Converts source tile(s) to

intermediate result sinusoidal tiles

bull Simple nearest neighbor or spline algorithms

Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use

Analysis reduction stagebull Optional second stage visible

to scientistbull Enables production of science

analysis artifacts such as maps tables virtual sensors

Reduction 1 Queue

Source Metadata

AzureMODIS Service Web Role Portal

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Science results

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 95: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

MODISAzure Architectural Big Picture (12)

bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate

Download Reprojection or Reduction Job Queue

bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks

ndash recoverable units of work bull Execution status of all jobs and

tasks persisted in Tables

ltPipelineStagegt Request

hellipltPipelineStagegtJobStatus

PersistltPipelineStagegtJob Queue

MODISAzure Service(Web Role)

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

hellip

DispatchltPipelineStagegtTask Queue

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 96: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

MODISAzure Architectural Big Picture (22)

All work actually done by a Worker Role

Service Monitor (Worker Role)

Parse amp PersistltPipelineStagegtTaskStatus

GenericWorker (Worker Role)

hellip

hellip

DispatchltPipelineStagegtTask Queue

hellip

ltInputgtData Storage

bull Dequeues tasks created by the Service Monitor

bull Retries failed tasks 3 timesbull Maintains all task status

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 97: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Example Pipeline Stage Reprojection Service

Reprojection Requesthellip

Service Monitor (Worker Role)

ReprojectionJobStatusPersist

Parse amp PersistReprojectionTaskStatus

GenericWorker (Worker Role)

hellip

Job Queue

hellip

Dispatch

Task Queue

Points to

hellip

ScanTimeList

SwathGranuleMetaReprojection Data

Storage

Each entity specifies a single reprojection job request

Each entity specifies a single reprojection task (ie a single

tile)

Query this table to get geo-metadata (eg boundaries)

for each swath tile

Query this table to get the list of satellite scan times that

cover a target tile

Swath Source Data Storage

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 98: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Costs for 1 US Year ET Computation

bull Computational costs driven by data scale and need to run reduction multiple times

bull Storage costs driven by data scale and 6 month project duration

bull Small with respect to the people costs even at graduate student rates

Reduction 1 Queue

Source Metadata

Request Queue

Scientific Results Download

Data Collection Stage

Source Imagery Download Sites

Reprojection Queue

Reduction 2 Queue

DownloadQueue

Scientists

Analysis Reduction StageDerivation Reduction Stage Reprojection Stage

400-500 GB60K files10 MBsec11 hourslt10 workers

$50 upload$450 storage

400 GB45K files3500 hours20-100 workers

5-7 GB55K files1800 hours20-100 workers

lt10 GB~1K files1800 hours20-100 workers

$420 cpu$60 download

$216 cpu$1 download$6 storage

$216 cpu$2 download$9 storage

AzureMODIS Service Web Role Portal

Total $1420

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 99: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have

the potential to be important to both large and small scale science problems

bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access

bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today

bull Provide valuable fault tolerance and scalability abstractions

bull Clouds as amplifier for familiar client tools and on premise compute

bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 100: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for

developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 101: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 102: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating

basic performance for compute and storage services

bull Benchmarks for reference algorithms

bull Best Practice tipsbull Code Samples

Email us with questions at xcgngagemicrosoftcom

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 103: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Demonstration

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104
Page 104: Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom

  • Windows Azure for Research Roger Barga Architect
  • The Million Server Datacenter
  • HPC and Clouds ndash Select Comparisons
  • HPC Node Architecture
  • HPC Interconnects
  • Modern Data Center Network
  • HPC Storage Systems
  • HPC and Clouds ndash Select Comparisons (2)
  • Slide 9
  • Slide 10
  • Application Model Comparison
  • Application Model Comparison (2)
  • Key Components
  • Key Components Fabric Controller
  • Key Components Fabric Controller (2)
  • Key Components Fabric Controller (3)
  • Creating a New Project
  • Windows Azure Compute
  • Key Components ndash Compute Web Roles
  • Key Components ndash Compute Worker Roles
  • Suggested Application Model Using queues for reliable messaging
  • Scalable Fault Tolerant Applications
  • Key Components ndash Compute VM Roles
  • Slide 24
  • lsquoGrokkingrsquo the service model
  • Automated Service Management
  • Service Definition
  • Service Configuration
  • GUI
  • Deploying to the cloud
  • Service Management API
  • The Secret Sauce ndash The Fabric
  • Slide 33
  • Durable Storage At Massive Scale
  • Blob Features and Functions
  • Containers
  • Two Types of Blobs Under the Hood
  • Blocks
  • Pages
  • BLOB Leases
  • Windows Azure Drive
  • Windows Azure Drive API
  • BLOB Guidance
  • Table Structure
  • Windows Azure Tables
  • Is not relational
  • Windows Azure Queues
  • Storage Partitioning
  • Partition Keys In Each Abstraction
  • Replication Guarantee
  • Scalability Targets
  • Partitions and Partition Ranges
  • Key Selection Things to Consider
  • Slide 54
  • Tables Recap
  • Queues Their Unique Role in Building Reliable Scalable Applica
  • Queue Terminology
  • Message Lifecycle
  • Truncated Exponential Back Off Polling
  • Removing Poison Messages
  • Removing Poison Messages (2)
  • Removing Poison Messages (3)
  • Queues Recap
  • Windows Azure Storage Takeaways
  • Slide 65
  • Picking the Right VM Size
  • Using Your VM to the Maximum
  • Exploiting Concurrency
  • Finding Good Code Neighbors
  • Scaling Appropriately
  • Storage Costs
  • Saving Bandwidth Costs
  • Compressing Content
  • Best Practices Summary
  • Cloud Computing for eScience Applications
  • NCBI BLAST
  • Opportunities for Cloud Computing
  • AzureBLAST
  • AzureBLAST Task-Flow
  • Micro-Benchmarks Inform Design
  • AzureBLAST (2)
  • AzureBLAST Job Portal
  • Demonstration
  • R palustris as a platform for H2 production
  • All-Against-All Experiment
  • Our Approach
  • End Result
  • Understanding Azure by analyzing logs
  • Surviving System Upgrades
  • Surviving Storage Failures
  • MODISAzure Computing Evapotranspiration (ET) in the Cloud
  • Computing Evapotranspiration (ET)
  • ET Synthesizes Imagery Sensors Models and Field Data
  • MODISAzure Four Stage Image Processing Pipeline
  • MODISAzure Architectural Big Picture (12)
  • MODISAzure Architectural Big Picture (22)
  • Example Pipeline Stage Reprojection Service
  • Costs for 1 US Year ET Computation
  • Observations and Experience
  • Resources Cloud Research Community Site
  • Resources AzureScope
  • Resources AzureScope (2)
  • Demonstration (2)
  • Slide 104