CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture08.pdf · CIS 602-01: Scalable Data Analysis Cloud Computing Dr. David Koop D. Koop, CIS 602-01, Fall 2017. Data

CIS 602-01: Scalable Data Analysis

Cloud Computing Dr. David Koop

D. Koop, CIS 602-01, Fall 2017

Data Science Tasks

2D. Koop, CIS 602-01, Fall 2017

TASKS (major involvement only)

BASIC EXPLORATORY DATA ANALYSIS

69%

COMMUNICATING FINDINGS TO BUSINESS DECISION-MAKERS

58%

DATA CLEANING53%

CREATING VISUALIZATIONS

49%

IDENTIFYING BUSINESS PROBLEMS TO BE SOLVED WITH ANALYTICS

47%FEATURE EXTRACTION43% COLLABORATING ON CODE

PROJECTS (READING/EDITING OTHERS' CODE, USING GIT)

32%

TEACHING/TRAINING OTHERS31%

PLANNING LARGE SOFTWARE PROJECTS OR DATA SYSTEMS30%

DEVELOPING DASHBOARDS30%

ETL29%

DEVELOPING PRODUCTS THAT DEPEND ON REAL-TIME DATA ANALYTICS

19%

USING DASHBOARDS AND SPREADSHEETS (MADE BY OTHERS) TO MAKE DECISIONS

19%

DEVELOPING HARDWARE (OR WORKING ON SOFTWARE PROJECTS THAT REQUIRE EXPERT KNOWLEDGE OF HARDWARE)

5%

COMMUNICATING WITH PEOPLE OUTSIDE YOUR COMPANY

28%

SETTING UP / MAINTAINING DATA PLATFORMS

24%DEVELOPING DATA

ANALYTICS SOFTWARE

20%

IMPLEMENTING MODELS/ ALGORITHMS INTO PRODUCTION

36%ORGANIZING AND GUIDING TEAM PROJECTS

39%

DEVELOPING PROTOTYPE MODELS

43%

CONDUCTING DATA ANALYSIS TO ANSWER RESEARCH QUESTIONS

61%

[O'Reilly]

http://www.oreilly.com/data/free/2016-data-science-salary-survey.csp

Programming Languages

3D. Koop, CIS 602-01, Fall 2017

PROGRAMMING LANGUAGES

SQL70%

R57%

PYTHON54%

BASH24%JAVA18%

JAVASCRIPT17%

VISUAL BASIC / VBA13%

C++9%

SCALA8%

C#8%

C8%

SAS5%

PERL5%

RUBY3%

GO1%

OCTAVE2%

MATLAB9%

SALARY MEDIAN AND IQR (US DOLLARS)

Range/Median

Lang

uage

s

SHARE OF RESPONDENTS

0 50K 100K 150K 200K

GoOctave

RubySASPerl

C#C

ScalaMatlab

C++Visual Basic/VBA

JavaScriptJavaBash

PythonR

SQL

[O'Reilly]


Spreadsheet & Business Intelligence Tools

4D. Koop, CIS 602-01, Fall 2017

EXCEL69%

POWERPIVOT10%

POWER BI8%QLIKVIEW7%

BUSINESSOBJECTS6%

COGNOS6%

ORACLE BI5% SPOTFIRE

4%ADOBE ANALYTICS

3%MICROSTRATEGY3%

ALTERYX2%

JASPERSOFT1%

DATAMEER1%

PENTAHO3%

SPREADSHEETS, BI, REPORTING

SALARY MEDIAN AND IQR (US DOLLARS)

Range/Median

Spre

adsh

eets

, BI,

repo

rtin

g

SHARE OF RESPONDENTS

30K 60K 90K 120K 150K

DatameerJaspersoft

AlteryxMicrostrategy

Adobe AnalyticsPentahoSpotfire

Oracle BICognos

BusinessObjectsQlikViewPower BI

PowerPivotExcel

[O'Reilly]


Trends in Data Analysis Tool Use

5D. Koop, CIS 602-01, Fall 2017

[KDNuggets]

http://www.kdnuggets.com/2017/05/poll-analytics-data-science-machine-learning-software-leaders.html

Python Tools• matplotlib:

- Python's "base" visualization library - Originally mimicked the functions in matlab - Integrated with pandas .plot calls - Jupyter Notebook: %matplotlib inline or notebook - seaborn adds extras on top of matplotlib

• scikit-learn - Machine learning library - Fit and predict using models - Classification and regression

6D. Koop, CIS 602-01, Fall 2017

Reading Response• Due Thursday (10/12) before class • Read Reiss et al. • Short Summary + Critique • Questions:

- Do you agree with the conclusions? - What explanations could there be for the results? - Does the paper suggest improvements in cloud computing based

on workload analysis?

7D. Koop, CIS 602-01, Fall 2017

http://www.pdl.cmu.edu/PDL-FTP/CloudComputing/googletrace-socc2012.pdf

Assignment 2• http://www.cis.umassd.edu/~dkoop/

cis602-2017fa/assignment2.html • New York City Trees

- 680,000+ trees - Use WebGL for visualization - Use a Python bridge (mapboxgl) - Use the fork! - Smaller data versions available - Keep using pandas - Label subproblems and answers - Due Thursday, October 19

8D. Koop, CIS 602-01, Fall 2017

http://www.cis.umassd.edu/~dkoop/cis602-2017fa/assignment2.html

http://www.cis.umassd.edu/~dkoop/cis602-2017fa/assignment2.html

http://github.com/dakoop/mapboxgl-jupyter

Scaling Up

9D. Koop, CIS 602-02, Fall 2015

PC

[Haeberlen and Ives, 2015]

Scaling Up

9D. Koop, CIS 602-02, Fall 2015

PC Server


Scaling Up

9D. Koop, CIS 602-02, Fall 2015

PC Server Cluster


Scaling Up

9D. Koop, CIS 602-02, Fall 2015

PC Server Cluster Data center


Scaling Up

9D. Koop, CIS 602-02, Fall 2015

PC Server Cluster Data center Network of data centers


Scale Problems1.Difficult to dimension

- Load can vary considerably - Waste resources of lose customers

2.Expensive - Hardware costs - Personnel costs - Maintenance costs

3.Difficult to scale - scaling up (new machines, new buildings) - scaling down (energy, fixed costs)

10D. Koop, CIS 602-02, Fall 2015


Power Plant to Cloud Analogy

11D. Koop, CIS 602-02, Fall 2015


Power source

Power source directly connected

Power source

Network Meter Customer

Cloud Computing:State-of-the-art and Research Challenges

Q. Zhang, L. Cheng, and R. Boutaba


http://rboutaba.cs.uwaterloo.ca/Papers/Journals/2010/Qi10.pdf

http://rboutaba.cs.uwaterloo.ca/Papers/Journals/2010/Qi10.pdf

Cloud Computing Features• No up-front investment • Lowering operating cost • Highly scalable • Easy access • Reducing business risks and maintenance expenses

13D. Koop, CIS 602-01, Fall 2017

Everything as a Service• Software as a service (SaaS) [Restaurant] • Platform as a service (PaaS) [Take-out food] • Infrastructure as a service (Iaas) [Grocery]

14D. Koop, CIS 602-02, Fall 2015


Cloudprovider

User

Hardware

Middleware

Application

Cloud Computing Architecture

15D. Koop, CIS 602-01, Fall 2017

J Internet Serv Appl (2010) 1: 7–18 9

virtualized resources for high-level applications. A virtual-ized server is commonly called a virtual machine (VM). Vir-tualization forms the foundation of cloud computing, as itprovides the capability of pooling computing resources fromclusters of servers and dynamically assigning or reassigningvirtual resources to applications on-demand.

Autonomic Computing: Originally coined by IBM in2001, autonomic computing aims at building computing sys-tems capable of self-management, i.e. reacting to internaland external observations without human intervention. Thegoal of autonomic computing is to overcome the manage-ment complexity of today’s computer systems. Althoughcloud computing exhibits certain autonomic features suchas automatic resource provisioning, its objective is to lowerthe resource cost rather than to reduce system complexity.

In summary, cloud computing leverages virtualizationtechnology to achieve the goal of providing computing re-sources as a utility. It shares certain aspects with grid com-puting and autonomic computing but differs from them inother aspects. Therefore, it offers unique benefits and im-poses distinctive challenges to meet its requirements.

3 Cloud computing architecture

This section describes the architectural, business and variousoperation models of cloud computing.

3.1 A layered model of cloud computing

Generally speaking, the architecture of a cloud comput-ing environment can be divided into 4 layers: the hard-ware/datacenter layer, the infrastructure layer, the platformlayer and the application layer, as shown in Fig. 1. We de-scribe each of them in detail:

The hardware layer: This layer is responsible for man-aging the physical resources of the cloud, including phys-ical servers, routers, switches, power and cooling systems.In practice, the hardware layer is typically implementedin data centers. A data center usually contains thousandsof servers that are organized in racks and interconnectedthrough switches, routers or other fabrics. Typical issuesat hardware layer include hardware configuration, fault-tolerance, traffic management, power and cooling resourcemanagement.

The infrastructure layer: Also known as the virtualiza-tion layer, the infrastructure layer creates a pool of storageand computing resources by partitioning the physical re-sources using virtualization technologies such as Xen [55],KVM [30] and VMware [52]. The infrastructure layer is anessential component of cloud computing, since many keyfeatures, such as dynamic resource assignment, are onlymade available through virtualization technologies.

The platform layer: Built on top of the infrastructurelayer, the platform layer consists of operating systems andapplication frameworks. The purpose of the platform layeris to minimize the burden of deploying applications directlyinto VM containers. For example, Google App Engine oper-ates at the platform layer to provide API support for imple-menting storage, database and business logic of typical webapplications.

The application layer: At the highest level of the hierar-chy, the application layer consists of the actual cloud appli-cations. Different from traditional applications, cloud appli-cations can leverage the automatic-scaling feature to achievebetter performance, availability and lower operating cost.

Compared to traditional service hosting environmentssuch as dedicated server farms, the architecture of cloudcomputing is more modular. Each layer is loosely coupledwith the layers above and below, allowing each layer toevolve separately. This is similar to the design of the OSI

Fig. 1 Cloud computingarchitecture

[Zhang et al., 2010]

10 J Internet Serv Appl (2010) 1: 7–18

model for network protocols. The architectural modularityallows cloud computing to support a wide range of applica-tion requirements while reducing management and mainte-nance overhead.

3.2 Business model

Cloud computing employs a service-driven business model.In other words, hardware and platform-level resources areprovided as services on an on-demand basis. Conceptually,every layer of the architecture described in the previous sec-tion can be implemented as a service to the layer above.Conversely, every layer can be perceived as a customer ofthe layer below. However, in practice, clouds offer servicesthat can be grouped into three categories: software as a ser-vice (SaaS), platform as a service (PaaS), and infrastructureas a service (IaaS).

1. Infrastructure as a Service: IaaS refers to on-demandprovisioning of infrastructural resources, usually in termsof VMs. The cloud owner who offers IaaS is called anIaaS provider. Examples of IaaS providers include Ama-zon EC2 [2], GoGrid [15] and Flexiscale [18].

2. Platform as a Service: PaaS refers to providing platformlayer resources, including operating system support andsoftware development frameworks. Examples of PaaSproviders include Google App Engine [20], MicrosoftWindows Azure [53] and Force.com [41].

3. Software as a Service: SaaS refers to providing on-demand applications over the Internet. Examples of SaaSproviders include Salesforce.com [41], Rackspace [17]and SAP Business ByDesign [44].

The business model of cloud computing is depicted byFig. 2. According to the layered architecture of cloud com-puting, it is entirely possible that a PaaS provider runs itscloud on top of an IaaS provider’s cloud. However, in thecurrent practice, IaaS and PaaS providers are often parts ofthe same organization (e.g., Google and Salesforce). This iswhy PaaS and IaaS providers are often called the infrastruc-ture providers or cloud providers [5].

Fig. 2 Business model of cloud computing

3.3 Types of clouds

There are many issues to consider when moving an enter-prise application to the cloud environment. For example,some service providers are mostly interested in lowering op-eration cost, while others may prefer high reliability and se-curity. Accordingly, there are different types of clouds, eachwith its own benefits and drawbacks:

Public clouds: A cloud in which service providers of-fer their resources as services to the general public. Pub-lic clouds offer several key benefits to service providers, in-cluding no initial capital investment on infrastructure andshifting of risks to infrastructure providers. However, pub-lic clouds lack fine-grained control over data, network andsecurity settings, which hampers their effectiveness in manybusiness scenarios.

Private clouds: Also known as internal clouds, privateclouds are designed for exclusive use by a single organiza-tion. A private cloud may be built and managed by the orga-nization or by external providers. A private cloud offers thehighest degree of control over performance, reliability andsecurity. However, they are often criticized for being simi-lar to traditional proprietary server farms and do not providebenefits such as no up-front capital costs.

Hybrid clouds: A hybrid cloud is a combination of publicand private cloud models that tries to address the limitationsof each approach. In a hybrid cloud, part of the service in-frastructure runs in private clouds while the remaining partruns in public clouds. Hybrid clouds offer more flexibilitythan both public and private clouds. Specifically, they pro-vide tighter control and security over application data com-pared to public clouds, while still facilitating on-demandservice expansion and contraction. On the down side, de-signing a hybrid cloud requires carefully determining thebest split between public and private cloud components.

Virtual Private Cloud: An alternative solution to address-ing the limitations of both public and private clouds is calledVirtual Private Cloud (VPC). A VPC is essentially a plat-form running on top of public clouds. The main difference isthat a VPC leverages virtual private network (VPN) technol-ogy that allows service providers to design their own topol-ogy and security settings such as firewall rules. VPC is es-sentially a more holistic design since it not only virtualizesservers and applications, but also the underlying commu-nication network as well. Additionally, for most companies,VPC provides seamless transition from a proprietary serviceinfrastructure to a cloud-based infrastructure, owing to thevirtualized network layer.

For most service providers, selecting the right cloudmodel is dependent on the business scenario. For exam-ple, computation-intensive scientific applications are bestdeployed on public clouds for cost-effectiveness. Arguably,certain types of clouds will be more popular than others.

Business Model• IaaS and PaaS often part of the

same organization • Businesses leveraging subscription

models to support cloud services

16D. Koop, CIS 602-01, Fall 2017


Types of Clouds• Public:

- Everyone can use - Lack fine-grained control over data, network, security settings

• Private (aka internal): - May be homegrown or external - Offer more control - Require up-front capital

• Hybrid: - Keep parts secure (e.g. accounting, finances) - Other pieces can leverage public cloud resources - Have to make determination of what goes where

17D. Koop, CIS 602-01, Fall 2017

Data Center Network Infrastructure

18D. Koop, CIS 602-01, Fall 2017


Fig. 3 Basic layered design of data center network infrastructure

10 Gbps links. The aggregation layer usually provides im-portant functions, such as domain service, location service,server load balancing, and more. The core layer providesconnectivity to multiple aggregation switches and providesa resilient routed fabric with no single point of failure. Thecore routers manage traffic into and out of the data center.

A popular practice is to leverage commodity Ethernetswitches and routers to build the network infrastructure. Indifferent business solutions, the layered network infrastruc-ture can be elaborated to meet specific business challenges.Basically, the design of a data center network architectureshould meet the following objectives [1, 21–23, 35]:

Uniform high capacity: The maximum rate of a server-to-server traffic flow should be limited only by the availablecapacity on the network-interface cards of the sending andreceiving servers, and assigning servers to a service shouldbe independent of the network topology. It should be possi-ble for an arbitrary host in the data center to communicatewith any other host in the network at the full bandwidth ofits local network interface.

Free VM migration: Virtualization allows the entire VMstate to be transmitted across the network to migrate a VMfrom one physical machine to another. A cloud comput-ing hosting service may migrate VMs for statistical multi-plexing or dynamically changing communication patternsto achieve high bandwidth for tightly coupled hosts or toachieve variable heat distribution and power availability inthe data center. The communication topology should be de-signed so as to support rapid virtual machine migration.

Resiliency: Failures will be common at scale. The net-work infrastructure must be fault-tolerant against varioustypes of server failures, link outages, or server-rack failures.Existing unicast and multicast communications should notbe affected to the extent allowed by the underlying physicalconnectivity.

Scalability: The network infrastructure must be able toscale to a large number of servers and allow for incrementalexpansion.

Backward compatibility: The network infrastructureshould be backward compatible with switches and routersrunning Ethernet and IP. Because existing data centers havecommonly leveraged commodity Ethernet and IP based de-vices, they should also be used in the new architecture with-out major modifications.

Another area of rapid innovation in the industry is the de-sign and deployment of shipping-container based, modulardata center (MDC). In an MDC, normally up to a few thou-sands of servers, are interconnected via switches to formthe network infrastructure. Highly interactive applications,which are sensitive to response time, are suitable for geo-diverse MDC placed close to major population areas. TheMDC also helps with redundancy because not all areas arelikely to lose power, experience an earthquake, or suffer ri-ots at the same time. Rather than the three-layered approachdiscussed above, Guo et al. [22, 23] proposed server-centric,recursively defined network structures of MDC.

5.1.2 Distributed file system over clouds

Google File System (GFS) [19] is a proprietary distributedfile system developed by Google and specially designed toprovide efficient, reliable access to data using large clustersof commodity servers. Files are divided into chunks of 64megabytes, and are usually appended to or read and onlyextremely rarely overwritten or shrunk. Compared with tra-ditional file systems, GFS is designed and optimized to runon data centers to provide extremely high data throughputs,low latency and survive individual server failures.

Inspired by GFS, the open source Hadoop DistributedFile System (HDFS) [24] stores large files across multi-ple machines. It achieves reliability by replicating the dataacross multiple servers. Similarly to GFS, data is stored onmultiple geo-diverse nodes. The file system is built from acluster of data nodes, each of which serves blocks of dataover the network using a block protocol specific to HDFS.Data is also provided over HTTP, allowing access to all con-tent from a web browser or other types of clients. Data nodescan talk to each other to rebalance data distribution, to movecopies around, and to keep the replication of data high.

5.1.3 Distributed application framework over clouds

HTTP-based applications usually conform to some web ap-plication framework such as Java EE. In modern data centerenvironments, clusters of servers are also used for computa-tion and data-intensive jobs such as financial trend analysis,or film animation.

MapReduce [16] is a software framework introduced byGoogle to support distributed computing on large data sets


Virtualization

• Hypervisor controls access to resources • Flexibility for provider • Secure and isolated • Performance may be hard to predict

19D. Koop, CIS 602-02, Fall 2015

Alice

Bob

Charlie

Daniel

Physical machine

Virtual machinemonitor


Cloud Comparison

20D. Koop, CIS 602-01, Fall 2017


Table 1 A comparison of representative commercial products

Cloud Provider Amazon EC2 Windows Azure Google App Engine

Classes of Utility Computing Infrastructure service Platform service Platform service

Target Applications General-purpose applications General-purpose Windowsapplications

Traditional web applicationswith supported framework

Computation OS Level on a Xen VirtualMachine

Microsoft Common LanguageRuntime (CLR) VM; Predefinedroles of app. instances

Predefined web applicationframeworks

Storage Elastic Block Store; AmazonSimple Storage Service (S3);Amazon SimpleDB

Azure storage service and SQLData Services

BigTable and MegaStore

Auto Scaling Automatically changing thenumber of instances based onparameters that users specify

Automatic scaling based onapplication roles and aconfiguration file specified byusers

Automatic Scaling which istransparent to users

The .NET Services facilitate the creation of distributedapplications. The Access Control component provides acloud-based implementation of single identity verificationacross applications and companies. The Service Bus helpsan application expose web services endpoints that can beaccessed by other applications, whether on-premises or inthe cloud. Each exposed endpoint is assigned a URI, whichclients can use to locate and access a service.

All of the physical resources, VMs and applications inthe data center are monitored by software called the fabriccontroller. With each application, the users upload a config-uration file that provides an XML-based description of whatthe application needs. Based on this file, the fabric controllerdecides where new applications should run, choosing phys-ical servers to optimize hardware utilization.

5.2.3 Google App Engine

Google App Engine [20] is a platform for traditional webapplications in Google-managed data centers. Currently, thesupported programming languages are Python and Java.Web frameworks that run on the Google App Engine includeDjango, CherryPy, Pylons, and web2py, as well as a customGoogle-written web application framework similar to JSPor ASP.NET. Google handles deploying code to a cluster,monitoring, failover, and launching application instances asnecessary. Current APIs support features such as storing andretrieving data from a BigTable [10] non-relational database,making HTTP requests and caching. Developers have read-only access to the filesystem on App Engine.

Table 1 summarizes the three examples of popular cloudofferings in terms of the classes of utility computing, tar-get types of application, and more importantly their modelsof computation, storage and auto-scaling. Apparently, thesecloud offerings are based on different levels of abstraction

and management of the resources. Users can choose onetype or combinations of several types of cloud offerings tosatisfy specific business requirements.

6 Research challenges

Although cloud computing has been widely adopted by theindustry, the research on cloud computing is still at an earlystage. Many existing issues have not been fully addressed,while new challenges keep emerging from industry applica-tions. In this section, we summarize some of the challengingresearch issues in cloud computing.

6.1 Automated service provisioning

One of the key features of cloud computing is the capabil-ity of acquiring and releasing resources on-demand. The ob-jective of a service provider in this case is to allocate andde-allocate resources from the cloud to satisfy its servicelevel objectives (SLOs), while minimizing its operationalcost. However, it is not obvious how a service provider canachieve this objective. In particular, it is not easy to de-termine how to map SLOs such as QoS requirements tolow-level resource requirement such as CPU and memoryrequirements. Furthermore, to achieve high agility and re-spond to rapid demand fluctuations such as in flash crowdeffect, the resource provisioning decisions must be made on-line.

Automated service provisioning is not a new problem.Dynamic resource provisioning for Internet applications hasbeen studied extensively in the past [47, 57]. These ap-proaches typically involve: (1) Constructing an applicationperformance model that predicts the number of applicationinstances required to handle demand at each particular level,

Cloud Challenges and Opportunities• Availability: What happens to my business if

there is an outage in the cloud? • Data lock-in: How do I move my data from

one cloud to another? • Data confidentiality and auditability: How do I make sure that the

cloud doesn't leak my confidential data? • Data transfer bottlenecks: How do I copy large amounts of

data from/to the cloud? • Performance unpredictability: VMs sharing the same disk

21D. Koop, CIS 602-02, Fall 2015


Cloud Challenges and Opportunities• Scalable storage: Cloud model (short-term usage, no up-front cost,

infinite capacity on demand) does not fit persistent storage well • Bugs in large distributed systems: Many errors cannot be

reproduced in smaller configs • Scaling quickly: Dealing with boot times and idle power • Reputation fate sharing: One customer's bad behavior can affect

the reputation of others using the same cloud (e.g. spam, FBI raids) • Software licensing: Licenses tied to computers, how to scale?

22D. Koop, CIS 602-02, Fall 2015

Cloud Comparer

23D. Koop, CIS 602-01, Fall 2017

https://ilyas-it83.github.io/CloudComparer/

Cloud Adoption

24D. Koop, CIS 602-01, Fall 2017

Cloud Strategy

25D. Koop, CIS 602-01, Fall 2017

Cloud Tools

26D. Koop, CIS 602-01, Fall 2017

Cloud Challenges

27D. Koop, CIS 602-01, Fall 2017

Cloud Initiatives

28D. Koop, CIS 602-01, Fall 2017

Public Cloud Adoption

29D. Koop, CIS 602-01, Fall 2017

Public Cloud Change

30D. Koop, CIS 602-01, Fall 2017

Private Cloud Adoption

31D. Koop, CIS 602-01, Fall 2017

Wasted Cloud Spending

32D. Koop, CIS 602-01, Fall 2017

Amazon Web Services 101

I. Massingham


https://www.slideshare.net/IanMassingham/aws-101-introduction-to-aws

Reading Response• Due Thursday (10/12) before class • Read Reiss et al. • Short Summary + Critique • Questions:

- Do you agree with the conclusions? - What explanations could there be for the results? - Does the paper suggest improvements in cloud computing based

on workload analysis?

34D. Koop, CIS 602-01, Fall 2017

http://www.pdl.cmu.edu/PDL-FTP/CloudComputing/googletrace-socc2012.pdf

Documents

CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture08.pdf · CIS 602-01: Scalable Data Analysis Cloud Computing Dr. David Koop D. Koop, CIS 602-01, Fall 2017. Data