High Performance Optimization on Cloud for a Metal Process ...732246/FULLTEXT03.pdf · The last reason concerns acceptable HPC performance in the cloud. Previous research[3], assessed

UPTEC F 14032

Examensarbete 30 hpJuni 2014

High Performance Optimization on Cloud for a Metal Process Model

Adam Saxén

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

High Performance Optimization on Cloud for a MetalProcess Model

Adam Saxén

The Amazon Elastic Compute Cloud (EC2)is a service providing on-demand computecapacity to the public. In this thesis ascientific software, performing globaloptimization on a metal process model,is implemented in parallel using MATLABand provisioned as a service from AmazonEC2.

The thesis is divided into two parts.The first part concerns improving theserial software, analyzing differentoptimization methods, and implementing aparallel version; the second part isabout evaluating the parallelperformance of the software, both ondifferent computer resources in AmazonEC2 and on a local cluster. It is shownthat parallel performance of thesoftware in Amazon EC2 is similar andeven surpasses the local cluster forsome provisioned resources. Factorsaffecting the performance of the globaloptimization methods are found andrelated to network communication andvirtualization of hardware, where themethod MultiStart has the best parallelperformance. Finally, the runtime forlarge optimization problem wassuccessfully reduced from 5 hours(serial) to a few minutes (parallel)when run on Amazon EC2; with the totalcost of just 25-30$.

ISSN: 1401-5757, UPTEC F14 032Examinator: Tomas NybergÄmnesgranskare: Andreas HellanderHandledare: Kateryna Mishchenko

Contents

1 Introduction 71.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Problem description 102.1 Hot Rolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Optimization theory 123.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Optimality conditions . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Global optimization . . . . . . . . . . . . . . . . . . . . . . . . . 143.4 MATLAB Optimization . . . . . . . . . . . . . . . . . . . . . . . 15

3.4.1 Local optimization method fmincon . . . . . . . . . . . . 153.4.2 Global Optimization method MultiStart . . . . . . . . . . 163.4.3 Global Optimization method GlobalSearch . . . . . . . . 163.4.4 Global Optimization method Patternsearch . . . . . . . . 16

4 High Performance Computing theory 174.1 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1.1 Parallel performance metrics . . . . . . . . . . . . . . . . 174.1.2 Performance limitations . . . . . . . . . . . . . . . . . . . 18

4.1.2.1 Parallel delay . . . . . . . . . . . . . . . . . . . . 184.1.3 MATLAB Parallelism . . . . . . . . . . . . . . . . . . . . 19

4.1.3.1 fmincon and GlobalSearch parallelism . . . . . 214.1.3.2 MultiStart parallelism . . . . . . . . . . . . . . . 224.1.3.3 Patternsearch Parallelism . . . . . . . . . . . . 22

4.2 Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2.1 Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 The Software - Analysis 255.1 The model and its performance . . . . . . . . . . . . . . . . . . . 255.2 Code improvements . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2.1 Memory efficiency . . . . . . . . . . . . . . . . . . . . . . 275.2.2 Redundant computations . . . . . . . . . . . . . . . . . . 275.2.3 MEX-Files . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.3 Parameter study . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.3.1 Discretization of the temperature field . . . . . . . . . . . 295.3.2 Convergence study . . . . . . . . . . . . . . . . . . . . . . 305.3.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.4 Software Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . 325.4.1 The workload of the software . . . . . . . . . . . . . . . . 325.4.2 Implementation of parallelism . . . . . . . . . . . . . . . . 34

5.4.2.1 fmincon and GlobalSearch . . . . . . . . . . . . 345.4.2.2 MultiStart . . . . . . . . . . . . . . . . . . . . . 355.4.2.3 Patternsearch . . . . . . . . . . . . . . . . . . . 365.4.2.4 Performance analysis . . . . . . . . . . . . . . . 36

3

5.4.3 Parallel limitations . . . . . . . . . . . . . . . . . . . . . . 385.4.3.1 Parallel Overhead . . . . . . . . . . . . . . . . . 385.4.3.2 Granularity . . . . . . . . . . . . . . . . . . . . . 405.4.3.3 Load balancing . . . . . . . . . . . . . . . . . . . 41

5.5 Summary of analysis . . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Computer resources 42

7 Cloud Assessment - Method and Results 437.1 Cloud Computing through Mathwork’s CloudCenter . . . . . . . 43

7.1.1 Optimization method comparison . . . . . . . . . . . . . . 437.1.2 Maximized number of workers . . . . . . . . . . . . . . . . 44

7.2 Comparison Cloud instances . . . . . . . . . . . . . . . . . . . . . 457.3 Cloud bursting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

8 Cloud Assessment - Discussion 478.1 Cloud Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 478.2 The ADM as a Service . . . . . . . . . . . . . . . . . . . . . . . . 48

8.2.1 Cloud considerations . . . . . . . . . . . . . . . . . . . . . 498.2.2 Hybrid solutions . . . . . . . . . . . . . . . . . . . . . . . 50

8.3 Cost analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

9 Conclusions 519.1 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

9.1.1 Heterogeneity within Amazon EC2 . . . . . . . . . . . . . 529.1.2 Replacing Mathwork’s CloudCenter . . . . . . . . . . . . 52

A Study of data storage and security on cloud 55A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

A.1.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . 55A.1.2 Placement group . . . . . . . . . . . . . . . . . . . . . . . 56

A.2 Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . 56A.2.1 Storage types . . . . . . . . . . . . . . . . . . . . . . . . . 57

A.2.1.1 Amazon Elastic Block Storage (EBS) . . . . . . 57A.2.1.2 Amazon Instance Store . . . . . . . . . . . . . . 58A.2.1.3 Amazon Simple Storage Service (S3) . . . . . . . 59

A.3 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59A.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 59A.3.2 Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

A.3.2.1 Network Security . . . . . . . . . . . . . . . . . . 60A.3.2.2 Interfaces . . . . . . . . . . . . . . . . . . . . . . 60A.3.2.3 Data Security . . . . . . . . . . . . . . . . . . . 61A.3.2.4 Virtualization . . . . . . . . . . . . . . . . . . . 62A.3.2.5 Governance . . . . . . . . . . . . . . . . . . . . . 62A.3.2.6 Compliance . . . . . . . . . . . . . . . . . . . . . 63A.3.2.7 Legal issues . . . . . . . . . . . . . . . . . . . . . 63

A.3.3 Amazon Virtual private Cloud (Amazon VPC) . . . . . . 63A.3.3.1 The infrastructure . . . . . . . . . . . . . . . . . 64A.3.3.2 Amazon Direct Connect . . . . . . . . . . . . . . 65A.3.3.3 Dedicated Instances . . . . . . . . . . . . . . . . 65

4

B Parameter study 66B.1 Discretization - method and results . . . . . . . . . . . . . . . . . 66B.2 Convergence study - method and results . . . . . . . . . . . . . . 67

B.2.1 Parameter study - Additional plots . . . . . . . . . . . . . 68

5

1 Introduction

1.1 Purpose

The purpose of this report is to investigate how a scientific program can beparallelized and operated through a cloud service, in particular extending andanalysing a global optimization software developed by ABB, to perform paralleloptimization on the Amazon Elastic Compute Cloud (EC2). The software inquestion optimizes a metal working process called Hot Rolling.

The general method will be as follows:

• Analyze the software in preparation to implement efficient parallel versionof the software.

• Implement and test the parallel software on both a traditional cluster atABB and a public Amazon EC2 virtual cluster.

• Evaluate parallel performance of the software and assess its use as a cloudservice.

1.2 Scope

The report covers a short introduction to Hot Rolling; some optimization the-ory; high performance computing theory; the definition and infrastructure ofcloud computing; a thorough analysis of the software; benchmarks of parallelimplementations of the software; and an assessment of the software as a serviceprovisioned through the Cloud.

The software is implemented in MATLAB utilizing MATLAB’s Global Opti-mization Toolbox (GADS), Parallel Computing Toolbox (PCT) and DistributedComputing Server (MDCS). The optimization problem constitutes a multi-physical model in 33 dimensions with a set of non-linear inequality and equalityconstraints.

This report is limited to the public cloud Amazon EC21, which is one of manycommercial clouds today. The choice of programming language is limited toMATLAB. Mainly due to the initial software which was developed in MATLAB,but also because of the ease of adding parallelism through existing toolboxes.

1.3 Background

Many scientific applications require immense computing resources, well beyondthe capability of regular PCs, where one example would be MAPAS[1], a tool forprediction of membrane-contacting protein surfaces. Users constantly strive forbetter computing alternatives to further shrink the computational time-frame.Today, solutions constitute parallel computing, cluster computing, and the moreelusive, cloud computing. The concept of parallelism is simple. By dividing anddistributing computations over many cores and computers, computation timescan be reduced from weeks to hours. The implementation of these solutions

1https://aws.amazon.com/ec2/

6

is however not straight forward; knowledge of programming and experience inbasic system administration are required to fully utilize the power of parallelcomputing[5].

This report concerns cloud computing for three reasons, where the first is partlyaddressed. Scientists and companies often lack the knowledge to fully take ad-vantage of the parallel capabilities of a distributed system. Cloud computingoffers service models called Infrastructure as as Service (IaaS), Platform as a Ser-vice (PaaS), and Software as a Services (SaaS)[13], that could lower the thresh-old of utilizing these capabilities. Hence the cloud environment is a promisingcomputing resource for scientists and companies.

The second reason is from a business perspective. A customer investing in e.g.an optimization method for their industrial process, would need to acquire com-puting resources. For parallel application this could require a distributed systemof on-site computers, like a cluster. This is associated with high upfront costs,fluctuating utilization and maintenance costs. The cloud, as an alternative to anon-site cluster, offers on-demand, remote computing resources. The customeronly pays for the computing power when used, saving both time and up-frontcosts[2].

The last reason concerns acceptable HPC performance in the cloud. Previousresearch[3], assessed High Performance Optimization on the cloud of being afeasible and a comparable choice to traditional clusters. This report will partlybe a continuation of [3], investigating another optimization problem and broad-ening the perspective of a cloud application.

Cloud Computing infrastructures have been investigated for their appropriate-ness as scientific computing platforms. A study[4], released 2011, is one suchinvestigation and its key findings are noteworthy. One key finding shows that ap-plications with low communication and I/O are well suited for the cloud. For ex-ample, there were severe slowdowns (7x) for a tightly coupled code (PARATEC)on the Amazon EC2 instances, when compared to a Magellan bare-metal (non-virtualized) computing grid. More findings state that Cloud Computing stillexhibits significant programming and system administration support and thatcloud environment introduces new security issues; all important issues that willbe considered in this report.

7

1.4 Definitions

HPC High Performance Computing

AWS Amazon Web Services

Amazon EC2 Amazon Elastic Compute Cloud

AMI Amazon Machine Image

Amazon S3 Amazon Simple Storage Service

IaaS Infrastructure as a service

PaaS Platform as a service

SaaS Software as a service

GADS Global Optimization Toolbox

VM Virtual Machine

OS Operating System

std standard deviation

Gbps Gigabyte per second

KKT Karush Kuhn Tucker

MS MultiStart

GS GlobalSearch

PS Patternsearch

ADM Adaptive Dimension Model

Weak Scaling Increasing number of cores while keeping the work per core constant.

Strong Scaling Increasing number of cores while keeping total work constant.

8

2 Problem description

This report consists of two parts. The first is about parallelizing the software,and the second part concerns the use of cloud services for running the parallelsoftware.

The first part: The ABB Hot Rolling software is by itself a time consumingprocess, requiring the computation of costly objective functions and constraintfunctions. However, depending on the purpose of the optimization, the exe-cution time could radically increase. For example, when requiring global opti-mization or when computing multiple objective functions. Parallelism reducesthe execution time to a reasonable time-frame, by dividing the computationalwork into smaller parts that run simultaneously. Implementing a parallel ver-sion of the software requires a performance analysis and basic understanding ofthe software. From this information parallelism can effectively be implementedon the software, where effectiveness is measured in performance metrics like ex-ecution time, speedup and efficiency.

The second part: Moving the parallelized software to the Amazon ElasticCompute Cloud (EC2) offers desired features like scalability and on-demandresources, but also raises questions regarding computing performance, storageand security. The problem consists of investigating how the parallel softwareperforms on a virtual cluster in the EC2 cloud compared to a physical clusterat ABB. Lastly, cloud services offer a variety of computer resources tailored forspecific software and cost requirements. Finding a suitable resource specificationfor the software is also a part of the problem.

2.1 Hot Rolling

The industrial process called rolling reshapes metal into desired proportions. Itis done by feeding metal through a Rolling Mill consisting of stands that pressthe metal using a set of rolls (Fig. 1).

Figure 1: Illustration of the Hot rolling process. A Hot Rod and Wire Rollingmill, with vertical and horizontal stands, reshapes the metal.

9

The Rolling process can be divided into two categories, namely hot rolling andcold rolling. When metal is rolled at a temperature above the recrystallizationtemperature the process is referred to as hot rolling; when rolled below therecrystallization temperature it is called cold rolling. This is important sincemetal properties such as strength and ductility are affected by the microstruc-ture of the metal, which in turn is affected by the rolling temperature[6].

There are many types of Rolling, such as flat, ring, wire and rod rolling. Thevariant determines what kind of Mill that is required (roll types, number ofstands, etc.). The model considered in this report simulates a 10 stand hot rod& wire rolling mill. The metal, referred to as a billet, passes three stages:

1. Furnace: increases the billet’s temperature, enabling plastic deformation

2. Passes through stands which reshape the billet to requirements, in termsof geometry and metal properties

3. Cooling: decreases temperature to fixate the new metal-profile and metalproperties

The billet’s cross section area is carefully reduced with every pass of a stand,containing rolls with a specific groove. Usually a combination of vertical standsand horizontal stands are used to obtain the desired geometry; grooving thebillet in two dimensions with both oval and circular grooves (Fig. 1).

The simulation model is described by a non-linear multi-physics problem. Thereshaping of the metal depends on many factors such as tension, rolling speed,initial temperature, friction, the type of metal, etc. Naturally, simplificationsand reasonable discretizations of the process are used to give a sensible compu-tation time. One simplification is to approximate the oval cross section of thebillet with a rectangular cross section of the same area.

An important assumption for simulating these types of Rolling processes is thatthe mass flow and volume stay constant during deformation, i.e. the mass flowinto the stand equals the mass flow out of the stand [7, p.27]. The simplifiedrectangular cross section is calculated through the spread bi and thickness hi ofthe metal (Fig. 2).

Considering that the density of the metal is constant the conservation of massflow can be stated

ṁi = vibihi (1)

which is the the velocity multiplied with the width and height of a billet section i.

There are various challenges when rolling metal associated with the propertiesof the final product or the whole rolling process. Hence, optimization methodsfor e.g. minimizing the grain size of the finished product, or minimizing theeffect per production speed of the mill, are of interest.

10

Figure 2: Simplified rolling, where sub-indices describe properties before (1)and after (2) the stand. Index i indicates the disk.

3 Optimization theory

3.1 Definition

In a mathematical context optimization is about finding the maximum or min-imum of a function. Given a vector of variables x, objective function f(x) andconstraints gi(x), an optimization problem can be defined as

minxf(x)

subject to gi(x) = 0, i ∈ ε (2)gi(x) ≤ 0, i ∈ I

where I and ε are the indices for inequality and equality constraints, respec-tively. Bounds on x are a special type of linear inequality constraints and areincluded in gi, i ∈ I[8].

Some further definitions are needed when discussing optimization problems.

Definition 1: A point x is said to be a feasible point if it fulfills all constraintsgi.

Definition 2: The set of all feasible points form the feasible region, denoted D.

Definition 3: A feasible point, denoted x∗, is called a local optimizer iff(x∗) ≤ f(x) holds for all x in a feasible region confined by |x−x∗| ≤ δ, δ > 0.The pair (x∗, f(x∗)), of the optimizer and locally optimal objective functionvalue, is referred to as the local optimum.

11

A simple example of two inequality constraints g1, g2, limiting the feasible re-gion is presented in figure 3. The contour illustrates the value of the objectivefunction f(x). In this example the minimum is the point in the feasible regionclosest to the contour center.

Figure 3: Representation of the feasible region restricted by the constraints.

When solving an optimization problem there exists no general solution strategy.Instead there are methods tailored to specific optimization problems. Commonfor all methods is that they are iterative, meaning that they begin with an initialguess and create improved estimates (iterates) until certain termination criteriaare met.

3.2 Optimality conditions

An optimal point x is defined according to def. 3. To determine if a pointfulfills these criteria, intuitively, all points in a surrounding neighborhood of xshould be examined, to see if all of them have a higher function value. In thissection optimality conditions are presented for smooth functions, that when ful-filled state that a point is a local optimizer x∗ of the objective function.

Definition 4: A function f is smooth if it belongs to the differentiability classC2, meaning that the function is twice continuously differentiable.

For unconstrained optimization problems the two optimality conditions are:

First Order Necessary Condition: If x∗ is a local minimizer and f is con-tinuously differentiable in an open neighborhood of x∗, then ∇f(x∗) = 0

Second Order Necessary Condition: If x∗ is a local minimizer, and f and∇2f(x∗) are continuous in an open neighborhood of x∗, then ∇f(x∗) = 0 and∇2f(x∗) is positive semi-definite.

12

Constrained optimization problems require more sophisticated optimality con-ditions, defined through Lagrange multipliers. Lagrange multipliers are vari-ables λi that combine the function f and the constraints gi into a Lagrangianfunction L:

L(x, λ) = f(x) +

m∑i

λigi(x) (3)

The concept of Lagrange multipliers is based on the intuition that f(x) cannotbe increasing when stepping in the neighborhood of points where gi = 0; this istrue when ∇f and ∇g are parallel.

To define general optimality conditions the Karush-Kuhn-Tucker (KKT) condi-tions are used [9]:

First Order Necessary Condition:

∇f(x) +m∑

i∈ε∪Iλi∇gi(x) = 0 (4)

gi(x) = 0, i ∈ εgi(x) ≤ 0, i ∈ I (5)λi ≥ 0, i ∈ I (6)

λigi(x) = 0, i ∈ I (7)

The KKT conditions state that the derivative of the Lagrangian function L(eq. 3), with respect to x, is equal to 0. The Lagrange multipliers for the in-equality constraints need to be ≥ 0, since the gradient of f(x∗) and gi(x∗) haveopposite directions at x∗. The complementary slackness condition (eq. 7) is re-quired since the constraint terms in eq. 5 are zero in the set of possible solutions.

Second Order Necessary Condition: Let Z(x∗) be a basis for the null spaceof ∇g(x∗)T . Then the second order necessary conditions states that

Z(x∗)T∇2xxL(x∗, λ∗)Z(x∗)

is positive semidefinite.

3.3 Global optimization

Finding the global minimum is equivalent to finding the smallest minimumamong all local minima (Fig 4).

A global optimum is defined as follows:

Definition 5: A global optimizer is defined in analogue to a local optimizer(def. 3) but where the feasible region is equal to the entire feasible set D.

The process of finding the global optimum is simple if the optimization prob-lem is convex, then the local optimum will be equal to the global optimum.However, if the problem is complex (non-linear and non-convex) the process of

13

finding the global optimum will be more difficult, and proving that the foundoptimum is the global optimum is not as straight forward.

Definition convexity: A set of points S is said to be convex if the line segmentbetween any two points (x, y) lies entirely within S:

∀x, y ∈ S, ∀α ∈ [0, 1] : αx+ (1− α)y ∈ S

A function f is convex if its domain is a convex set and the following propertyholds:

f(αx+ (1− α)y) ≤ αf(x) + (1− α)f(y), ∀α ∈ [0, 1]

Figure 4: Representation of local vs global minimum.

3.4 MATLAB Optimization

In MATLAB there are two toolboxes that handle local and global optimization:Optimization toolbox and Global Optimization Toolbox (GADS), respectively.They contain a variety of methods for solving a range of different optimizationproblems. For the non-linear, non-convex and constrained hot rolling problema selection of methods are presented in this section.

3.4.1 Local optimization method fmincon

The fmincon method finds the local minimum of a nonlinear constrained op-timization problem by specifying an initial point from which the method williteratively converge to the closest local optimum. The initial point is the start-ing guess from where the method will step (iterate) to a better estimate. Thisis the only method from the optimization toolbox suitable for the Hot Rollingproblem. For more information regarding the other methods available throughthe toolbox see [10].

Fmincon contains a collection of algorithms that govern how a local minimumis reached. The algorithms are Active-set (AS), SQP (sequential quadratic pro-gramming), Interior-point (IP) and Trust-Region-Reflective. Common for all

14

four is the use of gradients and Hessians as a part of the optimization. Trust-Region-Reflective opposed to the other three algorithms requires an analyticgradient ∇f ; this is not available for the Hot Rolling problem and is thereforenot considered.

Active-set and SQP are similar in the sense that they both solve QuadraticProgramming (QP) sub-problems. Quadratic Programming concerns optimizinga quadratic function:

q(d) =1

2dTHd+ cT d

The function q(d) is a quadratic approximation of the change of the Lagrangianfunction L (eq: 3), where H is the Hessian of L and c the gradient. Minimizingthe quadratic functions yield the change d of the optimization variable x. Bothalgorithms iteratively solve a series of sub-problems, where every iteration con-sists of computing Hessians and gradients, and applying line-search methods inorder to find appropriate search directions towards the optimum.

The Interior-point algorithm attempts to solve the constrained minimizationproblem by adding a logarithmic term to the objective function, called thebarrier function. Slack variables are introduced to handle inequality constraintsand will together with the barrier function keep the points in every iterationwithin the feasible region.

3.4.2 Global Optimization method MultiStart

MultiStart is a method in the Global Optimization Toolbox. As the name im-plies MultiStart generates multiple initial points that then are evaluated by alocal optimization method. This is done by generating initial points stochasti-cally, running the local method for each initial point, and selecting the smallestminimum among the found local minima. The choice of local method will befmincon, but there are several documented alternatives [11].

3.4.3 Global Optimization method GlobalSearch

GlobalSearch, like MultiStart, utilizes a local optimization method in the processof finding the global optimum. However, GlobalSearch does not run the localmethod for all initial points. Instead it uses a more sophisticated techniquewhere initial point are filtered before running the local method fmincon. Thisis done by scoring candidate points based on their objective function valueand to what degree they violate constraints. Points with a score lower thana certain threshold, and that are not in the vicinity of points already assessedby fmincon, will be run by fmincon. GlobalSearch will therefore cover a largersearch space than MultiStart when run in serial. For a detailed description ofhow GlobalSearch finds the Global minimum see [11].

3.4.4 Global Optimization method Patternsearch

Patternseach is the name of a direct search method in MATLAB that at-tempts to find the global minimum without the use of gradients or Hessians.

15

It does this by directly computing objective function and constraint values ofpoints in a pattern.

The process is iterative, where every iteration consists of a pattern surroundingthe current smallest point, x∗. There are several points in the pattern, whichwill be computed and compared to x∗. If a point with lower objective functionvalue is found, it will become the new x∗ in the next iteration and the patternwill expand. If no point with smaller value is found the pattern will contractwith the same x∗ passed to the next iteration. The contraction and expansionproperties is the reason to why a global minimum can be found. Patternsearchsupports several different poll-algorithms that determine how the pattern shouldbe created. For example, a pattern can be chosen to evaluate points in everycomponent direction.

4 High Performance Computing theory

4.1 Parallelization

Parallelizing a serial software will allow the user to obtain the same results inless time, or run larger problems in the same amount of time. Achieving thisrequires an analysis of the program to find portions of the work that can bedone simultaneously.

To what degree a program can be parallelized sets an upper bound on theobtainable execution time. For example, a program takes 100s to run in serial.It turns out that 90s of the 100s can be done in parallel. Then regardless ofhow efficient the parallelism is, it will always take 10s to e.g. initialize theprogram. To understand the benefits and limits of parallelization a quantifiedmeasurement of parallel performance is required.

4.1.1 Parallel performance metrics

Speedup Sp is a relative metric defined as the ratio between the executiontime when the program is run in serial, Ts, and the execution time when run inparallel, Tp, where p is the number of cores.

Sp =TsTp

(8)

For example, a program takes 153s to run on a single core computer and 14s ona 12 core system. The speedup is then 153/14 = 10.9, or 10.9 times faster thanthe serial program.

Efficiency λ is another metric that is closely linked to speedup. Instead ofmeasuring how much faster a parallel version is compared to a serial version,efficiency measures the utilization degree of the computer resources (cores).

λp =Spp

(9)

16

The metric is expressed as speedup divided with the number of cores, where 1means that 100 % of the computer resources are used.

4.1.2 Performance limitations

There always exists an upper limit on parallel performance. Using the speedupmetric this limit is described by Amdahl’s law [12].

Theorem 1. Amdahl’s law: The maximum speedup S, using p cores, isbounded above by the serial fraction of the program, denoted f :

Sp ≤TsTp

=Ts

Ts(f +1− fp

)

=1

f +1− fp

(10)

The theorem states that the serial program can be divided into a serial fractionf and parallel fraction (1 − f)/p. This implies that the serial fraction of anapplication will pose as the maximum speedup.

For example, imagine a program where the fraction between serial and parallelwork is divided roughly 25% / 75 % (Fig. 5).

Figure 5: The serial fraction of the software imposes an upper limit on theparallel performance.

Then according to Amdahl’s law the maximum speedup Smax will be equal to4 when p grows large, since:

Smax = limp−>inf1

f +1− fp

=1

f= 4, f = 25% =

1

4

This could be a program that uses 25% of the execution time to initialize databefore running parallel computations.

4.1.2.1 Parallel delay

Many parallel applications tend to fall short of the analytical speedup limit.Delays are introduced when dividing and managing the work in parallel. Thiscould be communication delays between cores and job scheduler, or delays dueto uneven workload distribution among cores. This makes it hard to predicthow well a program will perform when parallelized. The following list presentsa grouping of possible reasons for a delay:

17

• CommunicationParallel cores need to receive work and return results to the client. Thecommunication delay is proportional to the size of the sent data and thenetwork bandwidth. Decreasing communication delays, by removing un-necessary data traffic from the software will improve parallel performance.

• Load balancingIf a set of cores receive one task each that take different amount of timeto compute, evidently the core with the largest task will limit the totalexecution time of the software. Improving load balancing by introducingmore tasks per worker would even out the workload, keeping a higheramount of cores utilized while running the software. Higher utilization ofthe computing resources equate to better parallel efficiency.

• General OverheadDelays caused by dividing the serial work into parallel work by a singlecore, and various other initialization procedures, add delays not presentin a serial program. Also running an application on a shared cluster couldresult in a race for computer resources, where delays occur when waitingfor other jobs to finish.

4.1.3 MATLAB Parallelism

The Parallel Computing Toolbox (PCT) contains all necessary functions for im-plementing parallelism in MATLAB, and Distributed Computing Server (MDCS)extends the parallelism to several computers in a cluster or on a computingcloud. In MATLAB the concept of a worker is used, which is a separate MAT-LAB instance running on a single core. The Worker has its own workspace andvariables. There are many ways to make use of multiple workers; for exampleby calling parfor, batch or spmd within a matlabpool environment or launchingjobs to a job scheduler through the createjob and createtask commands.

Every parallel section of a script needs to have access to a pool of workers, calleda matlabpool. The pool is created by calling matlabpool:

1 matlabpool(’open’,4);

2 %Exection of parallel code

3 matlabpool(’close’);

This will open a pool with 4 workers, available for parallel computations. Afterthe computations are done the pool resource is closed. The PCT allows MAT-LAB developers to run up to 12 Workers. If more workers are required MDCSis necessary.

The most straightforward way in MATLAB to parallelize code is to replace for-loops with parfor -loops. parfor is a parallel for-loop that automatically slicesthe iterations and distributes them dynamically to available workers.

18


2 parfor i=1:10

3 result(i)=fmincon(@objectiveFunction,x_initPt(i),...);

4 end


The above code example distributes 10 runs of the local optimization methodfmincon to 8 available workers. The workers are assigned iterations dynami-cally through the built-in job scheduler, improving load balancing between theworkers.

SPMD(single program multiple data) is another option for performing parallelcomputations. Like parfor it requires a matlabpool, but will not divide the workwithin the SPMD statement dynamically to the workers.


2 spmd

3 %Labindex is the Worker id

4 if labindex == 1

5 result = fmincon(@objectiveFunction1,x_init,...);

6 elseif labindex == 2

7 result = fmincon(@objectiveFunction2,x_init,...);

8 end

9 end


11 %Composite variable that contains result(1) and result(2)

12 result(:);

Instead, all code within a SPMD statement will be executed on every worker.The labindex variable can be used to manually divide data sets or tasks betweenworkers using their indices (labindex). The above script shows how workers canevaluate one objective function each and store the results from the computationin the variable result.

Batch is a command for offloading scripts with parallel computations from theclient MATLAB session. The command does not require a matlabpool to becalled in advance, however batch can start up a pool on the worker receivingthe script.

1 %clientScript.m

2 job = batch(’parallelScript’,’Pool’,8); %Non-blocking

3 %Do other work...

4 %Require parallel before continuing.

5 wait(job); %Blocking command

6 load(job,’res’); %Fetch variable computed in script

7 delete(job);

1 %parallelScript.m

2 parfor i=1:12

3 res = fmincon(@objectiveFunction,x_init(i,:),...);

4 end

19

This approach will reduce the number of available workers with 1, since oneworker will execute the script sent by the client Session.

Noteworthy is that MATLAB does not support nested parallelism which meansthat fmincon, run within e.g. a parfor -loop, can not parallelize its internal gra-dient computations. For information on how to implement parallelization withfmincon see [10].

4.1.3.1 fmincon and GlobalSearch parallelism

The local optimization method fmincon supports parallelization of the gradientcomputations:

∇f(x) =

f(x+ ∆1e1)− f(x)∆1︸︷︷︸Worker

,f(x+ ∆2e2)− f(x)

∆2︸︷︷︸Worker

, · · · , f(x+ ∆1en)− f(x)∆n︸︷︷︸

Worker

where f is the objective function, ei is the unit vector for component i, and ∆iis the step size in ei.

The components of the gradient are distributed to a set of workers that per-form the computations simultaneously. This type of parallelization is effectivewhen the time of evaluating the objective and constraint function is consid-erably larger than the inherited distribution time.Where distribution refers tocommunicating input data and returning results from workers in every iterationof the main loop of fmincon.

Globalsearch has no native parallelism due to its iteration-dependency; the solverevaluates new candidate points based on the previous iteration’s fmincon re-sults(Sec: 3.4.3). However there is still the possibility of using fmincon’s par-allelism to improve the execution time of globalsearch. This will parallelize theparts of globalsearch containing fmincon, but will keep the serial evaluation ofcandidate points.

20

4.1.3.2 MultiStart parallelism

For multistart the number of initial points constitute the majority of the ap-plication’s workload. Several fmincon runs, with different initial points, can becomputed independently. This makes MultiStart well suited for parallelization.Multistart uses a parfor -loop that contains a call to fmincon. The iterationcount is equal to the number of initial points to be evaluated. When workersare available the parfor will dynamically distribute new initial points to freeworkers.

Distributing points dynamically helps to load balance work between workers.Computing local optima from initial points will result in varying execution times.Without dynamical work balancing, the workers that finish early would simplywait until the rest are done. How many points and what points a worker willreceive will vary between different runs, and is not known in advance.

4.1.3.3 Patternsearch Parallelism

Direct Search methods, like Patternsearch in MATLAB, can utilize parallelismfor computing the objective and constraint function for every point in the pat-tern. There are several patterns available, where the pattern used in this reportis called GPSpostiveBasis2N ; the points in the pattern are located in the posi-tive and negative component directions of the vector x.

Patternsearch also contains a parfor -loop. All points within the mesh are dy-namically distributed to all workers in every iteration (Fig. 6).

Figure 6: Patternsearch evaluates points in the current estimate’s vicinity.

21

4.2 Cloud Computing

Cloud Computing is a term widely used in media and is often associated withservices available to the public, like Dropbox, Google Drive or Microsoft 365.A proper definition of Cloud Computing is given by the National Institute ofStandards and Technology (NIST)[13].

Cloud definition: Cloud computing is a model for highly accessible, on-demand, computer resources (e.g. networks, storage, servers and services) thatcan be provisioned and released by a customer with minimal effort. The defini-tion contains 5 essential characteristics:

1. On-demand self-service. A user can single handedly provision computerresources without requiring human interaction with the service-provider.

2. Broad network access. Services are available over the network and aresupported by a wide range of platforms (e.g. laptops, mobile phones, thinclients and workstations).

3. Resource pooling. Computing resources are viewed as a pool serving sev-eral users using a multi-tenant model. Physical and virtual resources aredynamically provisioned according to the user’s demand, without the usernecessarily knowing the exact location of the resources.

4. Rapid elasticity. Services can elastically be provisioned and released,sometimes automatically, to meet increasing or decreasing capacity de-mands.

5. Measured service. Used resources can be monitored providing transparencyfor customer and provider.

The model revolves around the concept of services, where the customer paysfor computer resources when utilized (service), and not for the actual hardware(goods); this differentiates a Cloud infrastructure from e.g. a local cluster[14,p.4-5]. There are three service models associated with cloud computing: Infras-tructure as a Service (IaaS), Platform as a Service (PaaS) and Software as aService (SaaS)

Infrastructure as a service (IaaS): IaaS enables the customer to provisionvirtual machines (VMs), storage and network to configure as they wish. Forexample, a cluster could be created in the cloud by grouping virtual machines,where the service would be the utilization of the cluster. Providers are e.g.Amazon EC2 and Rackspace.

Platform as a service (PaaS): PaaS offers the customer a platform for run-ning their own applications. Here the service consists of delivering a softwareenvironment (operating system, libraries, etc.) in which the customers can mod-ify and run their own application. An example would be Google App Engine

Software as a service (SaaS): SaaS is the highest abstraction between hard-ware and customer. Here the service consists of delivering an application. Atypical SaaS application would be a mail client like Gmail, providing computer

22

resources indirectly through the software.

4.2.1 Infrastructure

Cloud infrastructure revolves around distributing computer resources withoutdedicating physical hardware to a single tenant, a technique called Virtualiza-tion. This is possible by creating a virtual layer above the physical resources.The customer is provisioned Virtual Machines (VMs) that have access to aspecified amount of computer resources.[14, p.7]. VMs can be configured to begrouped or simply run stand alone, as seen in figure 7.

Figure 7: Cloud services offer computer resources through Virtual Machinesthat have access to physical resources like storage (HD), computing power, etc.

The infrastructure gives rise to the characteristics of Cloud computing, ad-dressed earlier. But there are also inherited problems. Heterogeneous resources,meaning that underlying hardware will most likely vary between VMs, couldhave an impact on performance. Providers initially beginning with a homo-geneous infrastructure will eventually get a heterogeneous system, caused bygradual upgrades of hardware; as was the case for instances (VMs) using Ama-zon EC2 in 2012 [15]. This could potentially affect the performance of parallelapplications, especially for those utilizing several VMs in a cluster.

There are more issues concerning the use of cloud services that fall into thedata security and data management categories. Moving a scientific applicationoutside the closed environment of a company and into the Cloud potentiallyexposes confidential models and data to the public. There is also the issue withloosing data due to hardware failures, legislations, hacking attempts, etc. In Ap-pendix A a thorough study, conducted as a part of this work, assesses storageand security concerns for both Cloud Computing in general, and for AmazonEC2 in particular.

In summary Cloud Computing aims to offer computing resources as an utility

23

available at any time anywhere, removing the need for a customer to set up andmanage hardware. Providers offer highly scalable, provisioned resources thatquickly can cope with varying workloads. The different service models allow awider utilization of the resources, ranging from individual PC users, storing dataon Dropbox, to companies running grid-dependent Scientific HPC applications.

5 The Software - Analysis

The hot rolling software in this report is called the Adaptive Dimension Model(ADM) and is developed at ABB. Prior to this report the application couldperform global optimization on a set of objective functions using the global op-timization solvers MultiStart, GlobalSearch and Patternsearch, as described in[16].

In this section a thorough analyzis of the ADM is performed. This includesidentifying and understanding performance bottlenecks and analyzing the op-timization problem (convergence and discretization). Fully understanding thecomputational nature of the software is vital before implementing parallelism(Sect. 4.1).

5.1 The model and its performance

The ADM is categorized as a constrained, non-linear and non-convex optimiza-tion problem, with a 33 dimensional x vector and a set of objective functionsf(x). The optimization vector x governs the tensions on the billet (betweenstands), the gap sizes (between rolls), etc. Table 1 specifies all the 33 dimen-sions of x.

Table 1: The physical meaning of the components in x.

Component Description

1 Entry speed of billet

2-12 Inter-pass tension on metal

13-22 Gap size between rolls

23-32 Velocity of rolls

33 Billet temperature

There are several objective functions that describe different quantities of themodel. Two objective functions will be of special interest in this report, namelyobjective function 71 and 63, from here on denoted obj. 71 and obj. 63. Forcertain parallelism implementations of the ADM more objective functions willbe used. Table 2 describes the objective functions.

24

Table 2: List of objective functions used.

Objective Fun. Description

Obj. 71 Grain size of metal in the end product

Obj. 67 Exit temperature of metal

Obj. 64 Rolling Power targets

Obj. 63 Specific power (power/production speed)

At its core the ADM computes temperature distributions and deformations of abillet passing through the stands of the rolling process. Analyzing which func-tions that contribute the most to the software’s execution time is importantwhen dividing tasks in parallel. MATLAB has a built in profiling tool tailoredfor this task. Profiling tools measure the execution time of all functions andchild-function of a program[17].

The profiling results for the ADM indicate that the majority of the executiontime is spent in functions computing the temperature distribution. Figure 8 isa screenshot from the profiler clearly identifying three computationally heavyfunctions: JE extract NEW, JE termo NEW and linspace.

Figure 8: Profiling results of ADM. Three functions constitute the majority ofthe total execution time.

The function JE extract NEW() prepares data structures to JE termo NEW(),that then performs temperature computations. Linspace is a part of both func-tions and is used to create linearly spaced vectors and matrices. As seen infigure 8 linspace is called 1.4 million times, stressing the memory of the system.

25

5.2 Code improvements

Improving the code will naturally result in shorter execution times, but it willalso determine the efficiency of the parallel program. The serial part of theprogram will limit the obtainable parallel performance (Theor. 1), so makingthe serial part as fast as possible is preferred. In this section three significantimprovements to the ADM is applied; improving memory efficiency, removingredundant computations and to MEX critical functions.

5.2.1 Memory efficiency

Memory can roughly be divided in cache (on-chip memory) and RAM (physicalmemory). The cache is the fastest of the two and requests data from the RAM.The cache performs the best when consistent blocks of data is loaded from theRAM, completely filling it. However when data does not fit into cache or ifdata is spread out in RAM, the utilization of the cache will drop and limit theperformance.

Code suffering from memory bottlenecks can be improved in several ways. MAT-LAB is designed to store data in columns, meaning that a column-vector’s el-ements are stored contiguously in memory, while a row-vector’s elements arespread out in memory. There are also vectorized operations that efficientlyload and compute entire vectors at once[18]. A simple demonstration of non-vectorized code versus vectorized code:

1 %Non-vectorized

2 for i=1:1e06

3 data(i) = sin(x(i))*B(i);

4 end

5 %Vectorized - the .* flags for component-wise(.) multiplication(*).

6 data = sin(x).*B;

The execution time of the above code was 1 second for the vectorized codeand 50 seconds for the non-vectorized code. Hence the vectorized code was 50xfaster. This clearly illustrates how important efficient memory utilization is.

5.2.2 Redundant computations

When solving the ADM optimization problem in MATLAB the computation-ally heavy objective function and constraint function are called in successionmultiple times. Both functions compute the rolling model, which constitute asignificant part of their execution time. Since both functions are called for everynew input vector x, the rolling model is computed twice for the same x. Usingglobal variables in MATLAB, the redundant computations can be avoided bysharing the model data between the functions.

A variable in MATLAB, defined as global, is available in every function. Thisis useful when the amount of data is large and constant between calls or if thedata is needed later in the code. Applying this concept to the ADM effectivelyreduces the execution time by roughly 50 % regardless of optimization method

26

(Fig. 9).

Figure 9: Removing redundant computations with global variables cuts theexecution time of the ADM by 50%. Here the local method fmincon confirmsthis.

Another benefit of global variables is that less time is spent on communicat-ing data. This is important when working with parallel computations, whereWorkers require data that is supplied by the host. Severe communication delaysquickly develop if there is a lot of data being sent back and forth. Sending ini-tial data once, and storing it with global variables, reduces communication time.

Improving memory efficiency of the ADM and removing redundant computa-tions resulted in an execution time nearly 4 times faster than the original code.Profiling results shown in figure 10 are greatly reduced in compared to figure 8.

5.2.3 MEX-Files

Finally, there is the possibility to create MEX (MATLAB Executable) files forcomputational demanding functions of a program. To MEX means replacing aMATLAB function with a compiled C/C++ or FORTRAN version of the samefunction [19]. Other languages like C can offer a speedup of the software, espe-cially when parts of the code cannot be vectorized.

The ADM contains AD termo prep NEW that calls both JE termo NEW andJE extract NEW. Parts of these functions can not be vectorized. For exampleloops where computations require data from previous iterations. This makesAD termo prep NEW a good candidate to be MEXed. The C-language and theC-compiler will improve the non-vectorized parts of the code. The compiler’soptimization flags can do automatic inlining of functions and loop unroling to

27

Figure 10: Profiling results of improved ADM. The top 5 most time consumingfunctions, where the self-time is listed in the last column.

improve the utilization of memory and CPU. MEXing is done through MAT-LAB and uses the compilation flags by default. For more information regardingC-compile optimization, see [20]

The final profiling of the improved ADM, now also with MEX-files, is seenin figure 11. Compared with the original code the execution time is reducedroughly 13 times.

Figure 11: Final profiling of the ADM with MEX, showing the three top timeconsuming functions.

5.3 Parameter study

This study contains two parts. The first part concerns discretization of thetemperature field and how it affects the found global minimum and executiontime of the ADM. The second part consists of investigating how different ini-tial points converge towards a single minimum or multiple minima. This givesinsight into how different initial points affect the execution time of the modeland what to expect when evaluating many points.

5.3.1 Discretization of the temperature field

Choosing the appropriate degree of discretization is important for the accuracyand execution time of the software. The ADM discretizises the temperature by

28

dividing the billet into two zones called the deformation and interpass zones.Every zone consists of a set of disks that represent a part of the complete billet.A disk is modelled as a circle with polar coordinates. FdJ is a discretizationparameter that divides the radial distance into fdJ computing points, as seenin figure 12.

Figure 12: The discretization of the temperature field is constructed by dividingthe billet into two zones, with a number of disks each.

The second important discretization parameter is called fdM and sets how manytime steps that are performed. So in essence fdJ and fdM constitute a compu-tation grid of the size (fdJ,fdM ) for computing the temperature field.

Performing parameter sweeps of fdJ and fdM when running the local optimiza-tion method fmincon, for the same feasible inital point x, will give informationabout the accuracy of the solution and the execution time. Results are presentedin Appendix B.1.

5.3.2 Convergence study

Stressing the ADM by running several random initial points gives interestinginformation of how the execution time will be for MultiStart, GlobalSearch orPatternsearch. This can be investigated by creating a large set of initial pointsthat are randomly distributed between the bounds of x.

The input variable x is a 33 dimensional vector limited by an upper bound anda lower bound. The vector will converge toward the minimum in a finite numberof iterations. Figure 13 illustrates the bounds on x and the iterative process fora single run of fmincon with obj. 63.

The green line in figure 13 illustrates the initial point sent to fmincon. By gen-erating new initial points at random within the defined bounds, the model willbe stressed. Results show how many iterations that are required on average toreach a local minimum; what local minimums that exist; and the distributionof which points that reach what value. The summarized conclusions are statedbelow. For detailed results see Appendix B.2.

29

Figure 13: The components of x are plotted for every iteration of fmincon. Theupper and lower bounds are dotted and dashed, respectively.

5.3.3 Conclusions

The following lists conclude the study of discretization and convergence, de-scribing highlights and discoveries that are important for the parallelization ofthe ADM. The conclusions are based on the results in Appendices B.1 and B.2.

Conclusions from discretization study:

• The discretization characteristics of the ADM are known and sat-isfactory values for fdJ, fdM and the number of disks have been chosenfor the parallel tests. Values: fdJ = 11, fdM = 6, disk in defzone is 2,and disks in interpass zone is 60. Increasing the discretization parametersbeyond the satisfactory values will result in a small increase in accuracyand a large increase in execution time.

• Optimization algorithms(SQP, Active-set, Interior-point) result in dif-ferent execution times of fmincon, where Active-set is the fastest. Also,Interior-Point is removed from further studies because of poor accuracyin finding the minimum and long execution times.

• Objective functions take different time to compute. Obj. 71 re-quires less iterations to find the minimum in comparison to obj. 63; thetemperature field calculations are equally hard to compute, but obj. 63requires more evaluations.

Conclusions from the convergence study:

• An initial point will fail in roughly 15-20 % of the time. Randomlydistributed initial points, within bounds, fail when non-physical results areencountered while optimizing.

• The choice of algorithm affects the obtained solutions. SQP hasa higher probability than Active-set in finding the global minimum; thisis true for both obj. 71 and obj. 63. The difference in the probability offinding the global minimum is significant for obj. 63 (AS :29 % comparedto SQP :85 %). Also, SQP is more accurate in finding the global minimum,in comparison to Active-set.

30

• Different initial points result in different solution times. Mainlydue to the number of iterations required for every initial point. Obj. 71can for certain initial points spike in the number of iterations required tohalt fmincon, where the solution obtained is usually far from the globalminimum. Obj 63 has no ’spikes’ but has a higher average number ofiterations.

• Only one local minimum was found for each objective function.This could question the use of global optimization. However, global op-timization is still important since there exists multiple ’problem points’,i.e. points that partly converge to another objective value, or that fail al-together. Also, this concludes only two objective functions out of severalavailable within the ADM; another objective function could have severallocal minima.

• The occurrence of ’problem points’ will greatly affect the work-load of all global solvers. These points greatly extend the executiontime of the software, making it harder to load-balance when parallelizing.

5.4 Software Parallelism

In this section the parallelism of the software is implemented and tested onABB’s Cluster, Leo (6). The results will help analyze the obtained parallelperformance of the software, and be an important reference when comparingresults obtained on the Amazon EC2 Cloud (6).

5.4.1 The workload of the software

The workload of a software is an important quantity when discussing paral-lelism. It is correlated to execution time in the sense that a larger workloadequals a longer execution time. Depending on what optimization method thatis chosen the composition of the workload can greatly vary (Fig.14).

Figure 14: The workload for the complete ADM software with respect to differ-ent optimization methods.

Regardless of what optimization method is used there will be a constant ini-tialization and evaluation part in the software. These parts are included in theserial fraction of the software (Fig. 5). The size of the workload depends on the

31

choice of optimization method and on what parameters the method uses. Forexample, for multistart the number of initial points will affect the workload.

32

Workload composition

• fmincon: Finds a local minimum for a single initial point. The major partof the work is spent on computing the objective function and constraintfunction (Fig. 11) many times until convergence.

• MultiStart : Every initial point is run by fmincon, hence the workload isroughly the number of initial points multiplied with the execution time ofone fmincon run. However, the convergence study (sec: ??) shows thatthe execution time of a single fmincon run can vary significantly.

• GlobalSearch: The workload is a combination of searching for candidatepoints and running fmincon. The search strategy is stochastic which makesit hard to predict the number of fmincon runs. and hence the workload.The size of the workload largely depends on the number of fmincon runs.

• Patternsearch: The workload consists of the number of objective and con-straint function evaluations required for convergence. The number of eval-uations are related to the choice of pattern and the search strategy.

5.4.2 Implementation of parallelism

Where to introduce parallelism in the software depends on the optimizationmethod and the number of objective functions that are of interest. In this sec-tion every method’s parallelism is illustrated and tested on ABB’s Cluster Leo(Sect. 6).

A set of tests are constructed to measure how well optimization methods scalewith increasing parallel resources. Since the parallel performance is affectedby the composition of the workload, and the workload depends on optimiza-tion characteristics, all tests include different objective functions and algo-rithms(Sect. 5.3.3). All tests are performed 5 times, averaged, and presented inthis section using parallel performance metrics (Sect. 4.1.1).

5.4.2.1 fmincon and GlobalSearch

fmincon, and hence also GlobalSearch, support the parallelization of gradientcomputations (Sect. 4.1.3.1). This means parts of fmincon’s workload is dividedamong several Workers (Fig. 15).

Figure 15: fmincon and globalsearch can distribute the computations of gradientcomponents to several workers, while converging toward a minimum.

33

Figure 16: Results from running fmincon for the same feasible initial point x,while varying test parameters: algorithm, obj. fun. and number of workers.

Figure 17: Results for GlobalSearch running 500 candidate points while varyingtest parameters: algorithm, obj. fun. and number of workers.

5.4.2.2 MultiStart

MultiStart divides the initial points, and hence a large part of the software work-load, dynamically to workers (Sect. 4.1.3.2). This allows for several fmincon tobe run simultaneously (Fig. 18).

34

Figure 18: Parallelism by running several fmincon at once.

Figure 19: Results when running MultiStart with 24 initial points.

5.4.2.3 Patternsearch

Patternsearch iterates towards a global minimum by evaluating patterns withpoints in every positive and negative component direction. In every point thecomputation of the objective and constraint functions is an independent task,hence the mesh evaluation can be divided among workers (Fig. 20). For moreinformation on patternsearch parallelism see section 4.1.3.3.

5.4.2.4 Performance analysis

The parallel performance of the software varies significantly for different opti-mization methods and algorithms, where the amount of parallel work in relationto the serial work is important. By measuring the serial execution time of theADM (Fig. 14) and the execution time of the optimization method, a näıveupper bound for speedup can be calculated (Eq. 10).

In table 3 the obtained speedup for a method is compared with its theoreticalmaximum speedup. The listed percentages represent the fraction of the total

35

Figure 20: One level parallelism of Patternsearch.

Figure 21: Results when running Patternsearch, limited to 100 iterations.

Table 3: Actual speedup compared with theoretical speedup-limit when using11 workers. Data is based on Figures: 16, 17, 19, 21 (obj. 71 and SQP).

Method % of runtime Theor. Speedup Act. Speedup Efficiency

fmincon 69.6 % 2.7x 1.9x 17.3 %

multistart 98.4 % 9.5x 6.1x 55.5 %

globalsearch 98.9 % 9.9x 3.9x 35.5 %

patternsearch 99.4 % 10.4x 5.8x 52.7 %

execution time executed by the optimization methods. Since the methods im-plement parallelism differently, and contain serial parts, the theoretical speedupis called näıve and reflects the maximum speedup if the method could be com-pletely parallelized.

The methods MultiStart and Patternsearch scale better than the other methodswith increasing number of workers. Both methods constitute the major part of

36

the total execution time of the software and therefore have large theoreticalspeedups. Also, the methods are parallelizable to a greater degree than fminconand globalsearch, since the efficiency is higher. The combination of large work-loads and efficient parallelism is the explanation behind the obtained results.

The optimization characteristics affect the parallel performance of the methods.It was concluded from the parameter study (Sec: 5.3.3) that obj. 63 requiresmore iterations than obj. 71 when running fmincon. This affects the executiontime of the methods, but not the execution time of the remaining software.Hence, the serial fraction will be less for obj. 63 than for obj. 71 when runningfmincon. The results in figures 16 and 19 show that the speedup is consistentlylarger for obj. 63 in comparison to obj. 71, regardless of algorithm. Globalsearchshows the same tendencies, but the speedup is influenced by the algorithm to alarge degree (Fig. 17); fmincon constitutes a smaller part of the globalsearch’stotal workload.

The choice between SQP and Active-set (Sec: 3.4.1) affects the execution timeof a single fmincon run. It has been shown in [3] that the parallel fraction offmincon depends on the choice of algorithm. It concludes that fmincon withSQP has a parallel fraction of 48.7 % and with Active-set 87.2 %. This meansthat Active-set should have larger speedup than SQP when run in parallel. Thisholds true for multistart (Fig. 19), but not for a single run of fmincon (Fig.16) or globalsearch (Fig. 17). The reason could be that SQP has, on average,longer execution times than Active-set.

5.4.3 Parallel limitations

The theoretical limits presented in the previous subsection are, as already stated,näıve. There are several causes lowering the obtained speedup, like communi-cation delays, load balancing issues, etc. This section addresses some of them,by stressing the ABB cluster’s resources (network, CPU:s, hard drives) by in-creasing the number of MATLAB Workers.

5.4.3.1 Parallel Overhead

Overhead caused by increased network communication, allocation for morenodes and the initialization of workers can drastically destroy parallel perfor-mance (Sec. 4.1.2.1). A test demonstrating the sharing of cluster resourceswith several users on the ABB cluster is constructed. The results show that thestart-up time for the matlabpool (allocation of parallel resources) worsens withincreasing number of provisioned workers (Fig. 22). Changing to a dedicatednode will remove the competition of resources (Fig. 23).

37

Figure 22: Shared node with high number of users. Stacked bars illustratingexecution time for different parts of the ADM.

Figure 23: Dedicated node. Stacked bars illustrating execution time for differentparts of the ADM.

Network communication adds extra time to the overall execution time. Theratio between the time to send data and the worker’s runtime needs to below for efficient parallelization. Considering how the available methods useparallelism (Sect. 4.1.3.1) multistart is the method with the lowest ratio. Herethe time of sending data is in insignificant to running a complete fmincon runon the worker. For the other methods every communication results in a coupleof evaluations of the objective and constraint functions. When increasing thenumber of available workers to 40, network communication becomes dominantfor fmincon, globalsearch and patternsearch (Fig. 24), but not for multistart (Fig. 25). This manifests as a decrease in speedup, which occurs in the range of20-40 Workers.

38

Figure 24: Methods fmincon, globalsearch and patternsearch for larger numberof workers.

Figure 25: Multistart for different number of initial points (pts) and for a largernumber of workers.

5.4.3.2 Granularity

A method’s workload can be divided into smaller, independent, parts and is con-sidered the granularity of the method. Parallel performance will not improvewhen using more workers than the method’s granularity. For example, runningmultistart for 24 initial points will limit the number of workers to 24. Ignoringthe granularity of the method will result in a speedup plateau where idle workerswill exist but not contribute in optimization. Figure 25 illustrates this situationwhen using 24 initial points. The methods have different granularities, and arepresented in table 4.

39

Table 4: Granularity of the different methods. The optimization vector x con-tains 33 components.

Method Granularity Description

fmincon 33 1 component of x per worker (gradient comp.)

multistart num. of init. pts 1 initial point per worker

globalsearch 33 Same as fmincon

patternsearch 66 (2*33) Pattern consists of 33*2 points (GPSpostiveBasis2N)

A plateau will occur for all methods when the number of workers exceeds thegranularity, tendencies of this is seen for fmincon and globalsearch in figure 24(33 workers and above).

5.4.3.3 Load balancing

In all methods a parfor -loop attempts to load balance work to the workersthrough dynamic distribution. The aim is to keep workers utilized as muchas possible during the total execution of the software. Proper load balancingequates to parallel efficiency.

Increasing the number of workers for a method, while keeping the number ofinitial points per worker constant, is called weak scaling; dividing the paralleltime with the serial time illustrates how efficiently the method scales. For mul-tistart a weak scaling test, when assigning 2, 4 or 6 initial points per worker,shows the impact of load balancing (Fig, 26).

Figure 26: Running Multistart where the number of initial points per worker isconstant. Increase in y-axis represents a % drop in efficiency.

It is known from the parameter study (Sect. 5.3.3) that obj. 71 for certainpoints can exhibit fmincon runs with high iteration counts. This, intuitively,should affect the load balancing, since certain workers will receive considerably

40

higher workloads than the rest. This is confirmed in figure 26, where severaltroublesome points worsen the load balancing with obj. 71. Obj. 63 has a moreeven average iteration count per fmincon run and hence has better efficiency.

Increasing the ratio between the number of initial points to be solved and thetotal number of workers will allow the parfor -loop to utilize workers for a largerfraction of the parallel execution time.

5.5 Summary of analysis

The optimization methods exhibit different parallel performance, where Multi-Start is the recommended method. Its good performance is due to the smallratio of serial to parallel parts, and that granularity is equal to the numberof initial points. The obtained performance is also affected by what objectivefunction that is optimized and what algorithm that is used. For example, obj.71 contains ’spikes’ (Sec. B.2), while obj. 63 doesn’t. Weak scaling (Fig. 26)illustrates how efficiency is worse for obj. 71 than obj.63, but is improved ifmore initial points are used.

For all methods only one local minimum was found per objective function. Thiscould question the use of global methods. However, the occurrence of initialpoints causing ’spikes’ and other ’problem points’ could prevent a local methodfrom finding the local minimum and a global method would mitigate that risk.

6 Computer resources

All computer resources used in this report are presented in table 5. Passmark2

score is a linear grading system for the performance of different CPU:s and isincluded to relate performance between hardware.

Table 5: Specifications for all computer resources used throughout this report.

Name CPU RAM Network Cost

LaptopIntel Core i5-2450M, 2 CPUs @ 2.50GHz

Passmark score: 3,4436 GB — —

ABB ClusterIntel Xeon E5-2670, 16 CPUs @ 2.6GHz

Passmark score: 12,86732 GB 1 Gbps —

Cloud (cc2.8x)Intel Xeon E5-2670, 16 CPUs @ 2.6GHz

Passmark score: 12,86760 GB 10 Gbps 2.3 $/h

Cloud (c3.8x)Intel Xeon E5-2680 v2, 16 CPUs @ 2.8GHz

Passmark score: 16,79960.5 GB 10 Gbps 1.9 $/h

Cloud (r3.8x)Intel Xeon E5-2670 v2, 16 CPUs @ 2.5GHz

Passmark score: 14,638244 GB 10 Gbps 3.1 $/h

2http://www.cpubenchmark.net/cpu test info.html

41

7 Cloud Assessment - Method and Results

In this section the parallel performance of the software will be evaluated on theAmazon Elastic Compute Cloud (EC2). The different optimization methodswill be stressed and analysed. The aim is to compare the parallel performanceof the software for the cloud cluster and the ABB cluster.

The Amazon cloud infrastructure is built on virtualization techniques, wherevirtual machines, called Instances, are provisioned and interconnected to a vir-tual cluster. There are several instance types, designed for different hardwarerequirements, where a couple of them will be tested and analyzed from a cost-performance perspective. The pre-study of the Amazon EC2 infrastructure isavailable in Appendix A, and the complete specifications of used computer re-sources can be found in section 6.

7.1 Cloud Computing through Mathwork’s CloudCenter

Access to the Amazon EC2 infrastructure is primarily, in this report, doneusing Mathwork’s CloudCenter3; a web-service that automatically provisionsinstances, configures virtual clusters, and installs MATLAB on EC2. The serviceis limited to the instance types: cc2.8xlarge and cg1.4xlarge, and a maximumof 256 Workers. The cc2.8xlarge is one of Amazon’s computing instance typeswith similar hardware specifications to the ABB Cluster. The same range oftests that are done for the ABB cluster are also performed on the Mathwork’sCloudCenter.

7.1.1 Optimization method comparison

Figure 27: cc2 instances compared to ABB Cluster Leo for all optimizationmethods. Left plot speedup, right plot difference in runtime.

3http://www.mathworks.se/discovery/matlab-ec2.html

42

Methods scale similarly on the cloud and on the ABB cluster, but where asignificant decrease in speedup develops for Leo when using 24 workers or more(Fig 27). The execution time of all methods are in general higher when run onthe cloud, regardless of algorithm, objective function or problem size (Fig. 27,28).

Figure 28: Radar plot comparing the execution time(s)(on the radial axis) ofthe cloud (blue) and Leo (red) for 1 worker.

7.1.2 Maximized number of workers

Increasing the number of workers eventually mitigates the change in speedup,for all methods (Fig. 29, 30).

Figure 29: Scaling from 1 to 256 workers for the methods: fmincon, GS, PS.

43

Figure 30: Scaling from 1 to 256 workers for Multistart (24 pts, 48 pts, 96 pts).

7.2 Comparison Cloud instances

There are two additional instance types suited for running the ADM. The firstinstance type, c3.8xlarge, has a faster processor and costs roughly 20% less perhour compared to cc2.8xlarge. The second instance type, r3.8xlarge, has fourtimes the memory and costs 35 % more compared to the cc2.8xlarge. The twoinstance types are chosen since they are designed for applications with differ-ent hardware requirements: computing power and memory, respectively. SinceMathwork’s CloudCenter is limited to the cc2.8xlarge, custom scripts were cre-ated to replace the functionality of CloudCenter.

Running fmincon and MultiStart for different number of workers the executiontime is compared between instance types and the ABB cluster. Results showthat the c3.8xlarge is comparable and sometimes even faster than the ABB clus-ter, while r3.8xlarge gives no significant improvement from its increased memorycapacity (Fig. 31).

Figure 31: Fmincon and MS. Different instance types and number of workers.

44

7.3 Cloud bursting

The ability to quickly offload large workloads from the private computer to acluster is called bursting. Cloud bursting is a type of deployment model wherea hybrid cloud infrastructure, i.e. a private network with access to public cloudservices, is designed to cope with short peaks in computing capacity by offload-ing workload to the public cloud4.

The ADM with its many objective functions can rapidly spike in computingcapacity. For example, if the global optimum of four objective functions is tobe found using MultiStart with 64 initial points. To burst this type of heavytask would benefit a laptop user. Test results from bursting to different clustersshow that there are significant time improvements when compared to runningon a regular laptop. (Tab. 6).

Figure 32: Two level parallelism of the ADM software.

From a parallel perspective the most efficient approach when solving a problemis to strive for as high granularity as possible. Dividing the ADM into twoparallel levels where the objective functions are run simultaneously, and everyMultiStart optimization has access to a pool of workers, achieves this (Fig. 32).

Table 6: Run ADM using 128 workers. Workload: 4 obj. functions with 4multistart runs each. Every multistart has 64 initial points. A total of 256fmincon runs.

Runtime ABB Cluster cc2.8xlarge c3.8xlarge Laptop (serial)

Submit job 5 s 7 s 6 s -

Wait 130 s 170 s 140 s 5 h

Receive results 1 s 12 s 12 s -

Total 136 s 189 s 158 s 5 h

4http://searchcloudcomputing.techtarget.com/definition/cloud-bursting

45

8 Cloud Assessment - Discussion

8.1 Cloud Performance

The ADM has the lowest execution times when run on the ABB cluster, re-gardless of method or optimization characteristics (Fig. 27, 28). The processorspecifications of the Cloud cluster (cc2) and ABB cluster are identical (Sect. 6).This indicates that virtualization5 could be the cause, since this is the largestdifference between the architectures. Communicating through the virtual layerwill accumulate latencies that eventually become significant. Expressing the per-formance degradation in relative terms, all methods show that running ADMon the ABB cluster is 18 % faster (std 3 %) than running it on the cloud(cc2.8xlarge, virtualization). Note that virtualization is suggested as a probablecause and not as proven fact.

Considering the performance of the other instance types (c3, r3), the loss inspeed due to virtualization can be compensated through better hardware. Thec3.8xlarge instance has a more powerful processor compared to the ABB clus-ter, reducing from 18 % to 3 % (std 3 %) faster execution time on the ABBcluster compared to c3.8xlarge instance (Fig. 31). For the r3.8xlarge instancethe corresponding number is 11 % (std 3 %). This indicates that the ADM isnot limited by the cluster’s memory capacity, but rather the computing powerof the CPU.

The scaling characteristics of the ADM are similar for all cluster architectures,using 24 workers and below. For more workers, all methods except MultiStartwill transcend from having the shortest execution times on the ABB cluster tohaving shorter execution times on the cloud (Fig. 27). This is due to increasednetwork communication when using multiple workers; especially affecting meth-ods where communication is continuous throughout the optimization process(fmincon, globalsearch and patternsearch).

The ABB cluster operates on a 1 Gbps (Gigabits per second) network band-width, while all tested cloud instances use a 10 Gbps network. Measuring thebandwidth allocation of parallel fmincon and patternsearch shows that everyworker takes roughly 48 Mbps of bandwidth.6 Hence, saturating the networkbandwidth requires ∼20 or more workers on the ABB cluster and ∼200 workersfor the AWS cloud instances. A saturated network will significantly affect theparallel performance of the software, as seen in figure 27, since workers will waituntil data is received before computing.

Increasing the number of workers to the maximum of 256 clearly identifies wherespeedup decreases (network saturation) and eventually flattens (Fig 29, 30). Thechange in speedup mitigates due to the granularity of the method, resulting inidle workers that don’t allocate network bandwidth. A tentative observation ofthe graphs show that the solvers’ granularity limits (Subsec. 5.4.3.2) correlate tothe number of workers, where speedup flattens. Finer measurements are neededto fully confirm this.

5On Cloud: Adding a virtual layer between hardware and software6Measured through NetHogs version 0.8.0 while running ADM with chosen method

46

Bursting large workloads to the cloud is a viable and time efficient option.Running the ADM for 4 objective functions, Multistart and 64 initial pointsper objective function, took 5 hours of computation time on a typical lap-top. Bursting to the cloud reduced this time to ∼3 minutes. The total timeconsists of sending the work (submit job), waiting for the job to complete (com-putations), and returning the results. Comparing the cloud and ABB cluster,time differs in the wait time and receive time (Tab. 6). The difference in waittime is expected, considering the virtualization layer. Also, the time to receivedata seems reasonable. Sending 1-2 MB of data over the private network (ABBcluster) is many times faster than from the public cloud.

8.2 The ADM as a Service

The parallel software has been created according to the service model Softwareas a Service (SaaS, Sect. 4.2), in which the user provisions the software from thecloud when needed. The ADM software is accessed from two places in Amazon;Amazon S3 when the cloud is offline, and from a EBS storage device when thecloud is online. This choice of data management adds redundancy and secu-rity by storing the model encrypted and redundantly in Amazon S3 while offline.

In order to provision the model as a service a MATLAB installation is requiredon the client computer. The client software is a single MATLAB function thatcreates jobs with parameters that instruct the ADM software which objectivefunction to optimize and what method to use. Through Mathwork’s CloudCen-ter a seamless link supplies the client with the service from Amazon EC2. Thecomplete concept is illustrated in figure 33.

Figure 33: Conceptual illustration of the ADM as a service.

47

8.2.1 Cloud considerations

Moving the ADM from the private network to the cloud is necessary in or-der to provide the software as a service. The relocation can potentially exposecompany secrets to the general public. This is always a concern with cloud com-puting and should be carefully considered before choosing cloud services overlocal computer resources.

From the literature-study on cloud security (App. A.3) made for this report,several security concerns where found related to the cloud. These are importantto assess when considering a cloud solution and are listed shortly below.

Security concerns:

• Network Security (spoofing, sniffing, firewalls, security config.)

• Interfaces (API, User interfaces, administration)

• Data Security (Encryption, redundancy, disposal)

• Virtualization (Isolation, hypervisor, data leakage)

• Governance (Lack of user data control, lock-in)

• Compliance (SLA, Loss of service, Audit)

• Legal issues (Subpoena laws, provider privilege)

Amazon is one of the largest cloud service providers at present, that constantlyimprove the security of their services. Many of the listed security concernshave direct solutions within the Amazon Web Services(AWS). For example,network security which is handled through sophisticated firewalls, called Secu-rity Groups, or access control, which is regulated through the IAM(Identity andAccess Management console). A thorough walk-through of common securityconcerns and how to mitigate them in AWS can be found in Appendix A.3.

Storage on the AWS sufficiently offers redundant storage of the software. Theoverall storage services are perceived as comparable and even more secure thancompany solutions. For example, redundant storage on multiple locations inAmazon S3 guarantees 99.99999999% data durability with 99,99 % availability7.Whereas a company’s private network most likely is a single isolated location,vulnerable to e.g. power outages.

Data management for the ADM, through CloudCenter, is elegantly constructed.When the cloud cluster is offline all data is moved and stored on an encryptedimage in Amazon S3 (snapshot), while the data in Amazon EC2 is destroyedtogether with the cluster. This improves redundancy by using S3 and improvessecurity by only decrypting the software when provisioned by the user. Notethat this solution is specific to Mathwork’s CloudCenter, where Amazon sup-ports several storage solutions. For more information regarding data manage-ment in AWS see Appendix A.2.

7https://aws.amazon.com/s3/faqs/

48

8.2.2 Hybrid solutions

The ADM service model allows for the use of a hybrid cluster infrastructure,i.e. a composition of two or more clusters available to the user. The clientsoftware can effortlessly and simultaneously deploy jobs to the Amazon EC2cloud, through the ABB firewall, and to the ABB cluster on the private net-work. The solution is possible through the design of the client software, whichis not specific to the ADM software. Hence, this solution works for arbitraryparallel MATLAB software.

The hybrid infrastructure can be used to burst workloads, requiring many work-ers, to the cloud. This is useful considering that the ABB cluster is sharedamong many employees at ABB, where the desired computing power may notbe available. For example, the bursting example presented in section 7.3 (Tab.6) would require 128 workers. The ABB cluster contains 512 cores. Runningthe example would allocate 25 % of available computing resources, which is notacceptable.

8.3 Cost analysis

The cost of provisioning computer resources from the Amazon EC2 depends onseveral choices related to e.g. storage type, computing power, network traffic,etc. As for other cloud services, the computer resources are billed by the houror by the amount occupied ($/GB of storage).

The largest cost is associated with the provisioning of instances, where the choiceof instance type is important. The three instance types tested for performance(Fig. 31) vary in their cost per hour (Fig. 34). It is clear that the c3.8xlargeis the best choice in terms of both cost and computing performance for the ADM.

Figure 34: Cost for different instance types compared to obtained executiontime when running ADM with Multistart 32 workers and 30 initial points.

49

9 Conclusions

This report proves the use of the ADM as a Service, supplying global optimiza-tion in parallel from the Amazon EC2. Parallel performance from Amazon EC2is comparable to a similar on-site cluster. Two factors related to performancewere identified that set the two cluster architectures apart; Network bandwidth,to the disadvantage of the on-site cluster (ABB), and virtualization to the dis-advantage of the cloud cluster. In particular, the on-site cluster executed theADM 18 % (± 3 %) faster than the cloud cluster (cc2.8xlarge), regardless ofmethod, for a number of workers less than 20.

From a cost-performance perspective, the c3.8xlarge instance was found to havethe best performance and the lowest cost per hour for running the ADM soft-ware.8

Multistart is the most efficient method when run in parallel. The analysis andparallelization of the ADM revealed hard limits on the speedup for all opti-mization methods. The most important limit is related to the ratio betweenserial and parallel workload, where MultiStart has the lowest ratio. Anotherlimitation is based on the granularity of the methods. The increase in speedupbecomes zero when the number of workers surpass the granularity of the method.MultiStart is the only method with dynamic granularity, where the granularityequals the number of initial points. The other methods are limited by the di-mensionality of x. Finally, latencies from network communication will becomesignificant and cause a decrease in speedup for increasing number of workers.The methods fmincon, globalsearch, and patternsearch require continuous net-work communication throughout the optimization process and thus suffer fromsignificant decreases in speedup. Multistart only sends

Documents

High Performance Optimization on Cloud for a Metal Process ...732246/FULLTEXT03.pdf · The last reason concerns acceptable HPC performance in the cloud. Previous research[3], assessed