Enterprise applications in the cloud - are providers ready?

Enterprise applications in the Cloud:

Are service providers ready?

Leonid Grinshpan, Oracle Corporation (www.oracle.com)

Subject

Managing the performance of enterprise applications is hard. Managing and optimizing

the performance of enterprise applications on shared virtualized infrastructure (i.e.

cloud computing) is even harder. This article outlines the specifics of capacity planning

and performance management of EAs deployed in the cloud.

Forrester research [www.teamquest.com/pdfs/.../forrester-key-cloud-virtual-computing.pdf]

indicates that 98% of interviewed executives in North America and in Europe believe

that the main challenges of virtualized and cloud environments have their root in

capacity and performance management:

Those findings are directly applicable to enterprise applications (EAs) featuring high

complexity, multiplatform deployments, and strict requirements to service quality.

As an Oracle consultant, the author has eyewitnessed numerous confirmations to

Forrester’s findings while working with a diversity of customers on sizing and tuning

Oracle’s EAs.

The following is a real-life story. A large international bank has deployed in a private

cloud a financial EA featuring spikes in user workload. The spikes occurred during each

financial reporting period because of the high rate of financial consolidations. The IT

department did not have the tools to measure transaction times and was not aware that

consolidations were unacceptably long during workload peaks. Monitored by IT

around the clock utilization of hardware resources did not exceed 70% on any of the

servers and IT was under the impression that the EA worked as expected. That feeling

evaporated immediately after the first complaints from users were logged. IT

reexamined all collected hardware performance counters, did not find any indications

of hardware resource shortage, and decided to resort to EA experts.

Analysis of the consolidation transaction indicated that most of the time it was running

on the OLAP (on-line analytical processing) database. In order to be in an active state, it

has to acquire database connections. The monitoring of the EA under peak load found a

shortage in the number of database connections. This limited the number of

consolidations the OLAP server was able to process concurrently. The finding clarified

that the unacceptable increase in consolidation time was due to long waits for available

database connections. Increasing the number of database connections noticeably

improved consolidation time, but produced an unwanted effect – the database server

was running on almost 100% of its total CPU capacity. Fortunately IT was proficient in

detecting and treating such a malaise. By raising on a database virtual machine the

number of CPUs from 24 to 32, IT delivered expected consolidation time as well as

brought CPU utilization to a normal level. Unfortunately, because of the over-

commitment of CPUs, other applications in the cloud started to experience performance

issues, but that is a beginning of another real-life story. The IT department learned a

few lessons: monitor transaction time, become skilled at detecting and fixing software

bottlenecks, and forecast consequences of the changes aimed at performance

improvement.

Failure is not an option when launching EA into the Cloud – failure equates to a

disruption of a company’s operations with all accompanying fiscal and public relation

consequences. EAs’ are critical vehicles carrying out day-to-day business functions; they

must perform as expected at any instance of the production cycle and efficiently process

workloads that fluctuate within broad limits.

EAs’ can be implemented in different ways and target diverse business tasks. In this

article we consider EAs’ deployed inside corporations that are in use by their

employees; this means that they are not retail apps or customer serving apps. We also

define EA as a complex, unified object consisting of hardware infrastructure, business-

oriented software, and operating systems.

A nascent trend in EA deployment is their relocation to the Clouds. The beauty of cloud

computing from an IT perspective is that along with EA relocation it also migrates a

headache of EA management from company’s IT to a cloud provider. The latter

becomes in charge for meeting Service Level Agreement (SLA), a task significantly

complicated by the great expectations of cloud customers.

Cloud providers are facing two major challenges – allocation of appropriate resources

to EA (capacity planning) and maintaining acceptable SLA (performance management).

What significantly complicates both tasks is that any cloud inherently represents a

collection of resources shared among a number of applications, unlike the traditional IT

environment with servers and appliances mostly dedicated to applications. The

cloud’s shared resources are under dynamically changing demands from a variety of

users working with diverse EAs having in common only one characteristic - mission-

critical importance for the corporations.

This article outlines the specifics of capacity planning and performance management of

EAs deployed in the Cloud. We are using queuing models of EAs to emulate and

analyze performance related happenings in the cloud’s shared platforms.

Methodological foundation for this study can be found in the author’s book [Leonid

Grinshpan. Solving Enterprise Application Performance Puzzles: Queuing Models to the

Rescue, Willey-IEEE Press; available in bookstores and from Web booksellers from January

2012].

The article affirms that in order to ensure successful EAs hosting, a cloud provider (in

addition to perfect execution of traditional system management duties) has to be

capable to efficiently carry out:

- Monitoring and characterizing of EA workload.

- Proactive evaluation of EA capacity as well as transaction times compliance with

SLA.

- Monitoring of business transactions.

- Identification and fixing of software bottlenecks.

EA workload characterization

EA workload characterization includes three components:

- List of business transactions.

- Number of each transaction executions during one hour per requests from one

user (transaction rate).

- Number of users requesting each transaction.

Only one of the three components is not prone to frequent fluctuations – that is a list of

application transactions. It reflects application functionality that tends to change slowly,

usually by small additions/deductions with new software releases. The two other

components are inclined to be highly volatile. They normally feature daily, weekly,

monthly, and yearly fluctuations usually exhibiting repeatable patterns.

Any cloud hosting several EAs has to service a number of diversified dynamic

workloads and manage to process them according to SLAs. This is possible only if a

cloud provider is equipped with the tools to monitor all three components of workload

characterization and can collect and correctly interpret workload data necessary for

cloud capacity planning.

Planning for capacity

A cloud with permanently changing workloads can deliver expected services only by

systematically implementing capacity planning. The highest degree of accuracy in

cloud capacity planning can be achieved by using queuing network models of EAs.

Queuing models are capable of factoring in cloud architecture, processing times on

different servers, the parameters of hardware as well as user workload. Models also can

assess the effects and limitations of software parameters like the number of threads,

connections to system resources, etc. Queuing models take into account the

fundamental behavior of any system servicing users –the fact that user requests are

waiting in the queues if a speed of a service is slower than a rate of incoming requests.

Wait time in any queue contributes to transaction time. The ability of queuing models

to assess it for different workloads and system architectures enables the cloud provider

to estimate needed capacity as well as compliance with SLA.

A cloud infrastructure has to have the unique ability to quickly reallocate system

resources as workload changes. We demonstrate how queuing models help find out

capacity needed to satisfy particular workloads, as well as how to predict usage cost for

the cloud’s customers.

Let’s consider an EA’s queuing model on Figure 1. It represents a classical three-tiered

EA with Web, Application, and Database servers. Each server corresponds to a model’s

node with the number of processing units equal to the number of CPUs in a server. The

users and network are modeled by dedicated nodes. We assume that all servers (no

matter physical or virtual) are allocated by a cloud provider to our EA and each one has

8 CPUs.

Figure 1 Model 1 of a three-tiered enterprise application

The workload for Model 1 is presented in Table 1. For simplicity it has only one

transaction named “Interactive transaction”; each user initiates an interactive

transaction ten times per hour. We have analyzed the model for 100, 200, 300, and 400

users.

Table 1

Workload for Model 1

Transaction name Number of users Number of transaction

executions per user per hour

Interactive transaction 1,100, 200, 300, 400 10

The models in this article were analyzed using TeamQuest solver

[http://teamquest.com/products/model/index.htm]. Model 1 predicts response time

exponential degradation starting from 300 users (Figure 2). It also estimates that for up

to 250 users transaction time will be under required by SLA 10 sec (the vertical line on

the chart).

Figure 2 Transaction response time and system throughput

System throughput (measured in the number of transactions per second or per hour)

grows linear until it reaches a breaking point for 300 users and its growth slows down

(point A on chart). At point A system throughput is 0.8 trans/sec * 3600 sec = 2880

trans/hour (for convenient representation on the chart we have scaled the system

0

20

40

60

80

100

120

1 user 100 users 200 user 300 users 400 users

Transaction response time (sec) System throughput (trans/sec)

A

B

throughput line 100 times). At point B (where transaction time is still in line with SLA),

system throughput is 0.65 trans/sec * 3600 sec = 2340 trans/hour.

System throughput is a must-have parameter for cost estimatation when the cloud’s

price policy requires customers to pay per each transaction. Usage cost is calculated per

formula:

Application Usage Cost = Cost of one transaction * System throughput

Per Figure 2, throughput increases when the number of users grows. As our system can

support no more than 250 users without SLA violation, we consider throughput

supported by a system for 250 users as a SLA-compliant maximum throughput. The

cloud provider, in order to receive the highest revenue, has to monitor workload and

dynamically allocate to the EA a volume of resources that keeps system at a SLA-

compliant maximum throughput level. In Model 1 on that level Database server is

efficiently utilized (Figure 3), the transaction time is in line with SLA, and the cloud

provider has the highest return on investment. In case a workload is fewer than 250

users, the Database server is underutilized and the cloud provider can reallocate it to

another EA.

Figure 3 Utilization of system servers

0

10

20

30

40

50

60

70

80

90

100

1 user 100 users 200 user 300 users 400 users

Percentage

Database server Application server Web server

Price policy can be based not only on system throughput but also on hardware

utilization or on specifications of allocated hardware (number of CPUs, their speed,

memory size etc). No matter what price policy, queuing models provide data for

scientific estimates of the provider’s return on investment as well as the customer’s cost.

Monitoring business transactions

The most important EA performance indicator is transaction response time. It is

specified in SLA and seemingly has to be under IT monitoring and control 24/7.

Ironically that is not a predominant case; many IT departments instead are obsessed

with hardware capacity monitoring and even do not have appropriate instruments and

policies for business transaction monitoring. In relentless pursuit for IT optimization

they strive for getting the most out of hardware servers and appliances often

compromising transaction time. Figures 2 and 3 show how such a policy can jeopardize

EA performance – after exceeding 250 users hardware utilization and system

throughput are going up, but the price we pay is the exponential degradation of

transaction time.

Software bottlenecks

Transferring EA in the Cloud puts cloud providers in charge of detection and

troubleshooting for all kinds of performance bottlenecks. EAs may suffer from one of

two distinct groups of bottlenecks - hardware and software. The first group is well

familiar to any IT department and cloud provider; the remediation prescriptions can be

found in textbooks on performance. Usually when a hardware bottleneck is identified it

can be fixed by either vertical or horizontal scaling.

Things are much trickier with software bottlenecks. Software bottlenecks are caused by

the settings of EA tuning parameters that limit the application’s ability to use available

hardware capacity to satisfy workload. Examples of a tuning parameter: Java Virtual

Machine heap size, number of Web server connections, number of database

connections, number of software threads used to execute particular EA function, etc.

Bottlenecks of this group can be fixed by changing the values of tuning parameters;

their identification requires knowledge of EA functionality which might be terra

incognita for a cloud provider. The inability of a cloud provider to detect and fix EA

software bottlenecks is equivalent to a provider’s failure to deliver service acceptable to

cloud customers. Queuing Model 2 demonstrates an impact of software bottlenecks on

EA performance.

Model 2 has the same topology as Model 1 presented on Figure 1, the same count of

CPUs for each server and the same workload described in Table 1. The difference

between both models is that Model 2 emulates database connection pooling. A

transaction has to acquire two database connections before it can be processed by the

database; after processing is completed the connections are returned to a pool and can

be allocated to another transaction. If the pool does not have idle connections then the

transaction will wait until they become available by the same token increasing its time.

We start model analysis for the pool size equal to ten connections. Model 2 forecasts

that for 250 users transaction time starts to degrade exponentially, but the reason for its

degradation is not overutilization of system servers – the most loaded is the Database

server and it is running only up to 50% of its capacity for 300 users (Figure 4).

Figure 4 Utilization of system servers and transaction time

(Database server has 8 CPUs, connection pool has 10 connections)

Increasing the Database server capacity by an additional 8 CPUs brings its total number

to 16 but does not fix the bottleneck; it reduces Database server utilization to 25% but

almost doubles transaction time (Figure 5).

0.00

10.00

20.00

30.00

40.00

50.00

60.00

1 user 100 users 200 users 300 users

Transaction time (sec) Database server utilization (%)

Application server utilization (%) Web server utilization (%)



The explanation of such an EA behavior is rooted in the insufficient size of the database

connection pool, which prevents the EA from using the database server until idle

connections become available. Let’s increase pool size and bring it to 30 for a Database

server with 8 CPUs. Model 2 predicts that software bottleneck will be fixed (Figure 6)

and transaction time for 300 users will be down to 15 sec.



0.00

20.00

40.00

60.00

80.00

100.00

120.00




0.00

20.00

40.00

60.00

80.00

100.00




Apparently, a more powerful Database server is needed because the one with eight

CPUs is utilized on 90% for 300 users. We solved Model 2 for a Database server with 16

CPUs and a connection pool with 30 connections (Figure 7).



This architecture eliminated software and hardware bottlenecks and delivered

transaction time consistent across all numbers of users as well as brought utilization of

the Database server under 35%.

Take away from the article

Successful launch of EAs into the Cloud is conditioned by a cloud provider’s ability to

perform below functions that expand traditional system management framework.

Monitoring and characterizing EA workload

The only unvarying feature of an EAs workload is its dynamism. Changing workload

requires change in hardware capacity to keep EA transaction times in line with SLA.

The estimates of hardware capacity and the values of software tuning parameter

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00




needed to satisfy SLA are based on EA modeling; input data for the models are

collected by workload monitoring.

Proactive evaluation of EA capacity and transaction time compliance with SLA using queuing

models

Queuing models factor in cloud architecture, processing times on different servers,

hardware specifications as well as user workload. They also assess performance

implications of software parameters like the number of threads, connections to system

resources, etc. In addition, models provide data needed for the assessment of service

cost to a cloud’s customers and provider’s revenue. Models convey an estimate of SLA-

compliant maximum throughput – a parameter cloud provider has to know in order to

receive the highest revenue. When the EA is fine tuned and delivers SLA-compliant

maximum throughput these three goals are achieved: 1) cloud’s hardware capacity is

efficiently utilized; 2) transaction time is in line with SLA; 3) cloud provider has the

highest return on investment (assuming customer pays per executed transactions).

Monitoring business transactions

Hardware performance counters always had been observed by IT and they continue to

be on the dashboards of the cloud providers. Unfortunately, transaction time

monitoring is predominantly neglected by system management applications, despite

the fact that it represents the most important SLA requirement. As we have

demonstrated using models, software bottlenecks lead to underutilization of hardware;

if the provider takes notice of only hardware utilization he might come to the wrong

conclusion that the system has sufficient capacity and can process even more intense

workload. By measuring business transaction times the provider knows when the SLA

is violated and can immediately start looking for hardware and software bottlenecks.

Identification and fixing software bottlenecks

Queuing models analyzed in this article have shown that software bottlenecks block EA

access to available hardware resources, increasing transaction time and keeping

hardware utilization low. If a software bottleneck is not identified and transaction time

is not monitored, a cloud provider can be under the impression that the system works

well, because hardware utilization does not exceed predetermined critical levels

triggering alarm. Trying to fix software bottlenecks by increasing hardware capacity

brings hardware utilization even lower; such solution also increases transaction times as

more transactions compete for a limited number of software threads or database

connections. The only efficient corrective action is to change the value of appropriate

software parameters, which requires a cloud provider to be familiar with EA

functionality. When a cloud provider is in charge of a number of EAs, this holds true

for all of them.

The cloud provider has to evaluate its ability to deal with the challenges noted in this

article while offering services to the customers. From their side customers have to be

vigilant and execute their own due diligence to ensure that EAs are handed over to

providers that are capable of successfully maintaining them in the Cloud.

About the author

Last fifteen years as an Oracle consultant author was hands on engaged in performance

tuning and sizing of enterprise applications for various corporations (Dell, Citibank,

Verizon, Clorox, Bank of America, AT&T, Best Buy, Aetna, Halliburton, Pfizer, Astra

Zeneca, Starbucks, etc).

Technology

Enterprise applications in the cloud - are providers ready?