J2EE Batch Processing

1

Batch ProcessingWith J2EE

Chris Adkin

28th December 2008Last Updated13th May 2009

2

Introduction

For the last two years I have worked on a project testing the performance and scalability of batch processes using a J2EE application server.

This presentation summarises my findings and conclusions based upon this work.

3

Introduction There is scarce information:-

on batch processing using J2EE in the public domain. on the end to end tuning of J2EE architectures which use Oracle

for persistence There is a lack of information within the DBA community on

performance tuning with respect to J2EE than goes beyond JDBC usage.

Most J2EE material only goes as far down to the database as persistence frameworks and JDBC.

The available information is not as “Joined up” as it could be. Hopefully, this presentation may fill some of these gaps and bridge the

divide between J2EE and database tuning.

4

Design Considerations

5

Design and Architecture Considerations

Use a third party tools and frameworks:- Spring batch Quartz

J2EE Application Server Extensions IBM WebSphere Compute Grid

Write your own infrastructure, devx has a good example.

6

Considerations For Available Infrastructures Quartz

Not a full blown batch infrastructure and execution engine, just a scheduler.

Spring Batch Version 2.0 was not available at the time of

inception of my project. Version 1.0 is only designed to run on one JVM and

was written for JSE 1.4. Earlier versions of Spring can compromise the

transaction integrity of the application server, refer to this article.

7

A Word On Frameworks

Leverage frameworks wherever you can to reduce design, coding and testing effort.

However, a batch environment requires more than just the use of a framework.

Other factors to consider are:- Quality of service High availability GUI based batch and job control Scalability, clustering, grid / caching solution usage Job control languages

8

Considerations For Available Infrastructures WebSphere Compute Grid

IBM has a long track record in both the J2EE and batch processing worlds.

“a complete out-of-the-box solution for building and deploying Java-based batch applications on all platforms supported by WebSphere” according to this article.

Integrates the tightest with WebSphere out of all the available options, but also ties you into WebSphere.

Requires WebSphere Network deployment as a pre-requisite. Not just a batch job processing infrastructure but a grid as

well. Comes with full tooling for developing batch jobs.

9

Off The Shelf “Batch Containers” WebSphere Compute Grid is essentially a batch container

with Eclipse based tooling support. If batch processing forms a significant part of your IT

infrastructure requirements, seriously consider using an off the shelf “Batch container”, most people would not write their own:- EJB container Web container SIP container Portlet container Spring framework

Do not write your own “Batch container” on a whim.

10

Infrastructure Considerations

Workload partitioning and scalability Can the workload be subdivided for distribution amongst

worker threads and nodes in a J2EE application server cluster ?

Does the infrastructure scale across JVM threads ? A grid ? J2EE application servers in a cluster ? Multiple JVMs via JMS and associated queuing

technologies ?

11


Job traceability Does the framework give visibility of each stage of

processing that a job is at ?. Can the level of logging / tracing / auditing be changed

for individual batch jobs and how fine grained is this ?.

Exception handling Does the framework allow for this ?.

12


Resource consumption management Control over CPU utilisation.

Extensibility Do you have to get your hands dirty with maintaining

the framework or can you just ‘Drop’ your business logic into it ?.

Is the framework flexible in handling the delivery of jobs from different sources ?, JMS, web services ? Etc . . .

Is the framework flexible in integrating with different end points ?.

13


Scheduling and event notification Does the framework provide a scheduling

mechanism or can it easily hook into a third party scheduler products.

In particular the more popular schedulers such as BMC Control-M or Tivoli Maestro ?

Does the framework provide hooks into a pager and / or email event notification system ?.

14


Resilience If a job or batch fails, will it bring the whole application

server down ?. If a batch fails, does it roll back and leave the application

in a consistent state ?. Can batches be re-started without any special steps

having to be performed.

15

Batch Environment Components

Batch execution environment The actual batch run time environment

Batch ‘container’ software to provide the services for a batch to run.

Scheduling Does the environment provide this or hooks into

third party schedulers ? The application itself

16

What Does J2EE Provide For A Batch Environment Pooling for the efficient management of resources.

Access to logging frameworks, Apache log4j, Java Util Logging (JUL).

Rich integration infrastructure via J2EE Connection Architecture and JDBC Java Messaging Services Web Services Web Service based publish / subscribe style event processing via

WS Notification Session Initiation Protocol (SIP) Service Component Architecture (provided in WebSphere 7 via a

feature pack).

17

What Does J2EE Provide For A Batch Environment Asynchronous processing via message driven beans.

Transaction support via JTS and an API via JTA.

Scalability across multiple Java Virtual Machines Most J2EE application server vendors offer clustered solutions.

Scalability across multiple Java threads Threading is not supported in the EJB container by definition of the

J2EE standard, however, it can be simulated using a client JVM or asynchronous beans.

Security via JAAS.

18

Clustering

To achieve scale out clustering might be required at some stage.

Some people confuse distributed object architectures for clustering, the two are not the same.

Make your architecture and design cluster friendly via the points on the following slides.

19

Clustering

Bean types and work load balancing Calls to instances of stateless beans can be

balanced across a cluster. Only calls to statefull bean home objects can be

balanced across a cluster. Calls to entity beans are directed to the node on

which the transaction they are associated with is still active.

20

Clustering

Scale proof the architecture by using cluster friendly caching features, e.g the DistributedMap.

Use WebSphere “Prefer local” workload management policies. Beans will prefer using co-located beans when making

method calls so as to minimise remote method calls. Avoid application properties files associated with specific

nodes, LDAP and the use of JNDI stores are popular solutions for resolving this issue.

Only beans with serializable classes will be able to use WebSphere failover capabilities.

21

ORM Considerations

Many frameworks are available: iBatis, Toplink, Spring, Hibernate and IBM pureQuery.

Java Persistence Architecture lessens the need for such frameworks.

Few frameworks utilise the Oracle array interface. Use of a framework can vastly reduce the amount

of code required to be written. A “Half way house” is to use a JDBC wrapper.

22

ORM Considerations

Questions to ask when choosing an ORM:- Can custom SQL be used ? Can SQL be hinted ? Does it have caching capabilities ? Does it allow stored procedures to be called ?, both

PL/SQL and Java. Does it allow for any batch / bulk operations to be

performed ?, e.g. wrappers for the JDBC batching API.

23

ORM Considerations

A hybrid approach can be adopted, for example:- Read only entity beans for accessing standing data,

these have been highly optimised of late, as per this article.

JDBC for leveraging database vendor specific features, the Spring JDBC template takes a lot of the effort out of writing JDBC code.

Hibernate for most of the ‘Simpler’ database access There are many real world projects that have used

both Hibernate and JDBC successfully for this.

24

Caching Considerations What is the percentage split between read and write activity

against data stored in the database:- Read intensive, then caching needs to be seriously

considered. Write intensive, consider stored procedures and

leveraging bulk operations as much as possible. Oracle and DB2 both support Java stored procedures

Leverage skills of J2EE developers within the database !. Whatever you do, frequently accessed standing data

should always be cached.

25

Caching Considerations

Jobs processed within the same batch may not reuse the same data.

Consecutive batch processes that follow on from one another might reuse the same data.

In the worst case scenario the only caching benefits that will be realised are:- Write behind when persisting changes. Caching of standing, fixed domain or

configuration data.

26


Is a custom caching design going to be used ?. “Scale proof” this using Network Deployment friendly

memory structures such as DistributedMap.

The simplest solution is to cache within the same heap as the application server JVM.

More advanced solutions include grids with distribution and replication capabilities.

27


Is an off the shelf caching solution to be used ? Two broad categories:-

Object caches, which can often cache whole object graphs Relational caches, e.g. solid db and Oracle Times Ten.

Object caches such as Oracle Coherence and WebSphere eXtreme scale also have grid like processing capabilities.

These are intrusive technologies that need to be factored into development.

Some ORMs can use a backing cache, e.g. Hibernate and JPA and both use DistrubtedMaps and eXtreme scale.

28


Choosing between a relational and an object caching solution depends upon:- Does the integration layer expect objects or relational data ?. Are you retro fitting the cache to an existing piece of software ?.

Again, if the software expects relational data, the use of a relational solution will incur the least effort.

Objects talking to objects is faster and more scalable than objects talking to relational data.

A common use case to use an object cache cannot ‘Front’ a relational database.

29

Logging Considerations

The attention that needs to go into this part of the infrastructure goes far beyond the usage of a none blocking logging framework.

The main things to consider are:- Managing the volume of logging generated, if you

are not careful you can end up with I/O bound application server !!!.

Job thread traceability

30


Job thread traceability You will invariably end up with multiple container

managed threads executing the same batch workload simultaneously for throughput purposes.

The ability to follow the execution steps of units of work through the application log files should be a basic operational requirement of the software.

31


The log4j properties file allows thread numbers to appear in log messages without any code changes. However, during the life cycle of a batch, it is highly likely that

threads will be reused across more than one unit of work. Most batch oriented applications will associate a unique

identifier with each discrete unit of work, based of a database sequence.

Therefore, always include this unit of work sequence number in all log messages where possible.

IBM developerworks has a useful article on logging in multi threaded applications.

32


A full discussion on logging frameworks is beyond the scope of this material.

Log4j and java util logging (JUL) have their own unique strengths and weaknesses.

However, JUL allows the logging level to be configured on a per class basis via both the websphere admin console and wsadmin.

This may be useful in reducing the amount of logging the application generates by enabling debug logging to be turned on for specific classes of interest.

33

Design Challenges

Resource UtilisationUsing a database for persistence incurs

performance penalties:-Network round tripsLatency in data retrieval and

modificationObject Relational impedance mismatch,

the “Vietnam of Computer Science”.

34

Design Challenges Resource Utilisation

Well designed and written batch processes may saturate CPU capacity on the application server:-

Good for throughput. Spare CPU capacity may be required to run

multiple batches simultaneously in “catch up” scenarios.

Not so good for any other none batch activities using the environment.

Sustained spikes in J2EE application server CPU utilisation whilst batch processes are running and low CPU activity at other times.

35

Design Challenges

ORM (Object Relational Mapping) frameworks There are a multitude ORM frameworks on the market. ORM frameworks abstract away the underlying database. Hibernate includes a batching API allowing something akin to

JDBC statement batching to be achieved. A framework may not necessarily allow vendor specific

database features to be leveraged. J2EE Java Persistence has come a long way with JEE 5 in

the form of the Java Persistence API, both in terms of functionality and performance.

36

Design Challenges Raw JDBC

Statement batching support from JDBC 2.0 onwards. Fetch size is configurable:-

Statement level within the source code Via the data source custom properties using defaultRowPrefetch

Will result in more “hand rolled” code than that required with an ORM framework

The Spring JDBC template can help here. Provide access to vendor specific performance related features such

as the "Oracle array interface“ and binary large object manipulation.

Requires more skill on the part of the Java programmer in terms of having SQL and database knowledge.

Development team might require a DBA.

37

Hand Written SQL

Most database vendors will support advanced language features such as MERGE statements, sub query factoring, inline views and inline functions. Fewer statements is better for performance and scalability.

An ORM framework might necessarily allow these features to be leveraged.

JPA gives the best of both worlds, custom SQL can be used via named queries.

38

Design Challenges SQLJ

Is essentially a JDBC wrapper, SQLJ calls are translated into JDBC calls by a pre-processor

Can achieve similar results as JDBC with less coding. Support for statement batching. SQLJ syntax can be checked at compile time. Does not support the Oracle array interface. An IBM SQLJ reference. Oracle SQLJ examples.

39

Design Challenges

Can the Oracle array interface be leveraged ? Despite all the choices available only raw JDBC provides

access to the Oracle array interface. There may come a point in scaling your architecture when the

Oracle array interface needs to be used, in order to:- Minimize network round trips Minimize parsing Leverage bulk operations within the database

For large volumes of data this is faster than statements that process one row at a time by orders of magnitude.

Refer to these figures.

40

ORM Summary

One of the challenges of designing the software architecture is how to construct the layers and tiers.

In my opinion equal thought should go into where to carry out the processing.

For some scenarios, carrying out the processing within a database may be the best option.

There will be other situations in which carrying out the processing within the application server is a “No brainer”.

Do not be precious about trying to do everything in the database or everything in the application server.

41

Design Challenges

Can Oracle 11g client side caching be used ? An extension of the technology that allows results to be

cached in the server shared pool, but on the “Client side”. Requires the use of the thick JDBC driver. Can vastly reduce network, round trips, data access latency

and CPU utilisation on the database server. An excerpt from the 360 degree programming blog:-

“Running the Nile benchmark[3] with Client Result Cache enabled and simulating up to 3000 users results inUp to 6.5 times less server CPU usage 15-22% response time improvement 7% improvement in mid-tier CPU usage”

42

To Batch Or Not Too Batch

When real time asynchronous processing is applicable Processing needs to take place as soon as the

source data arrives, which does not all come at the same time.

When the processing window is too small to process all the jobs in one batch and when the jobs arrive continuously throughout the day.

Jobs are delivered asynchronously.

43

To Batch Or Not Too Batch

When a batch environment is applicable If the jobs processed are delivered in batches, this

will to a degree enforce batch type processing. When files need to be generated for delivery to

another organisation. If migrating from a none J2EE legacy batch

environment to J2EE, stick to batch in the first iterations of development, rather than jump to J2EE and an event processing architecture in one “Quantum leap”.

44

A “Third Way” Hybrid Environment

A real world example of where this is in operation Most retailers aggregate sales information from their point

of sales (POS) systems for processing at the head office. Larger retailers tender so many transactions that

processing them within a single batch window is not practical.

Therefore, for some retailers, information from the POS systems is continuously trickled to the head office and then batched up for processing when a certain number of files have been received.

45

Our Batch Process Design

J2EE tier WebSphere launch client to instigate batch processes. Client using Java threads to fire off multiple requests at the

application server and hence ‘Simulate’ threading within the application server.

Something similar could be achieved with message driven beans. A batch session bean to process arrays of jobs within a loop inside

the WebSphere application server. Stateless session beans. Each job is processed within its own container managed transaction. Application configurable max threads per batch process and max

jobs per thread.

46


Persistence (Oracle) tier Raw JDBC and the Oracle thin driver. Some use of JDBC statement batching. Oracle 10g release 2 for the database. Limited use of stored procedures. J2EE tier data caching limited to standing data:-

Data cached in XML within the application server. When a standing data table is accessed for the first

time it is cached. All subsequent retrievals are via XPath.

47


Not a true batch implementation as such. Web GUI, Web service(s) and hand held units can

and are used whilst ‘batch’ processes run. ‘Batch’ in the context that large numbers of jobs are

processed together within specific time windows. All batch control is via the WebSphere launch

client, no GUI based job control.

48

Performance Monitoring and Tuning “Tool Kit”

Application Server and Client JVM Verbose garbage collection output WebSphere Performance Monitoring Infrastructure (PMI) WebSphere performance advisor Java thread dumps JProfiler Java profiler

Oracle Database 10g performance infrastructure, advisors, addm, time model etc.

Operating System Tools Prstat, sar, vmstat, iostat etc .

Veritas volume management monitoring tools vxstat

49

Performance Monitoring and Tuning “Tool Kit”

Available IBM WebSphere tools not used on the project:- IBM Support Assistant plugins, namely the thread

analyzer and verbose garbage collection output analyzer. ITCAM for Response Time Tracking (uses ARM – see

below). ITCAM for WebSphere. WebSphere Application Response Metrics (ARM).

Available Sun tools not used on the project:- jstat jconsole

50

Batch Architecture Deployment Diagram

Launch Client WebSphere 6.1 Application Server

EJB Container

Domain Layer

Data Access Layer

Utility Services (batch manager, logging, exception handling, standing data cache

etc)

Domain Interface Layer

Data Access Interface Layer

Oracle Database Server-<< JDBC >>

* -<< JDBC >>

*

-<< RMI >>

* -<< RMI >>

*

51

Software Architecture

Classical Horizontally Layered architecture Apache Struts 1=> out of the can MVC framework. Business logic tier implemented used stateless

beans, session façade and business delegate and service locator patterns.

Data Access layer written using stateless beans and raw JDBC and data transfer object pattern.

Utility layer providing logging, exception handling, service locator, EJB home caching, standing data cache and parameters and controls functionality.

52

Software Architecture

Vertical layering also Functional areas divided into vertical slices that go

through both the business logic / domain layers and the data access / integration layer.

Loose coupling of vertical slices via ‘manager’ beans, the session façade design pattern and coarse interfaces.

53

Domain / business logic layer Cached standing data EJB home caching (service locator design pattern) Use of session façade pattern with coarse interfaces All beans are stateless

IBM consider this to be a best practice. Invocations of methods to stateless session bean instances can

be load balanced across clusters unlike invocations to stateful bean instances.

J2EE community regards stateless beans as being better than statefull beans for performance.

Software Performance Features

54

Software Performance Features

Data Access Layer Use of Data Transfer Objects JDBC connection pooling, min and max settings on the

JDBC pool set to the same to prevent connection storms. JDBC statement batching used in places. JDBC prepared and callable statements used so as not

to abuse the Oracle database shared pool. Soft parsing may still be an issue, but can be reduced

slightly by using session_cached_cursors.

General design Batch process threading for scale out.

55

Batch Design Sequence Diagrambatch Clientbatch Client J2EE ContainerJ2EE Container DatabaseDatabase

1: Start the Batch process

3: Get no.of threads and no.of jobs per thread parameters

5: returns

6: Get the list of SPRs/Jobs to be processed

8: returns a list of SPRs / Job Ids

9: Create No.of threads and pass the 'job list' as parameter

10: Each thread makes a call to a Bean method by sending the ' job list' as parameter

12: On completion, each thread ends here

11: Loop through each SPR/ Job Id within the 'job list' to process them

4: Retrieve the parameters

7: Retrive the SPRs/Job Ids

2: Create a Batch record with Start time

13: Update the Batch record with Status, end time

56

Where Does The Source Data For Our Batch Processes Originate ?

Flat files delivered via ftp

Web Services

A third party of the shelf package via JNI

Hand Held Units using J2ME

57

Design Critique

58

Pros Design can scale out via threads. Design can scale out across multiple JVMs. Design is simple and clean.

Because of the online usage, the row by row processing simplifies the design.

Complex code might be required to allow for both batch array processing and on line usage.

59

Pros

If a single job fails the whole batch does not need to be rolled back.

CPU usage of batch can be controlled by changing the number of threads.

Provides a framework for the batch infrastructure.

60

Cons

Inefficiencies by design when accessing the database Limited opportunities for leveraging the JDBC

batching API and the Oracle array interface. Design is prone to a lot of ‘Chatter’ between the

application and database servers. Large soft parse overhead.

61

Cons HHU job retrieval may be more conducive to an event

processing architecture than a batch architecture:- Better for more even CPU utilisation.

We have to maintain the infrastructure code as well as the business logic / domain code.

Is there a better way of simulating threading that could reduce the role of the launch client, message driven beans perhaps ?:- i.e. limiting the role of the launch client in batch processing

will be better for performance and scalability.

62

Network Round Trip Overheads

Database utilisation – network round trip overhead From

“Designing Applications For Performance And Scalability”:-

“When more than one row is being sent between the client and the server, performance can be greatly enhanced by batching these rows together in a single network roundtrip rather than having each row sent in individual network roundtrips. This is in particular useful for INSERT and SELECT statements, which frequently process multiple rows and the feature is commonly known as the array interface.”

There is minimal scope for leveraging the array interface (and also the JDBC batching API) using our design.

63

Parsing Overheads

Best J2EE programming practise dictates that resources should be released as soon as they are no longer required.

All cached prepared statement objects are discarded when the associated connection is released.

This could be coded around, but would lead to code that is both convoluted and prone to statement cache leaks.

64

Parsing Overheads

The statement API is more efficient than the preparedStatement JDBC API for the first execution of a statement. Subsequent executions of a prepared statement

are more efficient and more scalable. Using the statement API would be less resource

intensive on the application server but more resource intensive on the database.

65

Parsing Overheads

Should the prepared statement cache size be set to zero ? No point in baring the overheads associated with

cached statement object creation. Will also create unnecessary pressure on the

JVM heap.

66

Parsing Overheads

Why is parsing such a concern ?:- Oracle’s Tom Kyte and the Oracle Real World

Performance group stress that parsing and efficient cursor use cannot be over stated when it comes to the scalability of applications that use Oracle.

This is not a problem unique to Oracle, WebSphere and DB2 material advocates the use of static SQL for the very same reason of avoiding parsing.

67

Parsing Overheads

Database utilisation – soft parse overhead The “Designing Applications For Performance And

Scalability – An Oracle White Paper” quotes the type of SQL usage with our design as being:-

“Category 2 – continued soft parsing The second category of application is coded such that the hard parse is replaced by a soft parse. The application will do this by specifying the SQL statement using a bind variable at run-time including the actual value . . . Continued . . .

68

Parsing Overheads Database utilisation – soft parse overhead

The application code will now look somewhat similar to:

loop cursor cur; number eno := <some value>; parse(cur, “select * from emp where empno=:x”);

bind(cur, “:x”, eno); execute(cur); fetch(cur); close(cur); end loop;”

Refer to “Soft things can hurt” !!!

69

Parsing Overhead

The Oracle Automatic Database Diagnostic Monitor (ADDM) reports on the performance impact of continuous soft parsing:-

FINDING 3: 13% impact (211 seconds)-----------------------------------Soft parsing of SQL statements was consuming significant database time.

RECOMMENDATION 1: Application Analysis, 13% benefit (211 seconds) ACTION: Investigate application logic to keep open the frequently

used cursors. Note that cursors are closed by both cursor close calls and session disconnects.

70

Parsing Overhead

“Category 3” processing as per the white paper is more efficient and what we should really be striving for, as per the PL/SQL below:-

“cursor cur; number eno; parse(cur, “select * from emp where empno=:x”); loop

eno := <some value>; bind(cur, “:x”, eno); execute(cur); fetch(cur);

end loop; close(cur) ;”

71

Testing Environment

72

Monitoring And Tuning The Software

Lots of things to monitor and tune:- Client JVM Server JVM Application server Object Request Broker EJB container JDBC connection pool usage and statement cache Application code Database usage and resource utilisation Application server resource utilisation, mainly CPU Network between the application server and database server Number of threads per batch job Number of jobs per thread

73

Testing Environment

Performance targets based on actual run times of batch processes from legacy environment.

In testing, 200% of the equivalent legacy workload was used and the database was artificially ‘aged’ to give it the appearance of containing two years worth of data.

Oracle 10g database flashback used to reproduce tests. Large full table scan used to clear out the Oracle db cache

and cache on storage array cache To prevent results from being skewed when repeating the

same test again after making a performance optimization.

74

Test Work Load

Apart from the processing of flat files, most jobs process between 120,000 and 180,000 jobs.

Few reference will be made to this in the presentation:- What we refer to as a ‘Job’ will have little meaning to

other people unless they are using the same application.

However, there is a consensus that a ‘job’ is something that requires a discrete set of actions to be performed against it in order to be processed.

75

Hardware and Software Platforms

IBM WebSphere application server 6.1 base edition 32 bit. Oracle Enterprise Edition 10.2.0.4.0 (10g release 2). Solaris 10. 1 x 4 CPU (single core) Fujitsu Siemens Prime Power 450

with 32Gb Ram to host database. 1 x 4 CPU (single core) Fujitsu Siemens Prime Power 450

with 32Gb Ram to host application server. 100Mb Ethernet network. EMC CX3-20F storage array for database accessed via

fibre channel.

76

Hardware and Software Platforms

EMC CX3-20F storage array for database accessed via fibre channel, with:- Two Intel Zeon based storage processors Two trays of disks, with 15 disks per tray. 1Gb cache.

77

EMC CX3-20F Configuration Despite being ‘Batch’ oriented, from a database perspective, the ratio of

logical reads to block changes is 92%. Some people dislike RAID5, we however, think it is perfectly suitable for

read intensive work loads:- i.e. spread the database files across as many disks as possible. Some disks will be lost to EMC vault disk usage.

Raid 1 was used for the redo logs and archived redo log files. Cache on the array was split 50/50 between read and write usage as

per EMC recommended best practice. The size of the database in terms of application segments was

approximately 25G, not that large really.

78

Database Statistics

A classical approach to ascertaining application scalability is to look at resource consumption, latching in particular.

Refer to Tom Kyte’s runstats package. The main problem with this was:-

Flashing the database back between tests would result in the loss of any resource consumption data loaded into a table.

This information could be written to a file, but this would result in expending effort in developing such a tool.

Fortunately, Oracle 10g provides an out of the box solution to this in the form of the db time model . . .

79

Database Statistics

What is db time ? A statistic that comes with the 10g performance management

infrastructure. Sum total of time spent in none idle database calls by

foreground processes across all sessions. !!! Not to be confused with “wall clock time” !!!. Provides a single high level metric for monitoring database

utilisation Higher db time = high database utilisation.

Makes tuning ‘simply’ a matter of reducing db time. Refer to this presentation from the architect at Oracle who

invented this.

80

Monitoring And Tuning The Software

So as not to be drowned by statistics, the following high level statistics were chosen for monitoring purposes:- Oracle CPU usage Oracle database time Average database load session WebSphere application server CPU usage

81

Database Statistics

Database load is a 10g statistic that usually accompanies db time, but what is this ? Active sessions as reported by the 10g Automatic Database

Diagnostic Monitor Is calculated by db time

wall clock time Higher average database load = greater database utilisation. High database utilisation = good throughput from application

server. Low database utilisation = some bottleneck in the application

server is bottle necking throughput through to the database.

82

How The db time Model Should Help

If to begin with, the CPU usage on the application server is high and the db time expended in the database low, this would imply some sort of bottleneck within the application server tier.

If a bottleneck is addressed in the application server and db time goes up, methods for reducing the db time should be looked at.

83

Identifying Performance Bottlenecks

How do we know where the bottleneck is ?:- The Tivoli Performance Viewer EJB Summary report

is a good place to start. In the example screen shot on the next slide, the

total time expended by the batch manager session bean can be compared to the sum total time expended by the dbaccess module beans.

Separate beans for accessing the database not only separates the integration layer access from the business logic, but helps with performance tuning.

84

Identifying Performance Bottlenecks

How do we know where the bottleneck is ?:- Separate beans for accessing the database not only

separates the integration layer access from the business logic, but helps with performance tuning.

Some people prefer a data access tier as opposed to a layer. i.e. tiers can reside in their own demilitarized zones for the best

possible security as favoured by banks and financial institutions. Others argue that this is an anti pattern and POJOs should be

used. We found there was little difference in performance when

using local method calls versus remote method calls when pass by reference was enabled on the object request broker.

85

Identifying Bottlenecks

86

Identifying Bottlenecks

From the screen shot on the previous slide(ScheduleManager is not associated with the batch processes) batch manager bean time = 429,276,448 time spent in dbaccess beans = 1,737,440 Db access time as a % total = 0.40% The bottleneck might be on the application server !!!.

There is also an EJB method summary report for drilling down further.

87

The ‘Carrot’ Model

Documents the thread usage in a J2EE application servers generic components:- HTTP Server Web Container EJB Container (driven by the number of active ORB

threads) JDBC Connection Pool Database

88


Typically, utilisation should be high towards the ‘front’ of the application server (HTTP server) and gradually dwindle of towards the end JDBC and JCA connection pools. Hence the ‘carrot’ analogy, unless the application

architecture is similar to the Microsoft Pet Store .Net versus J2EE benchmark, i.e. there is little business logic outside the database.

89


In summary, most of the load on the software stack will be carried by the J2EE application server.

Measuring the CPU on both the J2EE application and Oracle database servers, will show how well the ‘Carrot’ model applies to our architecture and design.

90


0

20

40

60

80

100

120

140

160

180

200

Threads Used

HTTP Server WebContainer

ORB threads JDBCConnection

Pool

DatabaseSessions

Component

J2EE Component Utilisation

91

Software Configuration Base Line

92

Oracle Initialisation Parameterscommit_write BATCH, NOWAITcursor_sharing SIMILARcursor_space_for_time TRUEdb_block_size 8192db_flashback_retention_target 999999log_archive_max_processes 4open_cursors 65535optimizer_index_cost_adj 100optimizer_dynamic_sampling 1optimizer_index_caching 0pga_aggregate_target 4294967296processes 500query_rewrite_enabled TRUEsession_cached_cursors 100sga_max_size 5368709120sga_target 4697620480statistics_level TYPICALundo_management AUTOundo_retention 691200undo_tablespace UNDOworkarea_size_policy AUTO

93

WebSphere Configuration Server JVM

-server -Xms2000m –Xmx 2500m

Client JVM -client -Xms200m –Xmx500m

JDBC Connection Pool Min connections 100 Max connections 100

ORB configuration Min threads 100 Max threads 100 JNI reader thread pool set to 100 Fragment size set to 3000

94

Application Configuration

Threads per batch process 100

Jobs per thread 100

Log4j logging level INFO

95

Notes On Oracle Parameter Settings

Cursor management has a major impact on the scalability of applications that use Oracle

With this in mind cursor_sharing, session_cached_cursors and cursor_space_for_ time have all been explicitly set.

“Designing applications for performance and scalability” has some salient points regarding these parameters which will be covered in the next few slides.

96

Notes On Oracle Parameter Settings

A separate JTS transaction per each job results in heavy usage of the Oracle log buffer and its associated synchronization mechanisms. The redo allocation latch is a unique point of

serialisation within the database engine. Therefore the log buffer needs to be used with care.

Asynchronous and batched commit writes were introduced for this purpose. Helps to prevent waits due log file sync waits.

97

Tuning

98

Disclaimer Tuning efforts of different projects will yield different results from those

detailed here due to differences in the :- Software stack component versions, e.g. using Oracle 10.1 and not

10.2, WebSphere 6.0 or 7.0 and not 6.1, 64 bit WebSphere and not 32 bit.

Software stack component vendors, e.g. you may be using Weblogic or JBoss and DB2 instead of Oracle

J2EE application server and database server topology J2EE and database initialisation parameters Application architecture design and coding Server hardware Data Etc . . .

99

Disclaimer

Despite all the reasons as to why your results might vary from those presented, the technical precepts behind what has been done should hold true for more than just the application tested here.

100

A Note On The Results

The tuning efforts made were mainly focussed on tuning the software stack from an environment perspective.

In practise there were a lot more ‘tweaks’ made than those presented here.

The optimisations made have been distilled down to those which made the greatest impact.

Despite this the biggest performance and scalability gains often come from:- The architecture The design Coding practises used

101


The next set of findings relate to the most ubiquitous type of batch process in our software.

This is a batch process that:- retrieves a list of jobs from the database. partitions jobs into ‘chunks’. invokes beans in the application server via child

threads with these ‘chunks’ attached as objects.

102

Finding 1: pass by copy overhead Symptom

db time, database load and CPU utilisation on the database server were all low.

CPU utilisation on the application server at 100%. Root cause

database access beans invoked by remote method calls. Action

set pass by reference to ‘On’ on the Object Request Broker. Result

Elapsed time 01:19:11 -> 00:41:58 WebSphere CPU utilisation 96% -> 66% Db time / avg sessions 23470 / 4.1 -> 40071 / 14.5

103

Finding 2: threading

Symptom high db time and database load high CPU time attributed to

com.ibm.ws.util.ThreadPool$Worker.run method (visible via Java profiler).

Root cause batch process threading set to high, 100 threads for

4 CPU boxes !!!.

104

Finding 2: threading

Action lower number of threads, optimum between 16 and

32 depending on the individual batch process.

Result (threads 100 -> 32) Elapsed run time 00:41:58 -> 00:36:45 Db time / avg sessions 40071 / 14.5 ->

21961 / 8.9 WebSphere CPU utilisation 66 % -> 73 %

105

Finding 3: db file sequential read over head

Symptom “db file sequential read event” = 73.6% total call

time. Root cause

job by job processing = heavy index range scanning. Action

compress most heavily used indexes. Result

Elapsed run time 00:36:45 -> 00:36:38 Db time / avg sessions 21961 / 8.9 -> 9354 /

3.6 WebSphere CPU utilisation 73 % -> 74 %

106

Finding 4: Physical read intensive objects

Symptom ADDM advised that there were physical read intensive objects

Root cause With a batch process same data is rarely read twice, except for

standing / lookup data. Action

‘pin’ hot objects into a ‘keep’ area configured in the db cache Result

Elapsed run time 00:36:38 -> 00:26:36 Db time / avg sessions 9354 / 3.6 -> 4105 / 2.3 WebSphere CPU utilisation 74 % -> 87 %

107

Finding 5: Server JVM heap configuration and

Symptom major garbage collections take place one a minute.

Root cause heap incorrectly configured.

Action tune JVM parameters.

Result Elapsed run time 00:26:36 -> 00:25:01 Db time / avg sessions 4105 / 2.3 -> 3598 /

2.4 WebSphere CPU utilisation 87 % -> 86 %

108


The most effective JVM parameter settings were found to be those used by IBM in a WebSphere 6.1 bench mark on Solaris submitted to the SPEC.

Resulted in one major garbage collection every 10 minutes.

Minimum heap size=2880 MB Maximum heap size=2880 MB initialHeapSize="2880" maximumHeapSize="2880" verboseModeGarbageCollection="true -server -Xmn780m -Xss128k -XX:-ScavengeBeforeFullGC -XX:+UseParallelGC -XX:ParallelGCThreads=24 -XX:PermSize=128m -XX:MaxTenuringThreshold=16 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseParallelOldGC

109


Usage of the JVM configuration from the IBM bench mark came after a lot of testing and experimentation via trial and error.

The Sun JVM tuning material supports this approach.

The heap is probably oversized for our requirements, but for a “first cut” at getting the configuration correct it is not a bad start.

110

Finding 6: Client JVM heap configuration and ergonomics

Symptom major garbage collections take place more than

once a minute. Root cause

heap incorrectly configured. Action

tune JVM parameters. Result

Elapsed run time 00:25:01 -> 00:24:20 Db time / avg sessions 3598 / 2.4 ->

3704 /2.5 WebSphere CPU utilisation 86 % -> 86 %

111

Finding 6: Client JVM heap configuration and ergonomics

Client JVM configurationJVM Options: -server -Xms600m -Xmx600m -XX:+UseMPSS -XX:-UseAdaptiveSizePolicy -XX:+UseParallelGC -XX:MaxTenuringThreshold=3 -XX:SurvivorRatio=2 -Xss128k -Dcom.ibm.CORBA.FragmentSize=3000 -Dsun.rmi.dgc.client.gcInterval=4200000 -Dsun.rmi.dgc.server.gcInterval=4200000 Server diagnostic trace turned off

112

Finding 6: Database Block Size Symptom

Significant latching around the db cache. Root cause

Block size too small. Action

Increase block size from 8 to 16K. larger block size = fewer index leaf blocks = less index branch

blocks = smaller indexes = less physical and logical IO, less logical IO = less latching

Result Elapsed run time 00:24:20 ->

00:21:25 Db time / avg sessions 3704 / 2.5 -> 2623 / 2 WebSphere CPU utilisation 86 % -> 93 %

113

Finding 7: JVM aggressive optimizations

Symptom No symptom as such, load still on the application

server. Root cause

N/A Action

Further experimentation with the server JVM options resulted in aggressive optimizations being used.

Result Elapsed run time 00:21:25 ->

00:18:36 Db time / avg sessions 2623 / 2 -> 2516 / 2.1 WebSphere CPU utilisation 93 % -> 85 %

114

Finding 7: JVM aggressive optimizations

AggressiveOpts has to be used with -XX:+UnlockDiagnosticVMOptions -XX:-EliminateZeroing, otherwise the application server would not start up !!!.

The following excerpt from the Java Tuning White Paper should be heeded:-

“Enables a technique for improving the performance of uncontended synchronization. An object is "biased" toward the thread which first acquires its monitor via a bytecode or synchronized method invocation; subsequent monitor-related operations performed by that thread are relatively much faster on multiprocessor machines. Some applications with significant amounts of uncontended synchronization may attain significant speedups with this flag enabled; some applications with certain patterns of locking may see slowdowns, though attempts have been made to minimize the negative impact.”

115


The other type of batch process in our software involved the reading and writing to files after the contents of files / database tables had been validated against standing data.

This type of batch process was highly ‘Chatty’ by design.

116

Tuning Finding: ‘Chatty’ Batch Process Design

Symptom Low CPU usage on WebSphere server. Low CPU usage on the database server.

Root cause Oracle stored procedure called to validate each record field in

files records being read and written, performance death by network round trips !!!!!!!!!.

Action Modify code to perform validation using pure Java code

against standing data cached within the application server. Results

See next slide

117

Tuning Finding: ‘Chatty’ Batch Process Design

Finding: excessive calls to Oracle stored procedures Results

Validation Method

Lines In File Threads Run Time(mm:ss)

% Improvement Over PL/SQL WebSphere

CPUOracle CPU

PL/SQL

15000

8

02:18 NA 68 60

Java

01:31 34% 77 68

4 01:48 24% 51 56

118

Other Findings

With some batch processes “cursor: pin S” wait events were observed, this accounted for up to 7.2% of total call time.

Investigating this alluded me to the fact that in 10.2.0.3.0 onwards the library cache pin had been replaced by a mutex.

In 11g even more of the what were library cache latches have been replaced with mutexes.

Notable, because one of the ways of comparing the scalability of different tuning efforts is to measure and compare latching activity.

119

Tuning Results Summary

120

Types Of Batch Processes

The following graphs illustrate capture the following statistics for an ‘atypical batch process” that has had all the tuning recommendations applied:- the average percentage CPU usage db time elapsed time

121

Batch Elapsed Time

0

200

400

600

800

1000

1200

4 8 16 32

Threads

Tim

e (s

)

Batch Elapsed Time

122

Batch DB Time

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

4 8 16 32

Threads

db

tim

e

Batch DB Time

123

Server % CPU Utilisation / Thread Count

0

10

20

30

40

50

60

70

80

90

4 8 16 32

Threads

% C

PU

Uti

lis

ati

on

% Database CPU Usage

% App Server CPU Usage

124

Critique Of Tools Used

125


Oracle 10g dbtime model This worked very well for measuring to the database

utilisation. It does not however give any indication of how heavy

utilisation is compared to the total capacity that the database tier can provide.

Both the Oracle diagnostics and tuning packs need to be licensed in order to use the tools that accompany the time model, namely the ADDM and workload repository.

These extra options are not cheap. The “ASH Masters” provide a low cost alternative to the 10g

performance infrastructure.

126


JProfiler (Java profiler) Provides detailed information on:-

Heap usage Thread lock monitor usage CPU usage, at method, class, package and bean level. JDBC usage. CPU profiling with drill downs all the way to JDBC calls. JNDI lookup activity.

Worked well for:- highlighting the RMI pass by object over head Diagnosing an issue earlier on whereby a ‘singleton’ object was

being created thousands of times resulting in excessive CPU and heap usage.

127

Critique Of Tools Used JProfiler:-

Used on the grounds that:- It was extremely easy to configure Attached to the JVM of WebSphere 6.1 Other products were more suited to JSE program profiling Some profilers could not attach to the WebSphere JVM, or could,

but not that of version 6.1 Other profilers came with unwieldy proprietary IDEs that we did

not require

Had a 100% performance overhead on the application server and should therefore not be used on production environments.

Kill -3 can be used to generate thread dumps, the “Poor mans profiler” according to some, this is much less intrusive than using a full blown Java profiler.

128


Tivoli Performance Monitoring Infrastructure (PMI) Comes with a number of summary reports, the EJB report of

which was particularly useful. If too many data points are graphed, the PMI viewer can

become painfully slow. Turning some data points on can have a major impact on

performance. One project member used the

WebSphere PerfServlet to query PMI statistics and graph them using big brother and round robin graphing.

129


WebSphere performance advisor Only useful information it provided was regarding the

turning off of the diagnostic trace service.

Relies on PMI data points being turned on in order to generate ‘Useful’ advice.

Turning some data point on can have a detrimental affect on performance, to reiterate what was mentioned on earlier slides.

Perhaps more useful when running WebSphere with the IBM JVM, as this is more tightly integrated into the performance monitoring infrastructure than the Sun JVM.

130

Conclusions

131

Bottlenecks In Distributed Object Architectures

This alluded to Martin Fowler’s "First law of distributed object architectures".

If remote interfaces are used and beans are deployed to a WebSphere application server in a single node configuration, the pass by copy overhead is still considerable.

132

Bottlenecks In Distributed Object Architectures

WebSphere application server provides a “quick win” for this situation in the form of the object request broker pass by reference setting. !!!! CAUTION !!!! This should not be used when the

invoking beans assume that they can use these objects passed by reference without the invoked beans having altered the received object(s).

For scale out, prefer shared nothing architectures as per this article from Sun.

WebSphere Network Deployment uses a shared nothing architecture.

133

Tuning Multi Tiered Applications

When multiple layers and tiers are involved an all encompassing approach needs to be taken to tuning the software stack:- Tuning the database in isolation may not result in the

performance and scalability goals being met. Tuning the J2EE application in isolation may not result in

the performance and scalability goals being met. Refer to

"Why you cant see your real performance problems" by Cary Millsap.

134

Tuning Multi Tiered Applications

The bottlenecks needs to be identified and targeted wherever they exist in the application stack.

A prime example of this is that the impact of database tuning would have been negligible had the pass by copy bottleneck not been addressed.

135

Threading

A given hardware platform can only support a finite number of threads.

There will be a “sweet spot” at which a given number of threads will give the best throughput for a given application on a given software stack.

Past a certain threshold, the time spent on context switching, thread synchronization and waiting on contention within the database, will result in diminishing returns from upping the thread count.

136

Avoid ‘Chatty’ Designs

‘Chatty’ ??? Yes, designs that can result in excessive chatter

between certain components. This can be particularly bad when there is a network

involved. “Design and Coding Applications for Performance and

Scalability” by IBM recommends putting processing closest to the resource that requires it (section 2.5.9).

137


A subtly different angle on this is that ‘Chatty’ designs should be avoided:- Specifically, avoid designs that and incur frequent

network round trips between the database and the application server.

Tuning finding 3 supports this.

138


Low CPU consumption on both the application server and database servers could be a sign of ‘Chatty’ software. i.e. excessive calls to the database, thus making

network round trips the bottleneck. Perform processing exclusively within the

application server where possible, but not when there are database features available specifically for carrying this work out.

139


Operations that involve significant bulk data manipulation should be done in the database.

Always look to minimise network round trips by leveraging:- Stored procedures Array interfaces, both in Oracle and the JDBC API Tuning the JDBC fetch size In line views Merge statements Sub query factoring SQL statement consolidation

140


‘Chatty-ness’ can be a problem within the application server also:- There are two vertical layers of domain (business) logic within

the application which are invariably called together. These could be consolidated into one vertical slice with the

benefit of:- Code path length reduction Allowing for SQL statement consolidation

Not addressed to date as all of our performance goals have been achieved without having to carry this work out.

141

JVM Tuning The Java Virtual Machine is a platform in its own right,

therefore it deserves a certain amount of attention when it comes to tuning.

When using the Sun JVM, use the appropriate garbage collection ‘Ergonomics’ for you application.

As per some of Sun’s tuning material, there can be an element of trial and error in JVM tuning.

Use verbose garbage collection to minimise major garbage collections.

Look at what tuning experts have done on your platform in the past to get ideas. www.spec.org is not a bad place to look as per the example used in this material.

142

Row by Row Processing Scalability and Performance

There was great concern over the row by row access to the persistence layer. However, a bottleneck is only an issue if it prevents

performance goals from being achieved. It would be interesting to find the level of application

server through put required to make the database become the bottleneck.

This would require more application server instances, i.e. WebSphere network deployment.

143

Is The Database The Bottleneck ?

db time does not help in terms of measuring resource usage and time spent in the database in relation to the total available capacity.

However, as we have gone from 40071s in db time to 2516, there appears to be ample capacity within the database tier.

144


Parsing was raised as a concern, the % None-parse CPU on the “Automatic Workload Repository” excerpt on the next slide will dispel this.

This report was captured whilst running an atypical batch process with all the tuning changes applied and 32 threads.

The “Parse CPU to parse elapsed” ratio is not too optimal, however as the % Non-Parse CPU is quite small, this is not a major concern.

145


Buffer Nowait %: 99.99 Redo NoWait %: 100.00

Buffer Hit %: 99.33 In-memory Sort %:

100.00

Library Hit %: 99.99 Soft Parse %: 99.99

Execute to Parse %:

91.14 Latch Hit %: 99.91

Parse CPU to Parse Elapsd %:

24.76 % Non-Parse CPU:

94.13

146

There Is Always A Bottleneck

In all applications there are always performance and scalability bottlenecks. A J2EE application server will usually be bound by CPU

capacity and memory access latency from a pure resource usage point of view.

A relational database will usually be constrained by physical and logical IO.

In the J2EE world where a database is used for persistence, tuning will involve moving the bottleneck between the application server and the database.

147

Useful Resources

IBM resources Designing and Coding Applications For Performance

and Scalability in WebSphere Application Server WebSphere Application Server V6 Performance and

Scalability Handbook IBM WebSphere Application Server V6.1 on the Sol

aris 10 Operating System

148

Useful Resources

IBM WebSphere Compute Grid resources WebSphere Extended Deployment Compute Grid Executing Batch Programs In Parallel With WebSphere

Extended Deployment Compute Grid Compute Grid Run Time Compute Grid Applications Swiss Re Use Of Compute Grid Compute Grid Discussion Forum

Links provided courtesy of Snehal Antani of IBM.

149

Useful Resources

Sun Resources Albert Leigh’s Blog Dileep Kumar's Blog Scaling Your J2EE Applications Part 1 Scaling Your J2EE Applications Part 2 Java Tuning White Paper J2SE and J2EE Performance Best Practices, Tips

And Techniques

150

Useful Resources

Oracle Resources Oracle Real World Performance Blog 360 Degree DB Programming Blog Oracle Technology Network JDBC Resources Designing Applications For Performance And Scalability - An

Oracle White Paper Best Practices For Developing Performant Applications

151

Useful Resources

Other resources Standard Performance Evaluation Council

jAppServer 2004 Results JProfiler