Download pdf - Modern Batch DirectionMoving from Legacy Batch to Service Oriented - Component Modeled - Event Driven Batch Modernizations - Whitepaper

Modern Batch Directions

Enterprise Architecture by

Akiva Marks

Akiva Marks

Chief Architect – Architecture / SOA / Cloud / Integration Group

Guardian Information Systems – a Malam Team Company

[email protected] – phone +972-52-313-4184

http://MakingSOAwork.blogspot.com

mailto:[email protected]

http://makingsoawork.blogspot.com/

TABLE OF CONTENTS

1. OVERVIEW ...................................................................................................................................... 3

1.1. LIMITATIONS OF BATCH ..................................................................................................................... 3 1.2. WHERE DID MY BATCH GO? .............................................................................................................. 4

2. WHAT IS (WAS) BIG BATCH? ........................................................................................................... 6

3. ELIMINATING BATCH .................................................................................................................... 10

3.1. REAL TIME PROCESSING ................................................................................................................... 10 3.2. BUSINESS EVENT DRIVEN PROCESSING................................................................................................ 11 3.3. DECOUPLED PROCESSING VIA MESSAGING .......................................................................................... 12 3.4. JUST ANOTHER INPUT CHANNEL VIA PARALLELIZATION .......................................................................... 14 3.5. SUMMARY .................................................................................................................................... 14

4. MODERN BATCH ALTERNATIVES ................................................................................................... 15

4.1. ROLL YOUR OWN (CODE YOUR OWN SOLUTION) ................................................................................. 15 4.2. BPA CONTROLLED .......................................................................................................................... 16 4.3. ETL – EXTRACT, TRANSFORM, LOAD .................................................................................................. 17 4.4. CDC OR BAM AS REAL TIME ETL ALTERNATIVES ................................................................................. 17 4.5. A MODERN BATCH RISK .................................................................................................................. 18

5. BATCH AND SERVICES, SOA, SOAP AND/OR MESSAGING.............................................................. 20

5.1. THE BULK SERVICE PATTERN ............................................................................................................. 21

6. BIG BATCH TOOLS FOR JAVA ......................................................................................................... 22

6.1. COMPUTE GRID (IBM) .................................................................................................................... 22 6.2. SPRING BATCH ............................................................................................................................... 24 6.3. J2EE & JSR 352 ........................................................................................................................... 26

7. A MODERN BATCH STRATEGY ....................................................................................................... 27

7.1. ELIMINATE. ................................................................................................................................... 27 7.2. ROLL YOUR OWN VIA QUEUING. ....................................................................................................... 27 7.3. ROLL YOUR OWN VIA BPA. .............................................................................................................. 27 7.4. BAM + ESB FOR NEAR REAL TIME DATA DISTRIBUTION ........................................................................ 27 7.5. SUMMARY .................................................................................................................................... 28

INCLUDED ILLUSTRATIONS

Figure 1 - Example Big Batch Control Features............................................................................................. 6 Figure 3 - Large Batch Job subdivided into parallel segments ....................................................................... 8 Figure 4 - Arcati Mainframe User Survey 2013 ............................................................................................. 9 Figure 5 - from IBM ......................................................................................................................................23 Figure 6 - From SpringSource.org, http://static.springsource.org/spring-batch/trunk/spring-batch-core/ .....24 Figure 7 – From SpringSource.org, http://static.springsource.org/spring-batch/trunk/spring-batch-core/ ....25 Figure 8 - From SpringSource.org, http://static.springsource.org/spring-batch/trunk/spring-batch-core/ .....25

1. Overview

A Mainframe using customer recently approached me with a problem: "We're upgrading

our application and technology environment, please help us select the tool set to which to

move our batch jobs." It sounded like a straightforward need, just find a new tech set to

build new batch jobs and into which to move their old batch jobs. But as I researched the

problem and need more in depth, I found the technology approach in this area has moved

significantly, making an answer much more complicated.

Batch was a primary processing pattern for the majority of 2nd

generation enterprise

applications, and the driving requirement for enterprise computer hardware and operating

environments from around 1960 through 1990. To meet this need, IBM together with

other vendors created a complete ecosystem of hardware, software and database tools

optimizing performance of this pattern.

These tools were fully effective at developing, controlling, managing and monitoring

complex batch jobs. This didn’t mean batch jobs were without problems. Large

dependency chains are always vulnerable to any step or link having a problem. Even

with the most sophisticated batch monitoring and support tools in place, operator

intervention (to stop and restart, recover, or re-run a job) was a frequent experience in

every batch environment.

Batch was an effective method of processing large quantities of data with limited and

costly computing resources. As the paradigms of data processing changed (such as the

driving business need for real time data) and the resource-cost basis of the computing

resources changed, the limitations of the batch pattern became more significant and the

alternatives more viable.

1.1. Limitations of Batch

The batch processing pattern presents 3 inherent limitations:

1. Any problems with a batch job step often meant failure of the job as well as

requiring operator intervention to recover. Almost every IT enterprise

running batch jobs had “night operators”, or at least on call operators, to

intervene when necessary – usually at least once a week if not more.

2. Batch job resource requirements were often linear. Meaning, if in 1980 it

took 2 hours to process 100,000 customers, and in 1990 we had 1,000,000

customers, it was then taking 20 hours to process. Since batch jobs usually

had to fit within windows of available resources (time frames when online

processing was shut down to free up resources for batch processing),

usually an 8-12 hour overnight time or a weekend time, increasing volume

often put the enterprise into a resource squeeze. (Though IBM would

always be happy to deliver the latest Mainframe and latest Database, at huge

cost, to get you a 25% increase in handling time for the next generation

hardware.)

3. Batch, by nature, is not real time. Therefore, any data processed via batch

patterns meant a system not up to date. In some cases this was not very

important, but in other cases the business and the customers came to expect

real time updates. For example, it used to be when you made a bank

deposit, you received a note that “deposits may not be applied for 48 hours”.

So even if you deposited money in an ATM machine (Caspomat in Israel),

you would not see the update to your balance until the next day or later.

With the creation of customer web sites showing account balances, having

balances that didn’t include updates from certain sources due to IT

processing delays was unacceptable.

While updates to the supporting technologies over time provided a variety of work-

arounds and updated techniques to reduce the impact of these problems, they

remain fundamental problems of batch processing.

1.2. Where Did My Batch Go?

As we moved forward to client/server applications, web based applications, and

service oriented applications, we’ve heard less and less about batch. Few new

systems being developed today use the 2nd

generation batch processing pattern.

Even more rare is the programmer aged 35 years or younger who has any idea

about batch. The verdict is in, “Big Batch” is dead.

But there are still plenty of newer applications and processing instances where large

amounts of data are processed and updated. A batch programmer looking at these

processes would instantly identify many batch-like techniques in use, though more

targeted and limited implementations. Or more to the point, as an integrated part of

the application rather than a separate managed and controlled processing pattern of

its own.

Examples of batch like operations in modern applications include:

ETL processes (extract, transform, load), particularly ETL extracts that are

then loaded into a data warehouse.

Billing cycles. Generating customer bills and updating customer accounts for

their billed status.

Updating groups of data across systems.

In recent examples of Batch Processing described by an IBM architect, he wrote:

Batch processing is execution of series of "jobs" that is suitable for non-interactive, bulk-oriented and long-running tasks. Typical examples are end-of-month bank statement generation, end-of-day jobs such as interest calculation, and ETL (extract-transform-load) in a data warehouse. These tasks are typically data or computationally intensive, execute sequentially or in parallel, and may be initiated through various invocation models, including ad-hoc, scheduled, and on-demand.

Yet that description no longer describes a chain of large bulk oriented long

running tasks. Rather it's showing that modern bulk processing has devolved to

single task operations (even if that task has a large quantity of the particular

transactions or operations).

Some big batch techniques remain and have become an integrated part of the

modern application environment. The need for bulk processing has not completely

disappeared. But in many cases the need has been eliminated…

2. What is (was) Big Batch?

This section is written for the younger technologists. If you have developed on the

Mainframe or developed Big Batch jobs in the past, you may skip or skim this section.

The vast majority of today’s developers have never worked on a mainframe or in a 2nd

generation language (such as COBOL). To relate to where batch is going, first we must

present a brief technical overview of what batch was. And though I use the word “was”,

almost every large IT enterprise has an older generation of applications, and many of the

oldest continue to run Big Batch. If there’s a Mainframe in the data center, there’s

probably some Big Batch running every night.

In general, Big Batch is a series of batch jobs, each job consisting of job steps, each step

being the execution of bulk processing of an extract, transform, load, or transaction

processing step. A scheduler starts the jobs at specified times or upon receipt of input of

specified type. A controller executes each job step, and a supervisor monitors the results

of each step.

Figure 1 - Example Big Batch Control Features

A single “Big Batch” production job may have 1 – 100 or more job steps. Each job step

comprises a script (JCL – Job Control Language) that may run 1 or many program

modules or utility operations. Such a job step script could itself be 1 – 500 commands.

The Big Batch execution cycle tries to maximize the resources available for the bulk

processing (shutting down online processing to free up memory and CPU for the batch

cycle). Further, due to the quantities of data involved, it has to be very careful to manage

the I/O.

* How 100ms of I/O overhead can turn into 27 hours of processing time… The average developer is very familiar with database transactions but is not familiar with what's happening behind the scenes and the resource cost. Multiple database change operations (insert/create, update, delete) can be performed (across any number of tables in the same database schema/data set) and grouped into a "transaction". If any of the operations fail, the transaction is discarded or in database terminology "rolled back", meaning the full group of operations are never applied. If all the operations succeed, the transaction is permanently applied as a group, or in database terminology "committed". (As an aside, attempting to build a transaction that includes multiple databases or multiple applications is called a "two phase commit" and is considered technically challenging and operationally risky.) For a relational database (DB/2, Oracle, Microsoft SQL Server, MySQL, etc.) to create a transaction, a special technique for tracking the changes but not actually saving them is activated. Rather than apply the changes to the physical tables, upon beginning a transaction it builds a "transaction log entry" (this may differ slightly by database vendor, for example Oracle builds a "log buffer" and "data buffer" set for the transaction). As each operation is executed, the new state and old state is saved in the log(s). Upon an attempted commit, the new state will be moved from the log to the actual tables. If there is a failure, the old state will be restored. If it succeeds, the old state and log entries will be discarded. It's a surprising amount of actual work and data movement for even seemingly small changes, but it is key functionality within a relational database. All this log and live interaction has significant overhead. Even if the overhead (with today's highly optimized relational databases) comes out to no more than 100ms per transaction, if we're processing 1,000,000 updates that comes out to 27 hours of overhead just for the commit (not including time of the actual processing). [The time calculation assumes those 1,000,000 updates are being applied sequentially.] Therefore bulk processing of database updates to a relational database commonly group 10-1,000 updates/inserts in a single “transaction", resulting in 1 commit per 10-1000 updates/inserts. This is done not because the updates/inserts are business related (the normal reason for a transaction), but rather to reduce the number of database commits because of the overhead (that provides no functional value for a single insert but can’t be avoided).

Grouping of operations into a single database transaction is not due to business need or

association of the operations, rather it’s simply to reduce the overhead of the database

commit process (an operation that can’t be avoided or deactivated). Unfortunately when

grouping a large number of unrelated operations into a single database transaction, the

failure of any one operation will fail the transaction – causing a business impact between

operations where no business connection exists. Managing this impact becomes a

major part of Big Batch. Job segments that have had failures need to be recovered, have

the problem repaired (which may be as straightforward as removing the single

problematic update/insert), and then resubmitted. Low level operational control of job

segments is a standard part of Big Batch control environments for exactly this reason.

I/O and processing management is a major portion of traditional batch jobs, and part of

what adds to their complexity. Large input files are commonly sorted prior to processing,

allowing processing to optimize database access on the primary indexes and allowing the

database to pre-fetch and buffer with predictive algorithms. Sometimes data sets are

queried from the database, sorted by a needed access pattern and written as a sequential

file to speed access.

Output of one job step is frequently required input to another job step, and sometimes

different jobs are running in parallel and producing required input for each other. A web

of dependencies, both internal and external, are frequently created. The result is

complicated and highly dependent on inputs arriving at the right time. Being that

input/output sets are usually sent as files, yesterday's input arriving (perhaps due to a

previous job failure and re-run) was a not infrequent occurrence.

Figure 2 - Large Batch Job subdivided into parallel segments

Because sorting and ordering the input sets was usually part of the job steps, the

controlling steps could split the input sets and execute segments in parallel to maximize

resource utilization. This could continue until the CPU capacity, memory capacity,

database capacity or IO capacity was hit, though in practice the average batch job was

rarely split into more than 5 segments.

Programs that "did batch" were almost always exclusively written for batch. The batch

approach meant different models of input and output handling, and possibly different

models of data access, and different approaches to transaction management. Even the

database could contain optimizations, special indexes for batch access, even layout out

the physical tables structures on the disk for optimized bulk access.

Big Batch worked, and is still working at many organizations. Yet the amount of

enterprise data that still lives on the mainframe has fallen to 40% or less for the majority

of enterprises with a mainframe. So Big Batch and the mainframe structures that enable

it continue to decrease, from 80% of organizations in 2000, to 50% in 2005, to perhaps

30% today.

Figure 3 - Arcati Mainframe User Survey 2013

3. Eliminating Batch

When looking to eliminate batch processes, here are some of the common approaches for

doing so. Most organizations will use a combination of these and others, depending on

the exact need.

3.1. Real Time Processing

IT processes that were, in older systems, locked up in the application code are often

exposed today, via a Web Service or Messaging based interface (or both). This

makes the ability to connect systems and feed an application data real time to be

easier than building and managing a batch process. This also spreads the work load

for the process across the day.

* Hardware Capacity and Batch vs. Real Time One of the problems of managing capacity at the hardware level is variable workloads – batch being a particularly heavy workload. If we purchase capacity to handle the burst load, which in batch environments is usually the batch load, that extra capacity is idle throughout most of the day. By spreading the work load over the day, we maintain a more constant load and can operate at a higher level of constant capacity utilization, reducing our hardware requirements and associated cost.

The primary reason for moving to real time processing (or near real time) is to have

an up-to-date data view across our application environment.

* Real Time or Near Real Time? Real time technically means immediate processing across the connected environment or applications, or more specifically relates to hardware level programming and responding to electrical signals. Near Real Time means a system may send an asynchronous event or message to another system, which applies the update “very soon” – usually within a few seconds. In most business scenarios Near Real Time is perfectly adequate. Actual Real Time or synchronized transactions and updates may be necessary for certain financial business processes, such as stock trading, or infrastructure (such as telecommunications) operations. But Actual Real Time incurs significant extra implementation costs and should only be used when absolutely necessary. Within this document we’ll use the general term “Real Time”, but usually mean “Near Real Time”.

Real Time processing eliminates much if not a majority of batch jobs in most

organizations, if systems on both sides are being or have previously been

updated to modern generation applications or been modified to expose

transactions or processes via modern interface methods.

* CICS / Mainframe Web Services? IBM and other vendors have provided tools to web service expose older 2nd generation program modules, so even some of the oldest applications can be moved to Real Time. CICS and Enterprise COBOL have been web service enabled, able to expose and consume web services (with some XML limitations). Similarly Software AG has web service enabled their Natural language environment. Similarly, even DB/2 on the Mainframe can expose Store Procedures as Web Services. Further, a variety of other vendors and tools are available to bridge web services and messaging into various 2nd generation Mainframe languages and environments. However, the ability to do so easily depends on the architecture model of the code within the older software. If the processing level or transaction level is separated from the user interface level, represented as a separate callable module or library, then it can almost instantly be exposed (changing the older system from user interaction to a transaction engine.) However, if the code logic is tightly wrapped together with screens and/or the interface method, then the effort to separate the transaction or process logic from the interaction logic may exceed the cost of rebuilding the abilities in a modern environment.

We sometimes get into the humorous yet incredibly inefficient situation where one

side of a process will update their system, allowing for real time, but the other side

will not. The updated side must then use a bridging technique (such as using an

ESB, enterprise service bus, or messaging system to save real time transactions

throughout the day and then send them in a batch later) to maintain the existing

batch process. Some years later the other side is updated, but they ALSO use a

bridging technique to “maintain compatibility” with the existing interface rather

than re-engineer a working process (or not even knowing the other side is using a

bridging technique). We end up with two real time capable systems connecting and

operating in a batch pattern and maintaining bridging code. The bridging code,

serving no real need, is maintained at extra expense and is actively reducing the

data liveliness between the systems. This is particularly likely when such processes

are between companies.

3.2. Business Event Driven Processing

Our operating systems, browser environments and mobile systems all operate in an

event driven mode. Every Windows programmer is well familiar with responding

to GUI events and system events. And every Javascript programmer is familiar

with Browser events such as OnLoad or OnClick.

Business Event Driven Processing extends this idea to business events. Events

such as NewEmployee, CustomerSale, or ProductShipped can be listened for and

subscribed to. So rather than the Billing System having to wait to the evening to be

pushed a list of sales made today, it can subscribe to “CustomerSale” events.

This further allows for a processing decoupling. In the example above, the Billing

System could process the CustomerSale event in real time, in background at a

lower priority, in the evening to reduce impact on users, or at the end of the month

as part of its bill generation cycle. (The last option would mean it’s view of the

customer state would be out of date, but perhaps it’s not the system used to view

customer state – so no need to keep it constantly up to date.)

This pattern allows the systems to communicate, push needed data, yet process on

their own individual application cycle. Where a near-real-time result is needed, the

needing application can subscribe and process immediately upon receipt of an

event. Since subscribing and processing are the responsibility of the subscriber (the

system needing the data or processing the transaction), the provider (the system

sharing the event and associated data) is completely decoupled from the subscribing

systems.

Because the systems are physically decoupled, it is important for the IT

development process to include steps to maintain a catalog of events (type and

content), providers of the events, and subscribers of the events. This becomes the

only way to determine change impact.

From a historical batch perspective, one could say each subscriber is responsible for

its batch step and its cycle. Success or failure to process is its own local

responsibility to manage without any impact on the providing system or on any

other consuming systems. In other words, even each sub-step of what would have

been a full batch job is isolated, with no impact beyond the system involved in the

particular sub-step.

3.3. Decoupled Processing via Messaging

Decoupled Processing via Messaging is similar to Business Event Driven

Processing, but rather than consuming systems subscribing to events (a fully

decoupled pull model), the producing system sends specific messages to one or

more designated receiving/consuming systems (a loosely coupled push model).

* Messaging and Communication Patterns Messaging refers to an asynchronous communication pattern, where a sender targets a message at a queue, and the receiver or consumer reads from the queue – at its convenience.

Messaging is usually considered a reliable or guaranteed delivery approach, with messages being persistent until consumed. IBM’s Websphere MQ is the top market solution for such a reliable asynchronous messaging infrastructure, though TIBCO offers several very viable alternatives. On a smaller scale, Microsoft offers MSMQ as well as a messaging infrastructure in their Biztalk integration server environment.

With Messaging, each system requiring data, updates, or transaction activation has

the operation sent as a message. The sending and receiving of the message is an

asynchronous decoupled process, meaning the sending system writes the message

to a target queue and their task is complete. The queue transactionally

acknowledges receipt of the message, at which point the sending system's

responsibility is complete. The receiving system that will process the message

either reads the queue at its convenience (according to its availability and

processing schedule) or monitors it, activating processing upon message receipt

(usually monitoring with multiple threads, allowing parallel processing of multiple

messages simultaneously).

The reading and processing is also transactional, the message being locked at the

start of the processing and deleted upon successful completion (or unlocked upon

processing failure and rollback).

Therefore, neither side of the process is dependent upon the other side to operate or

complete, nor does either side stop and wait (for receipt, for processing, for success

or fail) for the other side.

Message queues themselves may be monitored and activate alerts should messages

be aging (not being picked up) or growing too many in the queue (not being

processed fast enough).

It is possible for multiple systems to be a provider of a particular message.

However, with the use of a queue, only one system may be the consumer since once

a message is read and processed, it is deleted from the queue. Of course that

"system" may be a server cluster, with all active members of the cluster listening to

and consuming messages from the same physical queue. The point is that two

separate applications cannot process the same message from the same queue. If

there is such a need, in this pattern you would have the sending application send the

message multiple times, once to each queue for each consuming application.

One final point on messaging. The sending system may bundle multiple transaction

or operation requests in a single message AND the receiving system may choose to

read and process multiple messages in a single operation or transaction. A message

may contain as much data or as many transactions as is appropriate. For example,

if 10 different updates for a customer were being sent, it would be very appropriate

to bundle all 10 updates for the same customer into one message. Similarly, for

processing efficiency the receiving system (the system reading the messages from

the queue) may determine to maximize processing and minimize database

overhead, it will read 10 messages from the queue and process them in a single

transaction. When sending / receiving large quantities of transactions, both patterns

may be appropriate and should be considered.

3.4. Just Another Input Channel via Parallelization

What the remainder of batch often becomes is ‘just another input channel.’ The

point of this approach is to use the same online code set for bulk processing,

handling the bulk load via parallelization.

Parallelization means taking the group of transactions to be processed and feeding

them to pool of threads that may be deployed across a cluster of application

instances. Each thread processes it’s given transaction via the same code or object

or service as an online process.

Distributing the load across a cluster can be technically difficult. Messaging is a

frequent approach to do so, though in particularly complicated models or those with

very large numbers of such tasks or processes it is appropriate to consider modern

batch tools (see Section 6 for a discussion of these tools).

Just Another Input Channel is a very viable option for the remainder of bulk

processing needs after the previous elimination methods are used, but does carry

risk. See Section 4.5, A Modern Batch Risk for more information.

3.5. Summary

These are just some of the popular architecture patterns for eliminating batch

processing. They are dependent on both sides of the processing chain having

flexible interface options or having available bridging technologies. Fortunately

such options are now available in almost all circumstances, even for older systems

and technologies – though in some cases programmers or managers of older

systems may not be aware that new options have become available or feel

uncomfortable working “with the new stuff”.

We should emphasize that "new" options are not new. As an example, CICS

(IBM Mainframe environment for transaction management) gained web service

enabling in 2007. Prior to it being added to CICS, both Mercator and MQ offered a

functional way of exposing Z/OS (mainframe operating system) program modules

as external interfaces at least back to the year 2000 if not before.

4. Modern Batch Alternatives

With much of processing moving to real time, event driven, messaging oriented and other

similar patterns, a major portion of batch processing has been eliminated. Yet as we

stated early in this document, batch processing has not disappeared. Bulk data processing

has, in some ways, increased in volume. Data warehouses are constantly taking batch

loads of data from other systems. Business Intelligence and Big Data are driven by

processing huge volumes of data, and those volumes are (generally) not moving real

time. So what is happening?

4.1. Roll Your Own (Code Your Own Solution)

Roll Your Own is an American expression from cigarettes meaning make your own.

In consulting with numerous enterprise architects around the world (with the U.S.

heavily represented) I found this is the majority approach.

With more limited bulk processing requirements, a more limited approach is

perfectly adequate for most situations. In other words, rather than trying to

implement a large managed processing framework (see Section 6 – Big Batch

Tools for Java, for discussion of such frameworks), use of a locally coded thread

pool, a message queue or even an input table will meet the needs.

One of the side effects of such an approach is there is no central or consistent

approach for bulk processing. Rather, there may be one approach for handling a

few files, another for parallelizing certain types of transaction groups, another for

extracting and loading the data warehouse.

Yet this is exactly the point, the requirements are no longer met with a large

generalized approach, but targeted with the right tool and right technique to meet

the now more narrow circumstances.

For example, large regular data transfers from one database to another are today

commonly done with an ETL tool (such as Informatica). No modern developer

would manually code a large set of queries, write out sequential files, write

programs to transform their format, then write programs to manage the loading of

those files into the target data warehouse database. Since ETL tools contain all of

these abilities and perform them in a self-controlled environment, as well as

offering a visual scripting ability to design the steps, the development and

maintenance cycles are significantly decreased. Further, the runtime environment

is optimized for its narrow function set, and therefore may perform significantly

faster than manually developed code to perform the same function.

* Who can optimize better, you or the vendor?

Narrowly focused vendor products (or open source products), such as databases, ETL tools, ESB’s, messaging platforms, etc, have been focused on their IT and technical problems for years or even decades. Assuming their approach is one that solves your particular IT problem, their environment will usually be both more feature rich and optimized beyond what you can develop. How can we say this? Tool vendors often apply 50-500 developers to their product over years, even decades. To say a product as 100-1000 man years of work invested is the norm. Further, vendors are not just working out feature sets for their product but bringing in top algorithm designers, software architects as well as information technology scientists to create new processes for maximizing performance and features. BUT, note that this assumes their approach meets your need. A classic example of this is a relational database versus a NoSQL graph database. Even the best optimized relational database may provide poor performance and a poor development process (due to an unnecessarily complex relational data model) if the requirement calls for a data set with an extremely large number of relationships, particularly bi-directional relationships. [Note that the latest announced releases of relational databases are now including graph features to overcome this limitation and retain their role as the primary data storage tool.] Of course this is a generalization. While most business situations are good fits for traditional vendor tools, more specific or narrow problems may not be and unique approaches may offer the particular IT organization a competitive advantage. A good example of this was Amazon.com’s taking a unique approach to IT hardware resource allocation, creating an on-demand system that they eventually were able to productize and sell, now known as Amazon Cloud Services (Amazon EC2, EC3, S3 – elastic computing and storage, and more.)

We find most organizations coding a small controller for handling bulk volume,

queuing or staging the transactions (in a reliable auto-recoverable model),

processing them through parallelization, and scheduling the operation through

standard scheduling tools (such as BMC’s Control-M). Different needs or models

within the environments will have separate controllers and staging methods, or

different instances of similar methods. Generally the volume of different needs is

not so great that any type of generalized or larger control approach is needed.

4.2. BPA Controlled

BPA = Business Process Automation. Business Process Automation generally

means a scripted controlling process (rather than a coded process) that usually

tracks is own status and state (along with query mechanisms for querying upon or

reporting upon such).

BPA is a surprisingly good fit for bulk processes that have more than one step.

Rather than having to code a controller and manage state transitions, BPA does this

as it’s native functionality. Further, for those processes that have requirements for

monitoring / viewing of the status, again BPM provides some level of this as native

functionality.

This is a good option if a BPM or BPA tool already exists within the environment;

otherwise it’s only appropriate when sufficient need exists to make it worthwhile to

add another tool.

4.3. ETL – Extract, Transform, Load

ETL, extract transform and load tools, have taken over the space which formerly

were a major portion of Big Batch…

- Manually code a large set of queries.

- Write out sequential files.

- Write programs to transform their format.

- Write programs to load the transformed files into the target data warehouse

database.

Today the ETL tool and its environment provide a relatively fast way to develop

and deploy such processes. They are optimized and efficient, and offer additional

abilities such as complex data transformations and data cleansing.

Interestingly, they can also interface to and use other protocols besides direct

database access, such as web services. Prior to using web services as a data source

for an ETL process, please see Section 5 - Batch and Services, SOA, SOAP and/or

Messaging.

In general, ETL tools are very good at what they do, popular and found in most

large IT organizations. However, they suffer from the classic batch problem of not

being real time, make them inadequate when frequent updates or a near real time

view is necessary.

4.4. CDC or BAM as Real Time ETL Alternatives

CDC = Change Data Capture, a set of tools that monitors a database in real time,

transferring updates as they occur out to another destination database. Similar to

ETL, format transformation and cleansing may occur along the way. But differing

from ETL in that the Change Capture and send is immediate as it occurs.

BAM = Business Activity Monitoring. It is a set of tools that plugs in to

integration points, such as an ESB or Messaging, extracting data as it moves

through those environments. The data may be aggregated for real-time presentation

on monitoring screens, for generating real time business alerts based on content or

volume, or sent on to activate other services or processes.

CDC and BAM are the modern tools for real time activity monitoring and updating.

With these tools one can monitor real time activity at the application/integration

level or at the database level, reacting (triggering) on what has changed or how

much it has changed. Changes can be sent on to other databases, on to the data

warehouse for (near) real time updates, other applications, or trigger transactions,

processes, or business alerts.

* What’s a Business Alert? IT people are used to monitoring servers, systems, and applications. Tools such as BMC Patrol will monitor your database, verifying it has sufficient memory and CPU resources, that it’s not being flooded with activity beyond its capacity, that it has sufficient storage for the ongoing demand. Similar tools are available to monitor the various servers, the operating systems, middleware environments (MQ, ESB), and the application environments (.Net or Websphere, for example). Business Activity Monitors perform similar monitoring, but are designed to look at the content of the activity. They may monitor that no transaction exceeds $100,000 for example. Or that if new sales exceed a certain volume, management is alerted. Such tools are monitoring the content of the data traffic for pre-set business parameters, and building real time monitoring screens and/or sending business alerts on the basis of the monitored business parameters. It IS technically possible to use some of the system level monitoring tools to monitor business parameters. But system level monitoring tools do not typically present the information in ways business users find helpful, nor do they alert in ways useful to the business user. (The average business user doesn’t have much interest in an SNMP trap, a common type of monitoring alert.) BAM tools fill this gap and have shown themselves to be of high business value to businesses that operate and adjust their business operations in semi-real time (businesses such as airlines and credit card processors).

CDC and BAM are an excellent choice to bring ETL style bulk operations

into a near real time model.

4.5. A Modern Batch Risk

Modern server environments have increased the number of CPUs and the number

of cores per CPU.

* What’s the difference between a CPU and a Core? Modern general purpose CPUs comprise a variety of parts within the physical computer chip. It includes a processing unit (ALU+CU+registers), buffers, on chip memory caches, and I/O controllers and more. CPU designers, in attempting to offer more power, determined that putting multiple processing units but sharing the supporting components resulted in a single physical chip that effectively acts as multiple CPUs. The shared components provide only slight overhead, though actually provide value as some tasks (and associated needed input) may be shared among multiple processing units. Today we refer to a CPU meaning the physical chip connected to a board. Cores refers to the number of processing units within the chip. And on a practical basis each core of today is the same as a CPU of 10 years ago. Technically this means a CPU with 4 cores (a quad-core CPU) can be running 4 tasks in parallel. On a practical basis a much larger number of parallel tasks are switched in and out of the processing units. Therefore any CPU effectively can perform tens or hundreds of tasks in parallel. Regardless, each core means another processing engine available to handle tasks. More cores don’t make a computer faster (that’s the speed rating of the CPU), they make a computer able to do more in parallel.

Because of this increase in number of CPUs and cores, put together with a cluster

of servers for processing, it is very reasonable to run hundreds of parallel threads to

handle bulk processing. However, clustering at the database level is much more

challenging (and expensive) and is rarely done. Therefore, increasing

parallelization will increase the load upon the database and can easily overwhelm a

database environment.

Overwhelm can initially mean the database server’s ability to handle all the parallel

requests – does the database server have enough CPUs / cores + memory, but even

if the database server is fully equipped with enough capacity it still may overwhelm

the total I/O capacity of the server. It simply may not have enough bandwidth to

the physical disks, and/or the physical disks may not be able to respond fast

enough. While it is possible to use faster disks, raid disk arrays, an SSD buffer

layer or even large local disk caches, these solutions quickly become costly and

will hit an absolute limit that is very challenging to overcome.

5. Batch and Services, SOA, SOAP and/or Messaging

In Section 3 of this document we discussed Big Batch and a required supporting database

pattern of grouping updates or inserts together into a single database commit. This

grouping of operations into a single database transaction is not due to business need or

association of the operations, rather it’s simply to reduce the overhead of the database

commit process (an operation that can’t be avoided or deactivated).

In that section (see insert titled “How 100ms of I/O overhead can turn into 27 hours of

processing time…”) we described how database commits turn into significant processing

overhead, requiring a special batch transaction grouping pattern to minimize the

overhead.

We face a similar problem when activating services or creating/consuming messages in

bulk. Each operation requires:

- Creation of a service or message header.

- Creation of the service request or message content, usually involving

transformation of the data to an XML, JSON or other specialty format.

- Activating a communication connection/session.

- Transfer of the request or message.

- Waiting for an acknowledgement (asynchronous) or response (synchronous).

Depending on whether the connection is local or remote, or an optimized high speed

connection versus regular communications between servers, activating a service has an

average minimum overhead of 50ms. Using the same example in Section 3, if we are

sequentially processing (or sending) 1,000,000 requests (or messages), we create 14

hours of processing overhead just for the service communications and setup

overhead.

* Beat the overhead with a cluster and parallelization? You say you’re not doing it sequentially and you have a cluster of 4 servers, each server running a pool of 10 listeners? Doing so will reduce the overhead impact to only 21 minutes (in our 1,000,000 transaction example) of communication overhead. However, we’ve only moved the problem and are now hitting our database with 40 parallel updates/inserts and commits, paying a double overhead impact (communication overhead plus commit overhead – see Section 3) and must significantly increase our database server capacity or overwhelm the database processing capacity. Setting up a cluster is sometimes appropriate and occasionally the only option, especially when there is a mismatch between processing models between systems. But the expense and overhead that can often be avoided or at least reduced with the right service processing pattern – the bulk service pattern.

5.1. The Bulk Service Pattern

We commonly take a single operation class and expose it, creating a single request

or single transaction service. This leads us to think of services in a single operation

pattern. Yet there is no reason a SOAP service or a Message can’t contain multiple

operation requests, meaning multiple XML documents or JSON data sets in the

same Message content or SOAP request. Meaning one operation to connect and

communicate multiple service runs or business operations.

For example, an UpdateCustomerAddress service could contain:

<UpdateCustomerAddresses>

<UpdateRequest ID=”1”>

<CustomerAddressUpdate>

<data>

</>

<UpdateRequest ID=”2”>

<CustomerAddressUpdate>

<data>

</>

</>

The receiving/processing service could loop and process each request, or hand them

off to a pool of parallel threads to process a number in parallel, and/or open a

database transaction to group the operation under one database commit. A reply, if

necessary, would identify the results by ID number of the request. Note the ID

number only has to be unique in context of the particular request (so each request

can start at ID #1 and count upwards within the request).

This pattern allows us to bulk the communications together, reducing the

communications overhead by the number of operations bundled together in a single

service request or message.

The practical point is, any service can easily be built to handle 1 or more

operation requests in a single SOAP call or Message. This small adjustment will

allow the service to be reused for small to medium bulk operations. (Sending 10

requests instead of 1 in one activation and letting the service processor loop

requires no redesign.) If larger numbers need to be sent through (such as 100 or

1,000), changes to manage database transaction grouping and/or creating a pool of

processing threads may be appropriate.

6. Big Batch Tools for Java

While writing this document my current project focus is Java oriented, so when looking

for Big Batch solutions my view was Java focused. The mainframe vendors and big

application vendors have been looking towards Java as a primary server side

development environment as well, so generally if you’re looking to follow traditional Big

Batch patterns but modernize with current tools, Java is most likely the way to go. (This

is not to say that Microsoft doesn’t offer completely effective bulk processing

approaches, but they tend to be less Big Batch traditional in their approach.)

The first key thing to say about the Java Big Batch Tools is they are rarely used. This is

because the Big Batch tools follow the nature of Big Batch and are:

Process heavy.

Complicated frameworks.

A unique pattern within the application (making code reuse difficult between

batch and online).

A unique monitoring and operational framework.

The result is these tools are only used by projects with a very heavy Big Batch pattern

requirement. They have a significant learning curve.

6.1. Compute Grid (IBM)

IBM Websphere Extended Deployment Compute Grid is an extension now included

as part of the IBM Websphere Application Server J2EE container environment. It

provides a complete Mainframe batch replacement environment that will run (and

can distribute workload) across Windows Servers, Linux or Unix Servers, on the

IBM Mainframe Z/OS environment, any of these in virtualized form, and any of

these in a mixed combination.

The ability to distribute a “job” across a mixed server environment (a “grid”) makes

Compute Grid a promising solution for intensive computations that can be divided

into steps or bulk processing where the transactions can be divided across the

environment. The downside is a particularly complex framework and, in cases

where a database is required, the database remains a bottleneck no matter how far

the workload is divided.

Figure 4 - from IBM

Compute Grid has 2 common roles:

A. A full big batch replacement environment for moving big batch jobs, with full

big batch control and management features, from a mainframe processing

environment onto Linux / Unix / Windows servers. This allows re-hosting of

re-developed (but not re-architected) big batch jobs off the mainframe.

B. A workload balancing environment, clustering large bulk processing jobs

across a set of Websphere Application Servers running on Linux, Unix, and/or

Windows servers. It's worth mentioning that it can also run as part of

Websphere on the Mainframe, clustering batch job load across the Mainframe

to Linux/Unix/Windows – a nice way of maximizing mainframe utilization

but not having to buy additional expensive mainframe CPU’s if that capacity

is exceeded during the burst load requirements of a batch cycle.

6.2. Spring Batch

Spring Batch is part of the Spring Framework that "provides reusable functions that

are essential in processing large volumes of records, including logging/tracing,

transaction management, job processing statistics, job restart, skip, and resource

management. It also provides more advanced technical services and features that

will enable extremely high-volume and high performance batch jobs through

optimization and partitioning techniques."

Spring Batch is not an executable environment, but rather a java library framework

to manage execution within standard Java environments. If the Java development

environment is Spring based, then Spring Batch is likely a good choice if Big Batch

is required. And it's open source (free).

Figure 5 - From SpringSource.org, http://static.springsource.org/spring-batch/trunk/spring-batch-core/

Figure 6 – From SpringSource.org, http://static.springsource.org/spring-batch/trunk/spring-batch-core/

Figure 7 - From SpringSource.org, http://static.springsource.org/spring-batch/trunk/spring-batch-core/

6.3. J2EE & JSR 352

JSR 352 is a Java Specification Request submitted by IBM to the Java Community

Process. It defines a set of Java language abilities, implemented as J2EE

extensions, to cover the needs of batch jobs being coded in Java. The spec was

finalized in early 2013 and can be expected to appear in J2EE containers in late

2013 and 2014.

The spec defines a new Java lib: javax.batch, which is designed to offer

basic Big Batch control functions from the Java container. It includes library

classes offering these functions:

7. A Modern Batch Strategy

Here is a recommended modern batch strategy, if a full new environment is being built.

7.1. Eliminate.

The vast majority of batch needs are to be re-architected for event driven

processing and decoupled processing, whenever possible. An event driven model is

more than just an approach to eliminate batch, it’s an approach to create a software

model who’s components are maintaining real time views as much as possible.

See Section 2 for more details on these approaches.

7.2. Roll Your Own via Queuing.

Utilize assured delivery mechanisms via tools such as IBM MQ or TIBCO

messaging to queue bulk volume processing, and then reliably process the bulk via

parallelization and a processing cluster (see Section 2.3 and Section 5.1 for more

details on these methods).

Because we have queued the bulk there is no possibility of job or transaction loss,

though the processing object must be designed to handle and redirect business

failures to a failure handling queue and process. This will require IT or the

business to have a process to monitor and correct such business problems as they

occur.

7.3. Roll Your Own via BPA.

More complicated update or transaction processes may have multiple coordinated

steps, one step that kicks off another, or even a human verification step in the

middle. Business Process Automation (BPA), acting as a controller for a multi-step

process in which a parallelized cluster processes and then emits an event/alert upon

completion, from which the controller will react and begin the next step.

7.4. BAM + ESB for Near Real Time Data Distribution

BAM, Business Activity Monitoring tools, offer an interesting alternative to ETL –

providing a real-time ETL process. Rather than building resource intensive ETL

processes for data distribution, we create near real time data distribution processes

uses the BAM tool where appropriate, and using the ESB where appropriate (and

sometimes both in combination).

This will allow updating of the data warehouse in near real time, provide near real

time business monitoring, and can be used to create processes to distribute data

between applications (if such a requirement exists).

7.5. Summary

Few modern develops are familiar with the Java big batch tools (discussed in

Section 6), and relatively few locations are using these tools. The heavy overhead

of using these tools is beyond the need of most organizations, and that overhead

includes training a special sub-team of developers to use them as well as the strong

possibility that they will require a specialized coding that will be somewhat

incompatible with the online code set.

I generally recommend moving batch processes to other processing models for

elimination.