Modern Batch Directions
Enterprise Architecture by
Akiva Marks
Akiva Marks
Chief Architect – Architecture / SOA / Cloud / Integration Group
Guardian Information Systems – a Malam Team Company
[email protected] – phone +972-52-313-4184
http://MakingSOAwork.blogspot.com
TABLE OF CONTENTS
1. OVERVIEW ...................................................................................................................................... 3
1.1. LIMITATIONS OF BATCH ..................................................................................................................... 3 1.2. WHERE DID MY BATCH GO? .............................................................................................................. 4
2. WHAT IS (WAS) BIG BATCH? ........................................................................................................... 6
3. ELIMINATING BATCH .................................................................................................................... 10
3.1. REAL TIME PROCESSING ................................................................................................................... 10 3.2. BUSINESS EVENT DRIVEN PROCESSING................................................................................................ 11 3.3. DECOUPLED PROCESSING VIA MESSAGING .......................................................................................... 12 3.4. JUST ANOTHER INPUT CHANNEL VIA PARALLELIZATION .......................................................................... 14 3.5. SUMMARY .................................................................................................................................... 14
4. MODERN BATCH ALTERNATIVES ................................................................................................... 15
4.1. ROLL YOUR OWN (CODE YOUR OWN SOLUTION) ................................................................................. 15 4.2. BPA CONTROLLED .......................................................................................................................... 16 4.3. ETL – EXTRACT, TRANSFORM, LOAD .................................................................................................. 17 4.4. CDC OR BAM AS REAL TIME ETL ALTERNATIVES ................................................................................. 17 4.5. A MODERN BATCH RISK .................................................................................................................. 18
5. BATCH AND SERVICES, SOA, SOAP AND/OR MESSAGING.............................................................. 20
5.1. THE BULK SERVICE PATTERN ............................................................................................................. 21
6. BIG BATCH TOOLS FOR JAVA ......................................................................................................... 22
6.1. COMPUTE GRID (IBM) .................................................................................................................... 22 6.2. SPRING BATCH ............................................................................................................................... 24 6.3. J2EE & JSR 352 ........................................................................................................................... 26
7. A MODERN BATCH STRATEGY ....................................................................................................... 27
7.1. ELIMINATE. ................................................................................................................................... 27 7.2. ROLL YOUR OWN VIA QUEUING. ....................................................................................................... 27 7.3. ROLL YOUR OWN VIA BPA. .............................................................................................................. 27 7.4. BAM + ESB FOR NEAR REAL TIME DATA DISTRIBUTION ........................................................................ 27 7.5. SUMMARY .................................................................................................................................... 28
INCLUDED ILLUSTRATIONS
Figure 1 - Example Big Batch Control Features............................................................................................. 6 Figure 3 - Large Batch Job subdivided into parallel segments ....................................................................... 8 Figure 4 - Arcati Mainframe User Survey 2013 ............................................................................................. 9 Figure 5 - from IBM ......................................................................................................................................23 Figure 6 - From SpringSource.org, http://static.springsource.org/spring-batch/trunk/spring-batch-core/ .....24 Figure 7 – From SpringSource.org, http://static.springsource.org/spring-batch/trunk/spring-batch-core/ ....25 Figure 8 - From SpringSource.org, http://static.springsource.org/spring-batch/trunk/spring-batch-core/ .....25
1. Overview
A Mainframe using customer recently approached me with a problem: "We're upgrading
our application and technology environment, please help us select the tool set to which to
move our batch jobs." It sounded like a straightforward need, just find a new tech set to
build new batch jobs and into which to move their old batch jobs. But as I researched the
problem and need more in depth, I found the technology approach in this area has moved
significantly, making an answer much more complicated.
Batch was a primary processing pattern for the majority of 2nd
generation enterprise
applications, and the driving requirement for enterprise computer hardware and operating
environments from around 1960 through 1990. To meet this need, IBM together with
other vendors created a complete ecosystem of hardware, software and database tools
optimizing performance of this pattern.
These tools were fully effective at developing, controlling, managing and monitoring
complex batch jobs. This didn’t mean batch jobs were without problems. Large
dependency chains are always vulnerable to any step or link having a problem. Even
with the most sophisticated batch monitoring and support tools in place, operator
intervention (to stop and restart, recover, or re-run a job) was a frequent experience in
every batch environment.
Batch was an effective method of processing large quantities of data with limited and
costly computing resources. As the paradigms of data processing changed (such as the
driving business need for real time data) and the resource-cost basis of the computing
resources changed, the limitations of the batch pattern became more significant and the
alternatives more viable.
1.1. Limitations of Batch
The batch processing pattern presents 3 inherent limitations:
1. Any problems with a batch job step often meant failure of the job as well as
requiring operator intervention to recover. Almost every IT enterprise
running batch jobs had “night operators”, or at least on call operators, to
intervene when necessary – usually at least once a week if not more.
2. Batch job resource requirements were often linear. Meaning, if in 1980 it
took 2 hours to process 100,000 customers, and in 1990 we had 1,000,000
customers, it was then taking 20 hours to process. Since batch jobs usually
had to fit within windows of available resources (time frames when online
processing was shut down to free up resources for batch processing),
usually an 8-12 hour overnight time or a weekend time, increasing volume
often put the enterprise into a resource squeeze. (Though IBM would
always be happy to deliver the latest Mainframe and latest Database, at huge
cost, to get you a 25% increase in handling time for the next generation
hardware.)
3. Batch, by nature, is not real time. Therefore, any data processed via batch
patterns meant a system not up to date. In some cases this was not very
important, but in other cases the business and the customers came to expect
real time updates. For example, it used to be when you made a bank
deposit, you received a note that “deposits may not be applied for 48 hours”.
So even if you deposited money in an ATM machine (Caspomat in Israel),
you would not see the update to your balance until the next day or later.
With the creation of customer web sites showing account balances, having
balances that didn’t include updates from certain sources due to IT
processing delays was unacceptable.
While updates to the supporting technologies over time provided a variety of work-
arounds and updated techniques to reduce the impact of these problems, they
remain fundamental problems of batch processing.
1.2. Where Did My Batch Go?
As we moved forward to client/server applications, web based applications, and
service oriented applications, we’ve heard less and less about batch. Few new
systems being developed today use the 2nd
generation batch processing pattern.
Even more rare is the programmer aged 35 years or younger who has any idea
about batch. The verdict is in, “Big Batch” is dead.
But there are still plenty of newer applications and processing instances where large
amounts of data are processed and updated. A batch programmer looking at these
processes would instantly identify many batch-like techniques in use, though more
targeted and limited implementations. Or more to the point, as an integrated part of
the application rather than a separate managed and controlled processing pattern of
its own.
Examples of batch like operations in modern applications include:
ETL processes (extract, transform, load), particularly ETL extracts that are
then loaded into a data warehouse.
Billing cycles. Generating customer bills and updating customer accounts for
their billed status.
Updating groups of data across systems.
In recent examples of Batch Processing described by an IBM architect, he wrote:
Batch processing is execution of series of "jobs" that is suitable for non-interactive, bulk-oriented and long-running tasks. Typical examples are end-of-month bank statement generation, end-of-day jobs such as interest calculation, and ETL (extract-transform-load) in a data warehouse. These tasks are typically data or computationally intensive, execute sequentially or in parallel, and may be initiated through various invocation models, including ad-hoc, scheduled, and on-demand.
Yet that description no longer describes a chain of large bulk oriented long
running tasks. Rather it's showing that modern bulk processing has devolved to
single task operations (even if that task has a large quantity of the particular
transactions or operations).
Some big batch techniques remain and have become an integrated part of the
modern application environment. The need for bulk processing has not completely
disappeared. But in many cases the need has been eliminated…
2. What is (was) Big Batch?
This section is written for the younger technologists. If you have developed on the
Mainframe or developed Big Batch jobs in the past, you may skip or skim this section.
The vast majority of today’s developers have never worked on a mainframe or in a 2nd
generation language (such as COBOL). To relate to where batch is going, first we must
present a brief technical overview of what batch was. And though I use the word “was”,
almost every large IT enterprise has an older generation of applications, and many of the
oldest continue to run Big Batch. If there’s a Mainframe in the data center, there’s
probably some Big Batch running every night.
In general, Big Batch is a series of batch jobs, each job consisting of job steps, each step
being the execution of bulk processing of an extract, transform, load, or transaction
processing step. A scheduler starts the jobs at specified times or upon receipt of input of
specified type. A controller executes each job step, and a supervisor monitors the results
of each step.
Figure 1 - Example Big Batch Control Features
A single “Big Batch” production job may have 1 – 100 or more job steps. Each job step
comprises a script (JCL – Job Control Language) that may run 1 or many program
modules or utility operations. Such a job step script could itself be 1 – 500 commands.
The Big Batch execution cycle tries to maximize the resources available for the bulk
processing (shutting down online processing to free up memory and CPU for the batch
cycle). Further, due to the quantities of data involved, it has to be very careful to manage
the I/O.
* How 100ms of I/O overhead can turn into 27 hours of processing time… The average developer is very familiar with database transactions but is not familiar with what's happening behind the scenes and the resource cost. Multiple database change operations (insert/create, update, delete) can be performed (across any number of tables in the same database schema/data set) and grouped into a "transaction". If any of the operations fail, the transaction is discarded or in database terminology "rolled back", meaning the full group of operations are never applied. If all the operations succeed, the transaction is permanently applied as a group, or in database terminology "committed". (As an aside, attempting to build a transaction that includes multiple databases or multiple applications is called a "two phase commit" and is considered technically challenging and operationally risky.) For a relational database (DB/2, Oracle, Microsoft SQL Server, MySQL, etc.) to create a transaction, a special technique for tracking the changes but not actually saving them is activated. Rather than apply the changes to the physical tables, upon beginning a transaction it builds a "transaction log entry" (this may differ slightly by database vendor, for example Oracle builds a "log buffer" and "data buffer" set for the transaction). As each operation is executed, the new state and old state is saved in the log(s). Upon an attempted commit, the new state will be moved from the log to the actual tables. If there is a failure, the old state will be restored. If it succeeds, the old state and log entries will be discarded. It's a surprising amount of actual work and data movement for even seemingly small changes, but it is key functionality within a relational database. All this log and live interaction has significant overhead. Even if the overhead (with today's highly optimized relational databases) comes out to no more than 100ms per transaction, if we're processing 1,000,000 updates that comes out to 27 hours of overhead just for the commit (not including time of the actual processing). [The time calculation assumes those 1,000,000 updates are being applied sequentially.] Therefore bulk processing of database updates to a relational database commonly group 10-1,000 updates/inserts in a single “transaction", resulting in 1 commit per 10-1000 updates/inserts. This is done not because the updates/inserts are business related (the normal reason for a transaction), but rather to reduce the number of database commits because of the overhead (that provides no functional value for a single insert but can’t be avoided).
Grouping of operations into a single database transaction is not due to business need or
association of the operations, rather it’s simply to reduce the overhead of the database
commit process (an operation that can’t be avoided or deactivated). Unfortunately when
grouping a large number of unrelated operations into a single database transaction, the
failure of any one operation will fail the transaction – causing a business impact between
operations where no business connection exists. Managing this impact becomes a
major part of Big Batch. Job segments that have had failures need to be recovered, have
the problem repaired (which may be as straightforward as removing the single
problematic update/insert), and then resubmitted. Low level operational control of job
segments is a standard part of Big Batch control environments for exactly this reason.
I/O and processing management is a major portion of traditional batch jobs, and part of
what adds to their complexity. Large input files are commonly sorted prior to processing,
allowing processing to optimize database access on the primary indexes and allowing the
database to pre-fetch and buffer with predictive algorithms. Sometimes data sets are
queried from the database, sorted by a needed access pattern and written as a sequential
file to speed access.
Output of one job step is frequently required input to another job step, and sometimes
different jobs are running in parallel and producing required input for each other. A web
of dependencies, both internal and external, are frequently created. The result is
complicated and highly dependent on inputs arriving at the right time. Being that
input/output sets are usually sent as files, yesterday's input arriving (perhaps due to a
previous job failure and re-run) was a not infrequent occurrence.
Figure 2 - Large Batch Job subdivided into parallel segments
Because sorting and ordering the input sets was usually part of the job steps, the
controlling steps could split the input sets and execute segments in parallel to maximize
resource utilization. This could continue until the CPU capacity, memory capacity,
database capacity or IO capacity was hit, though in practice the average batch job was
rarely split into more than 5 segments.
Programs that "did batch" were almost always exclusively written for batch. The batch
approach meant different models of input and output handling, and possibly different
models of data access, and different approaches to transaction management. Even the
database could contain optimizations, special indexes for batch access, even layout out
the physical tables structures on the disk for optimized bulk access.
Big Batch worked, and is still working at many organizations. Yet the amount of
enterprise data that still lives on the mainframe has fallen to 40% or less for the majority
of enterprises with a mainframe. So Big Batch and the mainframe structures that enable
it continue to decrease, from 80% of organizations in 2000, to 50% in 2005, to perhaps
30% today.
Figure 3 - Arcati Mainframe User Survey 2013
3. Eliminating Batch
When looking to eliminate batch processes, here are some of the common approaches for
doing so. Most organizations will use a combination of these and others, depending on
the exact need.
3.1. Real Time Processing
IT processes that were, in older systems, locked up in the application code are often
exposed today, via a Web Service or Messaging based interface (or both). This
makes the ability to connect systems and feed an application data real time to be
easier than building and managing a batch process. This also spreads the work load
for the process across the day.
* Hardware Capacity and Batch vs. Real Time One of the problems of managing capacity at the hardware level is variable workloads – batch being a particularly heavy workload. If we purchase capacity to handle the burst load, which in batch environments is usually the batch load, that extra capacity is idle throughout most of the day. By spreading the work load over the day, we maintain a more constant load and can operate at a higher level of constant capacity utilization, reducing our hardware requirements and associated cost.
The primary reason for moving to real time processing (or near real time) is to have
an up-to-date data view across our application environment.
* Real Time or Near Real Time? Real time technically means immediate processing across the connected environment or applications, or more specifically relates to hardware level programming and responding to electrical signals. Near Real Time means a system may send an asynchronous event or message to another system, which applies the update “very soon” – usually within a few seconds. In most business scenarios Near Real Time is perfectly adequate. Actual Real Time or synchronized transactions and updates may be necessary for certain financial business processes, such as stock trading, or infrastructure (such as telecommunications) operations. But Actual Real Time incurs significant extra implementation costs and should only be used when absolutely necessary. Within this document we’ll use the general term “Real Time”, but usually mean “Near Real Time”.
Real Time processing eliminates much if not a majority of batch jobs in most
organizations, if systems on both sides are being or have previously been
updated to modern generation applications or been modified to expose
transactions or processes via modern interface methods.
* CICS / Mainframe Web Services? IBM and other vendors have provided tools to web service expose older 2nd generation program modules, so even some of the oldest applications can be moved to Real Time. CICS and Enterprise COBOL have been web service enabled, able to expose and consume web services (with some XML limitations). Similarly Software AG has web service enabled their Natural language environment. Similarly, even DB/2 on the Mainframe can expose Store Procedures as Web Services. Further, a variety of other vendors and tools are available to bridge web services and messaging into various 2nd generation Mainframe languages and environments. However, the ability to do so easily depends on the architecture model of the code within the older software. If the processing level or transaction level is separated from the user interface level, represented as a separate callable module or library, then it can almost instantly be exposed (changing the older system from user interaction to a transaction engine.) However, if the code logic is tightly wrapped together with screens and/or the interface method, then the effort to separate the transaction or process logic from the interaction logic may exceed the cost of rebuilding the abilities in a modern environment.
We sometimes get into the humorous yet incredibly inefficient situation where one
side of a process will update their system, allowing for real time, but the other side
will not. The updated side must then use a bridging technique (such as using an
ESB, enterprise service bus, or messaging system to save real time transactions
throughout the day and then send them in a batch later) to maintain the existing
batch process. Some years later the other side is updated, but they ALSO use a
bridging technique to “maintain compatibility” with the existing interface rather
than re-engineer a working process (or not even knowing the other side is using a
bridging technique). We end up with two real time capable systems connecting and
operating in a batch pattern and maintaining bridging code. The bridging code,
serving no real need, is maintained at extra expense and is actively reducing the
data liveliness between the systems. This is particularly likely when such processes
are between companies.
3.2. Business Event Driven Processing
Our operating systems, browser environments and mobile systems all operate in an
event driven mode. Every Windows programmer is well familiar with responding
to GUI events and system events. And every Javascript programmer is familiar
with Browser events such as OnLoad or OnClick.
Business Event Driven Processing extends this idea to business events. Events
such as NewEmployee, CustomerSale, or ProductShipped can be listened for and
subscribed to. So rather than the Billing System having to wait to the evening to be
pushed a list of sales made today, it can subscribe to “CustomerSale” events.
This further allows for a processing decoupling. In the example above, the Billing
System could process the CustomerSale event in real time, in background at a
lower priority, in the evening to reduce impact on users, or at the end of the month
as part of its bill generation cycle. (The last option would mean it’s view of the
customer state would be out of date, but perhaps it’s not the system used to view
customer state – so no need to keep it constantly up to date.)
This pattern allows the systems to communicate, push needed data, yet process on
their own individual application cycle. Where a near-real-time result is needed, the
needing application can subscribe and process immediately upon receipt of an
event. Since subscribing and processing are the responsibility of the subscriber (the
system needing the data or processing the transaction), the provider (the system
sharing the event and associated data) is completely decoupled from the subscribing
systems.
Because the systems are physically decoupled, it is important for the IT
development process to include steps to maintain a catalog of events (type and
content), providers of the events, and subscribers of the events. This becomes the
only way to determine change impact.
From a historical batch perspective, one could say each subscriber is responsible for
its batch step and its cycle. Success or failure to process is its own local
responsibility to manage without any impact on the providing system or on any
other consuming systems. In other words, even each sub-step of what would have
been a full batch job is isolated, with no impact beyond the system involved in the
particular sub-step.
3.3. Decoupled Processing via Messaging
Decoupled Processing via Messaging is similar to Business Event Driven
Processing, but rather than consuming systems subscribing to events (a fully
decoupled pull model), the producing system sends specific messages to one or
more designated receiving/consuming systems (a loosely coupled push model).
* Messaging and Communication Patterns Messaging refers to an asynchronous communication pattern, where a sender targets a message at a queue, and the receiver or consumer reads from the queue – at its convenience.
Messaging is usually considered a reliable or guaranteed delivery approach, with messages being persistent until consumed. IBM’s Websphere MQ is the top market solution for such a reliable asynchronous messaging infrastructure, though TIBCO offers several very viable alternatives. On a smaller scale, Microsoft offers MSMQ as well as a messaging infrastructure in their Biztalk integration server environment.
With Messaging, each system requiring data, updates, or transaction activation has
the operation sent as a message. The sending and receiving of the message is an
asynchronous decoupled process, meaning the sending system writes the message
to a target queue and their task is complete. The queue transactionally
acknowledges receipt of the message, at which point the sending system's
responsibility is complete. The receiving system that will process the message
either reads the queue at its convenience (according to its availability and
processing schedule) or monitors it, activating processing upon message receipt
(usually monitoring with multiple threads, allowing parallel processing of multiple
messages simultaneously).
The reading and processing is also transactional, the message being locked at the
start of the processing and deleted upon successful completion (or unlocked upon
processing failure and rollback).
Therefore, neither side of the process is dependent upon the other side to operate or
complete, nor does either side stop and wait (for receipt, for processing, for success
or fail) for the other side.
Message queues themselves may be monitored and activate alerts should messages
be aging (not being picked up) or growing too many in the queue (not being
processed fast enough).
It is possible for multiple systems to be a provider of a particular message.
However, with the use of a queue, only one system may be the consumer since once
a message is read and processed, it is deleted from the queue. Of course that
"system" may be a server cluster, with all active members of the cluster listening to
and consuming messages from the same physical queue. The point is that two
separate applications cannot process the same message from the same queue. If
there is such a need, in this pattern you would have the sending application send the
message multiple times, once to each queue for each consuming application.
One final point on messaging. The sending system may bundle multiple transaction
or operation requests in a single message AND the receiving system may choose to
read and process multiple messages in a single operation or transaction. A message
may contain as much data or as many transactions as is appropriate. For example,
if 10 different updates for a customer were being sent, it would be very appropriate
to bundle all 10 updates for the same customer into one message. Similarly, for
processing efficiency the receiving system (the system reading the messages from
the queue) may determine to maximize processing and minimize database
overhead, it will read 10 messages from the queue and process them in a single
transaction. When sending / receiving large quantities of transactions, both patterns
may be appropriate and should be considered.
3.4. Just Another Input Channel via Parallelization
What the remainder of batch often becomes is ‘just another input channel.’ The
point of this approach is to use the same online code set for bulk processing,
handling the bulk load via parallelization.
Parallelization means taking the group of transactions to be processed and feeding
them to pool of threads that may be deployed across a cluster of application
instances. Each thread processes it’s given transaction via the same code or object
or service as an online process.
Distributing the load across a cluster can be technically difficult. Messaging is a
frequent approach to do so, though in particularly complicated models or those with
very large numbers of such tasks or processes it is appropriate to consider modern
batch tools (see Section 6 for a discussion of these tools).
Just Another Input Channel is a very viable option for the remainder of bulk
processing needs after the previous elimination methods are used, but does carry
risk. See Section 4.5, A Modern Batch Risk for more information.
3.5. Summary
These are just some of the popular architecture patterns for eliminating batch
processing. They are dependent on both sides of the processing chain having
flexible interface options or having available bridging technologies. Fortunately
such options are now available in almost all circumstances, even for older systems
and technologies – though in some cases programmers or managers of older
systems may not be aware that new options have become available or feel
uncomfortable working “with the new stuff”.
We should emphasize that "new" options are not new. As an example, CICS
(IBM Mainframe environment for transaction management) gained web service
enabling in 2007. Prior to it being added to CICS, both Mercator and MQ offered a
functional way of exposing Z/OS (mainframe operating system) program modules
as external interfaces at least back to the year 2000 if not before.
4. Modern Batch Alternatives
With much of processing moving to real time, event driven, messaging oriented and other
similar patterns, a major portion of batch processing has been eliminated. Yet as we
stated early in this document, batch processing has not disappeared. Bulk data processing
has, in some ways, increased in volume. Data warehouses are constantly taking batch
loads of data from other systems. Business Intelligence and Big Data are driven by
processing huge volumes of data, and those volumes are (generally) not moving real
time. So what is happening?
4.1. Roll Your Own (Code Your Own Solution)
Roll Your Own is an American expression from cigarettes meaning make your own.
In consulting with numerous enterprise architects around the world (with the U.S.
heavily represented) I found this is the majority approach.
With more limited bulk processing requirements, a more limited approach is
perfectly adequate for most situations. In other words, rather than trying to
implement a large managed processing framework (see Section 6 – Big Batch
Tools for Java, for discussion of such frameworks), use of a locally coded thread
pool, a message queue or even an input table will meet the needs.
One of the side effects of such an approach is there is no central or consistent
approach for bulk processing. Rather, there may be one approach for handling a
few files, another for parallelizing certain types of transaction groups, another for
extracting and loading the data warehouse.
Yet this is exactly the point, the requirements are no longer met with a large
generalized approach, but targeted with the right tool and right technique to meet
the now more narrow circumstances.
For example, large regular data transfers from one database to another are today
commonly done with an ETL tool (such as Informatica). No modern developer
would manually code a large set of queries, write out sequential files, write
programs to transform their format, then write programs to manage the loading of
those files into the target data warehouse database. Since ETL tools contain all of
these abilities and perform them in a self-controlled environment, as well as
offering a visual scripting ability to design the steps, the development and
maintenance cycles are significantly decreased. Further, the runtime environment
is optimized for its narrow function set, and therefore may perform significantly
faster than manually developed code to perform the same function.
* Who can optimize better, you or the vendor?
Narrowly focused vendor products (or open source products), such as databases, ETL tools, ESB’s, messaging platforms, etc, have been focused on their IT and technical problems for years or even decades. Assuming their approach is one that solves your particular IT problem, their environment will usually be both more feature rich and optimized beyond what you can develop. How can we say this? Tool vendors often apply 50-500 developers to their product over years, even decades. To say a product as 100-1000 man years of work invested is the norm. Further, vendors are not just working out feature sets for their product but bringing in top algorithm designers, software architects as well as information technology scientists to create new processes for maximizing performance and features. BUT, note that this assumes their approach meets your need. A classic example of this is a relational database versus a NoSQL graph database. Even the best optimized relational database may provide poor performance and a poor development process (due to an unnecessarily complex relational data model) if the requirement calls for a data set with an extremely large number of relationships, particularly bi-directional relationships. [Note that the latest announced releases of relational databases are now including graph features to overcome this limitation and retain their role as the primary data storage tool.] Of course this is a generalization. While most business situations are good fits for traditional vendor tools, more specific or narrow problems may not be and unique approaches may offer the particular IT organization a competitive advantage. A good example of this was Amazon.com’s taking a unique approach to IT hardware resource allocation, creating an on-demand system that they eventually were able to productize and sell, now known as Amazon Cloud Services (Amazon EC2, EC3, S3 – elastic computing and storage, and more.)
We find most organizations coding a small controller for handling bulk volume,
queuing or staging the transactions (in a reliable auto-recoverable model),
processing them through parallelization, and scheduling the operation through
standard scheduling tools (such as BMC’s Control-M). Different needs or models
within the environments will have separate controllers and staging methods, or
different instances of similar methods. Generally the volume of different needs is
not so great that any type of generalized or larger control approach is needed.
4.2. BPA Controlled
BPA = Business Process Automation. Business Process Automation generally
means a scripted controlling process (rather than a coded process) that usually
tracks is own status and state (along with query mechanisms for querying upon or
reporting upon such).
BPA is a surprisingly good fit for bulk processes that have more than one step.
Rather than having to code a controller and manage state transitions, BPA does this
as it’s native functionality. Further, for those processes that have requirements for
monitoring / viewing of the status, again BPM provides some level of this as native
functionality.
This is a good option if a BPM or BPA tool already exists within the environment;
otherwise it’s only appropriate when sufficient need exists to make it worthwhile to
add another tool.
4.3. ETL – Extract, Transform, Load
ETL, extract transform and load tools, have taken over the space which formerly
were a major portion of Big Batch…
- Manually code a large set of queries.
- Write out sequential files.
- Write programs to transform their format.
- Write programs to load the transformed files into the target data warehouse
database.
Today the ETL tool and its environment provide a relatively fast way to develop
and deploy such processes. They are optimized and efficient, and offer additional
abilities such as complex data transformations and data cleansing.
Interestingly, they can also interface to and use other protocols besides direct
database access, such as web services. Prior to using web services as a data source
for an ETL process, please see Section 5 - Batch and Services, SOA, SOAP and/or
Messaging.
In general, ETL tools are very good at what they do, popular and found in most
large IT organizations. However, they suffer from the classic batch problem of not
being real time, make them inadequate when frequent updates or a near real time
view is necessary.
4.4. CDC or BAM as Real Time ETL Alternatives
CDC = Change Data Capture, a set of tools that monitors a database in real time,
transferring updates as they occur out to another destination database. Similar to
ETL, format transformation and cleansing may occur along the way. But differing
from ETL in that the Change Capture and send is immediate as it occurs.
BAM = Business Activity Monitoring. It is a set of tools that plugs in to
integration points, such as an ESB or Messaging, extracting data as it moves
through those environments. The data may be aggregated for real-time presentation
on monitoring screens, for generating real time business alerts based on content or
volume, or sent on to activate other services or processes.
CDC and BAM are the modern tools for real time activity monitoring and updating.
With these tools one can monitor real time activity at the application/integration
level or at the database level, reacting (triggering) on what has changed or how
much it has changed. Changes can be sent on to other databases, on to the data
warehouse for (near) real time updates, other applications, or trigger transactions,
processes, or business alerts.
* What’s a Business Alert? IT people are used to monitoring servers, systems, and applications. Tools such as BMC Patrol will monitor your database, verifying it has sufficient memory and CPU resources, that it’s not being flooded with activity beyond its capacity, that it has sufficient storage for the ongoing demand. Similar tools are available to monitor the various servers, the operating systems, middleware environments (MQ, ESB), and the application environments (.Net or Websphere, for example). Business Activity Monitors perform similar monitoring, but are designed to look at the content of the activity. They may monitor that no transaction exceeds $100,000 for example. Or that if new sales exceed a certain volume, management is alerted. Such tools are monitoring the content of the data traffic for pre-set business parameters, and building real time monitoring screens and/or sending business alerts on the basis of the monitored business parameters. It IS technically possible to use some of the system level monitoring tools to monitor business parameters. But system level monitoring tools do not typically present the information in ways business users find helpful, nor do they alert in ways useful to the business user. (The average business user doesn’t have much interest in an SNMP trap, a common type of monitoring alert.) BAM tools fill this gap and have shown themselves to be of high business value to businesses that operate and adjust their business operations in semi-real time (businesses such as airlines and credit card processors).
CDC and BAM are an excellent choice to bring ETL style bulk operations
into a near real time model.
4.5. A Modern Batch Risk
Modern server environments have increased the number of CPUs and the number
of cores per CPU.
* What’s the difference between a CPU and a Core? Modern general purpose CPUs comprise a variety of parts within the physical computer chip. It includes a processing unit (ALU+CU+registers), buffers, on chip memory caches, and I/O controllers and more. CPU designers, in attempting to offer more power, determined that putting multiple processing units but sharing the supporting components resulted in a single physical chip that effectively acts as multiple CPUs. The shared components provide only slight overhead, though actually provide value as some tasks (and associated needed input) may be shared among multiple processing units. Today we refer to a CPU meaning the physical chip connected to a board. Cores refers to the number of processing units within the chip. And on a practical basis each core of today is the same as a CPU of 10 years ago. Technically this means a CPU with 4 cores (a quad-core CPU) can be running 4 tasks in parallel. On a practical basis a much larger number of parallel tasks are switched in and out of the processing units. Therefore any CPU effectively can perform tens or hundreds of tasks in parallel. Regardless, each core means another processing engine available to handle tasks. More cores don’t make a computer faster (that’s the speed rating of the CPU), they make a computer able to do more in parallel.
Because of this increase in number of CPUs and cores, put together with a cluster
of servers for processing, it is very reasonable to run hundreds of parallel threads to
handle bulk processing. However, clustering at the database level is much more
challenging (and expensive) and is rarely done. Therefore, increasing
parallelization will increase the load upon the database and can easily overwhelm a
database environment.
Overwhelm can initially mean the database server’s ability to handle all the parallel
requests – does the database server have enough CPUs / cores + memory, but even
if the database server is fully equipped with enough capacity it still may overwhelm
the total I/O capacity of the server. It simply may not have enough bandwidth to
the physical disks, and/or the physical disks may not be able to respond fast
enough. While it is possible to use faster disks, raid disk arrays, an SSD buffer
layer or even large local disk caches, these solutions quickly become costly and
will hit an absolute limit that is very challenging to overcome.
5. Batch and Services, SOA, SOAP and/or Messaging
In Section 3 of this document we discussed Big Batch and a required supporting database
pattern of grouping updates or inserts together into a single database commit. This
grouping of operations into a single database transaction is not due to business need or
association of the operations, rather it’s simply to reduce the overhead of the database
commit process (an operation that can’t be avoided or deactivated).
In that section (see insert titled “How 100ms of I/O overhead can turn into 27 hours of
processing time…”) we described how database commits turn into significant processing
overhead, requiring a special batch transaction grouping pattern to minimize the
overhead.
We face a similar problem when activating services or creating/consuming messages in
bulk. Each operation requires:
- Creation of a service or message header.
- Creation of the service request or message content, usually involving
transformation of the data to an XML, JSON or other specialty format.
- Activating a communication connection/session.
- Transfer of the request or message.
- Waiting for an acknowledgement (asynchronous) or response (synchronous).
Depending on whether the connection is local or remote, or an optimized high speed
connection versus regular communications between servers, activating a service has an
average minimum overhead of 50ms. Using the same example in Section 3, if we are
sequentially processing (or sending) 1,000,000 requests (or messages), we create 14
hours of processing overhead just for the service communications and setup
overhead.
* Beat the overhead with a cluster and parallelization? You say you’re not doing it sequentially and you have a cluster of 4 servers, each server running a pool of 10 listeners? Doing so will reduce the overhead impact to only 21 minutes (in our 1,000,000 transaction example) of communication overhead. However, we’ve only moved the problem and are now hitting our database with 40 parallel updates/inserts and commits, paying a double overhead impact (communication overhead plus commit overhead – see Section 3) and must significantly increase our database server capacity or overwhelm the database processing capacity. Setting up a cluster is sometimes appropriate and occasionally the only option, especially when there is a mismatch between processing models between systems. But the expense and overhead that can often be avoided or at least reduced with the right service processing pattern – the bulk service pattern.
5.1. The Bulk Service Pattern
We commonly take a single operation class and expose it, creating a single request
or single transaction service. This leads us to think of services in a single operation
pattern. Yet there is no reason a SOAP service or a Message can’t contain multiple
operation requests, meaning multiple XML documents or JSON data sets in the
same Message content or SOAP request. Meaning one operation to connect and
communicate multiple service runs or business operations.
For example, an UpdateCustomerAddress service could contain:
<UpdateCustomerAddresses>
<UpdateRequest ID=”1”>
<CustomerAddressUpdate>
<data>
</>
<UpdateRequest ID=”2”>
<CustomerAddressUpdate>
<data>
</>
</>
The receiving/processing service could loop and process each request, or hand them
off to a pool of parallel threads to process a number in parallel, and/or open a
database transaction to group the operation under one database commit. A reply, if
necessary, would identify the results by ID number of the request. Note the ID
number only has to be unique in context of the particular request (so each request
can start at ID #1 and count upwards within the request).
This pattern allows us to bulk the communications together, reducing the
communications overhead by the number of operations bundled together in a single
service request or message.
The practical point is, any service can easily be built to handle 1 or more
operation requests in a single SOAP call or Message. This small adjustment will
allow the service to be reused for small to medium bulk operations. (Sending 10
requests instead of 1 in one activation and letting the service processor loop
requires no redesign.) If larger numbers need to be sent through (such as 100 or
1,000), changes to manage database transaction grouping and/or creating a pool of
processing threads may be appropriate.
6. Big Batch Tools for Java
While writing this document my current project focus is Java oriented, so when looking
for Big Batch solutions my view was Java focused. The mainframe vendors and big
application vendors have been looking towards Java as a primary server side
development environment as well, so generally if you’re looking to follow traditional Big
Batch patterns but modernize with current tools, Java is most likely the way to go. (This
is not to say that Microsoft doesn’t offer completely effective bulk processing
approaches, but they tend to be less Big Batch traditional in their approach.)
The first key thing to say about the Java Big Batch Tools is they are rarely used. This is
because the Big Batch tools follow the nature of Big Batch and are:
Process heavy.
Complicated frameworks.
A unique pattern within the application (making code reuse difficult between
batch and online).
A unique monitoring and operational framework.
The result is these tools are only used by projects with a very heavy Big Batch pattern
requirement. They have a significant learning curve.
6.1. Compute Grid (IBM)
IBM Websphere Extended Deployment Compute Grid is an extension now included
as part of the IBM Websphere Application Server J2EE container environment. It
provides a complete Mainframe batch replacement environment that will run (and
can distribute workload) across Windows Servers, Linux or Unix Servers, on the
IBM Mainframe Z/OS environment, any of these in virtualized form, and any of
these in a mixed combination.
The ability to distribute a “job” across a mixed server environment (a “grid”) makes
Compute Grid a promising solution for intensive computations that can be divided
into steps or bulk processing where the transactions can be divided across the
environment. The downside is a particularly complex framework and, in cases
where a database is required, the database remains a bottleneck no matter how far
the workload is divided.
Figure 4 - from IBM
Compute Grid has 2 common roles:
A. A full big batch replacement environment for moving big batch jobs, with full
big batch control and management features, from a mainframe processing
environment onto Linux / Unix / Windows servers. This allows re-hosting of
re-developed (but not re-architected) big batch jobs off the mainframe.
B. A workload balancing environment, clustering large bulk processing jobs
across a set of Websphere Application Servers running on Linux, Unix, and/or
Windows servers. It's worth mentioning that it can also run as part of
Websphere on the Mainframe, clustering batch job load across the Mainframe
to Linux/Unix/Windows – a nice way of maximizing mainframe utilization
but not having to buy additional expensive mainframe CPU’s if that capacity
is exceeded during the burst load requirements of a batch cycle.
6.2. Spring Batch
Spring Batch is part of the Spring Framework that "provides reusable functions that
are essential in processing large volumes of records, including logging/tracing,
transaction management, job processing statistics, job restart, skip, and resource
management. It also provides more advanced technical services and features that
will enable extremely high-volume and high performance batch jobs through
optimization and partitioning techniques."
Spring Batch is not an executable environment, but rather a java library framework
to manage execution within standard Java environments. If the Java development
environment is Spring based, then Spring Batch is likely a good choice if Big Batch
is required. And it's open source (free).
Figure 5 - From SpringSource.org, http://static.springsource.org/spring-batch/trunk/spring-batch-core/
Figure 6 – From SpringSource.org, http://static.springsource.org/spring-batch/trunk/spring-batch-core/
Figure 7 - From SpringSource.org, http://static.springsource.org/spring-batch/trunk/spring-batch-core/
6.3. J2EE & JSR 352
JSR 352 is a Java Specification Request submitted by IBM to the Java Community
Process. It defines a set of Java language abilities, implemented as J2EE
extensions, to cover the needs of batch jobs being coded in Java. The spec was
finalized in early 2013 and can be expected to appear in J2EE containers in late
2013 and 2014.
The spec defines a new Java lib: javax.batch, which is designed to offer
basic Big Batch control functions from the Java container. It includes library
classes offering these functions:
7. A Modern Batch Strategy
Here is a recommended modern batch strategy, if a full new environment is being built.
7.1. Eliminate.
The vast majority of batch needs are to be re-architected for event driven
processing and decoupled processing, whenever possible. An event driven model is
more than just an approach to eliminate batch, it’s an approach to create a software
model who’s components are maintaining real time views as much as possible.
See Section 2 for more details on these approaches.
7.2. Roll Your Own via Queuing.
Utilize assured delivery mechanisms via tools such as IBM MQ or TIBCO
messaging to queue bulk volume processing, and then reliably process the bulk via
parallelization and a processing cluster (see Section 2.3 and Section 5.1 for more
details on these methods).
Because we have queued the bulk there is no possibility of job or transaction loss,
though the processing object must be designed to handle and redirect business
failures to a failure handling queue and process. This will require IT or the
business to have a process to monitor and correct such business problems as they
occur.
7.3. Roll Your Own via BPA.
More complicated update or transaction processes may have multiple coordinated
steps, one step that kicks off another, or even a human verification step in the
middle. Business Process Automation (BPA), acting as a controller for a multi-step
process in which a parallelized cluster processes and then emits an event/alert upon
completion, from which the controller will react and begin the next step.
7.4. BAM + ESB for Near Real Time Data Distribution
BAM, Business Activity Monitoring tools, offer an interesting alternative to ETL –
providing a real-time ETL process. Rather than building resource intensive ETL
processes for data distribution, we create near real time data distribution processes
uses the BAM tool where appropriate, and using the ESB where appropriate (and
sometimes both in combination).
This will allow updating of the data warehouse in near real time, provide near real
time business monitoring, and can be used to create processes to distribute data
between applications (if such a requirement exists).
7.5. Summary
Few modern develops are familiar with the Java big batch tools (discussed in
Section 6), and relatively few locations are using these tools. The heavy overhead
of using these tools is beyond the need of most organizations, and that overhead
includes training a special sub-team of developers to use them as well as the strong
possibility that they will require a specialized coding that will be somewhat
incompatible with the online code set.
I generally recommend moving batch processes to other processing models for
elimination.