DataStage How to Kick Start

Information Server 8.1 Introduction

Business Intelli Solutions Inc www.businessintelli.com

Course Outline

This course will explain the concepts of DataStage, its architecture, and how to apply it to a

'real life' scenario in a business case-study in which you'll solve business problems. We will

begin by looking at the big picture and discuss why businesses need ETL tools and where DataStage fits in the product set.

Once we've talked about the very basic architecture of DataStage, we'll investigate a

business case-study, learn about a company called Amalgamated Conglomerate Corporation

(ACC) - a fictitious holding company - and its business and technical needs. We'll then go

about solving this company's problems with DataStage in a Guided Tour Product Simulation.

In a practice environment, you'll become ACC’s Senior DataStage Developer. In this

capacity, you will assess an existing DataStage Job, build your own Job, modify a Job, and build a Sequencer Job.

Using the DataStage clients, you'll log onto the DataStage server and look at a Job that was

built previously by the former DataStage Developer at ACC. You’ll then build your own Job

by importing meta data, building a Job design, compiling, running, troubleshooting, and

then fixing this Job. You’ll then modify a Job and finally build a special type of Job called a

Sequence Job.

Let’s get started!

This section begins by talking about the DataStage tool, what it does, and some of the

concepts that underpin its use. We’ll discuss how it fits in the Information Server suite of

products and how it can be purchased as a stand-alone product or in conjunction with other products in the suite. We’ll briefly cover DataStage’s architecture and its clients.



What Is DataStage? InfoSphere DataStage is an ETL (Extract Transform and Load) tool

that is a part of the InfoSphere suite of products. It functions as a stand-alone product or in

conjunction with other products in the suite. DataStage provides a visual UI, which you can

use, in a point-and-click fashion (A non-linear programming tool), to quickly build

DataStage Jobs that will perform extractions, transformations, and loads of data for use in

data warehousing, system migrations, data integration projects, and data marts.

Sometimes DataStage is purchased with QualityStage, which performs data cleansing (We’ll

touch on it in this training but be sure to take the “Introduction to QualityStage

FlexLearning module as well). When implementing DataStage as a stand-alone product, you

will still need to install InfoSphere Information Server components with it. Other

components or products in the suite can be added at a later time.

The other products in the suite are QualityStage, Information Analyzer, Business Glossary,

and MetaData Workbench. InfoSphere Information Server is a suite of tools that is used to

manage your information needs. Several components come with Information Server. One of

the main components is DataStage, which provides the ETL capability.

IBM InfoSphere Metadata Workbench provides end-to-end metadata management,

depicting the relationships between sources and consumers.

IBM InfoSphere Change Data Capture Real-time change Data Capture (CDC) and replication solution across heterogeneous environments

IBM InfoSphere Change Data Capture for Oracle Replication Real-time data distribution and high availability/disaster recovery solution for Oracle environments

IBM InfoSphere Information Analyzer Profiles and establishes an understanding of source systems and monitors data rules.

IBM InfoSphere Business Glossary Creates, manages, and searches metadata definitions.

IBM InfoSphere QualityStage Standardizes and matches information across heterogeneous sources.

IBM InfoSphere DataStage Extracts, transforms, and loads data between multiple sources

and targets

IBM InfoSphere Datastage MVS Edition provides native data integration capabilities for the

mainframe.



IBM InfoSphere Federation Server Defines integrated views across diverse and distributed

information sources, including cost-based query optimization and integrated caching.

IBM InfoSphere Information Services Director allows information access and integration processes to be published as reusable services in a service oriented architecture.

IBM InfoSphere Information Server FastTrack simplifies and streamlines communication

between the business analyst and developer by capturing business requirements and automatically translating into DataStage ETL jobs.

Connectivity Software provides efficient and cost-effective cross-platform, high-speed, real-

time, batch and change-only integration for your data sources.

This training will focus on DataStage which gives you the ability to import, export, create,

and manage metadata from a wide variety of sources to use within these DataStage ETL

Jobs. After Jobs are created, they can be scheduled, run, and monitored, all within the DataStage environment.

DataStage provides a point-and-click user interface in which there is a Canvas. You drag-

and-drop icons that represent object’s (Stages and Links) from a Palette onto the Canvas to

build Jobs.

For instance, you might drag and drop icons for a source, a transformation, and a target

onto the Canvas. The data might then flow from a Source Stage via a Link to a Transformation Stage through another Link to a Database Stage, for example.

Stages and Links present a graphical environment that guides you, the developer through

needed steps. You are presented with fill-in-the-blank-style boxes that enable you to quickly

and easily configure data flows.

When connecting to a database for example, you drag a proprietary database Stage icon

onto the Canvas, you don’t have to worry about how that database’s native code works to still be able to leverage its high performance.

After configuring some particulars and doing a compile, the result is a working Job that is

executable within the DataStage environment as well as a Job design which provides a

graphical look as to how the data or information is flowing and being used



Information Server Backbone In order to understand the environment in which we will

work as Developers, let’s first very briefly look at the architecture and framework behind the

InfoSphere Information Server and its various components (Collectively known as

InfoSphere). InfoSphere has various product modules within it including the Information

Services Director, the Business Glossary, Information Analyzer, DataStage, QualityStage,

and Federation Server.

These product modules are all supported by common metadata access services and

metadata analysis services. They all sit on top of the repository.

They all work with one another and can be used as an integrated product suite or it also comes as a stand-alone DataStage application.

Whichever component or components you have, you will need to install the basics of the Information Server itself including the repository and the application layer.

From there, this 3-tiered architecture fulfills many information integration needs allowing

metadata and terminology collected with one product during one phase of a project to flow

to other products throughout the enterprise enabling common understandings of data and

leveraging common discoveries.



Let’s just review the products in the suite again.

• DataStage, by itself, can pull and push metadata, table definitions, and information

from and to various targets and sources of data, DataStage will provide the

components needed to get data from place to another and manage it on an ongoing

basis.

• The Information Services Director let's you turn your Jobs into real-time Web

Services.

• The Business Glossary maps your business terms to your technical terms so that all

the members of your organization can have common nomenclature during all

business activities and stages of technical development thus reducing ambiguity

among varying audiences.

• The Information Analyzer allows you to analyze data in databases, files, and data

stores and gives you the knowledge to determine what types of Jobs need to be

developed. DataStage and QualityStage allow you to then manipulate the data and

cleanse it in-line while you're working with it.

• DataStage and QualityStage have been integrated into a single canvas so that all

your activities within a Job can flow from a traditional DataStage Stage to one of the

QualityStage Stages; thereby allowing you to, in the process of doing your other ETL

activities, cleanse your data inline. This tutorial will focus primarily on the ETL

activities with a little bit of knowledge on DataStage's QualityStage Stages that go

along with it.

• The Federation Server is a tool that let's you put together disparate and

heterogeneous data stores through a single point of view for the organizations entire

data world: Databases still stored on mainframes as well as a variety of other computers are made to look as one repository.

These product modules feed to DataStage allowing you to create Jobs that will function in a

high efficiency manner to move data and/or cleanse it as necessary. Using DataStage, you

will be able to build data warehouses and other new repositories for integrated forms and

views of data for your entire organization. DataStage is also very useful in data migration

activities, EDI and other means of data communication. As mergers and acquisitions occur,

for example, the underlying IT systems can, using DataStage, be brought together into a

single trusted view or common unified view with which to make better business decisions

and implementations. DataStage can help with ordering and receiving activities as well by

putting data to and/or pulling data from an EDI wire, effectively parse it, and move it into a

data warehouse or to online operating systems so that you can have quicker access and more robust delivery of your business data.

DataStage also integrates with various other tools in addition to databases such as MQ

Series that provide reliable delivery of messaging traffic. DataStage has both real-time and

batch capabilities.

Now that we've talked about the overall Information Server's architecture and components, let's take a more detailed look at DataStage's components.



DataStage Architecture DataStage has a 3-tiered architecture consisting of clients,

dedicated engines, and a shared repository. These are all components within the InfoSphere

Information Server using its shared and common repository. The clients include an

Administrator client, a Designer client, and the Director client.

The Administrator is used for setting up and managing individual DataStage Projects and

each Project's related common project information. The Designer is used to develop and

manage DataStage applications and all their related components such as metadata. The

Designer is where you will be doing most of your development. The Director allows you to

actively monitor runtime activities, watch Jobs, schedule Jobs, and keep track of events

once your Job has been developed, deployed, and is running



Inside DataStage, there are two engines; the traditional Server engine (used for DataStage

'Server' Jobs) and the Parallel engine (used for DataStage Parallel Jobs). The 'Server’ jobs

are typically single-threaded and have evolved over the years during various versions of

DataStage since its inception. The Parallel engine allows you to create Jobs that can, at

runtime, dynamically increase or decrease their efficiency and to optimize the use of

resources at hand within your hardware. The Parallel engine can dynamically change the

way in which it performs its parallelism.

The Shared or 'Common' repository holds all of the information such as Job designs, logs of

the Job runs, and metadata that you will need while developing Jobs. Now, let's take a look

at the Administrator client's UI.



Administrator Client We're looking at the Administrator's UI under the tab entitled

'General'. Here, you can enable job administration from within the Director client, enable

runtime column propagation (RCP) for all of your Parallel Jobs, enable editing of the internal

Job references, sharing metadata with importing from Connectors (the DataStage objects

that automatically connect to various proprietary sources and databases with minimal

configuration) and also give you some auto-purge settings for your log. What is RCP and

how does it work?



DataStage is a role-based tool. Usually, when it is installed the Administrator assigns roles

such as Developer or Operator. This means that the UI may have different options available

to different people. The Administrator, developers, and operators may not be the same

person. Thus, when you look at the actual product, your UI may be different than others'.

Under the Permissions tab, DataStage can assign functions to individuals who will be using

the tool. The person with the role of ‘administrator’ may have more options than does a

‘developer’. There may be differing functions made available to different people. This can all

be designated for each separate DataStage Project. You can thus determine which user gets

what role for that particular Project.

Traces are available should you wish to put tracing on the engine. The Schedule tab allows

us to dictate which scheduler will be used. In the Windows environment, we would need to

designate who will be the authorized user to run Jobs from the scheduler.

The Mainframe tab allows us to set needed options when working with the mainframe

version of DataStage. This would allow us to create Jobs that could then be ported up to the

mainframe and then executed in their native form with the appropriate JCL.

Other tabs let us set the memory requirements on a per-Project basis. The Parallel tab gives

us options as to how the parallel operations will work with the parallel engine. Under the

Sequence tab, the Sequencer options are set. And under the Remote tab of the



Administrator client, we make settings for remote parallel Job deployment on the USS

system or remote Job deployment on a Grid.

DataStage Designer The second of the three DataStage clients is the Designer. As an ETL

Developer, this is where you'll spend a good deal of your time. Let's examine the main

areas of the UI. In the upper-left (shown highlighted) is the Repository Browser area. It

shows us the metadata objects and Jobs found within DataStage. Toward the lower-left is

the Palette. The Palette contains all of the Stages (drag and droppable icons). You would

select a Database source icon, for instance and drag it to the right.

On the right is the Canvas. This is where you would drop the Stages that you have chosen

to be part of your Job design. You can see that the Job in the graphic above is a Parallel Job

and that is contains two Stages that have been joined together with a Link. Links can be

added from the Palette directly or you can just Right-click on a Stage, drag, and then drop

on the next Stage to create the Link dynamically. Now, In this particular example, data will

flow from a Row Generator Stage to the Peek Stage.



The Designer is where you will create the kinds of working code that you need in order to

properly move your data from one source to another. There are over 100 Stages that come

with the product out of the box (At installation time there are approximately 60 base Stages

installed, an additional 40 that are optional, and 50-100 more based on service packs and

add-ons, and then you can build your own custom Stages in addition).

DataStage Director The third DataStage client is the Director. Director gives us a runtime

view of the Job. Above, we can see an individual Job log. There are messages and other

events shown in this Job run that are logged to this area and kept inside the Shared

Repository. We can see the types of things that occur during a Job run such as the Job

starting, setting environment variables, main activities of the program, and then eventually

the successful completion of the Job. Should there be any warnings or runtime errors, they

can be found and analyzed here to determine what went wrong in order to take corrective

action. Color-coding shows that green messages are just informational, a yellow message is

a warning and may not directly affect the running of the Job but should probably be looked

at. If you should ever see a red message in here, this is a fatal error, which must be

resolved before the Job can run successfully.



DataStage Repository In the Repository Browser, we have various folder icons that

represent different categories within the DataStage Project. Some come automatically with

the Project and then you can add your own and name them. Organize them in a way

suitable to your DataStage Project providing for easy export and import. Standard folders

include those for Jobs, Table Definitions, Rule Sets, and other DataStage objects. Tip:

Create a Folder named after your initiative and then order or move all the appropriate

objects for just that initiative under that one common location.

Table Definitions are also known as metadata or schemas. These terms are often used

interchangeably. Each typically contains a collection of information that defines the

metadata about the individual record or row within a table or file. A piece of metadata can

describe column names, column lengths, or columns' data types or it can describe fields

within a file.

Other folders include one for Routines (sub-routines able to be called from within a Job).

Shared Containers are pieces of DataStage Jobs that can be pulled together, used, and re-

used as modular components. Stage Types (highlighted) is a master list of all the Stages

that are available to you in this Project. Stages can also be added later through various



options depending on what product(s) have already been installed and depending on what

your administrator has made available to you. The Standardization Rules folder contains

out-of-the-box or default rulesfor QualityStage.

Transforms provide other abilities from the standard Server engine to create macro types of

use and re-use of various functions to its calling structure. WAVES Rules are also used by

QualityStage for address verification modules. Match Specifications which are used by

QualityStage's Match Stages. Machine Profiles which are used in IBM Mainframes. IMS “view

sets” are for working with legacyIMS types of Database Management Systems

Steps to Create a DataStage Job .

This section talks about the steps to building your first Job. First, you'll need to understand

some basic concepts, then we'll talk about setting up an environment, next you'll connect to

the sources, then we'll talk about how to import table definitions, and then we'll provide you

with an understanding of the various types of Stages and how they are used. Then, we'll

talk about working with RCP, creating Parameter Sets, understanding how to use the CLI,

and, in the next lession, you will put this all of this knowledge to use and begin building Jobs in a case-study scenario. This section covers the following:

1. Understand Some Underpinnings

1. Types of Jobs 2. Design Elements of Parallel Jobs

3. Pipeline Parallelism

4. Partition Parallelism

5. Three-Node Partitioning 6. Job Design Versus Execution 7. Architectural Setup Option

2. Setting up a new DataStage environment

1. Setting up the DataStage Engine for the First Time

2. DataStage Administrator Projects Tab

3. Environment Reporting Variables

4. DataStage Administrator Permissions Tab

5. Export Window

6. Choose Stages; Passive Stages 7. Connect to Databases

3. Import Table Definitions

4. Active Stages; The Basics 5. An Important Active Stage; The Transformer Stage

6. Advanced Concepts When Building Jobs; RCP and More 7. Pulling Jobs Together; Parameter Sets, CLI, etc



Let’s now shift our discussion to the process of developing Jobs in DataStage. What are the

steps involved in developing Jobs? First, you define global and Project properties using the

Administrator client that we mentioned earlier in the tutorial. This includes defining how you

want the Job to be run. For instance, do you want Runtime Column Propagation or any

other common variables in your environment that you might be using throughout the

Project? We'll discuss this in greater depth later in this tutorial.

Next, you go into the Designer client and import metadata into the repository for use, later,

in building your Jobs. Then, using the Designer tool, you actually build the Job and compile

it. The next step is to use the Director tool to run and monitor the Job (Jobs can also be run

from within the Designer but the Job log messages must be viewed from within the Director

client). So, as you’re testing your Jobs it can be a good idea to have both tools open. This

way, you can get very detailed information about your Job and all of the things that are

happening in it.



Administrator Client

Let’s say that we want to create a Job that extracts data from our primary operational system.

We could go to the Administrator tool set up our Jobs to be able to use RCP and set the

number of purges in the logs for however often we want to run purges.

Still in the Administrator, we could then put in some common environmental variables, for instance, the name of the database that is being accessed, the user ID, and password.

These can be set there and encrypted in a common location so that everyone can use these without necessarily exposing security in your system.

Designer Client

To then create this kind of extract Job, we would next want to go into our Designer tool to first import metadata from the database itself for the table or tables which we’ll be pulling.

Also within the Designer we want to use Stages that will connect us to the database. We do this by dragging and dropping the appropriate Stage onto the Canvas.

Inside the Stage(s), we will use the variables (that we defined in our common environment

using the Administrator tool) to define the Job as we see it needs to be done. We do this by double-clicking the Stage in question and filling out the particulars that it requests.

We’ll pull that data out and store it into either a local repository such as a flat file (By

dragging a Sequential File Stage onto the Canvas and creating a link between the two

Stages) or store it into one of DataStage’s custom Data Sets (A proprietary data store which

keeps the data in a partitioned, ready-to-use, high-speed format so that it doesn’t need to be parsed for re-use as it would otherwise need to be from a flat file)

And then we could, for instance, put it directly into another database (By dragging a Stage onto the Canvas at the end of that data flow).

The database Stages come pre-built to meet the needs of many proprietary databases and

satisfy their native connectivity requirements thus providing high-speed transfer capability

while eliminating the need to do a lot of the configuration.



Types of Jobs Let’s talk about the three types of Jobs that we would most likely be

developing. These fit into the categories of Parallel Jobs, Job Sequences (which control other

Jobs), and Server Jobs (the legacy Jobs from earlier versions of DataStage).

Parallel Jobs are executed by DataStage using a parallel engine. They have built-in

functionality for what is called pipeline and partition parallelism. These are means of

efficiency that can be set. For instance, you can configure your parallelism dynamically.

We’ll discuss these things in greater detail later in the tutorial. When Jobs are compiled,

they are compiled into a DataStage-specific language called Orchestrate Scripting

Language, called OSH. The OSH executes various Operators

Operators are pre-built functions relating to the Stages in our Job Design. Also you can

create custom Operators using a toolkit and the C++ language. If you have special data

need to write to a special device, such as one for scanning and tracking control on delivery

trucks, then you could write a custom module to put the data directly into that format

without having to go through any other intermediate tools. The executable that the OSH

executes is accomplished with C++ class instances. These are then monitored in the

DataStage Director with the runtime Job monitoring process.

Job Sequences, as we mentioned, are Jobs that control other Jobs. Master Sequencer Jobs

can, for example, kick off other Jobs and other activities including Command Line activities,

other control logic, or looping that needs to go on in order to re-run a job over and over



again. The Job Sequencer can control all your Jobs at once without having to schedule each

individual Job with the scheduler. Although you still have the ability to call any/each Job

individually, you can also group them together through the Job Sequencer and call them

that way.

In addition, you can perform activities based on their failures and do that either specifically,

when a local Job fails or globally, when anything happens and an exception is thrown. A

common Command Line API is provided. Thus, you can embed this into third-party

schedulers and can then execute Jobs in whatever fashion you choose.

The third type of Job is the Server Job. Server Jobs are legacy DataStage Jobs that continue

to exist and are supported as DataStage applications but you may also find them useful for

their string-parsing capabilities. These are executed by the DataStage Server engine and

are compiled into a form of BASIC and then run against the Server engine. Runtime

monitoring is built into the DataStage engine and can also be viewed from the Director

client.

Server Jobs cannot be parallelized in the same way that the Parallel Jobs can be. However,

you can achieve parallel processing capabilities by either sequencing these Jobs with the Job

Sequencer to run at the same time, or using other options to gain efficiencies with their

performance. Legacy Server Jobs do not follow the same parallelization scheme that the

Parallel Jobs do. Therefore, they cannot easily have their parallelism changed dynamically at

runtime as the Parallel Jobs can.



Design Elements of Parallel Jobs One of the nice things about having a visual medium

like DataStage is that you can easily take the notions of Extract, Transform, and Load and

represent them visually in a manner that is familiar to people. For instance, you can drag an

Extract Stage out onto the visual Canvas and drop it on the left of the screen, flow through

transformations in the middle of the Canvas, going outputs or Loading Stages on the right

side.

DataStage let’s you put elements such as Stages and Links in any fashion that you wish,

using an orderly, flow chart style left-to-right top-to-bottom visual design. This flow makes

future interpretation of your Jobs simple. Those who are new to DataStage are able to look

at, see, and understand the activities and flows that are going on within the DataStage Job.

This also means that other Developer's Jobs will be intuitively understandable to you

When we start designing a Parallel Job, we should think of the basic elements, which are

Stages and Links. Stages, as we mentioned, get implemented ‘under the covers’ as OSH

Operators (The pre-built components that you don’t have to worry about and that will

execute based on your Job design on the graphical Canvas).

Next, we have Passive Stages, which represent the ‘E’ and the ‘L’ of ETL (Extract and Load).

These allow you to read data and write data. For example, Passive Stages include things

such as the Sequential File Stage, the Stage to read DB2, the Oracle Stage, Peek Stages

that allow you to debug and look at some of the Job’s interimactivity without having to land

your data somewhere. Other Passive Stages might include the MQ Series Stage, and various

other third-party component Stages such as SAP and Siebel Stages



Processing or active Stages are the ‘T’ of ELT (Transformation). They do the transformation

and other ‘heavy lifting’ work. These include the Transformer Stage, which allows you to

transform data, manipulate column order and do a variety of other functions to your data to

enrich it and potentially to validate and pre-cleanse it before applying some of the

QualityStage Stages to it.

You will learn to rely heavily on the Transformer Stage. It provides a variety of functionality

including the ability to filter rows, as well as individually modifying each column of data that

goes out. It allows you to propagate the data that goes out in a number of ways and, in

some cases, merge two streams of data together or pieces from two streams into common

one common stream.

These are also assisted by specific Stages to do filtering, perform aggregation of the data

(summations or counting of data or finding first and last), generating the data – such as

generating rows or columns, splitting or merging the data (using the ability to direct data

down multiple paths simultaneously, or pulling the data – either by joining it into one

common larger column or by funneling it into multiple rows within the same data. Now that

we’ve talked about some Stages, let’s talk about the things that help the data flow from one

Stage to the Next

Links are the ‘pipes’ through which the data moves from Stage to Stage. DataStage is a

link-based system. Although the Stages are visually predominant on the Canvas, in

actuality, all the action is happening on the links. When you open up a Stage by double-

clicking it, you get various dialogue boxes that refer to input and output (input to that Stage

and output from that Stage) in DataStage. These represent “Links” coming into and out of

the Stage (when viewed from the Canvas).

The setting that you define in these two areas within the Stage affect the Link coming in

and the Link going out of the Stage, Your settings, including the metadata that you chose,

are all kept with the Link. And therefore, it’s easy, when you modify Jobs, to move those

pieces of data because they stay with the Links (all the metadata and accompanying

information) from one Stage type to another Stage very easily without losing data or having

to re-enter it.



Pipeline Parallelism The purpose of parallelism is to create more efficiency and get more

work out of your machine(s). This is done is two ways the first of which is Pipeline

Parallelism. Here, each of the Operators (That correspond to the various Stages that we see

on our Canvas) set themselves up within their own processing space and pass data from

one to the other rapidly without having to be re-instantiated continually or without having to

do all the work in a given Stage before starting the next portion of the pipelining process.

For example, this allows us to execute transformer, cleansing, and loading processes

simultaneously. You can think of it as a conveyer-belt moving the rows from process to

process. This reduces disk usage for the staging areas, by not requiring you to land data as

often. It uses your processor more efficiently and thus maximizes the investment in your

hardware. There are some limits on the scalability depending on how much CPU processing

horsepower is available and how much memory is available to store things dynamically.



Partition Parallelism The second factor of parallelism is called Partition Parallelism. This is

more of a ‘divide and conquer’ approach in which we divide an incoming stream of data into

subsets to be separately processed by an Operator. Each one of these subsets is called a

partition. Each partition of data is processed by the same Operator. For instance, let’s say

that you want to do a filtering type of operation. Each partition will be filtered exactly the

same way but multiple filters will be set up in order to handle each of the partitions

themselves.

This facilitates a near-linear scalability of your hardware. That means that we could run 8

times faster on 8 processors, 24 times faster on 24 processors, and so on. This assumes

that data is evenly distributed and no other factors that would limit your data from getting

out there, such as network latency or other external issues.



Three-Node Partitioning Data can be partitioned into multiple streams. 15:34 The way in

which we partition will depend upon how we need to process this data. The simplest of the

methods is called the Round Robin in which we assign each incoming row to the next

subsequent partition in order to create a balanced workload. In other words, like a card

dealer, in Vegas, dealing out cards in the deck to every player around the table, thus

dealing all the cards evenly to each player.

Other times, we need to segregate the data within the partition based on some key values,

whether they are numeric or alpha values. This is known as Hash partitioning (Or with

numeric you can use a Modular partitioning) and in this way, you keep the relevant data

together by these common values.

This is very important when we’re doing things such as sorting the data or possibly filtering

the data on these key values. Should the data become separated (The Key value fields start

going into different partitions), then, when the Operation occurs it will not fully work

correctly as some of the data that should have been removed, may be in another partition

and thereby mistakenly ‘kept alive’ and moving down another stream in a parallel Operation.

Then at the other end of the partitioning process when all the data is collected, we still have

multiple copies when there shouldn’t be. If we had put them all correctly on one partition then only one duplicate would survive which was our desired outcome.



The kind of partition that you perform matters a lot!

One of the reasons why Round Robin is the default is that if you can evenly distribute the

data, it will be processed much more quickly. If you have to partition by range or some type

of hash partitioning scheme, the quantities of data in each partition may become

imbalanced and one or two of the partitions may finish long before the others and thus reduce performance.

Job Design Versus Execution Let’s look at an example that will give you an idea of what

happens when we design the Job and then what happens when we run it. At the top of the

graphic, you can see that the data is flowing from two Oracle Stages and the data is

brought together inside of a Merge operation. From there, we take the merged data and we

do an aggregation on it and finally we send the data out to a DB2 database.

In the background, let’s say that we want to partition this data four-ways (In DataStage

terminology would be known as 4 Nodes). Here’s what we would see: For each of the Oracle

Stages (Toward the left of the data flow) and other Stages in this design, we would create

four instances of each to handle the four separate streams of data. In this kind of activity

when doing summation, it is extremely important that we partition or segregate our data

and that we partition it in a way that will work with our design.



Here, we want to partition it by the values by which we are going to do the summation. This

way, we don’t accidentally sum on two different things and loose parts of our rows. Let’s say

that we have 5 rows that need to come together for a sum, we want them all to be together

in the same partition, so that all the summation can be done together. Otherwise, we might

get four values out of this phase (Highlight on the 4 instances of Aggregation Stage) with

only one of the set of partitions summing anything at all!

Architectural Setup Options

Now that we've talked about the considerations that we must be aware of as we do our

partitioning both in the design of our Job and then in the choice of partitioning strategy,

let's continue our discussion in a different direction. Note that we will discuss these partitioning strategies in greater detail later in the tutorial.

For now, though, let's talk about some architectural setup options.

This section of the tutorial will talk about some of the underpinnings of DataStage in general

and specifically some of the architectural setups that we can use. In one setup, we have

deployed all of the Information Server DataStage components on a single machine. All of

the components are hosted in one environment.



In one setup, we have deployed all of the Information Server DataStage components on a

single machine. All of the components are hosted in one environment. This includes the

WebSphere Domain, the Xmeta repository using DB2 (other databases can be used for the

repository such as Oracle or SQL Server), and the DataStage Server itself. The clients can

be stored on the same machine as well if it is a Windows machine. In other words, let’s say

that we are carrying a laptop around: It can contain all the co-located components (and it

could still be connected to by a remote client). What are some reasons for putting all of

these components on a single machine? Perhaps that is all the resources that you have

available or perhaps you have very poor network performance and don’t want the resulting

latency going across the network wire to the repository.



Now let’s talk about putting the components on two machines. A typical two-machine setup

would consist of the Metadata repository and the Domain layer on Machine A and the

DataStage Server onto Machine B and potentially be able to connect to Machine B remotely

with the clients. One of the benefits to this separation (Having two machines) is that the

metadata repository on Machine A can be utilized (By a product such as Information

Analyzer) concurrently without impacting the performance of DataStage as it runs on

Machine B



In a 3rd scenario, major components are separated onto different machines. The database

repository is on Machine A (Maybe because you are using a machine that already has your

database farms installed on it). The Metadata Server backbone is installed on Machine B so

that Information Analysis and Business Glossary activities don’t risk negatively impacting

any development on the DataStage Server: Several DataStage development teams can

keep working and utilizing their resources most effectively. Finally, you have the DataStage

Server on Machine C, connected to by remote clients.

Setting up a new DataStage environment.

In this last section, we talked about some different options. And, it should be said that you

must determine the optimal setup in your environment depending on your own conditions

and requirements.

In this next section of the tutorial, we are going to talk about DataStage’s architecture

‘under the cover’ and some of the ways that we can set up the product. We’ll also talk about

setting up the DataStage engine itself for the first time using the Administrator client and

see how to then setup DataStage Projects. Within a Project, we’ll see how to import and

export table definitions into that Project. Next we’ll look at Jobs within Projects and talk

about a number of important development objects such as Passive DataStage Stages. For



example, we’ll talk about the Data Set Stage, Sequential Files Stage, and various relational

data Stages. We’ll also talk about some Active Stages, which are important when we are

trying to combine and/or manipulate data. These include the Lookup Stage, the Join Stage, the Merge, Stage, and the Transformer Stage (at the heart of many Jobs).

Then we’ll talk about some key concepts that you’ll want to know about as you are

developing Jobs such as the use of Runtime Column Propagation (RCP), how to create Shared Containers for re-usability, putting together Sequencers to organize all of your Jobs.

Then we’ll talk about ways of implementing and deploying our DataStage application that

can make things easier on you and the team. These might include the creating Parameter

Sets, using various Run options, and using the Command Line Interface (CLI) in which Jobs can be called from a third-party scheduler.

Of course, we’ll discuss parallelism more in depth. Specifically, we’ll talk about where we

perform parallelism. This may be within Sequencers or at the Job level we will talk about

strategies for different ways to apply parallelism. Within InfoSphere, we’ll talk in more detail

about the two forms of parallelism (Pipeline parallelism and Partitioning parallelism). We’ll see how these things work and then what happens when the parallel Job gets compiled.

Finally, we’ll discuss how to apply these concepts and scale our previously constructed DataStage application up or down depending on the resources that we have available to us.



Setting up the DataStage Engine for the First Time When we're setting up a new

DataStage engine, the first thing is to go into the Administrator client and make sure that

some of the basics have been taken care of. When we log on, we are presented with the

Attach to DataStage dialog box shown above. We can see it shows the Domain to which we

want to connect with its corresponding port number of the application server, the user ID

and password for the DataStage Administrator with which we will connect, and the

DataStage Server name or its IP address.

Often, you will find that the Server’s name is the same as the beginning portion of the

Domain name (But this does not have to be the case – it depends on the architecture and

name of the computer and how it is set up in your company’s Domain Name Server

(DNS)).

DataStage Administrator Projects Tab Within the Administrator client, we have a tab

entitled Projects. We use this for each of the individual Projects that we want to set up here

within DataStage. The Command button allows you to issue native DataStage engine

commands when necessary such as the list.readu command which would show you any

locks that are still active in DataStage (as well as any other commands associated with the



DataStage engine or with the Universe which was the predecessor to the DataStage

engine).

Below that, you can see the Project pathname box. This shows us where the Project

directory is located. This is extremely important as you must know where the Project is

located on the machine (It is located the machine on which the DataStage Server has been

installed). Let’s take a look at the Properties button. It will show us the settings for the

selected Project (datastage1 shown in the graphic).

The Properties button brings up the dialog box shown above. Various tabs across the top

each have thier own options. One thing that you’ll probably want to do is to check the first

option. This will enable job administration in Director. Enabling job administration in

Director will allow developers to do things such as stopping their jobs, re-setting their jobs,

clearing job information, and so on. One of the times that you may NOT want to do this is in

a production environment. Otherwise, you do, typically, want to give people the ability to

administer Jobs in the Director client as they see fit.



Next, is the option to enable RCP for parallel Jobs. We’ll cover this more in depth a little

later in the tutorial but just be aware that this is where you enable it for the datastage1

(each) Project. The ‘Enable editing of internal references in jobs’ option will help you when

dealing with lineage and impact analysis (If you change Jobs or objects over time), The

‘Share metadata when importing from Connectors’ option. This again, allows us to perform

cross-application impact analysis.

The ‘autopurge of the job log’ option is very important. You always want to set up default

purge options so that the logs don’t fill up and consume large disk space on your machine

(The machine that the DataStage Server is installed on – not necessarily the machine that

houses your clients like the Administrator client).

You have your choice of purging them based on the number of runs or by the number of

days old. Some strategies include setting it to the 5 previous runs within your development

and test environment, which may go for long periods of time between each run of a

particular Job. In your production environment, you may want to set it to ‘every 30 days’

where you plan on running your Jobs on a daily basis.

Next is a button that allows you to protect a Project. It's also useful in production where you

don't want people to make modifications to Jobs. Since most people want to make changes

regularly during development work, it's not very useful for development environments. The

Environment button allows you to set global environment variables for your entire Project.

These will be Project-specific. These will be separate from your overall DataStage

Environment variables, which are global to the entire Server. Project-specific Environment

variables are only for this one particular Project.

The “Generate operational metadata” checkbox will allow the data, as it is being generated;

in particular the row counts to be able to be captured, later, for further analysis and

understanding data in the ‘operational’ form. Metadata is usually understood as a

descriptive (How big is a field, What type of data is it) but operational metadata tells us

things like “How frequently has it been run” and “How much volume has gone through it”.

This can be important later as you determine your scalability and future considerations such

as capacity planning.



When you click on the Environment button, it opens the Environment variables dialog box.

Here, there are two panes in the window. On the left, we have our categories of

environmental variables such as General, which includes Parallel environmental variables.

The other category is User Defined. Under each category area, the corresponding pane on

the right displays the name of the environmental variable to be used, its prompt, and then

the value, that will be used for the entire DataStage Project.



Environment Reporting Variables A standard category is for environmental variables for

reporting such as the APT_DUMP_SCORE, APT_MSG_FILELINE, APT_NO_JOBMON. By

setting these here, you can have all the developers on the Project use them in a

standardized way. However, each of these variables can also be set individually or in each

Job. That is to say, at runtime, you can have just one Job contain a different value from the

other uses of that same environmental variable throughout the rest of the Project.





DataStage Administrator Permissions Tab is where we set up users and groups for a

variety of authorizations. We would add a user and then assign a product role. The

“DataStage Operator to view the full log” checkbox is typically used unless you have a

particular reason such as sensitive data or Operators being overwhelmed by too much

information. Normally, we allow Operators to view the full log so that if they encounter

problems during a DataStage Job, they can easily communicate that information to their

second line of support when escalating a problem.

Once we create a new user, we can see in the User Role drop-down menu that there are

several roles available for us to assign to that user or group. These consist of Operator,

Super Operator, Developer, and Production Manager.

Let’s continue our discussion by moving over to the Tracing tab. This allows us to perform

engine traces. Typically, you will not use unless working with IBM support while

experiencing a difficult problem that cannot be solved any other way.



The Parallel tab contains some options. The checkbox at the top allows you to see the

Orchestrate Shell (OSH) Scripting language that has been generated from within your Job

properties box. It creates a tab entitled OSH. Further down, there are some advanced

options for doing things such as creating format defaults for various data types used in the

parallel framework for this Project.



Under the tab entitled Sequence, we can add checkpoints so that the Job sequences are re-

startable upon any failure. We can automatically handle activities that fail, log warnings

after activities that finish with a status other than ‘OK’, and log report messages after each

Job run.

Work on Projects and Jobs

Once we get our engine configured and set our basic Project properties, the next thing that

we will want to be able to do is to work on a new Project. If we generate all these

DataStage Jobs from scratch, we won’t be re-using anything from previous Projects. However, often times, it makes sense to leverage work that we have already done.

We may want to import things from a previous Project. These include things such as

program objects such as Jobs, Sequences, Shared Containers, Metadata objects (for

instance Table Definitions), Data Connections, things used globally at runtime such as

Parameter Sets, and other objects such as routines, transforms, that we want to bring along and re-use from a previous Project.

Exporting and Importing Objects



Export Window In order to make this happen, the first thing that we need to do is to

export it from one environment. Then we will be able to import it. Here in the Export

window, we can highlight certain objects and then add them to our export set.

One important thing to consider is the ‘Job components to export’ drop-down menu shown

highlighted. This will allow you to export the job designs with executables, where applicable.

Let’s say that we are promoting all of these objects to a production environment that does

not have its own compiler. You will then need to have these pre-compiled Jobs ready to run

when they reach their target. You can opt to exclude ‘read only’ items. This is a typical

default.

Next, you would choose which file you would like to export it into. This is a text file that will

‘live’ on your client machine. All of the export functionality is done through client

components. Once we have identified all the objects that we want to export, we simply click

the Export button to begin the export.



We export either into an XML file or the native DSX file. Once this is accomplished, then we

can begin our import. For the import, we would use the DataStage Designer client, click its

import option and select the objects to import from the flat file into our new environment.

Let's take a look at that dialog box.

This file that we are importing may serve as a backup in that it allows us to import object-

by-object. Our export is exactly like a database export in so far as it allows us to pull out

one object from the export at a time. Should someone accidentally delete something (in the

new environment), instead of losing the entire box, we only lose one object that can then be



restored without losing any of the other work that you’ve done in the meantime by

importing it from the export file.

On the Import screen, you will need to identify the export file. You then have a choice to

‘Import all’ or ‘Import selected’. At the time, you can choose to ‘Overwrite without a query’

if you know that everything that you want is the most current. You can also choose to

perform an ‘Impact analysis’ to see what objects are related to each other. Once you have

set things up and are ready to begin developing Jobs. Let’s talk about some very important

and key components that we will use during our development.

Choose Stages

Now that we've talked about setting up our Project, let's now focus on building Jobs. In

particular, let’s talk about DataStage Stages that we see on our Canvas. There are two classes of Stages; Passive Stages and Active Stages. First, let's discuss the Passive Stages.

Passive Stages

Passive Stages are typically used for ‘data in’ and ‘data out’ such as working with sequential

files, databases, and Data Sets (Data Sets are the DataStage proprietary high-speed data

format).

Whereas, Active Stages are the things that actually perform on the data: As rows are

coming in, they immediately go out. For example a Transformer Stage, the Filter Stage, and

the Modify Stage pass data as they receive the data unlike the Passive Stage which only

either inputs data or only outputs data. The active Stages manipulate the data. So the

Passive Stages are beginning points in the data flow or ending points. The Active Stages are continuous flow Stages that go in the middle of the data flow.

Most notable of the Active type is the Transformer Stage, which has many functions. We'll get to this Stage a little later. First, let's cover the Passive Stages.



Types of File Data Let’s describe the types of file data that we’ll be dealing with as we

begin to develop Jobs. Primarily this falls into three categories; sequential file data (Should

be of either fixed or variable length), Data Sets which are the DataStage proprietary format

for high-speed access, and then complex flat file data which is most notably used with older

COBOL-type structures – these allow us to look at complicated data structures embedded inside of files and are used in proprietary formats –much the way that COBOL does.

How Sequential Data is Handled When we are using our Sequential File Stage, it is

important to know that DataStage has an import/export operator working on the data (Not

to be confused with importing and exporting objects, components, and metadata). This

Sequential File Stage’s import and export is about taking the data from its native format,

parsing it, and putting it into an internal structure (within DataStage) so that it can be used

in the Job and then, likewise, on the way out (when writing it), taking that internal structure

and writing it back out into a flat file format.



After the Job has been run, we could look in the log and see messages relating to things

such as ‘How many records were imported successfully’ and ‘How many were rejected’.

When the records get rejected, it is because they cannot be converted correctly during the import or the export. Later, we’ll see how we can push those rejected files out.



Normally, this is going to execute in sequential mode but you can use one of the features of

the Sequential File Stage to be able to read multiple files at the same time. These will

execute in parallel when executing multiple files. If we set up multiple readers, we can read

chunks of a single file in parallel.

The Stage needs to ‘know’ how the file is divided into rows and it needs to ‘know’ what the

record format is (e.g. how the row is divided into columns). Typically, a field of data within

a record is either delimited or in a fixed position. We can use these record delimiters and column delimiters (such as the comma or the new-line delimiter) as a record delimiter.



Job Design Using Sequential Stages The graphic above shows how we can read data

from the Selling_Group_Mapping file, send it into our Copy Stage and load it into the target

flat file. However, if anything is not read correctly by the Selling_Group_Mapping link, it will

be sent down the dashed reject link entitled Source_Rejects and, in this case, read by a

Peek Stage. Over on the right, by the target, if anything isn’t written correctly, it will be

sent out to the Target_File_Target_Rejects link to the TargetRejects Peek Stage. To configure the Sequential File Stage, we would double-click it.



Sequential Source Columns Tab Then, within the Sequential File Stage, on the Output

tab, under the Columns sub-tab, we would configure our Input file column definitions. The

column metadata looks like the metadata that we have elsewhere. In other words all of the

Column Names, Keys, SQL Types, Lengths, Scales, etc. under the Columns sub-tab look like

all of the other metadata that we’ve seen for this file. Next, let's say that we clicked on the

sub-tab entitled 'Properties'.



Under the Properties tab, we are presented with some options. These options help us define

how we will read the file. For instance, there is the file’s name, its read method, and other

options such as whether or not the first line is a column header, whether or not we will reject unreadable rows, etc.



Datasets The other primary file that we use is the Data Set. This kind of file stores data in

a binary format that is non-readable without a DataStage viewer. It preserves the

partitioning that we established during our Jobs. This way, the data contained in the Data

Set is already partitioned and ready to go into the next parallel Job in the proper format.

Data Sets are very important to our parallel operations in that they work within the parallel

framework and use all the same terminology and are therefore accessed more easily.



We use a Data Set as an intermediate point to land data that does not need to be

interchanged with any application outside of DataStage: It will only be used within

DataStage Jobs going from one job to the next (or just within a Job). It is not a useful form

of data storage for sending to other people as an FTP file, or for EDI, or for any of the other

normal forms of data interchange. It is a proprietary DataStage file format that can be used

within the InfoSphere world for intermediate staging of data where we do not want to use the database or a sequential file.



Data Set management Utility You may need to look at a Data Set to verify that it looks

‘right’ or to find specific values as a part of your testing. You can look at a Data Set using

the Designer client: There is a tool option that allows you to do data set management. It

let’s you first, seek out the file that you want. It then opens a screen, which will show you

how the file is separated. It shows the number of partitions and nodes and it shows how the

data is balanced between them. And, if you would like, it will also include a Displaying option for your data. Let’s look at this display



Data and Schema Displayed A Data Set cannot be read like a typical sequential file by

native tools such as the VI Editor or your Notepad editor in Windows. When looking at the

Data Viewer we will see data in the normal tabular format just as you view all other DataStage data using the View command.



Connect to Databases Now that we’ve talked about some of the basic Passive Stages for

accessing file data we will take a look at some of the basic Database Stages that allow us to

connect to a variety of relational databases. To do this, we will need to talk a little bit about

working with relational data. However, before we can access our databases, we need to

know how to import relational data. So, let's begin by talking a little bit about working with relational data.

To import relational data, we will use the Table Definition Import utility within the Designer

client – or you can use the orchdbutil. The orchdbutil is the preferred method to get correct

type conversions. However, in any situation, you need to verify that your type conversions

have come through correctly and will suit the downstream needs (They must suit the needs of subsequent Stages and other activities within DataStage).

You will need to work with Data Connection objects. Data Connection objects store all of our

database connection information into one single named object so that, when we go into our

other Stages in Jobs, we won’t need to include every piece of detailed information such as User ID, password of the various databases that we’re connecting to.



Next we need to see what Stages are available to access the relational data. We just talked

briefly about Connector Stages. They provide parallel support and are the most functional

and provide consistent GUI and functionality across all the relational data types.

Then, we have the Enterprise Stages. These are the legacy Stages from previous versions of

DataStage. They provide parallel support as well with the parallel extender set.

There are also things called Plug-in Stages: These are the oldest family of connectivity

Stages for databases and other relational sources. These were ported to the current version of Information Server in order to support DataStage Server Jobs and their functionality.

When you have one of these Stages, it gives you the ability to select the data you want. We

have the ability to use Select statements as we would in any database. With DataStage, you

get a feature called the SQL Builder utility that quickly builds up such statements allowing you to get the data that you desire.

When we’re writing the data, DataStage has a similar functionality that allows you to build INSERT, UPDATE, and DELETE statements also using the SQL Builder.



Import Table Definitions

The first thing that we want to do is to import our table definitions.

To import these table definitions, we can use either ODBC or the Orchestrate schema

definitions. Orchestrate schema imports are better because the data types tend to be more

accurate.

From within the Designer client, you simply click on Import>Table Definitions>Orchestrate

Schema Definitions. Alternatively, you could choose to import table definitions using the

ODBC option. Likewise you can use the Plug-In table definitions or other sources from which

you can import metadata including legacy information such as COBOL files, which can then

later be translated into pulling that same data out from corresponding databases.



Orchestrate Schema Import The Orchestrate Schema Import utility does require certain

bits of information including the database type, the name of the database from which you

are pulling it, the server on which it is hosted, username, and password.

ODBC Import When we’re using our ODBC Import, first we select the ODBC data source

name (DSN). This will need to have been set up for you by your System Administrator

before you use the screen shown above (Otherwise, all the entries will be blank). Many will

need a username and password although some may come pre-configured.



Connector Stage Types There are several Connector Stage types including ODBC, which

conforms to the ODBC 3.5 standard and is level 3 compliant and certified for use with

Oracle, DB2 UDB, SQL Server, and various other databases. This will also include the suite

of DataDirect drivers, which allow you to connect to these various data sources and give

you unlimited connections.

The next connector type is the DB2 UDB Stage. This is useful for DB2 versions 8.1 and 8.2.

There is also a connector for the WebSphere MQ to allow us to connect to MQ series queues.

It can be used with MQ 5.3 and 6.0 for client/server. And there is also WSMB 5.0. The last

Connector Stage type is for Teradata. This gives us fast access to Teradata databases.



Connector Stages Now that we’ve done some of the basic set up for our relational data,

we want to be able to use it on our Canvas. You can see, in the graphic above, that an

ODBC connector has been dragged and droppedon on the left. Next we would want to

configure it so that we can then pull our data from an ODBC sources. Let’s take a look inside

of the Stage.



Stage Editor Inside of the Connector Stage, you can see a Navigator panel, an area in

which we can see the link properties, and then below, we can see the various other

properties of the connection itself. Once we have chosen the connector for our ODBC

Connector Stage, we will then want to configure an SQL statement that will allow us to pull

data. We can use the SQL Builder tool highlighted on the right. Or we can manually enter

our own SQL. Alternatively, we can use SQL that is file-based.



Connector Stage Properties

When configuring the Connector Stage, we should have our connection information and

know what SQL we will be using. This may include any transaction information and session

management information that we entered (This would be our record count and commitment control shown highlighted)

Additionally, we can enter any Before SQL commands or After SQL commands that may

need to occur such as dropping indexes, re-creating indexes, removing constraints, or various other directives to the database.



Building a Query Using SQL Builder When you build a query using the SQL Builder

utility, there are certain things that you’ll need to know: Be sure that you are using the

proper table definition and be sure that the Locator tab information is specified fully and

correctly. You can also drag the table definition to the SQL Builder Canvas. And, you can

drag on the columns that you want to select.



Data Connection Data Connections are objects in DataStage that allow us to store all of

the characteristics and important information that we need to connect to our various data

sources. The Data Connection stores the database parameters and values as a named

object in the DataStage repository. It is associated with a Stage type. The property values

can be specified in a Job Stage of the given type by loading the Data Connection into that

Stage.



Creating a New Data Connection Object You can see the icon that represents the Data

Connection highlighted above.



Select the Stage Type Inside the Data Connection, we can select the Stage type that

we’re interested in using and that we want to associate this to. Then we specify any

parameters that are needed by that particular database. Obviously, different databases

have different required parameters, so the selection of the Stage type is important.



Loading the Data Connection Once we have built our Data Connection, then we will want

to load the Data Connection into the Stage that we have selected.

Active Stages

Lookup, Merge, Join Stages

Aside from Passive Stages, our other main classification of Stages is the Active Stage.

Some of the Active Stages that we’ll be looking at include a number of functionalities such

as data combination, the transformation of data, filtering of data, modification of metadata, and a number of other things for which there are corresponding Active Stages.



Data Combination First, let’s talk about data combination. For this, the three Stages that

we use are the Lookup Stage, the Merge Stage, and the Join Stage. These Stages combine

two or more input links. These Stages differ, mainly, in the way that they use memory.

They also differ in how they treat rows of data when there are unmatched key values. Some

of these Stages also have input requirements such as needing to sort the data or de-

duplicate it prior to its combination.

The Lookup Stage Some of the features of the Lookup Stage include its requirement for

one input for its primary input link. You can have multiple reference links. It can only have

one primary output link (And an optional ‘reject’ link if that option is selected).



Lookup Failure Actions It is limited to one output link however, with its lookup failure

options, you can include a reject link. Other lookup failure options include continuing with

the row of data through the process, passing the data through with a null value for what

should have been looked up, dropping the row, or we could just fail the Job (Which will

abort the Job).



The Lookup Stage can also return multiple matching rows, so you must be careful how you

use it. The Lookup Stage builds a Hash table in memory from the Lookup file(s). This data is

indexed by using a hash key, which gives it a high-speed lookup capability. You should

make sure that the Lookup data is small enough to fit into physical memory.

The Lookup Stage can also return multiple matching rows, so you must be careful how you

use it. The Lookup Stage builds a Hash table in memory from the Lookup file(s). This data is

indexed by using a hash key, which gives it a high-speed lookup capability. You should

make sure that the Lookup data is small enough to fit into physical memory.



Lookup Types There are different types of Lookups depending on whether we want to do

an equality match, a caseless match, or a range lookup on the reference link.



Lookup Example Let’s see how a Lookup Stage might typically fit into a Job. We have a

primary input coming in from the left and the data flows into the Lookup Stage. Coming in

from the top is our Reference data. Notice that the reference link is dashed (a string of

broken lines). As you are dragging objects onto the Canvas and building the Job, always

draw the Primary link before the Reference link. Then coming from our Lookup Stage is the

output link going to the target. What would happen if we clicked on the Lookup Stage to see

how it's configured? Let's take a look.



Lookup Stage With an Equality Match Within the Lookup Stage, there is a multi-pane

window. Everything on the left represents the data coming into the Lookup Stage – both

Primary link and Reference link(s). Everything on the right is coming out of the Lookup

Stage. On the top of the screen is more of a graphical representation of each row. Toward

the bottom of the screen is a more metadata-oriented view in which you can see each of the

links and their respective characteristics. This same metaphor applies to most other Stages

as well.

Notice that the metadata references (shown highlighted lower left) look like the table

definitions that we used earlier to pull metadata into the DataStage Project.



Highlighted toward the top, is a place where you connect column(s) from the input area

(highlight Item input link) and dragging them down and dropping them to the corresponding

Reference link and connect them to the various equivalent fields here (highlight on

Warehouse item in the lower blue area). In this case, we are using a single column to

connect an equal: Where a “Warehouse Item” is equal to an “Item”.



Next, we can move all the columns from the input area on the left to the output link area on

the right but replace this Item (highlight on the right box) description with the Warehouse

Item description (the one that we found in the Reference link). So, we are taking all of our

input columns and we are adding a new column to it based on what we looked up on the

Warehouse Item ID.



When you click on the icon with golden-chains (shown highlighted) it gives you a new dialog

screen.



Specifying Lookup Failure Actions This dialog box shows us the constraints that we can

use. Here we can determine what actions to take should our lookup fail. You can continue

the row, drop the row, fail the row, or reject it. Also, if you use a condition during your

lookup (highlight on fail under conditions not met), you can use these same options of

continue, drop, fail, or reject for if/when your condition is not met.



Join Stage The Join Stage involves the combination of data. There are 4 types of joins;

inner, left outer, right outer, and the full outer join. The input links must be sorted before

they come into a Join Stage. The Join Stage will use much less memory: It is known as a

‘light-weight’ Stage. The data doesn't need to be indexed and instead has come into the

Stage in a sorted fashion. We must identify a left and a right link as they come in. This

Stage supports additional ‘intermediate’ links. Like a Lookup Stage in an SQL query, in the

Join Stage we will use columns to pull our data together.



Job with Join Stage Above is an example of a Job that uses the Join Stage. Notice the

green Join Stage icon in the middle. Similar to the Lookup Stage, here, we have a primary

input, a secondary input, and an output. But, in this case, notice that the right outer link

input is not a dashed line (Highlight on the upper link). Remember that the reference link

that we saw earlier did have a dashed line.



Join Stage Editor When we get inside the Join Stage, we will use the Properties editor.

Similar editors are found in other Stages. Here, we will use the attributes necessary to

create the join. We will specify the join type that we want and what kind of key we will be

joining on.



The Merge Stage It is quite similar to the Join Stage. Much like the Join Stage, its input

links must be sorted. Instead of a left and a right link, we have a 'Master' link and one or

more Secondary links. It is a ‘light-weight’ Stage in that it uses little memory (This is

because we expect the data to already be sorted before coming into the Stage and therefore

there are no keys in memory [indexes to be used]). Unmatched Master rows can be kept or

dropped. This gives the effect of a left-outer join. Unmatched Secondary links can be

captured in a Reject link and dispositioned accordingly.



When the data is not ‘pre-sorted’, the sorting can be done by using an explicit Sort Stage or

you can use an On-link sort. If you have an On-link sort, then you will see its icon on the

Link (The icon is shown in the lower-left of the graphic). The icon toward the right of each

link above is the partitioning icon (Partitioning it out, it looks like a fan, and then collecting

the partitioned data back up).



Coming out of the Merge Stage, we have a solid line for the main output and a dashed line

for the rejects. The latter will capture the incoming rows that were not matched in the

merge.



Merge Stage Properties Inside the Merge Stage, you must specify the attributes by which

you will pull together the two sets of data; What to do with the unmatched master records

and whether to warn or to reject updates. Also you must specify whether or not to warn on

unmatched masters. This is another way to combine data, which will create more columns

within one row and, in some cases produce multiple rows.



The Funnel Stage The Funnel Stage provides a way to bring many rows from different

input links into one common output link. The important thing here is that all sources must

have identical metadata! If your links do not have this, then they must first go through

some kind of Transformer Stage or Modify Stage in order to make all of their metadata

match before coming into the Funnel Stage.



The Funnel Stage works in three modes. The Continuous mode is where records are

combined in no particular order. The first one coming into the Funnel is the first one put

out. The Sort mode is where we combine the input records in order defined based on keys.

This produces a sorted output if all input links are sorted by the same key. The Sequence

mode outputs all of the records from the first input link and then outputs all from the

second input link, and so on (Based on the number of links that you have coming into your

Funnel Stage).



Funnel Stage Example Above, you can see an example of the Funnel Stage’s use.

Let’s continue our discussion on Active Stages. Here, we’ll briefly talk about the Sort Stage and the Aggregate Stage.

Obviously, the Sort Stage is used to sort data that requires sorting, as was the case with

the Join Stage and the Merge Stage. As mentioned, sorts can be done on the input link to a

given Stage. This on-link sort is configured within a given Stage on its Input link’s Partitioning tab. Just set the partitioning to anything other than Auto.



Alternatively, you can have a separate Sort Stage. One of the advantages to the Sort Stage

is that it gives you more options for controlling memory usage during the sort. The

advantage to on-line sorting (within the Stage) is that it is quick and easy and can often by

done in conjunction with any partitioning that you’ll be doing.



Sorting Alternatives The Sort Stage is highlighted above. In this example, we have

sequential data coming into a Sort Stage before it is moved to a Remove_Duplicates Stage

and then sent out to a Data Set. Below, we have a different example in which file that is

being sorted directly into a Data Set but in this case the sort is happening directly on the

link (on-link sort).



Our next Active Stage is the Aggregator Stage. The purpose of this Stage is to perform data

aggregations. When using it, you specify one or more key columns, which define the

aggregation units (or groups). Columns to be aggregated must be specified. Next, you

specify the aggregation functions. Some examples of functions are counting values (such as

nulls/non-nulls), summing values, or determining the maximum, the minimum, or ranges of

values that you are looking for. The grouping method (hash table or pre-sort) is often a

performance issue.

If your data is already sorted, then it’s not necessary. If you don’t want to take the time to

do a sort of your data prior to aggregation, then, using the default hash option will allow

you to pull the data in without any sorting ahead of time.



Job with Aggregator Stage In this Job we can see the Aggregator Stage’s icon. Data

flows from a Data Set (at the top of the data flow) and is copied two ways. One way (to the

right) is sent to a Join Stage. Another way (straight down) is sent to an Aggregator Stage to

aggregate the row counts and then (flowing from the Aggregate to the Join) is passed back

into the same Join so that each row that came into this Job can also have, attached to it,

the total number of rows that are being counted. The Aggregator Stage uses a Greek Sigma

character as a symbol for summation.



Remove Duplicates Stage Our next Active Stage is the Remove Duplicate Stage. The

process of removing duplicates can be accomplished using the Sort Stage with the Unique

option. However, this leaves us with no choice as to which duplicate to keep. Use of the

Unique option provides for a Stable sort which always retains the first row in the group or a

Non-Stable sort in which it is indeterminate as to which row will be kept.

The alternative to removing duplicates is to use the Remove Duplicates Stage. This tactic

gives you more sophisticated ways to remove duplicates. In particular, it lets you choose

whether you want to retain the first or the last of the duplicates in the group.



This sample Job shows us an example of the Remove Duplicate Stage and we can see what

the icon looks like. From the Sort Stage (toward the left of the graphic), the data comes into

a Copy Stage toward the bottom, and then runs the sorted data up into the Remove

Duplicate Stage. Then the data is sent out to a Data Set.

An Important Active Stage: The Transformer Stage One of the most important Active

Stages is the Transformer Stage. It provides for four things: Column Mappings, Derivations,

Constraints, and Expressions to be referenced.



It is one of the most powerful Stages. With column mappings, we can change the metadata,

its layout, and its content. Also, within the Transformer Stage are derivations. These are

written in a BASIC code and the final compiled code is C++ generated object code. There

are a number of out-of-the-box derivations that you can use allowing you to do things such

as: string manipulation, character detection, concatenation, and other activities. You can

also write your own custom derivations and save these in your Project for use in many

Jobs.

Other features of the Transformer include Constraints. The Constraint allows you to filter

data much like the Filter Stage does. You can direct data down different Output links and

process it differently or process different forms of the data.

The Transformer's use of expressions is for constraints and/or derivations to use as

reference. We can use the input columns that are coming into the Transformer to determine

how we want to either derive or constrain our data. We can use Job parameters and/or

functions (whether they be the native ones provided with DataStage or ones that you have

custom-created). We can use system variables with constants, Stage variables (local in their

scope) to just the Stage – as opposed to being global for the whole Job), and we can also

use external routines within our Transformer.



Job with a Transformer Stage The Transformer Stage has an icon that looks like a T

shape with an arrow. In the example Job above, we have the data coming into the

Transformer. There is a reject link coming out of the Transformer (performed with the

Otherwise option) and two other links coming out of the Transformer as well to populate

two different files.



Inside the Transformer Stage Inside a Transformer Stage we have a look-and-feel

similar to that of the Lookup Stage that we talked about earlier, in that we have an Input

link (highlighted on the left) and all of our Output links on the right. These two output links

(the two boxes on the right) give us a graphical view and specify the column names and

also gives us places to enter any derivations. The top area (highlight on upper-most blue

box on the right) allows us to create Stage variables, which we can use to make local

calculations in the process of doing all of our other derivations.

Again, as with the Lookup Stage, the bottom half of the screen resembles the table

definitions that you will see inside of your repository (highlight on center lower half). This

gives you the very detailed information about the metadata on each column.



(Yellow highlight on the Golden-Chain button in the toolbar at the top) When you click the

Golden Chain button then this will bring you to the section that allows you to do

Constraints.



Defining a Constraint; you can type in your constraint into the highlighted area that also

uses BASIC terminology. Here you can create an expression that will let you put certain

data into one link or a different link or not put it into a link. When you right-click in any of

the white space, DataStage brings up options for your Expression Builder (highlighted

toward the left).



Alternatively, you can also use the Ellipse button to bring up the same menu. In the

example above it shows an UpCase function and a Job parameter named

(Channel_Description). At the beginning of the expression, we are using an input column.

We compare our input column to an upper-cased version of one of our Job parameters.



Defining a Derivation When we are defining or building a derivation, we also use the

BASIC-like code to create expressions within a particular cell. These derivations can occur

within the Stage Variable area or down within any one of the Links. As with the constraints,

a right-click or clicking the Ellipse button brings up the context-sensitive code menu to help

you build your expression.



If Then Else Derivation Using our Transformer Stage, we can also build up “If, then, else”

derivations where we can put conditional logic that will affect the data on our output link

String Functions and Operators Some things included in the Transformer Stage’s

functionality are string functions and operators. The Transformer can use a Substring

operator, it can Upcase/Downcase, and/or it can find the length of strings. It provides a



variety of other string functions as well such as string substitution and finding positional

information.

Checking for NULLs Other out-of-the-box functionality of the Transformer includes

checking for NULLs. DataStage can identify whether or not there are NULLs, assist us in how

we want to handle them, and can do various testing, setting, or replacing NULLs as we see

fit.



Transformer Functions Other functions of the Transformer include those for Date and

Time, logic, NULL handling, Numbers, Strings, and those for type-conversion.

Transformer Execution Order It is important to know that, when a Transformer is

executed, there is a very specific order of execution. The first things that are executed in

the Transformer for each row are the derivations in the Stage Variables. The second things

that are executed are the constraints for each link that is going out of the Transformer.

Then within each link, the next thing that is executed are the column derivations and they

are executed first, in the earlier links before the later links.

The derivations in higher columns are executed before lower columns: Everything has a top-

down flow to it. 1. Derivations in Stage variables (from top to bottom) 2. Constraints (from

top to bottom) 3. Within each Link, column derivations (from top to bottom).



Let’s look at a quick example of how this might work. So, at the beginning we will execute

all my variables (from top to bottom) and then execute each constraint (from top to

bottom). For each constraint, the Transformer will fire off the corresponding output Link,

and within that Link, we do all the columns from top to bottom, Which means that, if we do

something in a column above, we can use its results in the column below. TIP: You must be

very, very careful if you ever re-arrange your metadata (because of how the order of

execution could affect your results within a Transformer.



Transformer Reject Links You can also have Reject links coming out of your Transformer.

Reject links differentiate themselves from other links by how they are designated inside the

constraint. Let’s quickly see how this is done.



Off to the right, we have the “Otherwise Log” option or heading. Checking the respective

checkbox create a Reject link as necessary.



Our “Otherwise” link creates a straight line (such as the Missing Data Link shown above)

whereas; a “Reject” link would have a dashed line (Such as the Link all the way to the left).

Advanced Concepts When Building Jobs

We’ve discussed the key objects in DataStage and we have set up DataStage. We’ve been

able to use Passive Stages to pull data in from various files or relational database sources and we’ve talked about various Active Stages that let us manipulate our data.

Now, we need to cover a few other important concepts that will help us in our development of DataStage Jobs.

• Run Time Column Propagation

• Shared Containers

• Runtime Column Propagation (RCP)

• One of the most important concepts is that of Runtime Column Propagation (RCP).

When RCP is turned on, the columns of data can flow through a Stage without being

explicitly defined in the Stage. We’ll see some examples of this later.

• With RCP enabled, the target columns in a Stage need not have any columns

explicitly mapped to them: No column mapping is enforced at design-time.

• With RCP, the input columns are mapped to unmapped columns by name. This is all

done ‘behind the scenes’ for you.



• Let’s talk about how the implicit columns get into a Job. The implicit columns are

read from sequential files associated with schemas or from the tables when read

from relational databases using a “select” or they can be explicitly defined as an

output column in a Stage that is earlier in the data flow of our Job.

• The main benefits of RCP are twofold. First, they give us greater Job flexibility. This

way, a Job can process input with different layouts. We don’t have to have a

separate Job for multiple record types that are just slightly different.

• Second, RCP gives us the ability to create more re-usability within components, such

as the Shared Containers (There are many ways to create re-usable bits of

DataStage ETL code). This way you can create a component of logic and apply it to a

single named column while all the other columns will flow through untouched.

• When we want to enable RCP, it must be enabled at the Project-level. If it’s not

enabled there, then you won’t find the option in your Job. At the Job level, we can

enable it for the entire Job or just for individual Stages. Within each Stage, we look

at the Output Column tab in order to decide whether or not we want to use RCP.

• You must enable RCP for the entire Project if you intend to use it. And, if you do,

then can also choose to have each newly created Link use RCP by default or not. This

default setting can be overridden at the Job-level or at the Link-level within each Stage.



Here’s an example of where we set up and enable RCP just within a Stage (Highlight on

checkbox of Enabling RCP at Stage Level slide). When you see this checkmark, you know

that columns coming out of this Stage will be RCP-enabled.

When RCP is Disabled Let’s say that we had four columns coming in to this Stage shown

above as the input link shown above on the left and four columns going out (on the right)

then, if we only drew derivations to two of the columns (the upper two on the right), the

two lower columns would remain in a RED state (ineligible derivations) and this Job will not

compile. However, let’s see what we can do with RCP



When RCP is enabled we can see that the output link’s columns (lower right) do not have

anything in the red state. By name matching, DataStage will assign the two columns on the

input link (on the left) to two columns going out on the output link (lower right)

automatically.

Or, we could leave these two columns completely off (the ones in the lower right of the

previous graphic) and they would be carried in an invisible fashion through subsequent

Stages and thereby, allowing you to run many Jobs that have row-types that have the

upper two columns explicitly but also have the lower two (or more) columns implicitly

In other words, we can run this Job one time with a record that has four columns and

another time we can run a Job that has eight columns another time we can run that Job

with six columns.

The only thing that they all have to share is that the very first two columns (highlight on top two columns) must be the same and must be explicit.



Runtime Column Propagation (RCP) In Detail

Because of the importance of concepts including Runtime Column Propagation (RCP), the

use of schemas, the use of Shared Containers (the chunks of DataStage code that can be

re-used), and other things that allow you greater flexibility and help create greater re-

usability within your DataStage Jobs, it is worth talking about them in greater detail. Understanding RCP often takes several years in the field to understand.

Even people who have been in the field for years can have difficulty grasping both the

power and the difficulties of understanding RCP better, so let’s review and continue our

discussion about it.

RCP, when it is turned on, will allow columns of data to flow through a Stage without being

explicitly defined in the Stage. Target columns in the Stage do not need to have any

columns explicitly mapped to them. Therefore, no column mapping is enforced at design-time when RCP is turned on. Input columns are mapped to unmapped columns by name.

RCP is how implicit columns get into a Job. We can define their metadata by using a schema

file in conjunction with a Sequential File Stage. As you’ll recall, a schema file is another way

to define our metadata much like a table definition. Using this schema, we can work with a

Sequential File Stage or the Modify Stage to take unknown data and give it an explicit metadata tag.

We can also pull in our metadata implicitly when we read it from a database using our

“select” statement.

We can also define it by explicitly defining it as an output column in an earlier Stage in our data flow (Using a previous Stage in the Job design).

The main benefits of RCP are Job flexibility so that we can process input with different or

varying layouts. RCP also allows us to create re-usability. It also works very well within DataStage’s Shared Containers (the user-defined chunks of DataStage ETL applications).

Let’s look at an example of RCP so that we can better understand it.



The graphic above summarizes the main points of RCP but by revisiting the example that we

looked at a few moments ago, we can gain a deeper understanding of RCP. So let’s review

our example in which RCP is NOT used.



We’re looking within a Transformer Stage at a typical example of columns when RCP has not

been enabled. Let’s review how it functions without RCP. We can see four columns on the

input link to the left. To the lower right, we can see four outgoing columns (on the output

link). But the two lower columns SPEC_HANDLING_CODE and DISTR_CHANNEL_DESC

aren’t connected to the input columns and don’t have any derivations. Therefore, these two

columns are left in a red color indicating that this is an incomplete derivation. Therefore,

this Job won’t compile.



When RCP is Enabled. But, in the exact same situation but one in which RCP is enabled.

Now with the same Transformer, the bottom two columns are no longer in a red state and

therefore, this Job will compile. When this Job runs, by name reference these two columns

on the bottom left will be assigned to the two columns on the bottom right.

If we had additional explicitly named input link columns beyond the four highlighted to the

left, they would be carried invisibly along to the next Stage. And, they could be either used

or not used as necessary. But this would allow us to not require any kind of individual

column-checking at design-time. This has mixed benefits. It gives us a lot of flexibility but

can create headaches at runtime if you are not aware of data that is passing through your

Job. This can negatively impact downstream activities. Thus you must be very careful when

using the RCP.

Important things to know when using RCP are to understand what data is coming in, what

kind of metadata needs to get processed there, and to understand what data is going out of

the Stage. This applies in terms of the number of columns and the metadata type. If they



do not match, then what you are bringing in at some point in the processing, will need to be modified accordingly to work with the final target metadata of the Job.

For instance, let’s say that you bring in one column of a numeric type of data but it needs to

go out as character data, then you’ll need to perform some type of conversion function on

it. For example if it comes in with one column length but needs to go out with a longer

column length then you’ll need to affect it from within either a Transformer Stage or a

Modify Stage. Likewise, if you have a longer column coming in, you’ll need to make sure

that it is truncated properly to fit into the outgoing column space so that it doesn’t overflow and create a runtime failure.

Business Use Case Scenario

Let’s talk about a good use of RCP in business terms. Let’s say that you have a large

conglomerate company and five companies underneath it in a corporate hierarchy. The

conglomerate wants to take data from all five of these companies’ operational systems and

populating a data warehouse with Customer Information. Each of the sub-systems has

different layouts for their Customer but they all contain a commonality of having its own

Customer ID field. However, each of these Customer ID fields may overlap: The value of

‘123’ at Company A may represent a totally different Customer than the value of ‘123’ at

Company B even though they are the same value. The conglomerate has a single Customer

ID that can track each of them distinctly. But each of these sub-companies will not know

what their corporate conglomerate Customer ID is until it comes time to pull all of this data

into the warehouse. They don’t know this corporate conglomerate Customer ID because

their systems weren’t built knowing about any corporate conglomerate Customer ID. But,

the corporation conglomerate knows about it and can pull the customer data from all of the

sub-companies, add another column or piece of information to say ‘from which company it

came’, and then process the data through a DataStage Job which would then relate the sub-

system’s ID to the corporate ID via a cross-reference that is stored in the corporate data warehouse.

Two columns would be needed; the source system ID and the source system Customer ID.

The Job could then use this information to do a Lookup to a reference from the

conglomerate’s warehouse that would in turn find the corresponding corporate conglomerate Customer ID.

It would then bring this data together so that it could be written out with the new number

attached to it. And then that data could then be processed subsequently knowing to which corporate Customer ID this particular sub-system’s row belongs.

So, if we were to process five Customer records from the five companies separately, one

company might have 10 columns in their Customer table while another company has 20 columns in their Customer’s table.

We are only going to specifically identify 2 of those columns coming in and only specify 3 of those columns coming out of our Stage.

If we do this in a file-to-file example; We will read the first file coming in, and process the

data, then write it out… even though we only see, in our Job, the 2 or 3 columns that we



have explicitly identified, when we write our output file, all of the columns that came from

the input file will be written in addition to the new conglomerate Customer ID that we did

the lookup on.

This is the case with RCP on. Thereby, we will only need one Job instead of having to write 5

separate Jobs, one for each of specific data layouts coming from each of the sub-companies.

Shared Containers Shared Containers are encapsulated pieces of DataStage Job designs.

They are components of Jobs that are stored in a different container. They can then be re-

used in various other Job designs. We can apply “stored Transformer business logic” to a

variety of different situations. In the previous example we performed some logic on the

grouping Customer Information from sub-companies with a corporate Customer ID. This

logic could be encapsulated, stored, and then re-applied.



Creating a Shared Container In the example above, the Developer has selected the Copy

Stage and the Transformer Stage as a Shared Container. Then clicking (E)dit from the

menubar, the (C)onstruct Container option, and (S)hared completes the task. Then this

same Job would have a slightly different look. Let’s see what it would look like.



Using a Shared Container in a Job Above, you can see the Shared Container icon where

the selected Stages used to be. This same Shared Container can now be used in many other

Jobs. Shared Containers can use RCP although they don’t have to do so. The combination of

Shared Container and RCP is a very powerful tool for re-use. Let’s examine the concept of

combining the two.

If we create a Shared Container in conjunction with RCP then the metadata inside the

Shared Container only need to be what is necessary to conduct the functions that are going to occur in that Shared Container.

Recall for a moment the example that we were talking about with the conglomerate and the

five sub-companies.

If that Shared Container is where we did our Lookup to find our corporate conglomerate

number, and all we really needed to do was to pass into the Shared Container the 2

columns we explicitly knew, any other columns (regardless of how many there are – different sub-companies have different amounts of columns) passed in there too.

But on the way out of the Transformer (that adds the conglomerate Customer ID), all the

columns would be passed out with the newly added corporate identifier (that we had done a Lookup in our Shared Container).



Mapping Input/Output Links to the Container When mapping the input and output

links to the Shared Container, you need to select the Shared Container link with which the

input link will be mapped to. You will need to select the Container link to map the input link

as well as the appropriate columns that you’ll need. By using RCP we only have to specify

the few columns that really matter to us while we are inside the Shared Container. When we

come back out of the Shared Container we can re-identify any of the columns as necessary

– depending on what we want to do with them

Shared Containers Continued

The main idea around Shared Containers is that they let us create re-usable code. It can be

used in a variety of Jobs. Maybe even in Jobs that weren’t intended to process just the

Customer data. For instance, we might want to process Order data where we still need to

Lookup the conglomerate’s Customer ID just for this Order. By simply passing in our whole

Order record but only exposing the Customer number (the sub-system Customer identifier)

and the source sub-system from which it came so that the Shared Container can do the

Lookup, produce the appropriate output including the concatenated conglomerate Customer

ID, and tag that into the Order. Then, as we process this Order into our data warehouse, we

will know not only the Source CustomerID that appeared on the billing invoice and other

information from that particular sub-company but also be able to tie it to our conglomerate database so that we can distinguish it from other customers from other sub-companies.



Again, Shared Containers do not necessarily have to be RCP-enabled but the combination of the two is a very powerful re-usability tool.

Job Control

Now that we’ve created the DataStage environment, have talked about how to pull in

metadata, and have seen how to build Jobs, the next topic answers the question “How do

we control all of these Jobs and process them in an orderly flow without having to do each one individually?”

The main element that we use for this is called a Job Sequence. A Job Sequence is a special

type of Job. It is a master controlling Job that controls the execution of a set of subordinate

Jobs. They might even be of the Sequencer type in which case they are sub-Sequencers or

they can be Jobs themselves whether Parallel or Server Jobs.



One of the things that a Job Sequencer does is to pass values to the subordinate Job’s

parameters. It also controls the order of execution by using Links. These Links are different

from our Job Links though in that there is no metadata being passed on them. The

Sequence Link tells us what should be executed next and when it should be executed. The

Sequence specifies the conditions under which the subordinate Jobs get executed using a

term know as “Triggers”.

We can specify a complex flow of control using Loops, the All or Some options, and we can

also do things such as ‘Wait for File’ before our Job starts. For example, let’s say that we

were waiting for a file to be FTP’ed to our system; we can then sit there in a waiting-mode

until that file comes. When it does arrive, the rest of the activities within that Sequencer will

be kicked off.



The Job Sequencers will allow us to do system activities. For example you can use its e-mail

option that ties into your native system and sends an e-mail. They also allow us to execute

system commands and executables (that we might normally execute from the Command

Line). Any Command Line options can be executed from within the Sequencer. For example,

you could send a message to your Tivoli Work Manager, by writing a Command Line option

that writes a message to the Tivoli Command Center so that operators can see what has

happened.

The Sequencer can include restart checkpoints. Should there be some failure within your

Sequence, you don’t have to go back and re-run all the Jobs. You can just find the one that

failed, skipping over all the ones that worked, and pick up and re-run the one that failed.

When you create a Job Sequencer, you open a new Job Sequence & specify whether or not

it's re-startable. Then, on the Job Sequencer’s Canvas you add Stages that will execute the

Jobs, Stages that will execute system commands in other executables, and/or special

purpose Stages that will perform functions such as looping options or setting up local

variables. Next, you add Links between Stages. This will specify the order in which the Jobs

are executed. The order of sub-sequences, Jobs, looping, system commands, e-mails is

determined by the flow that you create on the Canvas



You can specify triggers on the Links. For instance, you might create a condition so that

when the flow is coming out of Activity A, one Link goes to Activity B if the Activity A Job

finished successfully but goes to Activity C on a different Link if the Job ‘errors-out’. The

‘error-out’ Link might then go to an e-mail that sends an error message to someone’s

console and stops the Sequencer at that point.

Job Sequences allow us to specify particular error handling such as the global ability to trap errors. Then you could enable and disable re-start checkpoints within the Sequencer.

Now let's see some of the Stages that can be used within the Sequencer.



Here is a graphic of some of the Stages that can be used within the Sequencer. These

include the EndLoop Activity Stage, the Exception Handler Stage, Executing (the native OS)

Command Stage, the Job Activity Stage (used to execute a DataStage Job), and the Nested Condition Stage.

In addition there are the Notification Activity Stage (that is typically tied into DataStage’s e-

mail function), the Routine Activity Stage (used to call DataStage Server routines), the

Sequencer Stage (which allows us to bring together and to coordinate a rendezvous point

for multiple Links once we have started many Jobs in parallel) and the StartLoop Activity Stage.

There is a (forced) terminator Activity Stage, the UserVariables Activity Stage (used to create local variables), and the Wait For File Activity.



Sequencer Job Example

Here we can see an example of a Sequencer Job. At the top left, this Job Sequence starts

with nothing coming into it. This Stage will thus start executing immediately. Normally the

one at the bottom would as well, however, the bottom Stage is an exception handler Stage

and will only kick off if there is an exception during our runtime.

In a different example, if there were many Stages independent of one another, they normally will all start at the same time within a typical Job Sequence.

In the top flow of the Job Sequence above though, the first Stage waits for a file, then it

kicks off Job 1, if Job 1 is finished successfully the flow follows the green Link to Job 2 and

so on. The green color of the Link indicates that there is a trigger on the Link (this trigger

basically says ‘if the Job was successful, go to Job 2’). If Job 3 completes successfully, then

the flow goes on to execute a command. The red links indicate error activity. Should any of

Jobs 1, 2, or 3 fail, the flow follows the red link to a Sequencer (which can have all or any of

the Jobs complete) but in this case, if any of the Links are followed to it, then it sends a notification warning by email that there has been a Job failure.



Exception Handler Stage

There is one final note about our Sequencer Job: It is not a scheduler unto itself. It just

sequences the activities. There must be some external scheduler whether that is the

DataStage scheduler or a third-party enterprise scheduler to start up this whole Sequencer

Job and only from there, will it call all the other Jobs for you.

Think of it as an entry point into your application from whatever external source you intend

to use. Whether that source is just kicking it off from a Command Line or using something

like a Tivoli Work Manager to execute the Job based on a daily or monthly schedule or a

conditional schedule should some other activity occur first and then, in turn, call this Job.

In order to do that, we would need to learn to execute the Job from the Command Line. We

will talk a little about how the Command Line can be used in this manner as well as how it will help us deploy our applications effectively and efficiently.

Pulling Jobs Together

• Parameter Sets

• Running Jobs from the Command Line • Other Director Functions

Parameter Sets

Let's start by talking about Parameter Sets. One of the areas that will help us to pull our

Jobs together more effectively will be the use of Parameter Sets. Parameters are needed for

various functions within our Jobs and are extremely useful in providing Jobs with flexibility and versatility.

The Parameter Set allows us to store a number of parameters in a particular named object.

Thereby, one or more value files can be named and specified. A value file stores the values

for each parameter within the Parameter Set. These values can then be picked up at

runtime.



Parameter Sets can be added to the Job’s parameter list, which is on the Parameter tab

within the Job Properties. This makes it very convenient for us. As we develop our Jobs, we

don’t have to enter a long list of parameters for every single Job thus risking the

possibilities of mis-keying or having omissions that prevent our parameters from passing

successfully from the Sequencer Job into our lower level Jobs.

An example of a Parameter Set might include all the things that happen within a certain environment.

We might set up a Parameter Set entitled “environment” and have it contain parameters

such as the “Host Name”, the “Host Type”, the “Database Name”, the “Database Type”, the

“Usernames”, and the “Passwords” (Not to be confused with the DataStage Connection

Objects).

We could then have one set of values in a value file used specifically for development, another set for testing, and another set for production.

We could then move our Job from one environment to the next and simply select the appropriate values for that environment from the Parameter Set.



Running Jobs from the Command Line

Now that we’ve learned the basics of how to sequence our Jobs and group our parameters,

we have an application. Let’s say that we’ve built, developed, and tested a Job and are now ready to put it into production. The next question is “How do we execute it?”

Although it can be executed by hand using the DataStage Director client, this is not the

typical way that applications are utilized. Instead, there are a series of Command Line

options that will allow us to start the Job from the native Command Line when we type in the command or to use our scheduler to call Jobs.

The graphic above shows some of the particulars.



Other Director Functions We mentioned that we can schedule Jobs using the DataStage

Director, but more often than not, Jobs are scheduled by an external scheduler. To do this,

we must run Jobs from the Command Line as most schedulers use a Command Line execution.

The InfoSphere suite provides a utility for DataStage called the dsjob. The dsjob is a multi-

functional Application Programming Interface (API) on the Command Line that allows us to

perform a number of functions. One of the most common functions is to run a Job. By

executing the dsjob command on your Command Line with various options after it, you’ll be

able to accomplish a number or automated functions using a whole host of other tools (Since the Command Line is the most common method for launching other activities).

The top one in the graphic above (HIGHLIGHT ON TOP ONE) will run a Job.

In it, we are passing parameters that include the number of rows that we want to run, the name of the Project that we want to run from, and the name of the Job.

dsjob also has other functions such as giving us information about a Job’s run status,

summary of its messages, and even Link information. The example above (in the second

bullet point) displays a summary of all Job messages in the log.

You can use the -logsum or the -logdetail options to the dsjob command to write your

output to a file, which can then either be archived or even used by another DataStage Job

and loaded into a database with all your log information (any database – doesn’t have to be

part of the DataStage repository). With DataStage version 8.1, you are able to store the

logged information directly into the Xmeta repository or keep it locally in the DataStage engine.

You may choose the latter for performance reasons so that you don’t tie up your Xmeta

database with a lot of logging activity while the Job is running –which could affect your

network should your Xmeta be remotely located from your DataStage server. Having the

option to pull the logging information out subsequently and allow you to run all the log

information into the local DataStage engine’s repository and then, only at the end, pull that

data out and process it into your master repository (whether that be Xmeta or some other

relational data source) so that you can keep it for long term usage. This way you can do it

without negatively impacting the performance of the application as it is running since you will be doing it after the application is finished.

As mentioned earlier, Xmeta is the repository for all of DataStage.

dsjob’s function is documented in the “Parallel Job Advanced Developer’s Guide”.

Now, typically, when we use the dsjob command it will be encapsulated inside some kind of

shell script (whether on a Windows or UNIX type of system or whatever system DataStage

is running on) and we would want to make sure that all the options are correct before issuing the command.



Product Simulation

Welcome to the Guided Tour Product Simulation. Earlier in the tutorial, we discussed the

many of the important concepts involved in DataStage. In this interactive section, you will

‘use the product’, without having to perform any installations, to create DataStage Jobs,

navigate around the UI, and ‘learn by doing’.

Specifically, we will first introduce you to a case-study of a business use-case scenario in

which you will learn about a business entitled Amalgamated Conglomeration Corporation.

Amalgamated Conglomeration Corporation (ACC) has hired you as their DataStage

Developer! ACC is a large holding company that owns many subsidiary companies

underneath it.

In our scenario, we’re going to talk about what happens to ACC over a period of a year as they acquire new companies. In this section you will do the following:

1. Understand the Business Problem and Case Scenario

2. Assess the Existing Architecture 3. Look at a Feed Job 4. Create a Feed Job 5. Modify the Consolidation Job 6. Create a Sequence Job



Let’s look at ACC’s corporate structure as it looks at the beginning of the year on January 1

st. As you can see in the graphic above, ACC has four subsidiary companies. These

subsidiary companies include Acme Universal Manufacturing (Company A), Big Box Stores

(Company B), Cracklin’ Communications (Company C) and Eco Research and Design

(Company E). All of these belong to ACC and information from all of them will need to be

pulled together from time to time.

Now, some time will have gone by, and, on March 3rd, the state of ACC has changed: Your

employer has acquired a new company called Disks of the World (Company D). Disks of the

World will now need to be able to feed its information into ACC’s Data Warehouse. Let’s go

back to January 1 and see what the feeds looked like before Disks of the World (Company

D) came on board.



Each of the individual subsidiaries (Companies A, B, C, and E) feeds its data into a common

customer Data Mart. But this is done in two steps. Let’s quickly talk about how this was

done from more of a technical standpoint. Above, we can see four particular feeds; one for

each company and that particular company's customers. The feeds are all turned into files. Each FEED FILE has been generated by an individual DataStage DS FEED JOB.

These files are then later pulled together by a consolidation job (A DataStage Job) and run into the COMMON CUSTOMER DATA MART.

The reason why we have several different feeds running to files is that each of these

subsidiary companies put out their data at different intervals or periods. In other words,

each puts out its own feed at different frequencies. However, we only want to consolidate it

in “one fell swoop” (ACC’s business requirements are that we put all the feed files into the Data Mart at one time).

Acme or Company A puts out their data once per day. But Big Box or Company B has such a

large feed that they put it out once an hour. Cracklin’ Communications or Company C has a

large feed but not quite as large as Company B, so they do it four times per day. Eco

Research or Company E has such a small customer base that they only produce their feed one time per week.

ACC’s business requirement though, is for us to have a daily consolidation of these feeds

into the Common Customer Data Mart. It will therefore pull in many files from Companies B



and C, probably only one from Company A, and then, once per week, there will also be one in the heap from Company E.

Now let’s go ahead in time to March 3 rd. You’ll recall that ACC has now acquired another

company called Disks of the World. The DataStage Developers for ACC (that’s you) will now

need to add a new DataStage feed for the customer data coming from Company D (Disks of

the World) into the Common Customer Data Mart. You will build a DataStage feed that

accommodates their need to produce their data multiple times per day during their sales-

cycle. This feed will be landed to the same set of files (Feed files at the center of the

graphic) that are then picked up by another DataStage Job entitled Consolidation DS Job

(which combines them into the mart). Finally, you will need to build a special type of Job

called a Sequencer Job. This type of Job is used in order to control the running of other

DataStage Jobs. In other words, we’ll see how all of the Jobs (All DS FEED JOBs and the

CONSOLIDATION JOB), can be brought together by using a Sequencer Job. This Sequencer

will call all the Jobs in the correct order. This way, we don’t have to schedule them

individually in our corporate-wide scheduling software. We will just schedule the one Sequencer.



In sum, as the DataStage Developer for ACC, you will need to do the following:

1. View the existing Acme DS FEED JOB to see how the previous DataStage Developer

at ACC did it

2. create a new DataStage Job shown highlighted above (Company D's DS FEED JOB)

3. Then you will need to modify ACC’s DS CONSOLIDATION JOB to bring in this

new/extra feed (from Company D) into the data mart

4. Finally, later in this Guided Tour Product Simulation, you will also create a Job that

controls all of these Jobs!

5. There is a technical problem that has been identified and reported to ACC in the

past. You will need to understand the issue so that, as you develop a new DS FEED

JOB, you will help solve this problem: ACC will need to recognize any particular

customer as having come from a particular subsidiary:

6. One of the things that all of these DS Feed Jobs have in common is that the

DataStage Jobs will need to find a common customer ID for the corporation. Since

each of the subsidiary companies developed their own systems independently (of

ACC), the numbering systems that they use for their customers are different and

may well inappropriately overlap. For example, Company A has a Customer 123 that

represents Joe Smith but over at Company B, there is a Customer 123 that

represents Emmanuel Jones Enterprises. Even though these two share the same

number, they are not the same customer. If we put them into the Common

Customer Data Mart ‘as is’, then we would have ambiguity and not know which

customer we were dealing with.

7. The Common Customer Data Mart will have its own numbering system for these

customers and it will differentiate by using the Customer ID that comes from the

company (Joe Smith = Customer 123), along with an additional identifier, that

indicates which company the data came from (Company A is known to the Data Mart

as Source System ID = 1, for instance). These two fields when combined will create

a new field that is an alternate key and is now a unique identifier in the Common

Customer Data Mart for Joe. This number will now be able to distinguish any

customer from any subsidiary.

8. Our DataStage “Feed” Jobs will need to be able to find out if their customer is

already known to the Common Customer Data Mart.

9. If the customer is not known, then we will hand it off to ACC's Data Mart team to add

it in a separate process (a ‘Black Box’ process Job that we will not see in this

tutorial) which will then place the new customer or insert it for the first time into the

Data Mart.

10. However, If the customer is already known to the Mart, then the Data Mart must find

the unique ID, and add the unique ID to the data. For example,

Customer_123SourceSysID_1 is added to an ADDRESS-update for Joe Smith). This

way, when it comes time to load the Common Customer Data Mart (With the

Consolidation DS Job) then we won’t have ambiguity in our data. We’ll be able to

load it correctly and distinctly. Joe’s address is updated.

11. Let’s talk about how the Consolidation DataStage Job would do this. 12. ACC's Business Requirements

13. Anytime that ACC runs one of these Jobs the Feed Jobs designate from which source

system (Company) they are coming. In other words, each Feed Job has designated

its subsidiary company uniquely. For example, “Source system 1” or “coming from

Company A” for Acme, “Source system 2” or “coming from Company B” for Big Box

Stores, and so forth. When the data is processed, the source system number will be

sent into the Job (as a parameter). Then, it the Job must perform a lookup. For this

we will use both the source system number or ID and the Customer ID from the

source system to do a lookup into the table, from the Common Customer Data Mart,



to see if this compound key already exists. This is called a “reverse lookup”. In other

words, the Job uses those two columns to find the true key for the Common

Customer Data Mart which should then be returned into the Job so that that column

(the true key) can be carried along with all the other native source columns that are

coming in the row as it passes through the Job.

14. Additionally, we need to code the Job so that it will convert all of all of the source system-specific metadata from the source system’s metadata layout into the target

metadata layout. Coming into the Job, the metadata appears as it would in the

source system (Acme’s layout of the customer) but, on the way out, we should see

the same type of data but captured in columns and fields which conform to the

Common Customer Data Mart’s metadata structure.

15. The Common Customer ID is a unique key within the Common Customer Data Mart.

This key has no meaning to each individual source system. So the only way that the

source system can find this unique key is through this feed process. The feed process

enriches the data by looking at the Common Customer Data Mart and finding this

Common Customer ID to help differentiate all of our different customers across all of

our subsidiary companies. If the customer is new, then it will be sent to another

process for insertion into the Customer Data Mart. This Insertion Job is not discussed

in this tutorial and is considered a “Black Box” process used by Amalgamated

Conglomerate Corporation to add new customers.

16. In summary, the unique Customer ID in the Common Customer Data Mart has

compounded the Source System Identifier with a source-specific Customer ID into a

field that, for example, might contain a value like “9788342”. 17. And this field is unique so that no customer is overlapped with another



Look At Acme DS Feed Job Now that we know the business requirements, the

architecture, and the development strategy, let's look at the DS Feed Job for the Acme

subsidiary (Company A) that was built previously by another DataStage Developer at ACC

(Highlighted above). After we've seen it and how it works, then, after ACC makes another

acquisition in March, we will go to a section of the tutorial, in which you will have to build a similar DS Feed Job for Disks of the World (Company D).

In order to look at the DataStage Feed Job for Acme Universal Manufacturing, we'll need to

bring up the DataStage Designer Client. In this Job, we will see how it pulls the data from

Acme’s operational database and puts that data into a file after it has looked up to find the

appropriate Common Customer ID from the Data Mart. Then, we’ll look at the Consolidation

Job. There, we will see how all of the individual Feed files are brought together prior to

being loaded into the Common Customer Data Mart. We will need the highlighted client to look at, create, and modify DataStage Jobs.



Our Login screen comes up. Here, we’ll need to enter username/password information. At

the top of the screen, you can see the domain in which we are working. At the bottom of

the screen, we need to make sure that we are working with the correct DataStage Project in

this case it is the AmalCorp Project (Amalgamated Conglomerate Corporation Project).

Then, with a password filled in for you (for purposes of this tutorial) enter a username of

student and click the OK button (The NEXT button below has been disabled for the

remainder of the Guided Tour Product Simulation)

Our Designer client is now connecting to the DataStage Server via authentication and

authorization from the Server’s Security layer. A screen will then pop up within which we

can work on our Job. First, we are asked if we want to create a Job and if so, what type. For right now, since we will look at an existing Job, just click the Cancel button.

From here please develop as per the spec……

Documents

DataStage How to Kick Start