Upload
karthik-bablu
View
13
Download
1
Tags:
Embed Size (px)
DESCRIPTION
datastage
Citation preview
Course Contents• Module 01: Introduction
• Module 02: Deployment
• Module 03: Administering DataStage
• Module 04: DataStage Designer
• Module 05: Creating Parallel jobs
• Module 06: Accessing Sequential data
• Module 07: Platform Architecture
• Module 08: Combining Data
• Module 09: Sorting and aggregating data
• Module 10: Transforming Data
• Module 11: Repository Functions
• Module 12: Working with Relational data
• Module 13: Metadata in Parallel Framework
• Module 14: Job Control
2/16/2010 2Training Material - Sample Copy
Course Objectives• DataStage Clients and Server
• Setting up the parallel environment
• Importing metadata
• Building DataStage jobs
• Loading metadata into job stages
• Accessing Sequential data
• Accessing Relational data
• Introducing the Parallel framework architecture
• Transforming data
• Sorting and aggregating data
• Merging data
• Configuration files
2/16/2010 3Training Material - Sample Copy
IBM WebSphere DataStage 8.1
____________________________Module 1: Introduction
2/16/2010 4Training Material - Sample Copy
Unit Objectives
After completing this unit , you should be able to :
• List and describe the uses of DataStage
• List and describe DataStage clients
• Describe the DataStage workflow
• List and compare different types of DataStage jobs
• Describe two types of parallelism exhibited by DataStage parallel jobs
2/16/2010 5Training Material - Sample Copy
What is IBM WebSphere DataStage?
• Design jobs for Extraction, Transformation, and Loading (ETL)
• Ideal tool for data integration projects – such as, data warehouses, data marts, and system migrations
• Import, export, create, and manage metadata for use within jobs
• Schedule, run, and monitor jobs all within DataStage
• Administer your DataStage development and execution environments
• Create batch (controlling) jobs
2/16/2010 6Training Material - Sample Copy
IBM information server• Suit of applications, including DataStage, that :
- Shares a common repository
– DB2, by default
- Shares a common set of application services and functionality
– Provided by Metadata server components hosted by an application server
- IBM Web Sphere Application server
– Provided services included:
• Security
• Repository
• Logging and reporting
• Metadata management
• Managed using web console clients
- Administration console
- Reporting console
2/16/2010 7Training Material - Sample Copy
Information Server Backbone
Web Sphere Web Sphere Web Sphere Web Sphere Web Sphere Web Sphere
Information service
director
Business
GlossaryInformation Analyzer DataStage Quality Stage
Federation Server
Metadata Access Services Metadata Analysis Services
Metadata Server
Information Server Console
2/16/2010 9Training Material - Sample Copy
What’s the Same• DataStage Designer and Director work very much the same
- What you could do before you can still do
- Minor changes to menus and GUI
• Same job types are supported
- Parallel jobs
- Server jobs
- Mainframe jobs
- Job sequences
• All stages that existed in DataStage 7.5x are still supported
• Previous DataStage functionality is still supported
- Export DataStage components (dsx)
• Now occurs in designer
- Import Metadata
• Sequential files
• Database
• COBOL file definitions
• Job compile, execution ,run-time log are the same2/16/2010 11Training Material - Sample Copy
What’s Different Part I
• QualityStage is now embedded within a DataStage designer
• DataStage Quality stage(a.k.a DataStage) are now hosted by Metadata server
- There is a layer of administration ,logging, reporting, security that occurs at Metadata server level(outside of DataStage)
• Managed by the administration and reporting console
- Repository is now managed by the Metadata server and its services
• No longer a universe database
• Repository model has been completely recognized
- Installation and deployment are now more flexible (complex)
• More deployment option
• Metadata server, repository, DataStage engines can be on different machines and platforms
2/16/2010 12Training Material - Sample Copy
What’s new or different? Part II
• DataStage manager is gone
– All manager functionality has been moved into Designer
• DataStage
Permissions are implemented differently
GUI : new icons, menu arrangements, etc.
New stages
• Slowly changing dimensions (SCD) stage
• Surrogate key management
• Connector stages
Stage enhancements
• Lookup stage now supports range lookups
• Complex flat file(CFF) stage now supports multiple record format
• SQL builders accessible in connector stages
2/16/2010 13Training Material - Sample Copy
What’s New or Different ? Part III• DataStage repository enhancements
– Flexible folder organization
– Repository search
– Enhanced DataStage components export
– Job and table definition difference reports
– Impact analysis
• New objects
– Parameters sets: Named sets of parameters
– Database connectors : Named sets of database connection property values
• New utilities
– Performance analyzer
– Resource Estimator
2/16/2010 14Training Material - Sample Copy
Components of DataStage 8.1
____________________________Module 1: Introduction
2/16/2010 15Training Material - Sample Copy
Client Logon
16
Information server
console address
Information
Server
administrator ID
2/16/2010 16Training Material - Sample Copy
Information Server Administration console
Administration Console
Reporting
Console
2/16/2010 17Training Material - Sample Copy
DataStage Architecture
Clients
Parallel engine Server Engine
Shared repositoryEngines
2/16/2010 18Training Material - Sample Copy
Steps to Develop job in DataStage
• Define global and project properties in Administrator
• Import metadata into the Repository
• Designer Repository View lists the existing jobs
• Build job in Designer
• Compile job in Designer
• Run and monitor job in Director
-- Jobs can also be run in designer, but job log messages can not be viewed in designer
2/16/2010 22Training Material - Sample Copy
DataStage Project Repository
User -added folder
Standard Job Folder
Standard table
definitions folder
2/16/2010 23Training Material - Sample Copy
Types of DataStage Jobs• Parallel jobs
– Executed under control of DataStage Server runtime environment
– Built-in functionality for Pipeline and Partitioning Parallelism
– Compiled into OSH (Orchestrate Scripting Language)
– OSH executes Operators
• Executable C++ class instances
– Runtime monitoring in DataStage Director
• Job Sequences (Batch jobs, Controlling jobs)
– Master Server jobs that kick-off jobs and other activities
– Can kick-off Server or Parallel jobs
– Runtime monitoring in DataStage Director
• Server jobs (Requires Server Edition license)
– Executed by the DataStage Server Edition
– Compiled into Basic (interpreted pseudo-code)
– Runtime monitoring in DataStage Director
• Mainframe jobs (Requires Mainframe Edition license)
– Compiled into COBOL
– Executed on the Mainframe, outside of DataStage2/16/2010 24Training Material - Sample Copy
Design Elements of Parallel Jobs
• Stages: Implemented as OSH operators (pre-built components)
• Types of Stages
• Passive stages (E and L of ETL)
– Read data
– Write data
– E.g., Sequential File, Oracle, Peek stages
• Processor (active) stages (T of ETL)
– Transform data
– Filter data
– Aggregate data
– Generate data
– Split / Merge data
– E.g., Transformer, Aggregator, Join, Sort stages
• Links
– “Pipes” through which the data moves from stage to stage
2/16/2010 25Training Material - Sample Copy
Quiz – True or False?
• DataStage Designer is used to build and compile your ETL jobs
• Manager is used to execute your jobs after you build them
• Director is used to execute your jobs after you build them
• Administrator is used to set global and project properties
2/16/2010 26Training Material - Sample Copy
Deployment of DataStage 8.1
____________________________Module 2: Deployment
2/16/2010 27Training Material - Sample Copy
Unit Objectives
After completing this unit, you should be able to
• Identify the components of Information server that needs to be installed
• Describe what a deployment domain consists of
• Describe different domain deployment options
• Describe the installation process
• Start the information server
2/16/2010 28Training Material - Sample Copy
What Gets Deployed An information server domain, consisting of the following:
• Metadata server, hosted by an IBM Web Sphere Application server instance
• One or more DataStage servers
– DataStage server includes both the parallel and server engines
• One DB2 UDB instance containing the repository database
• Information server clients
– Administration console
– Reporting console
– DataStage clients
• Administrator
• Designer
• Director
• Additional information server applications
– Information analyzer
– Business glossary
– Rotational data architect
– Information services Director
– Federation Server 2/16/2010 29Training Material - Sample Copy
Deployment : Everything on One Machine
Clients DB2 Instance With repository
Metadata Server Backbone
Clients
DataStage Server
• Here we have a single domain
with the hosted applications all on
one machine
• Additional client workstations can
connect to this machine using
TCP/IP
2/16/2010 30Training Material - Sample Copy
Deployment : DataStage on separate machine
• Here the domain is split
between two machines
- DataStage server
- Metadata server and
DB2 repository
Metadata Server Backbone
DB2 instance with Repository
DataStage Server
Clients
2/16/2010 31Training Material - Sample Copy
Metadata server and DB2 on separate machines
• Here the domain is split
between three machines
– DataStage server
– MetaData Server
– DB2 repository
DataStage server DB2 Instances
With repository
Clients
Metadata Server Backbone
2/16/2010 32Training Material - Sample Copy
Installation Configuration Layer• Configuration layers include:
– Client
• DataStage and information server clients
– Engine
• DataStage and other application engines
– Domain
• Metadata server and hosted metadata server components
• Installed products domain components
– Repository
• Repository database server and database
– Documentation
• Selected layers are installed on the machine local to the installation
• Already existing components can be configured and used
– E.g. DB2, Web Sphere Application Server
2/16/2010 34Training Material - Sample Copy
Information server Start-up– Start the Metadata Server
– From windows start menu, click “Start the server” after the profile to be used (e.g. default)
– From the command line, open the profile bin directory
– Enter the startup server1
> server 1 is the default name of the application server hosting the Metadata server
– Start the ASB agent
– From windows start menu, click “Start the agent” after selecting the information server folder
– Only required if DataStage and metadata server are on different machines.
– To begin work in DataStage, double-click on a DataStage client icon.
– To begin work in Administration and reporting consoles, double-click on the Web Console for Information Server icon
2/16/2010 35Training Material - Sample Copy
Starting the Metadata Server Backbone
Application Server Profiles folder Profile Start the Server
Startup command
2/16/2010 36Training Material - Sample Copy
Testing the Installation
• You can partially test the installation by logging on to the information Server web console
2/16/2010 38Training Material - Sample Copy
Checkpoint
1. What application components make up a domain ?
2. Can a domain contain multiple DataStage servers ?
3. Does the DB2 instance and the repository database need to be on the same machine as the application server ?
4. Suppose DataStage is on the separate machine from the Application server. What two components need to be running before you log onto DataStage ?
2/16/2010 39Training Material - Sample Copy
Check point solutions
1. Metadata server hosted by the application server. One or more DataStage servers. One DB2/UDB instance containing the suit repository database
2. Yes. The DataStage servers must be on separate machines. They can be on different platforms e.g. one server running on Windows and another running on Linux
3. No. the DB2 instance with the repository can reside on a separate machine/ platform than the Application server.
4. The Application server and the ASB agent2/16/2010 40Training Material - Sample Copy
IBM WebSphere DataStage 8.0
Module 3:Administering DataStage
2/16/2010 41Training Material - Sample Copy
IBM WebSphere DataStage 8.1
___________________________Module 2: Administering DataStage
2/16/2010 42Training Material - Sample Copy
Unit objectives
After completing this unit, you should be able to:
• Open the administrative console
• Create new user and groups
• Assign suite roles and product roles to users and groups
• Give user DataStage credentials
• Log on to DataStage Administrator
• Add a DataStage user on the permissions tab and specify the user’s role
• Specify DataStage global and project defaults
• List and describe important environment variables
2/16/2010 43Training Material - Sample Copy
Information server administrator console
• Web application for administering information server
• Used for :
– Domain management
– Session management
– Management of users and groups
– Logging management
– Scheduling management
2/16/2010 44Training Material - Sample Copy
Opening the Administrator Web Console
Information server
console address
Information
Server
administrator ID
2/16/2010 45Training Material - Sample Copy
User and group Management
• Suite authorization can be provided to users or group– Users that are members of the group acquire the authorization of the group
• Authorizations are provided in the form of roles– Two types of roles
• Suit roles: Apply to the suite
• Suite components roles : Apply to a specific product or components of Information Server, e.g. DataStage.
• Suite roles– Administrator
• Perform user and group management tasks
• Includes all the privileges of the Suite user role
– User
• Create views of scheduled tasks and logged messages
• Create and run reports
2/16/2010 47Training Material - Sample Copy
User and group Management (cont…)
• Suite components roles– DataStage
• DataStage user
– Permissions are assigned within DataStage
> Developer, Operator, Super Operator, Productions Manager
– DataStage administrator
• Full permission to work in DataStage Administrator, Designer and Director
– And so on for all products in the Suite
2/16/2010 48Training Material - Sample Copy
Creating a DataStage user ID
Administration console
Create new user
Users
2/16/2010 49Training Material - Sample Copy
Assigning DataStage Roles
Users
User ID Assign Suite role
Assign DataStage
User role
2/16/2010 50Training Material - Sample Copy
DataStage Credential Mapping
• Users given DataStage Administrator or DataStage user product roles in the suite administrator console do not automatically receive DataStage credentials
– Users with DataStage Administrator roles need to be mapped to a valid user on the DataStage server machine
• This DataStage users must have file access permission to the DataStage engine/Project files or Administrator rights on the operating system
– Users with DataStage user roles need to be mapped to a valid user on the DataStage server machine and need additional DataStage assigned permissions (developer or operator…)
2/16/2010 51Training Material - Sample Copy
Module Objectives
• Setting project properties in Administrator
• Defining Environment Variables
• Importing / Exporting DataStage objects in Manager
• Importing Table Definitions defining sources and targets in Manager
2/16/2010 54Training Material - Sample Copy
Logging on to AdministratorHost name,
port number of
application
server
DataStage
administrator ID
and Password
Name or IP
address of
DataStage
server machine2/16/2010 55Training Material - Sample Copy
Setting Project Properties
• Project Properties
• Projects can be created and deleted in Administrator
• Each project is associated with a directory on the DataStage Server
• Project properties, defaults, and environmental variables are specified in Administrator Can be overridden at the job level
2/16/2010 56Training Material - Sample Copy
Setting Project Properties
• To set project properties, log onto Administrator, select your project, and then click “Properties”
Click to specify
project properties
Server projects
Link to Information
server adminstation
console2/16/2010 57Training Material - Sample Copy
Project Properties General Tab
Enable Runtime
column Propagation
Specify auto-purge
Environment variable
setting
2/16/2010 58Training Material - Sample Copy
Environment Variables
User-defined
variables
Parallel job variables
2/16/2010 59Training Material - Sample Copy
Environment reporting variables
Display score
Display OSH
Display record
counts
2/16/2010 60Training Material - Sample Copy
Adding users and Groups
Available users /
groups. Must have
a DataStage
product User role
Add DataStage
users
2/16/2010 62Training Material - Sample Copy
Specify DataStage Role
Added DataStage
user
Select DataStage
role
2/16/2010 63Training Material - Sample Copy
Check Point
1. Authorizations can be assigned to what two items
2. What two items of authorization roles can be assigned to a user and group?
3. In addition to suite authorization to logon to DataStage, What else does a DataStage developer requires to work in DataStage
2/16/2010 67Training Material - Sample Copy
Check Point Solutions
1. Users and Groups. Members of the group acquires the authorizations of the group
2. Suite roles and Product roles
3. Must be mapped to a user with DataStage credentials
2/16/2010 68Training Material - Sample Copy
IBM WebSphere DataStage 8.1
___________________________Module 4: DataStage Designer
2/16/2010 69Training Material - Sample Copy
Unit Objectives
After completing this unit , you should be able to:
• Logon to DataStage
• Navigate around DataStage Designer
• Import and Export DataStage objects in to file
• Import a table definition for a Sequential file
2/16/2010 70Training Material - Sample Copy
Logging on to DataStage Designer
Host name, port
number of
application server
DataStage server
machine / project
2/16/2010 71Training Material - Sample Copy
Designer Work AreaRepository
Menus Toolbar
Parallel
canvas
Palette
2/16/2010 72Training Material - Sample Copy
Importing and DataStage Exporting Objects
• What Is Metadata?
2/16/2010 74Training Material - Sample Copy
Repository Window
Default jobs folder
Default table
definitions folder
Search for objects
in the project
Project
2/16/2010 75Training Material - Sample Copy
Repository Contents
• Metadata
• Describing sources and targets: Table definitions
• Describing inputs / outputs from external routines
• Describing inputs and outputs to BuildOp and CustomOp stages
• DataStage objects
• Jobs
• Routines
• Compiled jobs / objects
• Stages2/16/2010 76Training Material - Sample Copy
Import and Export
• Any object in Repository can be exported to a file
• Can export whole projects
• Use for backup
• Sometimes used for version control
• Can be used to move DataStage objects from one project to another
• Use to share DataStage jobs and projects with other developers
2/16/2010 77Training Material - Sample Copy
Export Procedure
• In Repository Export Window, click “Export>DataStage Components”
• Select DataStage objects for export
• Specify type of export:
– DSX: Default format
– XML: Enables processing of export file by XML applications, e.g., for
– generating reports
• Specify file path on client machine
2/16/2010 78Training Material - Sample Copy
Export Window
Selected objects
Default jobs folder
Click to select
objects from the
repository
Begin export
Export type
2/16/2010 79Training Material - Sample Copy
Import Procedure
• In Repository, click “Import>DataStage Components”
– Or “Import>DataStage Components (XML)” if you are importing an XMLformat export file
• Select DataStage objects for import
2/16/2010 80Training Material - Sample Copy
Import OptionsImport all objects
in the file
Display list to
select form
2/16/2010 81Training Material - Sample Copy
Importing Table definitions
• Table definitions describe the format and columns of files and tables
– Import format and column definitions from sequential files
– Import relational table column definitions
– Import COBOL files and Many other things
• Table definitions can be loaded into job stages
• Table definitions can be used to define Routine and Stage interfaces
2/16/2010 82Training Material - Sample Copy
Sequential File Import Procedure
• In Designer, click Import>Table Definitions>Sequential File Definitions
• Select directory containing sequential file and then the file
• Examined format and column definitions and edit is necessary
2/16/2010 83Training Material - Sample Copy
Sequential Import Window
Select repository folder
Select File
Start import
Select directory containing files
2/16/2010 84Training Material - Sample Copy
Specify Format
Edit Columns
Delimiter
Select if first row
has column names
2/16/2010 85Training Material - Sample Copy
Edit Column Names and TypesDouble-click to define
extended properties
2/16/2010 86Training Material - Sample Copy
Extended Properties window
Available properties
Property categories
2/16/2010 87Training Material - Sample Copy
Table Definition General Tab
SourceType
Stored table
definition
2/16/2010 88Training Material - Sample Copy
Quiz - True or False?
• You can export DataStage objects such as jobs, but you can’t export metadata, such as field definitions of a sequential file.
• The directory to which you export is on the DataStage client machine, not on the DataStage server machine.
2/16/2010 89Training Material - Sample Copy
IBM WebSphere DataStage 8.1
___________________________Module 5: Creating Parallel Jobs
2/16/2010 90Training Material - Sample Copy
Unit Objectives
After completing this unit, you should be able to:
• Design a simple Parallel job in Designer
• Define a job parameter
• Compile your job
• Run your job in Director
• View the job log
2/16/2010 91Training Material - Sample Copy
Creating Parallel Jobs
• What Is a Parallel Job?
– Executable DataStage program
– Created in DataStage Designer
– Can use components from Manager Repository
– Built using a graphical user interface
– Compiles into Orchestrate shell language (OSH) and object code (from generated C++)
2/16/2010 92Training Material - Sample Copy
Job Development Overview
• Import metadata defining sources and targets
-- Can be done within Designer or Manager
• In Designer, add stages defining data extractions and loads
• Add processing stages to define data transformations
• Add links defining the flow of data from sources to targets
• Compile the job
• In Director, validate, run, and monitor your job
-- Can also run the job in Designer
-- Can only view the job log in Director
2/16/2010 93Training Material - Sample Copy
Logging on to DataStage Designer
Host name, port number of
application server
DataStage server machine /
project
2/16/2010 94Training Material - Sample Copy
Designer Work Area
RepositoryMenus Toolbar
Parallel
canvas
Palette
2/16/2010 95Training Material - Sample Copy
Designer Toolbar• Provides quick access to the main functions of Designer
Toolbar
2/16/2010 96Training Material - Sample Copy
Adding Stages and Links
• Drag stages from the Tools Palette to the diagram
-- Can also be dragged from Stage Type branch to the
diagram
• Draw links from source to target stage
--Right mouse over source stage
--Release mouse button over target stage
2/16/2010 98Training Material - Sample Copy
Job Creation Example Sequence
• Brief walkthrough of procedure
• Assumes table definition of source already exists in the repository
2/16/2010 99Training Material - Sample Copy
Drag Stages and Links From Palette
Row Generator
Peek
Compile
Job
properties
2/16/2010 101Training Material - Sample Copy
Renaming Links and Stages
• Click on a stage or link to rename it
• Meaningful names have many
benefits
– Documentation
– Clarity
– Fewer development errors
2/16/2010 102Training Material - Sample Copy
Row Generator Stage
• Produces mock data for specified columns
• No inputs link; single output link
• On Properties tab, specify number of rows
• On Columns tab, load or specify column definitions- Click Edit Row over a column to specify the values to be generated for that column
- A number of algorithms for generating values are available depending on the data type
• Algorithms for Integer type- Random: seed, limit
- Cycle: Initial value, increment
• Algorithms for string type: Cycle , alphabet
• Algorithms for date type: Random, cycle
2/16/2010 104Training Material - Sample Copy
Inside the Row Generator Stage
Properties tab
Set Property
value
Property
2/16/2010 105Training Material - Sample Copy
Columns Tab
Select table
Definition
View Data
Load a Table
Definition
2/16/2010 106Training Material - Sample Copy
Extended Properties
Specified Properties
and their values
Additional
properties to add
2/16/2010 107Training Material - Sample Copy
Peek Stage
• Displays field values
- Displayed in job log or sent to a
- Skip records option
- Can control number of records to be displayed
- Shows data in each partition, labeled 0, 1, 2, …
• Useful stub stage for iterative job development
- Develop job to a stopping point and check the data
2/16/2010 108Training Material - Sample Copy
Job Parameters
• Defined in Job Properties window
• Makes the job more flexible
• Parameters can be:- Used in directory and file names
- Used to specify property values
- Used in constraints and derivations
• Parameter values are determined at run time
• When used for directory and files names and names of properties, surround with pound signs (#)
- E.g., #NumRows#
• Job parameters can reference DataStage and system environment variables- Prefaced by e.g: APT_CONFIG_FILE
2/16/2010 110Training Material - Sample Copy
Using a Job Parameter in a Stage
Job Parameter
surrounded with
pound signs
2/16/2010 112Training Material - Sample Copy
Adding Job Documentation
• Job Properties
- Short and long descriptions
- Shows in Manager
• Annotation stage
- Added from the Tools Palette
- Display formatted text descriptions on diagram
2/16/2010 113Training Material - Sample Copy
Error or Successful Message
Highlight stage
with errorClick for more
info
2/16/2010 117Training Material - Sample Copy
DataStage Director
• Use to run and schedule jobs
• View runtime messages
• Can invoke from DataStage Designer- Tools > Run Director
2/16/2010 120Training Material - Sample Copy
Run Options
Stop after number
of warnings
Stop after number
of rows
2/16/2010 121Training Material - Sample Copy
Job Log ViewClick the open book
icon to view log
messages
Peek Messages
2/16/2010 124Training Material - Sample Copy
Other Director Options
• Schedule job to run on a particular date/time
• Clear job log of messages
• Set job log purging conditions
• Set Director options- Row limits
- Abort after x warnings
2/16/2010 126Training Material - Sample Copy
Running Jobs From Command Line
• dsjob –run –param numrows=10 dx444 GenDataJob
- Runs a job
- Use –run to run the job
- Use –param to specify parameters
- In this example, dx444 is the name of the project
- In this example, GenDataJob is the name of the job
• dsjob –logsum dx444 GenDataJob
- Displays a job’s messages in the log
• Documented in “Parallel Job Advanced Developer’s Guide”
2/16/2010 127Training Material - Sample Copy
Check Points
1. Which stage can be used to display output data in the job log?
2. Which stage is used for documenting your job on the job Canvas?
3. What command is used to run jobs from the operating system command line?
2/16/2010 128Training Material - Sample Copy
Check Point Solution
1.Dsjob –run
2.Peek Stage
3.Annotation Stage
2/16/2010 129Training Material - Sample Copy
IBM WebSphere DataStage 8.1
____________________________Module 6: Accessing Sequential Data
2/16/2010 130Training Material - Sample Copy
Module Objectives
After completing this module, you should be able to:
• Understand the stages for accessing different kinds of sequential data
• Sequential File stage
• Complex Flat File stage
• Create jobs that read from and write to sequential files
• Create reject Link
• Work with NULL’s in sequential file
• Read from multiple files using file patterns
• Use multiple readers
2/16/2010 131Training Material - Sample Copy
Types of Sequential Data stages
• Sequential
Fixed or variable length
• Data Set
• Complex Flat File
2/16/2010 132Training Material - Sample Copy
How Sequential Data is Handled
• Import and export operators are generated
– Stages get translated into operators during the compile
• Import operators convert data from the external format, as
• described by the Table Definition, to the framework
• internal format
– Internally, the format of data is described by schemas
• Export operators reverse the process
• Messages in the job log use the “import” / “export”
• terminology
– E.g., “100 records imported successfully; 2 rejected”
– E.g., “100 records exported successfully; 0 rejected”
– Records get rejected when they cannot be converted correctly
• during the import or export2/16/2010 133Training Material - Sample Copy
Using the Sequential File Stage
• Both import and export of general files (text, binary) are performed by the sequential file stage
2/16/2010 134Training Material - Sample Copy
Features of Sequential File Stage
• Normally executes in sequential mode
• Executes in parallel when reading multiple files
• Can use multiple readers within a node
Reads chunks of a single file in parallel
• The stage needs to be told:
How file is divided into rows (record format)
How row is divided into columns (column format)
2/16/2010 135Training Material - Sample Copy
Sequential File Stage Rules
• One input link
• One stream output link
• Optionally, one reject link
– Will reject any records not matching metadata in the column definitions
– Example: You specify three columns separated by commas, but the row that’s read had no commas in it
2/16/2010 137Training Material - Sample Copy
Reject Link• Reject mode =
– Continue: Continue reading records
– Fail: Abort job
– Output: Send down output link
• In a source stage
– All records not matching the metadata (column definitions) are rejected
• In a target stage
– All records that fail to be written for any reason
• Rejected records consist of one column, datatype = raw
2/16/2010 145Training Material - Sample Copy
Working with NULLS
• Internally, NULL is represented by a special value outside the range of
• any existing, legitimate values• If NULL is written to a non- nullable column, the job will abort• Columns can be specified as nullable
NULLs can be written to nullable columns• You must “handle” NULLs written to non- nullable columns in a• Sequential File stage
You need to tell DataStage what value to write to the file Unhandled rows are rejected
• In a Sequential source stage, you can specify values you want• DataStage to convert to NULLs
2/16/2010 148Training Material - Sample Copy
Dataset• Binary data file• Preserves partitioning
– Component dataset files are written to on each partition• Suffixed by .ds• Referred to by a header file• Managed by Data Set Management utility from GUI (Manager, Designer,
Director)• Represents persistent data• Key to good performance in set of linked jobs• No import / export conversions are needed• No repartitioning needed• Accessed using DataSet stage• Implemented with two types of components:
– Descriptor file:contains metadata, data location, but NOT the data itself
– Data file (s)• contains the data
multiple files, one per partition (node)2/16/2010 151Training Material - Sample Copy
File Set
• Use to read and write to filesets
• Files suffixed by .fs
• Files are similar to a dataset
– Partitioned
– Implemented with header file and data files
• How filesets differ from datasets
– Data files are text files
• Hence readable by external applications
– Datasets have a proprietary data format which may change in future DataStage versions
2/16/2010 155Training Material - Sample Copy
Checkpoint
1. List three types of file data
2. What makes datasets perform better than other types of files in parallel jobs
3. What is the difference between a data set and a file set?
2/16/2010 156Training Material - Sample Copy
Checkpoint solutions
1. Sequential, dataset, complex flat files
2. They are partitioned and they store data in the native parallel format
3. Both are partitioned. Data sets store data in a binary format not readable by user applications. File sets are readable.
2/16/2010 157Training Material - Sample Copy
IBM WebSphere DataStage 8.1
____________________________Module 7: Platform Architecture
2/16/2010 158Training Material - Sample Copy
Module Objectives
After completing this module, you should be able to:
• Describe parallel processing architecture
• Describe pipeline parallelism
• Describe partition parallelism
• List and describe partitioning and collecting algorithms
• Describe configuration files
• Describe the parallel job compilation process
• Explain OSH
• Explain the Score
2/16/2010 159Training Material - Sample Copy
Pipeline Parallelism
• Transform, clean, load processes execute simultaneously• Like a conveyor belt moving rows from process to process
– Start downstream process while upstream process is running
• Advantages:– Reduces disk usage for staging areas– Keeps processors busy
• Still has limits on scalability
2/16/2010 162Training Material - Sample Copy
Partition Parallelism
• Divide the incoming stream of data into subsets to be separately
• processed by an operation
Subsets are called partitions (nodes)
• Each partition of data is processed by the same operation
E.g., if operation is Filter, each partition will be filtered in exactly the same way
• Facilitates near-linear scalability
8 times faster on 8 processors
24 times faster on 24 processors
This assumes the data is evenly distributed
2/16/2010 163Training Material - Sample Copy
Three-Node Partitioning
• Here the data is partitioned into three partitions
• The operation is performed on each partition of data separately and in parallel
• If the data is evenly distributed, the data will be processed three times faster
2/16/2010 164Training Material - Sample Copy
EE Combines Partitioning and Pipelining
Within EE, pipelining, partitioning and repartitioning are automatic, Job developer only identifies:• Sequential vs. Parallel operations (by stage)• Method of data partitioning• Configuration file (which identifies resources)• Advanced stage options (buffer tuning, operator combining, etc.)
2/16/2010 165Training Material - Sample Copy
Configuration File• Configuration file separates configuration (hardware / software)
from job design
– Specified per job at runtime by $APT_CONFIG_FILE
– Change hardware and resources without changing job design
• Defines number of nodes (logical processing units) with their resources (need not match physical CPUs)
– Dataset, Scratch, Buffer disk (file systems) Optional resources (Database, SAS, etc.) Advanced resource optimizations
– “Pools” (named subsets of nodes)
• Multiple configuration files can be used at runtime
– Optimizes overall throughput and matches job characteristics to overall hardware resources
– Allows runtime constraints on resource usage on a per job basis
2/16/2010 167Training Material - Sample Copy
Partitioning and Collecting
• Partitioning breaks incoming rows into sets (partitions) of rows
• Each partition of rows is processed separately by the stage/operator
If the hardware and configuration file supports parallel processing, partitions of rows will be processed in parallel
• Collecting returns partitioned data back to a single stream
• Partitioning / Collecting occurs on stage Input links
• Partitioning / Collecting is implemented automatically
Based on stage and stage properties
How the data is partitioned / collected can be specified
2/16/2010 170Training Material - Sample Copy
Partitioning / Collecting Algorithms• Partitioning algorithms include:
Round robinHash: Determine partition based on key value- Requires key specification
Entire: Send all rows down all partitionsSame: Preserve the same partitioningAuto: Let DataStage choose the algorithm
• Collecting algorithms include:Round robinSort Merge- Read in by key- Presumes data is sorted by the key in each partition- Builds a single sorted stream based on the key ordered- Read all records from first partition, then second, …
2/16/2010 171Training Material - Sample Copy
Keyless V. Keyed Partitioning Algorithms
• Keyless: Rows are distributed independently of data valuesRound RobinEntireSame
• Keyed: Rows are distributed based on values in the specified key
• Hash: Partition based on keyExample: Key is State. All “CA” rows go into the same partition; all
“MA” rows go in the same partition. Two rows of the same state never go into different partitionsModulus: Partition based on modulus of key divided by the number of partitions. Key is a numeric type.
Example: Key is OrderNumber (numeric type). Rows with the same order number will all go into the same partition.DB2: Matches DB2 EEE partitioning
2/16/2010 172Training Material - Sample Copy
Round Robin and Random Partitioning
• Keyless partitioning methods
• Rows are evenly distributed across
partitions• Good for initial import of data if no
• other partitioning is needed
• Useful for redistributing data
• Fairly low overhead
• Round Robin assigns rows to partitions like
dealing cards• Row/Partition assignment will be the
same for a given $APT_CONFIG_FILE
• Random has slightly higher
• overhead, but assigns rows in a
• non-deterministic fashion between job runs
2/16/2010 173Training Material - Sample Copy
ENTIRE Partitioning
• Each partition gets a complete copy of the data
• Useful for distributing lookup and reference data
• May have performance impact on MPP / clustered environment
• On SMP platforms, Lookup stage (only) uses shared memory instead of duplicating ENTIRE reference data
• On MPP platforms, each server uses shared memory for a single local copy.
• ENTIRE is the default partitioning for Lookup reference links with “Auto” Partitioning.
• On SMP platforms, it is a food practice to set this explicitly on the Normal Lookup reference lnk(s)
2/16/2010 174Training Material - Sample Copy
HASH Partitioning
• Keyed partitioning method
• Rows are distributed according to the Values in Key coloumns.
• Guarantees that rows with same key values go into the same partition
• Needed to prevent matching rows from “hiding” in other partitions
• E.g. Join, Merge
• Remove Duplicate
• Partition distribution is relatively equal if the data across the source key columns evenly distributed
2/16/2010 175Training Material - Sample Copy
Modulus Partitioning
• Keyed partitioning method Rows are distributed according to the values in one integer key column
• Uses modulus
• partition = MOD (key _value / # partitions)
• Faster than HASH
• Guarantees that rows with identical key values go in the same partition
• Partition size is relatively equal if the data within the key
• Column is evenly distributed
2/16/2010 176Training Material - Sample Copy
Auto Partitioning• DataStage inserts partition components as necessary to
ensure correct results- Before any stage with “Auto” partitioning- Generally chooses ROUND-ROBIN or SAME- Inserts Hash on the Stage that require matched key values
(e.g. Join, Merge, Remove Duplicates)- Inserts ENTIRE on Normal (not Sparse) Lookup reference links
• Not always appropriate for MPP/clusters Since DataStage has limited awareness of your data and business rules, explicitly specify HASH partitioning when needed
- DataStage has no visibility into Transformer logic- Hash is required before Sort and Aggregator stages- DataStage sometimes inserts un-needed partitioning Check the log
2/16/2010 177Training Material - Sample Copy
Partitioning Requirements for Related Records
• Misplaced records- Using Aggregator stage to sum customer sales by customer number
- If there are 25 customers, 25 records should be output
But suppose records with the same customer numbers are spread across partitions
-This will produce more than 25 groups (records) Solution: Use hash partitioning algorithm
• Partition imbalances
-If all the records are going down only one of the nodes, then the job is in effect running sequentially
2/16/2010 178Training Material - Sample Copy
IBM WebSphere DataStage 8.1
____________________________Module 8: Combining Data
2/16/2010 184Training Material - Sample Copy
Module objectivesAfter completing this module, you should be able to:
• Combine data using the Lookup stage
• Define range lookups
• Combine data using Merge stage
• Combine data using the Join stage
• Combine data using the Funnel stage
2/16/2010 185Training Material - Sample Copy
Combining Data
Ways to combine data:• Horizontally:
– Multiple input links– One output link made of columns from different input links. – Joins– Lookup– Merge
• Vertically:– One input link, one output link combining groups of related – Records into a single record– Aggregator– Remove Duplicates
• Funneling: Multiple input streams funneled into a single output stream– Funnel stage
2/16/2010 186Training Material - Sample Copy
Lookup, Merge, Join Stages• These stages combine two or more input links
Data is combined by designated "key" column(s)
• These stages differ mainly in:
Memory usage
Treatment of rows with unmatched key values
Input requirements (sorted, de-duplicated)
2/16/2010 187Training Material - Sample Copy
Not all Links are Created Equal
• DataStage distinguishes between
- The Primary Input : (Framework Port 0)
- Secondary Inputs : In some cases “reference” ( other Framework Ports)
• Conventions:
• Tip Check “ Link Ordering “tab to make sure intended Primary is Listed first.
2/16/2010 188Training Material - Sample Copy
Lookup Features
• One Stream Input link (Source)• Multiple Reference links (Lookup files)• One output link• Optional Reject link
Only one per Lookup stage, regardless of number of reference links
• Lookup Failure optionsContinue, Drop, Fail, Reject
• Can return multiple matching rows• Hash tables are built in memory from the lookup files
Indexed by keyShould be small enough to fit into physical memory
2/16/2010 190Training Material - Sample Copy
Lookup Types• Equality match
- Match exactly values in the lookup key column of the reference link toselected values in the source row
- Return row or rows (if multiple matches are to be returned) that match
• Caseless match- Like an equality match except that it’s caselessE.g., “abc” matches “AbC”Range on the reference link
- Two columns on the reference link define the range- A match occurs when a selected value in the source row is within therange
• Range on the source link- Two columns on the source link define the range- A match occurs when a selected value in the reference link is within the
range
2/16/2010 191Training Material - Sample Copy
The Lookup Stage
• Uses one or more key columns as an index into a table
• Usually contains other values associated with each key.
• The lookup table is created in memory before any lookup source rows are processed
2/16/2010 192Training Material - Sample Copy
Lookup Failure Actions
• If the lookup fails to find a matching key column, one of these actions can be taken:
- Fail: the lookup Stage reports an error and the job fails immediately. This is
the default.
- Drop: the input row with the failed lookup(s) is dropped
- continue: the input row is transferred to the output, together with the successful table entries. The failed table entry (s) are not transferred, resulting in either default output values or null output values.
- Reject: the input row with the failed lookup(s) is transferred to a second
output link, the"reject" link
• There is no option to capture unused table entries- Compare with the Join and Merge stages
2/16/2010 196Training Material - Sample Copy
Lookup Stage Behavior
• We shall first use a simplest case, optimal input
-Two input links:“Source" as primary, “Look up"
as secondary
- sorted on key column (here "Citizen"),
- without duplicates on key
2/16/2010 197Training Material - Sample Copy
The Lookup Stage
• Lookup Tables should be small enough to fit into physical memory
• On a MPP you should partition the lookup tables using entire partitioning method or partition them by the same hash key as the source link
- Entire results in multiple copies (one for each partition)
• On a SMP, choose entire or accept the default (which is entire)
- Entire does not result in multiple copies because
memory is shared
2/16/2010 199Training Material - Sample Copy
The Join Stage
• Four types:
– Inner
– Left outer
– Right outer
– Full outer
• 2 or more sorted input links, 1 output link
– "left" on primary input, "right" on secondary input
– Pre-sort make joins "lightweight": few rows need to be in RAM
• Follow the RDBMS-style relational model
– Cross-products in case of duplicates
– Matching entries are reusable for multiple matches
– Non-matching entries can be captured (Left, Right, Full)
• No fail/reject option for missed matches2/16/2010 209Training Material - Sample Copy
Join Stage Editor
• Link Order• Immaterial for
Inner and Full Outer Joins, but very important for Left/Right Outer
• joins) Multiple
Multiple key
columns
allowed
One of four variants:
Inner
Left Outer
Right
Outer
Full Outer
2/16/2010 211Training Material - Sample Copy
Join Stage Behavior
• We shall first use a simplest case, optimal input:– two input links: "left" as primary, "right" as secondary
– sorted on key column (here “citizen”),
– without duplicates on key
2/16/2010 212Training Material - Sample Copy
Inner Join
• Transfers rows from both data sets whose key columns contain equal values to the output link
• Treats both inputs symmetrically
Output of inner join on key Citizen
2/16/2010 213Training Material - Sample Copy
Left Outer Join
• Transfers all values from the left link and transfers values from the right link only where key columns match.
2/16/2010 214Training Material - Sample Copy
Left Outer JoinCheck Link Ordering tab to make sure intended Primary is listed first
2/16/2010 215Training Material - Sample Copy
Right Outer Join
• Transfers all values from the right link and transfers values from the left link only where key columns match
2/16/2010 216Training Material - Sample Copy
Full Outer Join
• Transfers rows from both data sets, whose key columns contain equal values, to the output link.
• It also transfers rows, whose key columns contain unequal values, from both input links to the output link.
• Treats both input symmetrically.
• Creates new columns, with new column names!
2/16/2010 217Training Material - Sample Copy
Merge Stage
• Similar to Join stage
• Input links must be sorted
– Master link and one or more secondary links
– Master must be duplicate-free
• Light-weight
– Little memory required, because of the sort requirement
• Unmatched master rows can be kept or dropped
• Unmatched secondary links can be captured in a reject link
2/16/2010 219Training Material - Sample Copy
The Merge Stage
• Allows composite keys
• Multiple update links
• Matched update rows are consumed
• Unmatched updates in input port n can be captured in output port n
• Lightweight
2/16/2010 221Training Material - Sample Copy
What is a Funnel Stage?
• A processing stage that combines data from multiple input links to a single output link
• Useful to combine data from several identical data sources into a single large dataset
• Operates in three modes
– Continuous
– SortFunnel
– Sequence
2/16/2010 225Training Material - Sample Copy
Three Funnel modes
• Continuous:– Combines the records of the input link in no guaranteed
order.– It takes one record from each input link in turn. If data is not
available on an input link, the stage skips to the next link rather than waiting.
– Does not attempt to impose any order on the data it is processing.
• Sort Funnel: Combines the input records in the order defined by the value(s) of one or more key columns and the order of the output records is determined by these sorting keys.
• Sequence: Copies all records from the first input link to the output link, then all the records from the second input link and so on.
2/16/2010 226Training Material - Sample Copy
Sort Funnel Method
• Produces a sorted output (assuming input links are all sorted on key)
• Data from all input links must be sorted on the same key column• Typically data from all input links are hash partitioned before they
are sorted– Selecting “Auto” partition type under Input Î Partitioning tab defaults to
this hash partitioning guarantees that all the records with same key column values are located in the same partition and are processed on the same node.
• Allows for multiple key columns– 1 primary key column, n secondary key columns– Funnel stage first examines the primary key in each input record.– For records with multiple records with same primary key value, it will then
examine secondary keys to determine the order of records it will output
2/16/2010 227Training Material - Sample Copy
Check Point
1. Name three stages that horizontally join data?
2. Which stage uses the least amount of memory? Join or Lookup?
3. Which stage requires that the input data is sorted? Join or Lookup?
2/16/2010 230Training Material - Sample Copy
Check Point Solution
1. Lookup, Merge, Join
2. Lookup
3. Join
2/16/2010 231Training Material - Sample Copy
IBM WebSphere DataStage 8.1
____________________________Module 9: Sorting and Aggregating Data
2/16/2010 232Training Material - Sample Copy
Module ObjectivesAfter completing this module, you should be able to :
• Sort data using in-stage sorts and Sort stage
• Combine data using Aggregator stage
• Combine data Remove Duplicates stage
2/16/2010 233Training Material - Sample Copy
Sorting DataUses
• Some stages require sorted input– Join, merge stages require sorted input
• Some stages use less memory with sorted input– E.g., Aggregator
Sorts can be done:
• Within stages– On input link Partitioning tab, set partitioning to anything other than
Auto
• In a separate Sort stage– Makes sort more visible on diagram
– Has more options
2/16/2010 235Training Material - Sample Copy
Sort Keys
• Add one or more keys
• Specify sort mode for each key
– Sort: Sort by this key
– Don’t sort (previously sorted):
– Assume the data has already been sorted by this key
– Continue sorting by any secondary keys
• Specify sort order: ascending / descending
• Specify case sensitive or not
2/16/2010 239Training Material - Sample Copy
Sort Options• Sort Utility
– DataStage – the default– Unix: Don’t use. Slower than DataStage sort utility
• Stable• Allow duplicates• Memory usage
– Sorting takes advantage of the available memory for increased performance
Uses disk if necessary– Increasing amount of memory can improve performance
• Create key change column– Add a column with a value of 1 / 0– 1 indicates that the key value has changed– 0 mean that the key value hasn’t changed– Useful for processing groups of rows in a Transformer
2/16/2010 240Training Material - Sample Copy
Partitioning Vs Sorting KeysPartitioning keys are often different than Sorting keys
• Keyed partitioning (e.g., Hash) is used to group related records into the same partition
• Sort keys are used to establish order within each partition
For example,
• Partition on HouseHoldID, sort on HouseHoldID, Entry Date
– Partitioning on HouseHoldID ensures that the same ID will not be spread across multiple partitions
– Sorting orders the records with same ID by entry date• Useful for deciding which of a group of duplicate records with the
same ID should be retained
2/16/2010 241Training Material - Sample Copy
Aggregator Stage
Purpose: Perform data aggregations
Specify:• Zero or more Key Coloumns that define the aggregation units
(or groups)
• Columns to be aggregated
• Aggregation functions, include among many others:– count (nulls/no nulls)Sum
– Max / Min / Range
• The grouping method (Hash table or pre sort) is a Performance issue
2/16/2010 243Training Material - Sample Copy
Aggregator Types
• Count rows– Count rows in each group
– Put result in a specified output column
• Calculation– Select column
– Put result of calculation in a specified output column
• Calculations include:– Sum
– Count
– Min, max
– Mean
– Missing value count
– Non-missing value count
– Percent coefficient of variation
2/16/2010 245Training Material - Sample Copy
Grouping MethodsHash (default)
• Calculations are made for all groups and stored in memory– Hash table structure (hence the name)
• Results are written out after all input has been processed
• Input does not need to be sorted
• Useful when the number of unique groups is small– Running tally for each group’s aggregations needs to fit into memory
Sort
• Requires the input data to be sorted by grouping keys– Does not perform the sort! Expects the sort
• Only a single aggregation group is kept in memory– When a new group is seen, the current group is written out
2/16/2010 248Training Material - Sample Copy
Removing Duplicates
• Can be done by Sort stage
Use unique option
– No choice on which to keep
– Stable sort always retains the first row in the group
– Non-stable sort is indeterminate
OR
• Remove Duplicates stage– Has more sophisticated ways to remove duplicates
Can choose to retain first or last
2/16/2010 250Training Material - Sample Copy
Check Point
1. What stage is used to perform calculation of columns values Grouped in specified ways
2. In what two ways can sorts be performed
3. What is stable sort
2/16/2010 253Training Material - Sample Copy
Check Point Solution
1. Aggregator Stage
2. Using the sort stage. In-Stage sorts
3. Stable sort preserves the order of non-key values
2/16/2010 254Training Material - Sample Copy
IBM WebSphere DataStage 8.1
____________________________Module 10 : Transforming Data
2/16/2010 255Training Material - Sample Copy
Module ObjectivesAfter completing this module, you should be able to:
• Use the Transformer stage in parallel jobs
• Define constraints
• Define derivations
• Use stage variables
• Create a parameter set and use its parameters in constraints and derivations
2/16/2010 256Training Material - Sample Copy
Transformer Stage• Column mappings
• Derivations– Written in Basic
– Final compiled code is C++ generated object code
• Constraints– Filter data
– Direct data down different output links
• For different processing or storage
• Expressions for constraints and derivations can reference– Input columns
– Job parameters
– Functions
– System variables and constants
– Stage variables
– External routines2/16/2010 258Training Material - Sample Copy
IF THEN ELSE Derivation• Use IF THEN ELSE to conditionally derive a value
• Format:
– IF <condition> THEN <expression1> ELSE <expression1>
– If the condition evaluates to true then the result of expression1 will be copied to the target column or stage variable
– If the condition evaluates to false then the result of expression2 will be copied to the target column or stage variable
Example:
– Suppose the source column is named In.OrderID and the target column is named Out.OrderID
– Replace In.OrderID values of 3000 by 4000
– IF In.OrderID = 3000 THEN 4000 ELSE Out.OrderID
2/16/2010 263Training Material - Sample Copy
String Functions and Operators
• Substring operator– Format: “String” *loc, length+
– Example:
• Suppose In.Description contains the string “Orange Juice”
• In Description*8,5+ “Juice”
• UpCase(<string>) / DownCase(<string>)– Example: UpCase(In.Description) Æ “ORANGE JUICE”
• Len(<string>)– Example: Len(In.Description) Æ 12
2/16/2010 264Training Material - Sample Copy
Checking for NULLs
• Nulls can be introduced into the data flow from lookups– Mismatches (lookup failures) can produce nulls
• Can be handled in constraints, derivations, stage variables, or a combination of these
• NULL functions– Testing for NULL
• IsNull (<column>)
• IsNotNull (<column>)
– Replace NULL with a
• NullToValue(<column>, <value>)
– Set to NULL:
• Example: IF In.Col = 5 THEN SetNull() ELSE In.Col
2/16/2010 265Training Material - Sample Copy
Transformer Functions
• Date & Time
• Logical
• Null Handling
• Number
• String
• Type Conversion
2/16/2010 266Training Material - Sample Copy
Transformer Execution Order
• Derivations in stage variables
• Constraints are executed before derivations
• Column derivations in earlier links are executed before later links
• Derivations in higher columns are executed before lower columns
2/16/2010 267Training Material - Sample Copy
Transformer Stage Variables
• Derivations execute in order from top to bottom
– Later stage variables can reference earlier stage variables
– Earlier stage variables can reference later stage variables
– These variables will contain a value derived from the previous row that came into the Transformer
• Multi-purpose
– Counters
– Store values from previous rows to make comparisons
– Store derived values to be used in multiple target field derivations
– Can be used to control execution of constraints
2/16/2010 268Training Material - Sample Copy
Parameter Sets
• Store a collection of parameters in a named object
• One or more value files can be named and specified
– A value file stores value for specified parameters
– Values are picked up at run time
• Parameter sets can be added to the job parameters specified on the parameter tab of the job properties
2/16/2010 274Training Material - Sample Copy
Check point
• What occurs first? Derivations or Constraints?
• Can stage variables be referenced in constraints?
• Where should you test for nulls with in a transformer? Stage variable derivations or Output column derivations?
2/16/2010 280Training Material - Sample Copy
Check point solution
• Constraints
• Yes
• Stage variable derivations. Reference the stage variable in Output column derivations
2/16/2010 281Training Material - Sample Copy
IBM WebSphere DataStage 8.1
____________________________Module 11 : Repository Functions
2/16/2010 282Training Material - Sample Copy
Module Objectives
After completing this module, you should be able to:
• Perform a simple Find
• Perform a Advanced Find
• Perform Impact Analysis
• Compare the difference between two Table Definitions
• Compare the differences between two jobs
2/16/2010 283Training Material - Sample Copy
Advance Find Filtering options
• Type: Type of object– Job, Table Definition etc
• Creation: Range of Dates– Eg: Up to week ago
• Last Modification: Range of Dates– E.g.: Up to a week ago
• Where Used: Objects that use specified objects– E.g.: A job that uses a specified table definition
• Dependencies of: Objects that are dependencies of objects– E.g.: A Table definition that is referenced in a specified job
• Options– Case Sensitivity
– Search with in a last result set
2/16/2010 288Training Material - Sample Copy
Performing Impact Analysis
• Find where table definitions are used– Right click over a stage or table definition– Select ‘Find where Table definition used’ or– Select ‘Find where Table definition used(deep)’
• Deep includes additional object types– Displays a list of objects using table definition
• Find Object Dependencies– Select ‘Find Dependencies’ or– Select ‘Find Dependencies(deep)’– Display list of objects dependent on the one selected
• Graphical Functionality– Display the dependency path– Collapse selected object– Move the graphical object– “Birds-eye” view
2/16/2010 291Training Material - Sample Copy
Finding the Difference between Two Jobs
• Example: Job1 is saved as Job2. Changes are made to job 2. What changes have been made?
– Here job1 may be a production job. Job2 is a copy of production job after enhancements or other changes have been made to it
2/16/2010 297Training Material - Sample Copy
Comparing Table Definitions• Same procedure as when comparing jobs
2/16/2010 301Training Material - Sample Copy
Check Point
• You can compare the differences between what two kind of objects
• What “Wild card” characters can be used in a Find?
• You have a job whose name begins with “abc”. You can’t remember the rest of the name or where the job is located. What would be the fastest way to export the job to a file?
• Name three filters you can use in an Advanced Find?
2/16/2010 302Training Material - Sample Copy
Check Point Solution
• Jobs. Table Definitions.
• Asterisk(*). It stands for any zero or more characters.
• Do a Find for objects matching “abc”. Filter by type Job. Locate the job in the result set, click the mouse button over it, and then click export.
• Type of object, creation data range, last modified date range, where used, dependencies of, other options including case sensitivity and search with in last result set.
2/16/2010 303Training Material - Sample Copy
IBM WebSphere DataStage 8.1
____________________________Module 12: Working with Relational Data
2/16/2010 304Training Material - Sample Copy
Module Objectives
After completing this module, you should be able to:
• Import table definitions for relational tables
• Create Data connections
• Use Connector stages in a job
• Use Sql Builder to define SQL Select statements
• Use Sql Builder to define SQL Insert and Update statements
• Use DB2 Enterprise stage
2/16/2010 305Training Material - Sample Copy
Working with Relational Data• Importing relational data
– Import using ODBC or orchdbutil• orchdbutil is preferred, in order to get correct type conversions
• Data Connection objects– Store database connection information in a named object
• Stages available to access relational data– Connector stages
• Parallel support• Most functionality• ConsistentGUI and functionality across all relational types
– Enterprise stages• Parallel support
– Plug-in stages• Functionality ported from DataStage Server Jobs
– Selecting data– Build SELECT statements using SQL Builder
• Writing data– Build INSERT, UPDATE, DELETE statements using SQL Builder
2/16/2010 306Training Material - Sample Copy
Importing Table Definitions
• Can import using ODBC or Orchestrate schema definitions– Orchestrate schema imports are better because the datatypes are
nore accurate
• Import->Table Definitions->Orchestrate Schema Definitions
• Import->Table Definitions->ODBC Table Definitions
• ODBC
2/16/2010 308Training Material - Sample Copy
IBM WebSphere DataStage 8.1
____________________________Module 12: Working with Relational Data
2/16/2010 311Training Material - Sample Copy
Module Objectives
After completing this module, you should be able to:
• Import table definitions for relational tables
• Create Data connections
• Use Connector stages in a job
• Use Sql Builder to define SQL Select statements
• Use Sql Builder to define SQL Insert and Update statements
• Use DB2 Enterprise stage
2/16/2010 312Training Material - Sample Copy
Working with Relational Data• Importing relational data
– Import using ODBC or orchdbutil• orchdbutil is preferred, in order to get correct type conversions
• Data Connection objects– Store database connection information in a named object
• Stages available to access relational data– Connector stages
• Parallel support• Most functionality• ConsistentGUI and functionality across all relational types
– Enterprise stages• Parallel support
– Plug-in stages• Functionality ported from DataStage Server Jobs
– Selecting data– Build SELECT statements using SQL Builder
• Writing data– Build INSERT, UPDATE, DELETE statements using SQL Builder
2/16/2010 313Training Material - Sample Copy
Importing Table Definitions
• Can import using ODBC or Orchestrate schema definitions– Orchestrate schema imports are better because the datatypes are
nore accurate
• Import->Table Definitions->Orchestrate Schema Definitions
• Import->Table Definitions->ODBC Table Definitions
• ODBC
2/16/2010 315Training Material - Sample Copy
IBM WebSphere DataStage 8.1
____________________________Module 12: MetaData in the Parallel Framework
2/16/2010 318Training Material - Sample Copy
Module Objectives
After completing this module, you should be able to:
• Explain Schemas
• Create Schemas
• Explain Runtime Column Propagation (RCP)
• Turn RCP ON and OFF
• Build a job that reads data from a sequential file using a Schema
• Build a shared Container
2/16/2010 319Training Material - Sample Copy
Schema
• Alternative way to specifying column definitions and record formats– Similar to a Table Definition
• Written in a plain text file
• Can be imported as a Table Definition
• Can be created from a Table Definition
• Can be used in place of a Table Definition in a Sequential file stage– Requires RCP
– Schema file path can be parameterized
• Enables a single job to process files with different column
definitions
2/16/2010 320Training Material - Sample Copy
Creating a Schema• Using a text editor
– Follow correct syntax for definitions
– Not recommended
• Import from an existing DataSet or FileSet– On DataStage Manager import->Table Definitions -> Orchestrate
Schema Definitions
– Select check box for a file with .fs or .ds
• Import from a Database Table
• Create from a Table Definition– Click parallel on layout tab
2/16/2010 321Training Material - Sample Copy
Runtime Column Propagation (RCP)
• When RCP is turned on:
– Columns of data can flow through a stage without being explicitly defined in the stage
– Target columns in a stage need not have any columns explicitly mapped to them
• No column mapping enforcement at design time
– Input columns are mapped to unmapped columns by name
• How implicit columns get into a job
– Read a file using a schema in a Sequential File stage
– Read a database table using “Select *”
– Explicitly define as an output column in a stage earlier in the flow
2/16/2010 325Training Material - Sample Copy
Runtime Column Propagation (RCP)
• Benefits of RCP
– Job flexibility• Job can process input with different layouts
– Ability to create reusable • components in shared containersComponent logic an apply to a
single named column
• All other columns flow through untouched
2/16/2010 326Training Material - Sample Copy
Enabling Runtime Column Propagation (RCP)
• Project Level– DataStage Administrator Parallel tab
• Job Level– Job Properties General tab
• Stage Level– Link Output Column tab
• Setting at a lower level override settings at a higher level– E.g.: disable at project level, but enable for a given job
– E.g.: enable at job level, but disable a given stage
2/16/2010 327Training Material - Sample Copy
Enabling RCP at Stage Level• Sequential File Stage
– Output columns tab
• Transformer– Open Stage Properties
– Stage properties Output tab
2/16/2010 330Training Material - Sample Copy
When RCP is Disabled• DataStage Designer enforces Stage Input to Output column
mapping
2/16/2010 331Training Material - Sample Copy
When RCP is Enabled
• DataStage does not enforce mapping rules
• Runtime error if no incoming columns match unmapped target column names
2/16/2010 332Training Material - Sample Copy
Shared Containers
• Encapsulate job design components into a shared container
• Provide reusable job design components
• Example– Apply stored transformer business logic
2/16/2010 334Training Material - Sample Copy
Creating a Shared Container
• Select Stages from an existing job
• Click Edit -> Construct Container -> Shared
2/16/2010 335Training Material - Sample Copy
Check Point
• What are the two benefits of RCP?
• What can you use to encapsulate stages and links in a job to make them reusable?
2/16/2010 338Training Material - Sample Copy
Check Point Solution
• Job Flexibility. Ability to create reusable components
• Shared Containers
2/16/2010 339Training Material - Sample Copy
IBM WebSphere DataStage 8.1
____________________________Module 14: Job Control
2/16/2010 340Training Material - Sample Copy
Module Objectives
After Completing this module, you should be able to:• Use the DataStage job sequencer to build a job that controls a
sequence of jobs• Use Sequencer links and stages to control Sequence of set of
jobs• Use Sequencer triggers and stages to control the conditions
under which jobs run• Pass information in job parameters from master controlling
job to the controlled jobs• Define user variables• Enable restart• Handle errors and exceptions
2/16/2010 341Training Material - Sample Copy
What is a job Sequence• A master controlling job that controls the execution of a set of
subordinate jobs• Passes values to Sub-ordinate job parameters• Controls the order of execution(links)• Specifies condition under which the subordinate jobs get
executed(triggers)• Specifies Complex flow of control
– Loops– All/Some– Wait for file
• Perform System Activities– Email– Execute system commandsand executables
• Can include Restart checkpoints
2/16/2010 342Training Material - Sample Copy
Basics for Creating a new Job Sequence
• Open a new job sequence– Specify whether its restartable
• Add Stages– Stages to execute job
– Stages to execute system commands and Executables
– Special Purpose Stages
• Add Links
• Specify the order in which jobs are to be executed– Specify Triggers
– Triggers specify the condition under which control passes across a link
• Specify error handling
• Enable/ Disable Restart Checkpoints
2/16/2010 343Training Material - Sample Copy
Job Sequencer stages• Run Stages
– Job activity: Run a job– Execute Command: Run a system
command– Notification Activity: Send an Email
• Flow control Stages– Sequencer: Go if All / Some– Wait for file: Go when File
exists/Don’t Exist– Start Loop/End Loop– Nested Condition: Go if Condition
Satisfied• Error Handling
– Exception handler– Terminator
• Variables– User Variables
2/16/2010 344Training Material - Sample Copy
Loop Stages
Counter
values
Pass counter
value
Reference
Link to start
2/16/2010 357Training Material - Sample Copy
Handling Activities that fail
Pass control to
Exception Stage
When an Activity
Fails
2/16/2010 359Training Material - Sample Copy
Exception Handler Stage
Control goes here if
any activity fails
2/16/2010 360Training Material - Sample Copy
Disable Check Point at a StageDon’t checkpoint
this activity
2/16/2010 363Training Material - Sample Copy
Check Point
• Which Stage is used to run jobs in a Job Sequence
• Does the Exception Handler stage support an input
link
2/16/2010 364Training Material - Sample Copy
Check Point Solution
• Job Activity Stage
• No, control is automatically passed to the stage when an exception occurs
2/16/2010 365Training Material - Sample Copy
IBM WebSphere DataStage 8.1____________________________
Special Topics 1 : Complex Flat File Stage
2/16/2010 366Training Material - Sample Copy
Module objectives
After completing this module, you should be able to:• Import table definitions from a COBOL copybook• Design a job that extracts data from a COBOL file containing
multiple records type• Specify in a Complex Flat File(CFF) stage the column layouts of
each record type• Specify in a CFF stage how to identify when a record is read of
a specific type• Select in a CFF stage which columns from the different records
types are to be output from the stage
2/16/2010 367Training Material - Sample Copy
Complex Flat File Stage
• Process data in a COBOL file– File is described by a COBOL file description – File can contain multiple record types
• COBOL copybooks with multiple record formats can be imported as COBOL File Definitions– Each format is stored as a separate DataStage table definition
• Columns can be loaded for each record type• On the records ID Tab, you specify how to identify each type
of record.• Columns from any or all record types can be selected for
output– This allows columns of data from multiple records of different types to
be combined into a single output record
2/16/2010 368Training Material - Sample Copy
Sample COBOL Copybook
CLIENT record format
POLICY record format
COVERAGE record format
2/16/2010 369Training Material - Sample Copy
Importing a COBOL File DefinitionLevel 01 column
position
Level 01 items
2/16/2010 370Training Material - Sample Copy
Example Data File with Multiple Formats
Record Type = ‘1’ CLIENT record
Record Type = ‘2’ POLICY record
Record Type = ‘3’ COVERAGE record
2/16/2010 374Training Material - Sample Copy
Records Tab
Load columns for record type
Set as master
Active record type
Add another record type
2/16/2010 377Training Material - Sample Copy
Record ID Tab
Condition that identifies the type of record
2/16/2010 378Training Material - Sample Copy
View Data
CLIENT columns POLICY columns Coverage columns
2/16/2010 382Training Material - Sample Copy
Processing Multi-Format Records
Derivations identify which type of record is
coming into the Transformer
Stage variable in the Transformer
2/16/2010 383Training Material - Sample Copy
Checkpoint
1. What types of files contain the Metadata that is typically into the CFF stage?
2. Does the CFF stage support variable length records?
3. How does DataStage know which type of record it is reading from a file containing records of different formats?
4. What does accomplish to select a record type as a master?
5. How many record types can be designated Master?
2/16/2010 385Training Material - Sample Copy
Checkpoint solutions
1. COBOL copybooks or COBOL file definitions.
2. Yes, it can read files containing multiple file formats, each of a different physical length.
3. On the records ID tab, you define constraints that identify the record type. These must reference fields common to all record formats.
4. When a master record is read, all outputs columns are emptied before the master record contents are written.
5. Only one.
2/16/2010 386Training Material - Sample Copy
IBM WebSphere DataStage 8.0
____________________________Topics 2 : Slowly Changing Dimension
2/16/2010 387Training Material - Sample Copy
Unit Objectives
After completing this unit, you should be able to:
• Design a job that creates a surrogate key source key file
• Design a job that updates a surrogate key source key file from a dimension table
• Design a job that processes a star schema database with Type 1 and Type 2 slowly changing dimensions
2/16/2010 388Training Material - Sample Copy
Surrogate Key Generator Stage
• Use to create and update the surrogate key state file
• Surrogate key state file
– One file per dimension table
– Stores the last used surrogate key integer for the dimension table
– Binary file
2/16/2010 390Training Material - Sample Copy
Example Job to Create Surrogate State Files
Create Surrogate State
File for Store dimension table
Create Surrogate State File for Product dimension table
2/16/2010 391Training Material - Sample Copy
Editing the Surrogate Key Generator Stage
Path to state file
Create the state file
2/16/2010 392Training Material - Sample Copy
Specifying the Update Information
Table column containing
surrogate key values
Update the state file
2/16/2010 394Training Material - Sample Copy
Slowly Changing Dimension Stage
• Used for processing a star schema• Performs a lookup into a star schema dimension table
– Multiple SDC stages can daisy chained to process multiple dimension tables
• Inserts new rows into the dimension table as required • Updates existing rows in the dimension table as required
– Type 1 fields of a matching row are overwritten– Type 2 fields of matching row are retrained as history rows
• A new record with the new field value is added to the dimension table and made the current record
• Generally used in conjunction with the Surrogate key Generator stage– Creates a Surrogate key state file that retains a list of the previously
used surrogate keys
2/16/2010 396Training Material - Sample Copy
Example Slowly Changing Dimension Job
Check for matching StoreDim
rows
Perform Type 1 and
Type 2 updates to StoreDim
Tables
Check for matching Product
rows
Perform Type 1 and
Type 2 updates to
Product Tables
2/16/2010 398Training Material - Sample Copy
Working in the SCD Stage• Five “Fast Path ” pages to edit• Select the output link
– This is the link coming out of the SCD stage that is not used to update the dimension table
• Specify the purpose codes– Fields to match by
• Business key fields and the source fields to match to it– Surrogate Key– Type 1 fields– Type 2 fields– Current indicator for Type 2– Effective Date, Expire date for Type 2
• Surrogate key Management – Location of State file
• Dimension update specification• Output mappings
2/16/2010 399Training Material - Sample Copy
Specifying the Purpose Codes
Lookup key mapping
Type 1 field
Type 2 field
Surrogate key
Fields used for Type2 handling
2/16/2010 401Training Material - Sample Copy
Surrogate Key Management
Path to state file
Initial surrogate key value
Number of values to retrieve at one
time
2/16/2010 402Training Material - Sample Copy
Dimension Update Specification
Function used to retrieve the next
surrogate key value
Values that means current
Function used to calculate history date
range
2/16/2010 403Training Material - Sample Copy
Checkpoint
1. How many Slowly Changing Dimension are needed to process a star schema with 4 dimension tables ?
2. How many surrogate key state files are needed to process a star schema with dimension tables ?
3. What’s the difference between a Type1 and a Type 2 dimension field attribute ?
4. What additional fields are needed for handling a Type 2 Slowly Changing Dimension field attribute ?
2/16/2010 405Training Material - Sample Copy
Checkpoint solutions
1. Four SCD stages are needed. One for each dimension table . Each SCD stage does a lookup and update to the table
2. Four surrogate key state files are needed. One for each dimension table. A separate state file is used for each.
3. Type 1 is a simple update. The value in the dimension record field is overwritten with the new value. Type 2 retains the value in a history record. A new record is created with the current value.
4. Three additional fields are needed : the current indicator is needed to flag whether a given record contains current type 2 value or an earlier value. The Effective Date and Expire Date fields are used to specify when the given record is applicable.
2/16/2010 406Training Material - Sample Copy
IBM WebSphere DataStage 8.1
____________________________Topics 3 : Installation Run Through
2/16/2010 407Training Material - Sample Copy
Module Objectives
After completing this module, you should be able to:
• Run-through the installation process
• Start the Information Server
2/16/2010 408Training Material - Sample Copy
Information Server User IDs
• Database owner: e.g., xmeta
• DB2 instance owner: e.g., db2admin
• Application server ID: e.g., appserv
• Information Server administrator: e.g., admin
• Be sure all user IDs have passwords that have not expired
2/16/2010 427Training Material - Sample Copy
Testing the Installation• You can partially test the installation by logging onto the
Information Server web console
2/16/2010 428Training Material - Sample Copy
Checkpoint
1. List the five installation layers that can be installed.
2/16/2010 429Training Material - Sample Copy
Checkpoint solutions
• Client, Engine, Domain, Repository, and Documentation.
2/16/2010 430Training Material - Sample Copy
IBM WebSphere DataStage 8.1
____________________________Topics 4: Solution Development Jobs
2/16/2010 431Training Material - Sample Copy
Module objectives
After completing this module, you should be able to:
• List and describe the Warehouse jobs
• Understand the stages and techniques used in the
• Warehouse jobs
2/16/2010 432Training Material - Sample Copy
Solution Development Jobs
• Series of 4 jobs extracted from production jobs
• Use a variety of stages in interesting, realistic configurations– Sort, Aggregator stages
– Join, lookup stage
– Peek, Filter stages
– Modify stage
– Oracle stage
• Contain useful techniques– Use of Peeks
– Datasets used to “connect” jobs
– Use of project environment variables in job parameters
– Fork Joins
– Lookups for auditing
2/16/2010 434Training Material - Sample Copy