41
1 DataStage Designer Tips & Tricks Jim Tsimis Advanced Technical Support Mike Carney Advanced Consulting Group Michael Ruland Field Engineering Steven Totman Product Manager Connectivity

DataStage Tricks & Tips

Embed Size (px)

Citation preview

Page 1: DataStage Tricks & Tips

1

DataStage DesignerTips & Tricks

Jim Tsimis Advanced Technical Support

Mike Carney Advanced Consulting Group

Michael Ruland Field Engineering

Steven Totman Product Manager Connectivity

Page 2: DataStage Tricks & Tips

2

Agenda

• Designer Session– General Debug Tips & Tricks

– Handling Complex Flat Files

– Joy of the Command Line

– Transaction Handling Tips & Tricks• Managing transactions in Server, Enterprise Edition, Enterprise

MVS Edition and RTI

– Re-usability Tips & Tricks• Shared Containers, Templates, Pre-configured stages, Runtime

column propagation

– Performance Tuning

Page 3: DataStage Tricks & Tips

3

General Debug Tips & Tricks

Page 4: DataStage Tricks & Tips

4

Server - Generating Test Data

Stage Variables are always executed and drive the

transformer stage

Output rows will be generated until the constraint is false – at that point the job

will stop

Notice – there are no input

links

General Debug

Page 5: DataStage Tricks & Tips

5

Enterprise Edition – Generating Test Data

General Debug

Page 6: DataStage Tricks & Tips

6

Building test data from live data

Head stage: selects the first N records from each partition of an input data set and copies the selected records to an output data set.

Tail stage: selects the last N records from each partition of an input data set and copies the selected records to an output data set.

Sample stage: samples an input data set. Operates in two modes: Percent mode, extracts rows, selecting them by means of a random number generator, and writes a given percentage of these to each output data set; Period mode, extracts every Nth row from each partition, where N is the period which you supply.

External Filter stage: allows you to specify a UNIX command that acts as a filter on the data you are processing. An example would be to use the stage to grep a data set for a certain string, or pattern, and discard records which did not contain a match.

Sequential File stage: FILTER OPTION - use this to specify that the data is passed through a filter program before being written to a file or files on output or before being placed in a dataset on input.

Filter stage: transfers, unmodified, the records of the input data set which satisfy the specified requirements and filters out all other records. You can specify different requirements to route rows down different output links.

Page 7: DataStage Tricks & Tips

7

Handling Complex Flat Files

Page 8: DataStage Tricks & Tips

8

Server – Decoding Multi-formatted Files

Input column definitions (3

columns)

The selected complex column is

decoded into individual columns

Page 9: DataStage Tricks & Tips

9

Enterprise Edition – Decoding Multi-formatted Files

Indicate the

columns to import

Map the columns to

their destination

Page 10: DataStage Tricks & Tips

10

Enterprise Edition – Taming the import

od –x –A x …

Print field option

Page 11: DataStage Tricks & Tips

11

Working with Schemas

ConvertingCopybooks

ToSchemas

Page 12: DataStage Tricks & Tips

12

Make Vector stage: combines specified columns of an input data record into a vector of columns of the same type.

Make Subrecord stage: combines specified vectors in an input data set into a vector of subrecords whose columns have the names and data types of the original vectors.

Split Vector stage: promotes the elements of a fixed-length vector to a set of similarly named top-level columns.

Combine Records stage: combines records, in which particular key-column values are identical, into vectors of subrecords.

Column Import stage: imports data from a single column and outputs it to one or more columns.

Column Export stage: exports data from a number of columns of different data types into a single column of data type string or binary.

Enterprise Edition – Working with Complex Files

Split Subrecord stage: separates an input subrecord field into a set of top-level vector columns.

Promote Subrecord stage: promotes the columns of an input subrecord to top-level columns, can also promote the columns in vectors of subrecords, in which case it acts as the inverse of the Combine Record stage.

Page 13: DataStage Tricks & Tips

13

Enterprise Edition - Complex Structures

A subrecord is a nested data structure. The column with type subrecord does not itself define any storage, but the columns it contains do. These columns can have any data type, and you can nest subrecords one within another. The LEVEL property is used to specify the structure of subrecords. The following diagram gives an example of a subrecordstructure.

A vector is a 1 dimensional array of any type except tagged. Elements of a vector are of the same type, and are numbered from 0. A vector can be of fixed or variable length. For fixed length vectors the length is explicitly stated, for variable length ones a property defines a link field which gives the length at run time.

Promote

Ma

ke

Subrecords

Vectors

Ma

keSpl

it

Page 14: DataStage Tricks & Tips

14

Enterprise Edition – Combining Vectors and Subrecords

There is a rich ability to support very complex data structures here.

Page 15: DataStage Tricks & Tips

15

Joy of the Command Line

Page 16: DataStage Tricks & Tips

16

The Joy of the Command Line

• What is dsjob?

• Utility to backup all the jobs in a project

• Utility to take BMP’s from the command line

• DSJob exposed as a web service

Page 17: DataStage Tricks & Tips

17

Automatically Backing up Projects

@echo offrem This batch script is used to backup all the projects on a DataStage server. Itrem must be run from a DataStage client machine and the parameters below should be rem modified to fit your environment. Use of parameters was avoided to simplify backuprem allow the command to be customized to a particular environment.rem rem Based on design by Manoli Krinos rem Modified by M Ruland to allow iteration through a complete server set of projectsrem *****************************************************rem Replace the following variables prior to runningrem *****************************************************rem Host is server namerem User is username to use to attach to DataStagerem PW is password to use to attach to DataStagerem BackupDir is the directory to place the backed up project in (don't forget final /)rem DsxCmd is directory of the export command on clientrem DsxCmd1 is the dsjob command to retrieve the project listrem TempProjFile is temp file to store project namesrem DSLog is the name of the log file accumulated during the backuprem *****************************************************rem *****************************************************Set Host=yourhosthereSet User=yourusernameSet PW=yourpasswordSet BackupDir=E:\Data\AutoBackup\UserConference\SET DsxCmd=E:\Progra~1\Ascential\DataStage7Beta\dscmdexport.exeSET DsxCmd1=C:\Ascential\DataStage\Engine\bin\dsjob.exe Set TempProjFile=c:\temp\ProjectList.txtSet DSLog=DataStageDumpLogrem ******************************************************

rem rem -------------------------------------------------------------------------rem Get the current Daterem -------------------------------------------------------------------------

FOR /f "tokens=2-4 delims=/ " %%a in ('DATE/T') do SET DsxDate=%%c%%a%%b

rem rem -------------------------------------------------------------------------rem Get the current Timerem rem -------------------------------------------------------------------------

FOR /f "tokens=1* delims=:" %%a in ('ECHO.^|TIME^|FINDSTR "[0-9]"') do (SET DsxTime=%%b)

rem rem -------------------------------------------------------------------------rem Set delimeters so that current time can be broken down into components rem then execute FOR loop to parse the DsxTime variable into Hr/Min/Sec/Hun. rem rem -------------------------------------------------------------------------

SET delim1=%DsxTime:~3,1%SET delim2=%DsxTime:~9,1%FOR /f "tokens=1-4 delims=%delim1%%delim2% " %%a in ('echo %DsxTime%') do ( set DsxHr=%%a set DsxMin=%%b set DsxSec=%%c set DsxHun=%%d)ECHO *** Backing up server %Host% == please be patient%DsxCmd1% -server %Host% -user %user% -password %user% -lprojects > %TempProjFile%

echo AutoProjectBackup run on %DsxDate%%DsxHr%%DsxMin%%DsxSec% with the following parameters > %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.logecho Host=%Host%, User=%user%, BackupDir=%BackupDir%, DsxCmd=%DsxCmd%, DsxCmd1=%DsxCmd1% >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.logecho TempProjFile=%TempProjFile%, DSLog=%DSLog% >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.logecho. >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.logecho Following Projects found on %Host% >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.logtype %TempProjFile% >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log

rem *************************rem ** Begin backup loop **rem *************************

for /F "tokens=1" %%i in (%TempProjFile%) do ( ECHO The current Project is %%i >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log ECHO Backing up Project %%irem rem -------------------------------------------------------------------------rem Issue message to screen (stdio) that the export is starting. rem ------------------------------------------------------------------------- ECHO Exporting Project=%%i on Host=%Host% into File=%BackupDir%%HostName%%%i%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.dsx ... >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log %DsxCmd% /H=%Host% /U=%User% /P=%PW% %%i %BackupDir%%HostName%%%i%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.dsx >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log IF NOT %ERRORLEVEL%==0 GOTO BADEXPORT ECHO. >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log ECHO *** Completed Export for Project: %%i on Host: %Host% to File: %BackupDir%%HostName%%%i%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.dsx >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log ECHO ************************************************************************** >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log ECHO ************************************************************************** >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log ECHO. >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log)ECHO *** Export completed successfully for projects:type %TempProjFile%GOTO EXITPT rem -------------------------------------------------------------------------rem :BADEXPORTECHO. >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.logECHO *** ERROR: Failed to Export Project: %%i on Host: %Host% >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log :EXITPTECHO. >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.logdel %TempProjFile%

This script is available from your account team and backs up all projects on an identified

server

Available on Ascential Developers Net when it launches in December

Page 18: DataStage Tricks & Tips

18

Automated Diagramming

"E:\Program Files\Ascential\DataStage\dsdesign.exe" /h=YourHost /u=UserID /p=***** YourProject YourJobName /saveasbmp=e:\Diagrams\YourProject\JobDesigns\YourJobName.bmp

A script is available from Ascential that will obtain all the jobs within a selected project and create a bmp

diagram for each job into a selected folder. This can be an effective way to create a file that

MetaStage can later use to present a graphical representation of the DataStage job design in an

HTML or XML report.

Page 19: DataStage Tricks & Tips

19

Job Control using Web Services

Job Control Web Service in C# .NET

Job Control Web Service in HTMLusing Web Service Behavior

Job Control Web Servicein Office XP documents Job Control Web Service in VBScript

Page 20: DataStage Tricks & Tips

20

Unit Of Work Tips & Tricks

Page 21: DataStage Tricks & Tips

21

Server – Unit of Work

Used to specify whether to continue or to roll

back if a link is skipped due to a constraint on it

not being satisfied.

Used to specify whether or not to

continue or rollback on failure of the SQL

statement.

Unit of Work processing is

currently available in the

Oracle and ODBC stages.

Page 22: DataStage Tricks & Tips

22

DataStage Enterprise Edition MVS Unit Of Work Support

The DataStage Enterprise Edition MVS (XE/390) Business Rule Stage provides the

ease of graphical construction through drag and drop facilities as well as the ability to customize the processing rules to meet

specific demands.

Page 23: DataStage Tricks & Tips

23

Enterprise Edition - Unit of Work

Source Queue

Reject Queue

Work Queue(s)

t1

T2

T2

BankAccount

TaxAccount

StockAccount

T4

T4

T4

T4

T3EOU

Page 24: DataStage Tricks & Tips

24

Enterprise Edition – Unit of Work

• Enterprise Edition Framework enhancement (end of unit)– Causes records to flush through flow

– Stages complete work, but stay running for next unit of work.

MQ-READ – UNIT OF WORK Solution

– Utilizes (end of unit)

– MQ series transaction manager (xa/open two phase commit)

– Coordinated transactions between MQSERIES, ORACLE,other RDBMS

– Guarantee no loss of data

– Automatic checkpoint/restart

– Highly scalable near realtime

• Only available through Ascential Professional Services

Page 25: DataStage Tricks & Tips

25

RTI Enabled Jobs

•Job acts like web service via Real Time Integration Server (RTI)

•Calls to service treated as unit of work

•Multi row unit of work is supported

Service arguments are complex arrays

Page 26: DataStage Tricks & Tips

26

Re-usability Tips & Tricks

Page 27: DataStage Tricks & Tips

27

Reuse – Job Templates

Creates an XML template that can be used as a

“starter job”. Facilities exist to allow consulting to further customize the

template such that “token” values can be replaced

during job creation.

Page 28: DataStage Tricks & Tips

28

Copy Paste From Job to Job

Can also now copy and paste directly

into a shared container on both

the Enterprise Edition and Server

canvas.

Page 29: DataStage Tricks & Tips

29

Pre-configured Stages

• Implemented through Shared containers

• Configure Stage with parameters– Create empty job with

parameters

– When developing new job.

• Start with empty job with parameters

• Drag/Drop pre-configured stages holding CTRL key.

• Minimal configuration required.

Page 30: DataStage Tricks & Tips

30

Enterprise Edition - Reusable Transformers

• Implemented through shared containers

• Runtime Column Propagation– Turn on for transformer stage.

– Specify minimal input columns, only column used in derivation. (not copy through)

– Specify minimal output column, new columns or columns that required a derivation expression. (complex vs column name)

– Enterprise Edition Framework automatically propagates all other columns.

Page 31: DataStage Tricks & Tips

31

Performance Tuning in Parallel Environments

Page 32: DataStage Tricks & Tips

32

Performance and Tuning

• Any given system can be tuned to favor one application so much that it actually negatively impacts the performance of other applications. This phenomena is exacerbated as we introduce parallel capabilities into the system.

• RDBMS configuration and performance• Memory vs. System Working Set Size• CPU's vs. System Load• Data input/output Throughput Rates•Launch Microsoft Outlook Amdahl’s law (an application is gated by it’s slowest component).

Many factors affect the performance of an application

Page 33: DataStage Tricks & Tips

33

Performance and Tuning

Best Practices:

– Establish baselines (especially with I/O), use copy with no output

– Avoid the use of only one flow for tuning/performance testing. Prototyping can be a powerful tool.

– Work in increments...change 1 thing at a time.

– Evaluate data skew: repartition to balance the data flow

– Isolate and Solve - determine which stage is causing a problem.

– distribute file systems (if possible) to eliminate bottlenecks

– Do NOT involve the RDBMS in initial testing. (See above)

– Understand and evaluate the tuning knobs available

Page 34: DataStage Tricks & Tips

34

Performance and Tuning

• Establishing a baseline:– Set up at least 3 configurations: sequential; max parallel; ½ max

parallel– Use real data if possible, else use table definition– Create or generate a dataset with 2-3 times available RAM (limit test

to 10-15 mins)– Using sequential configuration file:

• Read dataset to copy (copy –f)• Rerun and watch for caching• Add a write to dataset• Run a read/sort/copy test (use a relatively random key for sort)

– Using ½ max parallel configuration file• Create a non-skewed dataset• Rerun tests above• “tune” the configuration to obtain a linear application speed-up

– Review the entire I/O system– Review the configuration file to spread I/O activity

– Using max parallel configuration • Create a non-skewed dataset• Rerun tests above• Stress the system, looking for areas of contention

Page 35: DataStage Tricks & Tips

35

Performance and Tuning

• Buffering (Enterprise Edition and Server)– Facility added behind the scenes to optimize and regulate data flow.

It’s primary purpose is to match the rate data is produced upstream with the rate it is consumed downstream. (see next slide)

• Partitioning/Sorting (Enterprise Edition)– Operations added behind the scenes to alleviate the need for a

developer to worry about this while assuring that the flow operates correctly. (APT_NO_PART_INSERTION & APT_NO_SORT_INSERTION)

• Operator Combination (Enterprise Edition)– Operations combined behind the scenes to improve performance.

APT_DISABLE_COMBINATION

Page 36: DataStage Tricks & Tips

36

Performance and Tuning

Controlling the Buffers in DataStage Enterprise Edition

• APT_BUFFER_MAXIMUM_TIMEOUT – set to 1 for pre V7– Controls the speed of data flow after buffering

• APT_BUFFER_MAXIMUM_MEMORY – default is 3M– Increase for large memory configurations to avoid buffering to disk

• APT_BUFFER_DISK_WRITE_INCREMENT – default is 1M– Increase to create larger bursts of I/O during buffering to disk

• APT_BUFFER_FREE_RUN – default is N * APT_BUFFER_MAXIMUM_MEMORY– increase to reduce data flow impedance for large memory

configurations• Controlling the Buffers in DataStage Server• Set BUFFERSIZE and TIMEOUT for intra/inter-partitioning –

default is 128K– Set for project in administrator or in job properties

for a particular job

Page 37: DataStage Tricks & Tips

37

Performance and Tuning

Evaluating performance with Enterprise Edition

• APT_DUMP_SCORE– used to understand the details of a data flow.

• APT_PM_PLAYER_TIMING– Used to understand the CPU characteristics of a data flow

• APT_RECORD_COUNTS– Used to check for data skew across data partitions

Evaluating performance with Server

• Performance statistics – enabled in the “Tracing” panel of the “Job run options” presented

when a server job is run (Director or Designer)

Page 38: DataStage Tricks & Tips

38

Performance Tuning

The Configuration File• Tells DataStage how to exploit the underlying computer

hardware. For any given system there is not one ideal config file since in a given job there is a lot of variance about how they work on that system.

• General hints: (assumes SMP environment)– avoid using the disk that are used for ‘landing’ input and output data

for scratch and resource disk

– Do not use NFS or other remotely mounted disk for scratch disk

– Understand the file system underneath the mount points being used by the configuration file

– Separate the I/O between nodes as much as possible to provide the maximum I/O bandwidth

– Run your application using various configurations to understand it’s complexion during volume testing before moving to production.

Page 39: DataStage Tricks & Tips

39

Ascential Developer Net

LaunchesIn

December

Page 40: DataStage Tricks & Tips

40

DataStage Operator Tips & Tricks tomorrow:

Agenda

Tuesday 9:15am

• Operator Tips & Tricks Session– Upgrades & Installs

– Version Control

– Production Automation

– Running in a High Availability environment

Page 41: DataStage Tricks & Tips

41

[email protected]@[email protected]@ascentialsoftware.com

Please let us know if you have any comments or suggestions regarding this material.

E388 8195 92A2 4086 9699 4081 A3A3 8595 8489 9587 40A3 8889 A240 A285 A2A2 8996 9540 !

EOD/EOT