Datastage Performance Guide

7/24/2019 Datastage Performance Guide

1/108

DataStage Performance Tuning

Performance Tuning - BasicsBasicsParallelism Parallelism in DataStage Jobs should be optimized ratherthan maximized. The degree of parallelism of a DataStage Job isdetermined by the number of nodes that is dened in the Congurationile! for example! four"node! eight #node etc. $ conguration le %ith alarger number of nodes %ill generate a larger number of processes and%ill in turn add to the processing o&erheads as compared to aconguration le %ith a smaller number of nodes. Therefore! %hilechoosing the conguration le one must %eigh the benets of increasedparallelism against the losses in processing e'ciency (increasedprocessing o&erheads and slo% start up time).*deally ! if the amount ofdata to be processed is small ! conguration les %ith less number of

nodes should be used %hile if data &olume is more ! conguration les%ith larger number of nodes should be used.

Partioning :Proper partitioning of data is another aspect of DataStage Job design!%hich signicantly impro&es o&erall +ob performance. Partitioning shouldbe set in such a %ay so as to ha&e balanced data ,o% i.e. nearly e-ualpartitioning of data should occur and data se% should be minimized.

Memory :*n DataStage Jobs %here high &olume of data is processed! &irtual memory

settings for the +ob should be optimised. Jobs often abort in cases %here asingle looup has multiple reference lins. This happens due to lo% tempmemory space. *n such +obs /$PT0B12304$5*4140424637!/$PT0468*T630S*92 and /$PT0468*T630T*42 should be set tosu'ciently large &alues.

Performance Analysis of Various stages inDataStag

Sequential File Stage -

The se-uential le Stage is a le Stage. *t is the most common *:6 Stageused in a DataStage Job. *t is used to read data from or %rite data to oneor more ,at iles. *t can ha&e only one input lin or one 6utput lin .*t canalso ha&e one re+ect lin. ;hile handling huge &olumes of data! this Stagecan itself become one of the ma+or bottlenecs as reading and %ritingfrom this Stage is slo%.Se-uential les should be used in follo%ingconditions;hen %e are reading a ,at le (xed %idth or delimited) from18*5 en&ironment %hich is TPed from some external systems;hen some18*5 operations has to be done on the le Don


2/108

node can be increased (default &alue is one).

Data Set Stage :The Data Set is a le Stage! %hich allo%s reading data from or %ritingdata to a dataset. This Stage can ha&e a single input lin or single 6utput

lin. *t does not support a re+ect lin. *t can be congured to operate inse-uential mode or parallel mode. DataStage parallel extender +obs useDataset to store data being operated on in a persistent form.Datasets areoperating system les %hich by con&ention has the su'x .dsDatasets aremuch faster compared to se-uential les.Data is spread across multiplenodes and is referred by a control le.Datasets are not 18*5 les and no18*5 operation can be performed on them.1sage of Dataset results in agood performance in a set of lined +obs.They help in achie&ing end"to"end parallelism by %riting data in partitioned form and maintaining thesort order.

Lookup Stage $ =oo up Stage is an $cti&e Stage. *t is used to perform a looup on anyparallel +ob Stage that can output data. The looup Stage can ha&e areference lin! single input lin! single output lin and single re+ectlin.=oo up Stage is faster %hen the data &olume is less.*t can ha&emultiple reference lins (if it is a sparse looup it can ha&e only onereference lin)The optional re+ect lin carries source records that do notha&e a corresponding input looup tables.=ooup Stage and type of looupshould be chosen depending on the functionality and &olume ofdata.Sparse looup type should be chosen only if primary input data

&olume is small.*f the reference data &olume is more! usage of =ooupStage should be a&oided as all reference data is pulled in to local memory

!oin Stage :Join Stage performs a +oin operation on t%o or more datasets input to the+oin Stage and produces one output dataset. *t can ha&e multiple inputlins and one 6utput lin.There can be > types of +oin operations *nner

Join! =eft:3ight outer Join! ull outer +oin. Join should be used %hen thedata &olume is high. *t is a good alternati&e to the looup stage andshould be used %hen handling huge &olumes of data.Join uses the pagingmethod for the data matching.

Merge Stage :The 4erge Stage is an acti&e Stage. *t can ha&e multiple input lins! asingle output lin! and it supports as many re+ect lins as input lins. The4erge Stage taes sorted input. *t combines a sorted master data set %ithone or more sorted update data sets. The columns from the records in themaster and update data sets are merged so that the output recordcontains all the columns from the master record plus any additionalcolumns from each update record. $ master record and an update recordare merged only if both of them ha&e the same &alues for the merge ey

column(s) that you specify. 4erge ey columns are one or more columnsthat exist in both the master and update records. 4erge eys can be more


3/108

than one column. or a 4erge Stage to %or properly master dataset andupdate dataset should contain uni-ue records. 4erge Stage is generallyused to combine datasets or les.

Sort Stage :

The Sort Stage is an acti&e Stage. The Sort Stage is used to sort inputdataset either in $scending or Descending order. The Sort Stage o?ers a&ariety of options of retaining rst or last records %hen remo&ingduplicate records! Stable sorting! can specify the algorithm used forsorting to impro&e performance! etc. 2&en though data can be sorted on alin! Sort Stage is used %hen the data to be sorted is huge.;hen %e sortdata on lin ( sort : uni-ue option) once the data size is beyond the xedmemory limit ! *:6 to dis taes place! %hich incurs an o&erhead.

Therefore! if the &olume of data is large explicit sort stage should be usedinstead of sort on lin.Sort Stage gi&es an option on increasing the bu?ermemory used for sorting this %ould mean lo%er *:6 and betterperformance.

Transformer Stage@The Transformer Stage is an acti&e Stage! %hich can ha&e a single inputlin and multiple output lins. *t is a &ery robust Stage %ith lot of inbuiltfunctionality. Transformer Stage al%ays generates C"code! %hich is thencompiled to a parallel component. So the o&erheads for using atransformer Stage are high. Therefore! in any +ob! it is imperati&e that theuse of a transformer is ept to a minimum and instead other Stages areused! such as@Copy Stage can be used for mapping input lins %ith

multiple output lins %ithout any transformations. ilter Stage can be usedfor ltering out data based on certain criteria. S%itch Stage can be used tomap single input lin %ith multiple output lins based on the &alue of aselector eld. *t is also ad&isable to reduce the number of transformers ina Job by combining the logic into a single transformer rather than ha&ingmultiple transformers .

Funnel Stage unnel Stage is used to combine multiple inputs into a single outputstream. But presence of a unnel Stage reduces the performance of a +ob.*t %ould increase the time taen by +ob by >A (obser&ations). ;hen a

unnel Stage is to be used in a large +ob it is better to isolate itself to one+ob. ;rite the output to Datasets and funnel them in ne% +ob. unnelStage should be run in continuous mode! %ithout hindrance.

"#erall !o$ Design :;hile designing DataStage Jobs care should be taen that a single +ob isnot o&erloaded %ith Stages. 2ach extra Stage put in a Job corresponds tolesser number of resources a&ailable for e&ery Stage! %hich directlya?ects the Jobs Performance. *f possible! big +obs ha&ing large number ofStages should be logically split into smaller units. $lso if a particular Stage

has been identied to be taing lot of time in a +ob! lie a transformerStage ha&ing complex functionality %ith a lot of Stage &ariables and


4/108

transformations! then the design of +obs could be done in such a %ay thatthis Stage is put in a separate +ob all together (more resources for thetransformer StageEEE). $lso %hile designing +obs! care must be taen thatunnecessary column propagation is not done. Columns! %hich are notneeded in the +ob ,o%! should not be propagated from one Stage to

another and from one +ob to the next. $s far as possible! 3CP (3untimeColumn Propagation) should be disabled in the +obs. Sorting in a +obshould be taen care try to minimise number sorts in a +ob. Design a +ob insuch a %ay as to combine operations around same sort eys! if possiblemaintain same hash eys. 4ost often neglected option is don


5/108

You may get many errors in datastage while compiling the jobs or running the jobs.

Some of the errors are as follows

a)Source file not found.If you are trying to read the file, which was not there with that name.

b)Some times you may get Fatal Errors.

c) Data type mismatches.his will occur when data type mismaches occurs in the jobs.

d) Field Si!e errors.

e) "eta data "ismach

f) Data type si!e between source and target different

g) #olumn "ismatch

i) $ricess time out.If ser%er is busy. his error will come some time.

Some of the errors in detail:

ds0Trailer03ec@ ;hen checing operator@ ;hen binding output schema &ariable Kout3ecK@;hen binding output interface eld KTrailerDetail3ecCountK to eldKTrailerDetail3ecCountK@ *mplicit con&ersion from source type KustringK to result typeKstringLmaxMHFFNK@ Possible truncation of &ariable length ustring %hen con&erting to

string using codepage *S6"OOF"I.

Solution@* resol&ed changing the extended col under meta data of thetransformer to unicode

;hen checing operator@ $ se-uential operator cannot preser&e thepartitioningof the parallel data set on input port A.

Solution@* resol&ed by changing the preser&e partioning to QclearQ undertransformer properties

Syntax error@ 2rror in KgroupK operator@ 2rror in output redirection@ 2rror in outputparameters@ 2rror in modify adapter@ 2rror in binding@ Could not nd type@ KsubrecK! line>F

Solution@*ts the issue of le&el number of those columns %hich %ere being added in transformer.Their le&el number %as blan and the columns that %ere being taen from c? le had it as AH.$dded the le&el number and +ob %ored.

Out_Trailer: When checking operator: When binding output schema variable "outRec": When binding output interface field

"STDC_TR!R_RC_C#T" to field "STDC_TR!R_RC_C#T": $mplicit conversion from source t%pe "dfloat" to result

t%pe "decimal&'()(*": +ossible range,precision limitation-

C_Trailer: When checking operator: When binding output interface field "Data" to field "Data": $mplicit conversion from


6/108

source t%pe "string" to result t%pe "string&ma./0((*": +ossible truncation of variable length string-

$mplicit conversion from source t%pe "dfloat" to result t%pe "decimal&'()(*": +ossible range,precision limitation-

Solution: 1sed to transformer function2D3loatToDecimal2- s target field is Decimal- 4% default the output from aggregator

output is double) getting the above b% using above function able to resolve the 5arning-

When binding output schema variable "outputData": When binding output interface field "RecordCount" to field

"RecordCount": $mplicit conversion from source t%pe "string&ma./600*" to result t%pe "int'7": Converting string to number-

Problem(Abstract)

&obs that process a large amount of data in a column can abort with this error'

the record is too big to fit in a bloc( the length re*uested is' ++++, the ma+ bloc( length is' ++++.

Resolving the problem

o fi+ this error you need to increase the bloc( si!e to accommodate the record si!e'

'- !og into Designer and open the 8ob-

6- Open the 8ob properties99 parameters99add environment variable and select:+T_D31!T_TR#S+ORT_4!OC;_S$ou can set this up to 607?4 but %ou reall% shouldn2t need to go over '?4-#OT: value is in ;4

3or e.ample to set the value to '?4:

+T_D31!T_TR#S+ORT_4!OC;_S$


7/108

. S51E-I0E1G13/$41DIGE,7' run0ocallyH) did not reach E3F on its input data set ?.

S30' arning will be disappeared by regenerating S5 File.

8. hile connecting to Datastage client, there is no response, and while restarting websphere ser%ices,following errors occurred

rootJpoluloro?7 binKL .


8/108

Sometime e%en after mapping the fields this error can be occurred and one of the reason could be that the %iewadapter has not lin(ed the input and output fields. 9ence in this case the re*uired field mapping should bedropped and recreated.

&ust to gi%e an insight on this, the %iew adapter is an operator which is responsible for mapping the input andoutput fields. 9ence DataStage creates an instance of -$Piew-dapter which translate the components of theoperator input interface schema to matching components of the interface schema. So if the interface schema isnot ha%ing the same columns as operator input interface schema then this error will be reported.

7)hen we use same partitioning in datastage transformer stage we get the following warning in C.@.= %ersion.

F#$????8 = inputtfm' Input dataset ? has a partitioning method other than entire specifieddisabling memory sharing.

his is (nown issue and you can safely demote that warning into informational by adding this warning to $rojectspecific message handler.

=) arning' - se*uential operator cannot preser%e the partitioning of input data set on input port ?

1esolution' #lear the preser%e partition flag before Se*uential file stages.

)DataStage parallel job fails with for(H) failed, 1esource temporarily una%ailable

3n ai+ e+ecute following command to chec( ma+uproc setting and increase it if you plan to run multiple jobs atthe same time.

lsattr ME Ml sys? V grep ma+uprocma+uproc 7?=8 "a+imum number of $13#ESSES allowed per user rue

8)FI$?????? -ggstg' hen chec(ing operator' hen binding input interface field:#/S-#241; to field :#/S-#241;' Implicit con%ersion from source type :string@K; to result type:dfloat;' #on%erting string to number.

1esolution' use the "odify stage e+plicitly con%ert the data type before sending to aggregator stage.

@)arning' - user defined sort operator does not satisfy the re*uirements.

1esolution'chec( the order of sorting columns and ma(e sure use the same order when use join stage after sortto joing two inputs.

O)F"?????? = Stgtfmheader,7' #on%ersion error calling con%ersion routinetimestampfromstring data may ha%e been lost

F"?????? 7 +fm&ournals,7' #on%ersion error calling con%ersion routine decimalfromstring datamay ha%e been lost

1esolution'chec( for the correct date format or decimal format and also null %alues in the date or decimal fieldsbefore passing to datastage StringoDate, DateoString,DecimaloString or StringoDecimal functions.

C)3S3???77B = &oinsort' hen chec(ing operator' Data claims to already be sorted on thespecified (eys the WsortedT option can be used to confirm this. Data will be resorted as necessary. $erformancemay impro%e if this sort is remo%ed from the flow

1esolution' Sort the data before sending to join stage and chec( for the order of sorting (eys and join (eys andma(e sure both are in the same order.

N)F31?????? = 7 &oin3uter' hen chec(ing operator' Dropping component :#/S241;

because of a prior component with the same name.

1esolution'If you are using join,diff,merge or comp stages ma(e sure both lin(s ha%e the differnt column names


9/108

other than (ey columns

B)FI$????== 7 ocioraclesource' hen chec(ing operator' hen binding output interface field:"E"4E12-"E; to field :"E"4E12-"E;' #on%erting a nullable source to a nonMnullable result

1esolution'If you are reading from oracle database or in any processing stage where incoming column is definedas nullable and if you define metadata in datastage as nonMnullable then you will get abo%e issue.if you want to

con%ert a nullable field to non nullable ma(e sure you apply a%ailable null functions in datastage or in the e+tract*uery.

D--S-GE #3""32 E1131S


10/108

1un :ps Mef V grep dsapisla%e; to chec( if any dsapisla%e processes e+ist. If so, (ill them.

1un :netstat Ma V grep dsprc; to see if any processes ha%e soc(ets that are ES-40IS9ED, FI2-I, or#03SE-I. hese will pre%ent the dsprcd from starting. he soc(ets with status FI2-I or #03SE-Iwill e%entually time out and disappear, allowing you to restart DataStage.

hen 1estart DSEngine. Hif abo%e doesnTt wor() 2eeds to reboot the system.

C. o sa%e Datastage logs in notepad or readable format

S30' a)


11/108

$s Xef V grep 0ogging-gentSoc(etImpl H31)

$S Xef V grep -gent Hto chec( the process id of the abo%e)

=. arning' - se*uential operator cannot preser%e the partitioning of input data set on input port ?

S30' #lear the preser%e partition flag before Se*uential file stages.

. arning' - user defined sort operator does not satisfy the re*uirements.

S30' #hec( the order of sorting columns and ma(e sure use the same order when use join stage after sort tojoing two inputs.

8. #on%ersion error calling con%ersion routine timestampfromstring data may ha%e been lost. +fm&ournals,7'#on%ersion error calling con%ersion routine decimalfromstring data may ha%e been lost

S30' chec( for the correct date format or decimal format and also null %alues in the date or decimal fieldsbefore passing to datastage StringoDate, DateoString,DecimaloString or StringoDecimal functions.

@. o display all the jobs in command line

S30'

cd


12/108

then try to restart it.

7?. hile attempting to compile job, :failed to in%o(e Gen1unime using $hantom process helper;

1#'


13/108

If any processes are in closewait, then wait for some time, those processes

will not be %isible.

Stop the Datastage ser%ices.

cd ^DS93"E

.


14/108

T%pes of Datasets in Datastage:

ccording to the scheme above) there are the t5o groups of Datasets 9 persistent and

virtual-

The first t%pe) persistent Datasets are marked 5ith *.dse.tensions) 5hile for second t%pe)

virtual datasets *.ve.tension is reserved- $t2s important to mention) that no -v files might be

visible in the 1ni. file s%stem) as long as the% e.ist onl% virtuall%) 5hile inhabiting R?

memor%- .tesion -v itself is characteristic strictl% for OSE 9 the Orchestrate language of

scriptingF-

3urther differences are much more significant- +rimaril%) persistent Datasets are being

stored in Unix files using internal Datastage EE format ) 5hile virtual Datasets are never

stored on disk9 the% do e.ist 5ithin links) and in format) but in R? memor%- 3inall%)

persistent Datasets are readable and rewriteable with the DataSet Stage) and virtual

Datasets - might be passed through in memory-

ccurateness demands mentioning the third t%pe of Datasets 9 the%2re called filesets) 5hile

their storage places are diverse 1ni. files and the%2re human9readable- 3ilesets are generall%

marked 5ith the *.fse.tension-

There are a fe5 features specific for Datasets that have to be mentioned to complete the

Datasets polic%- 3irstl%) as a single Dataset contains multiple records) it is obvious that all of

them must undergo the same processes and modifications- $n a 5ord) all of them must go

through the same successive stage-

Secondl%) it should be e.pected that different Datasets usuall% have different schemas)

therefore the% cannot be treated commonl%-

?ore accuratel%) t%picall% Orchestrate Datasets are a single9file eGuivalent for the 5hole

seGuence of records- With datasets) the% might be sho5n as a one ob8ect- Datasets themselves

help ignoring the fact that demanded data reall% are compounded of multiple and diverse files

remaining spread across different sectors of processors and disks of parallel computers- long

that) comple.it% of programming might be significantl% reduced) as sho5n in the e.ample

belo5-


15/108

Datasets parallel architecture in Datastage:

+rimar% multiple files 9 sho5n on the left side of the scheme 9 have been bracketed together)5hat resulted in five nodes- While using datasets) all the files) all the nodes might be boiled

do5n to the onl% one) single Dataset 9 sho5n on the right side of the scheme- Thereupon) %ou

might program onl% one file and get results on all the input files- That significantl% shorten

time needed for modif%ing the 5hole group of separate files) and reduce the possibilit% of

engendering accidental errors- What are its measurable profitsH ?ainl%) significantl%

increasing speed of applications basing on large data volumes-

ll in all) is it 5orth95hileH Surel%- speciall%) 5hile talking about technological advance and

succeeding data9dependence- The% do coerce using larger and larger volumes of data) and 9 as

a conseGuence 9 s%stems able to rapid cooperate on them became a necessit%-

Datastage Senarios and solutions

3ield mapping using Transformer stage:

ReGuirement:

field 5ill be right 8ustified Iero filled) Take last 'A characters

Solution:

Right"((((((((((":Trim!nk_Jfm_Trans-linkF)'AF

Scenario ':

We have t5o datasets 5ith @ cols each 5ith different names- We should create a dataset 5ith

@ cols = from one dataset and one col 5ith the record count of one dataset-

We can use aggregator 5ith a dumm% column and get the count from one dataset and do a

look up from other dataset and map it to the = rd dataset

Something similar to the belo5 design:


16/108

Scenario 6:

3ollo5ing is the e.isting 8ob design- 4ut reGuirement got changed to: Eead and trailer

datasets should populate even if detail records is not present in the source file- 4elo5

8ob don2t do that 8ob-

Eence changed the above 8ob to this follo5ing reGuirement:

1sed ro5 generator 5ith a cop% stage- Kiven default value IeroF for col countF coming in

from ro5 generator- $f no detail records it 5ill pick the record count from ro5 generator-

We have a source 5hich is a seGuential file 5ith header and footer- Eo5 to remove the header

and footer 5hile reading this file using seGuential file stage of DatastageH

Sol:Type command in putty@ sed QIdR/dQ le0namene%0le0name (type

this in +ob before +ob subroutine then use ne% le in se- stage)

$3 $ EL SO1RC !$; CO!' 4 #D TRKT !$; CO!' CO!6 ' 6 4'-

EOW TO CE$L TE$S O1T+1T 1S$#K STK LR$4! $# TR#S3OR?R

STKH

$f ke%Change /' Then ' lse stagevaraibleM'

Suppose that @ 8ob control b% the seGuencer like 8ob ') 8ob 6) 8ob =) 8ob @ Fif 8ob ' have

'()((( ro5 )after run the 8ob onl% 0((( data has been loaded in target table remaining are notloaded and %our 8ob going to be aborted then-- Eo5 can short out the problem-Suppose 8ob


17/108

seGuencer s%nchronies or control @ 8ob but 8ob ' have problem) in this condition should go

director and check it 5hat t%pe of problem sho5ing either data t%pe problem) 5arning

massage) 8ob fail or 8ob aborted) $f 8ob fail means data t%pe problem or missing column action

-So u should go Run 5indo5 9Click9 Tracing9+erformance or $n %our target table

9general 9 action9 select this option here t5o option

iF On 3ail 99 commit ) ContinueiiF On Skip 99 Commit) Continue-

3irst u check ho5 man% data alread% load after then select on skip option then continue and

5hat remaining position data not loaded then select On 3ail ) Continue ------ gain Run the 8ob

defiantl% u get successful massage

Nuestion: $ 5ant to process = files in seGuentiall% one b% one ho5 can i do that- 5hile

processing the files it should fetch files automaticall% -

ns:$f the metadata for all the files r same then create a 8ob having file name as parameter

then use same 8ob in routine and call the 8ob 5ith different file name---or u can create

seGuencer to use the 8ob--

+arameteriIe the file name-4uild the 8ob using that parameter

4uild 8ob seGuencer 5hich 5ill call this 8ob and 5ill accept the parameter for file name-

Write a 1#$J shell script 5hich 5ill call the 8ob seGuencer three times b% passing different

file each time-

R: What Eappens if RC+ is disable H

$n such case Osh has to perform $mport and e.port ever% time 5hen the 8ob runs and the

processing time 8ob is also increased---

Runtime column propagation RC+F: $f RC+ is enabled for an% 8ob and specificall% for those

stages 5hose output connects to the shared container input then meta data 5ill be propagated

at run time so there is no need to map it at design time-

$f RC+ is disabled for the 8ob in such case OSE has to perform $mport and e.port ever% time

5hen the 8ob runs and the processing time 8ob is also increased-

Then %ou have to manuall% enter all the column description in each stage-RC+9 Runtime

column propagation

Nuestion:

Source: Target

no name no name

' a)b ' a6 c)d 6 b

= e)f = c

source has 6 fields like

CO?+#> !OCT$O#

$4? E>D

TCS 4#

$4? CEEC! E>D


18/108

TCS CE

$4? 4#

EC! 4#

EC! CE

!$; TE$S-------

TE# TE O1T+1T !OO;S !$; TE$S----

Compan% loc count

TCS E>D =

4#

CE

$4? E>D =

4#

CEEC! E>D =

4#

CE

6Finput is like this:

no)char

')a

6)b

=)a

@)b

0)a

7)a

B)b

A)a

4ut the output is in this form 5ith ro5 numbering of Duplicate occurence

output:

no)char)Count

"'")"a")"'"

"7")"a")"6"

"0")"a")"="

"A")"a")"@"

"=")"a")"0"

"6")"b")"'"

"B")"b")"6"

"@")"b")"="

=F$nput is like this:

file''(


19/108

6(

'(

'(

6(

=(

Output is like:

file6 file=duplicatesF

'( '(

6( '(

=( 6(

@F$nput is like:

file'

'(

6(

'('(

6(

=(

Output is like ?ultiple occurrences in one file and single occurrences in one file:

file6 file=

'( =(

'(

'(

6(

6(

0F$nput is like this:

file'

'(

6(

'(

'(

6(

=(

Output is like:file6 file=

'( =(

6(

7F$nput is like this:

file'

'

6

=

@

0

7B


20/108

A

'(

Output is like:

file6oddF file=evenF' 6

= @

0 7

B A

'(

BFEo5 to calculate SumsalF) vgsalF) ?insalF) ?a.salF 5ith out using ggregator stage--

AFEo5 to find out 3irst sal) !ast sal in each dept 5ith out using aggregator stage

FEo5 man% 5a%s are there to perform remove duplicates function 5ith out using Remove

duplicate stage--

Scenario:

source has 6 fields like

CO?+#> !OCT$O#

$4? E>D

TCS 4#

$4? CE

EC! E>D

TCS CE

$4? 4#

EC! 4#

EC! CE

!$; TE$S-------

TE# TE O1T+1T !OO;S !$; TE$S----

Compan% loc count

TCS E>D =

4#

CE

$4? E>D =

4#

CE

EC! E>D =

4#

CE

Solution:


21/108


22/108

3ou can get it by using Sort stage Transformer stage RemoveDuplicates Transformer tgt

-$

4irst Rea an 'oa t%e ata into your source file# 4or (ample Sequential 4ile &

!n in Sort stage select $ey c%ange column = True # To 5enerate 5roup is&

5o to Transformer stage

Create one stage variable.

3ou can o t%is by rig%t clic$ in stage variable go to properties an name it as your6is% # 4or eample temp&

an in epression 6rite as belo6

if $eyc%ange column =1 t%en column name else temp:)*):column name

T%is column name is t%e one you 6ant in t%e require column 6it% elimite

commas.

-n remove uplicates stage $ey is col1 an set option uplicates retain to> 'ast.in transformer rop col/ an efine / columns li$e col0*col/*col2in col1 erivation give 4iel#"nputColumn*7*7*1& anin col1 erivation give 4iel#"nputColumn*7*7*0& an

in col1 erivation give 4iel#"nputColumn*7*7*/&

Scenario:10&Consier t%e follo6ing employees ata as source8employee9i* salary

1* 1

0* 0/* /

2* ;

Create a


23/108

T%e scenario is t%at*

to t%e out put 2 > t%e recors 6%ic% are common to bot% 1 an 0 s%oul go.

to t%e output / > t%e recors 6%ic% are only in 1 but not in 0 s%oul go

to t%e output ; > t%e recors 6%ic% are only in 0 but not in 1 s%oul go.sltn:src1>copy1>>output91#only left table&

oin#inner type&> ouput91src0>copy0>>output9/#only rig%t table&

Consier t%e follo6ing employees ata as source8employee9i* salary1* 10* 0/* /

2* ;

Scenario:

Create a Transformer#! ne6 Column on bot% t%e output lin$s an assign avalue as 1 &> 1& !ggregator #Do group by

using t%at ne6 column&0&loo$up


24/108

seq>trans>seq

create one stage variable as delimiter..an put erivation on stage as DS'in$2.sno : 7*7 : DS'in$2.sname : 7*7 :DS'in$2.mar$1 : 7*7 :DS'in$2.mar$0 : 7*7 : DS'in$2.mar$/

an o mapping an create one more column count as integer type.

an put erivation on count column as Count#delimter* 7*7&

scenario:sname total9vo6els9count!llen 0Scott 1ar 1

ner Transformer Stage Description:

total9Eo6els9Count=Count#DS'in$/.last9name*7a7&FCount#DS'in$/.last9name*7e7&

FCount#DS'in$/.last9name*7i7&FCount#DS'in$/.last9name*7o7&FCount#DS'in$/.last9name*7u7&.

Scenario:

1&-n aily 6e r getting some %uge files ata so all files metaata is same 6e %ave to

loa in to target table %o6 6e can loa8se 4ile ,attern in sequential file

0& -ne column %aving 1 recors at run time 6e %ave to sen ;t% an @t% recor to

target at run time %o6 6e can sen8Can get t%roug%*by using G"H comman in sequential file filter option

Io6 can 6e get 1A mont%s ate ata in transformer stage8se transformer stage after input seq file an try t%is one as constraint intransformer stage :

DaysSince4romDate#CurrentDate#&* DS'in$/.ate91A&J=;2A -RDaysSince4romDate#CurrentDate#&* DS'in$/.ate91A&J=;2@

6%ere ate91A column is t%e column %aving t%at ate 6%ic% nees to be less or

equal to 1A mont%s an ;2A is no. of ays for 1A mont%s an for leap year it is

;2@#t%ese numbers you nee to c%ec$&.

%at is ifferences bet6een 4orce Compile an Compile 8

Diff b/w Compile and Validate?Compile option only c%ec$s for all manatory requirements li$e lin$ requirements* stage

options an all. ut it 6ill not c%ec$ if t%e atabase connections are vali.Ealiate is equivalent to Running a


25/108

How to FInd Out Duplicate Values Using Transformer?

You can capture the duplicate records based on (eys using ransformer stage %ariables.

7. Sort and partition the input data of the transformer on the (eyHs) which defines the duplicate.

=. Define two stage %ariables, lets say StgPar$re%5ey#olHdata type same as 5ey#ol) and StgPar#ntr as Integer

with default %alue ?

where 5ey#ol is your input column which defines the duplicate.

E+pression for StgPar#ntrH7st stg %arMM maintain order)'

If DS0in(nn.5ey#ol A StgPar$re%5ey#ol hen StgPar#ntr Z 7 Else 7

E+pression for StgPar$re%5ey#olH=nd stg %ar)'

DS0in(nn.5ey#ol

. 2ow in constrain, if you filter rows where StgPar#ntr A 7 will gi%e you the uni*ue records and if you filter

StgPar#ntr R 7 will gi%e you duplicate records.

?% source is !ike

Sr_no) #ame

'()a

'()b

6()c

=()d

=()e

@()f

?% target Should !ike:

Target ':Onl% uniGue means 5hich records r onl% onceF

6()c

@()f

Target 6:Records 5hich r having more than ' timeF

'()a

'()b

=()d=()e

Eo5 to do this in DataStage----

use aggregator and transformer stages

source99aggregator99transformat99target

perform count in aggregator) and take t5o op links in trasformer) filter data count' for one llinkand put count/' for second link-


26/108

Senario!

in m% i,p source i have # no-of records

$n output i have = targets

i 5ant o,p like 'st rec goes to 'st targt and

6nd rec goes to 6nd target and

=rd rec goes to =rd target again

@th rec goes to 'st taget ------------ like this

do this ""5ithout using partition techniGues "" remember it-

source999trans9999target

in trans use conditions on constraints

modempno)=F/'

modempno)=F/6

modempno)=F/(

Senario!

im having i,p as

col

a_b_c

._3_$D_KE_$3

5e hav to mak it as

col' col 6 col=

a b c

. f i

de gh if

Transformer

create = columns 5ith derivation

col' 3ieldcol)2_2)'F

col6 3ieldcol)2_2)6F

col= 3ieldcol)2_2)=F

3ield function divides the column based on the delimeter)

if the data in the col is like )4)Cthen


27/108

3ieldcol)2)2)'F gives

3ieldcol)2)2)6F gives 4

3ieldcol)2)2)=F gives C

Unix Shell Sripting

Script for pass5ordless sftp uni.:

cd ^targetdirectorysftp sftpuserJ^sftpmachine QQE3Fcd sourcedirectory

binget sourcefilenamee+itE3F

DataSet in DataStage

$nside a $nfoSphere DataStage parallel 8ob) data is moved around in data sets- These carr%

meta data 5ith them) both column definitions and information about the onfigurationthat

5as in effect 5hen the data set 5as created- $f for e.ample) %ou have a stage 5hich limits

e.ecution to a subset of available nodes) and the data set 5as created b% a stage using allnodes) $nfoSphere DataStage can detect that the data 5ill need repartitioning-

$f reGuired) data sets can be landed as persistent data sets) represented b% a Data Set stage

-This is the most efficient 5a% of moving data bet5een linked 8obs- +ersistent data sets are

stored in a series of files linked b% a control file note that %ou should not attempt to

manipulate these files using 1#$J tools such as R? or ?L- l5a%s use the tools provided

5ith $nfoSphere DataStageF-

there are the t5o groups of Datasets 9 persistent and virtual-

The first t%pe) persistent Datasets are marked 5ith *.dse.tensions) 5hile for second t%pe)virtual datasets *.ve.tension is reserved- $t2s important to mention) that no -v files might be

visible in the 1ni. file s%stem) as long as the% e.ist onl% virtuall%) 5hile inhabiting R?

memor%- .tesion -v itself is characteristic strictl% for OSE 9 the Orchestrate language of

scriptingF-

3urther differences are much more significant- +rimaril%) persistent Datasets are being

stored in Unix files using internal Datastage EE format ) 5hile virtual Datasets are never

stored on disk9 the% do e.ist 5ithin links) and in format) but in R? memor%- 3inall%)

persistent Datasets are readable and rewriteable with the DataSet Stage) and virtual

Datasets - might be passed through in memory-


28/108

data set comprises a descriptor file and a number of other files that are added as the data set

gro5s- These files are stored on multiple disks in %our s%stem- data set is organiIed in

terms of partitionsand segments.

ach partition of a data set is stored on a single processing node- ach data segment contains

all the records 5ritten b% a single 8ob- So a segment can contain files from man% partitions)and a partition has files from man% segments-

3irstl%) as a single Dataset contains multiple records) it is obvious that all of them must

undergo the same processes and modifications- $n a 5ord) all of them must go through the

same successive stage-

Secondl%) it should be e.pected that different Datasets usuall% have different schemas)

therefore the% cannot be treated commonl%-

lias names of Datasets are

'F Orchestrate 3ile6F Operating S%stem file

nd Dataset is multiple files- The% are

aF Descriptor 3ile

bF Data 3ile

cF Control file

dF Eeader 3iles

$n Descriptor 3ile) 5e can see the Schema details and address of data-

$n Data 3ile) 5e can see the data in #ative format-

nd Control and Eeader files reside in Operating S%stem-

Starting a Dataset "anager

#hoose $ools Data Set "anagement% a &rowse 'iles dialog box appears!

'- #avigate to the director% containing the data set %ou 5ant to manage- 4% convention)

data set files have the suffi. -ds-


29/108

6- Select the data set %ou 5ant to manage and click O;- The Data Set Lie5er appears-

3rom here %ou can cop% or delete the chosen data set- >ou can also vie5 its schema

column definitionsF or the data it contains-

(. S#D $ype )@- Slowly hanging dimension $ype )is a model 5here the 5hole histor% is stored inthe database- n additional dimension record is created and the segmenting bet5een

the old record values and the ne5 currentF value is eas% to e.tract and the histor% is

clear-

The fields 2effective date2 and 2current indicator2 are ver% often used in that dimension

and the fact table usuall% stores dimension ke% and version number-

. S#D ) implementation in Datastage7- The 8ob described and depicted belo5 sho5s ho5 to implement SCD T%pe 6 in

Datastage- $t is one of man% possible designs 5hich can implement this dimension-

3or this e.ample) 5e 5ill use a table 5ith customers data it2s name is

D_C1STO?R_SCD6F 5hich has the follo5ing structure and data:

D0C1ST6423 dimension table before loading

%&ST'(D

%&ST' %&ST' %&ST' %&ST' )*%' )*%' )*%'

+AM*,)"&P

'(DTP*'

(D%"&+T)

'(DV*)S(

"+*FFDT

%&))*+T'(+D

D3B61$

DreamBaset 2= S P= I

AI"IA"HAAG 7

*T(MAA.

*TLtoolsinfo B( % F( /

01-21-0223

$44$A a+atso D S CD I

H"A"HAAG 7

*C*=$AirstPactonic D C *T I

HF"A"HAAG 7

3D55$H rasir 2= C SU I

H>"A"HAAG 7

V$46P$

Vanpa=TD. D C 1S I

HI"A"HAAG 7

VV46P$

VVelectroni

2= S 31 I I"A"

7


30/108

cs HAAG

V=4*$GVlasithlini D S P= I

I"A"HAAG 7

V=4P2$ Vlobiteleco TC S * I

IF"

A"HAAG 7

V68D;$F

Voli$irlines B8 S VB I

I>"A"HAAG 7

The most important facts and stages of the C1ST_SCD6 8ob processing:

The dimension table 5ith customers is refreshed dail% and one of the data sources is a te.t

file- 3or the purpose of this e.ample the C1ST_$D/T$?0 differs from the one stored inthe database and it is the onl% record 5ith changed data- $t has the follo5ing structure and

data:

SCD 6 9 Customers file e.tract:


31/108

There is a hashed file Eash_#e5CustF 5hich handles a lookup of the ne5 data coming

from the te.t file-

T(('_!ookups transformer does a lookup into a hashed file and maps ne5 and old

values to separate columns-SCD 6 lookup transformer

T((6_Check_Discrepacies_e.ist transformer compares old and ne5 values of records and

passes through onl% records that differ-

SCD 6 check discrepancies transformer


32/108

T((= transformer handles the 1+DT and $#SRT actions of a record- The old record is

updated 5ith current indictator flag set to no and the ne5 record is inserted 5ith current

indictator flag set to %es) increased record version b% ' and the current date-

SCD 6 insert9update record transformer


33/108


34/108

V$46P$

Vanpa=TD. D C 1S I

HI"A"HAAG 7

VV46P$

VVelectroni

cs 2= S 31 I

I"A"

HAAG 7

V=4*$GVlasithlini D S P= I

I"A"HAAG 7

V=4P2$

Vlobiteleco TC S * I

IF"A"HAAG 7

V68D;$F

Voli$irlines B8 S VB I

I>"A"HAAG 7

I4" I2F3S$9E1ED--S-GE $E1F31"-2#E/2I2G' 3PE1PIE 3F 4ES

$1-#I#ESI213D/#I32

Data integration processes are &ery time and resource consuming. The

amount of data and the size of the datasets are constantly gro%ing but

data and information are still expected to be deli&ered on"time.

Performance is therefore a ey element in the success of a Business

*ntelligence W Data ;arehousing pro+ect and in order to guarantee the

agreed le&el of ser&ice! management of data %arehouse performance andperformance tuning ha&e to tae a full role during the data %arehouse and

2T= de&elopment process.

Tuning! ho%e&er! is not al%ays straightfor%ard. $ chain is only as strong

as its %eaest lin. *n this context! there are &e crucial domains that

re-uire attention %hen tuning an *B4 *nfosphere DataStage en&ironment @

System *nfrastructure

8et%or

Database


35/108


36/108

*n order to resol&e any performance issues it is essential to ha&e an

understanding of the data ,o% %ithin the +obs. To help understand a +ob

,o%! a score dump should be taen. This can be done by setting the

$PT0D14P0SC632 en&ironment &ariable to true prior to running the +ob.

;hen enabled! the score dump produces a report %hich sho%s the

operators! processes and data sets in the +ob and contains information

about @

;here and ho% data %as repartitioned.

;hether *B4 *nfoSphere DataStage has inserted extra operators in

the ,o%.

The degree of parallelism each operator has run %ith! and on %hich

nodes.

;here data %as bu?ered.

The score dump information is included in the +ob log %hen a +ob is run

and is particularly useful in sho%ing %here *B4 *nfoSphere DataStage is

inserting additional components:actions in the +ob ,o%! in particular extra

data partitioning and sorting operators as they can both be detrimental to

performance. $ score dump %ill help to detect super,uous operators and

amend the +ob design to remo&e them.

/sage of 1esource Estimation

Predicting hard%are resources needed to run DataStage +obs in order to

meet processing time re-uirements can sometimes be more of an art than

a science.

;ith ne% sophisticated analytical information and deep understanding of

the parallel frame%or! *B4 has added 3esource 2stimation to DataStage

(and ualityStage) O.x. This can be used to determine the needed system

re-uirements or to analyze if the current infrastructure can support the

+obs that ha&e been created.

;ithin a +ob design! a ne% toolbar option is a&ailable called 3esource

2stimation.


37/108

This option opens a dialog called 3esource 2stimation. The 3esource

2stimation is based on a modelization of the +ob. There are t%o types of

models that can be created@

Static. The static model does not actually run the +ob to create the

model. CP1 utilization cannot be estimated! but dis space can. The

record size is al%ays xed. The best case scenario is considered

%hen the input data is propagated. The %orst case scenario is

considered %hen computing record size.

Dynamic. The 3esource 2stimation tool actually runs the +ob %ith a

sample of the data. Both CP1 and dis space are estimated. This is

a more predictable %ay to produce estimates.

3esource 2stimation is used to pro+ect the resources re-uired to execute

the +ob based on &arying data &olumes for each input data source.


38/108

$ pro+ection is then executed using the model selected. The results sho%

the total CP1 needed! dis space re-uirements! scratch space

re-uirements! and other rele&ant information.

Di?erent pro+ections can be run %ith di?erent data &olumes and each can

be sa&ed. Vraphical charts are also a&ailable for analysis! %hich allo% the

user to drill into each stage and each partition. $ report can be generated

or printed %ith the estimations.


39/108

This feature %ill greatly assist de&elopers in estimating the time and

machine resources needed for +ob execution. This ind of analysis can

help %hen analyzing the performance of a +ob! but *B4 DataStage also

o?ers another possibility to analyze +ob performance.

/sage of $erformance -nalysis

*solating +ob performance bottlenecs during a +ob execution or e&en

seeing %hat else %as being performed on the machine during the +ob run

can be extremely di'cult. *B4 *nfosphere DataStage O.x adds a ne%

capability called Performance $nalysis.

*t is enabled through a +ob property on the execution tab %hich collects

data at +ob execution time. ( 8ote@ by default! this option is disabled ) .

6nce enabled and %ith a +ob open! a ne% toolbar option! called

Performance $nalysis! is made a&ailable .

This option opens a ne% dialog called Performance $nalysis. The rst

screen ass the user %hich +ob instance to perform the analysis on.

Detailed charts are then a&ailable for that specic +ob run including@

Job timeline

3ecord Throughput

CP1 1tilization

Job Timing Job 4emory 1tilization

Physical 4achine 1tilization (sho%s %hat else is happening o&erall

on the machine! not +ust the DataStage acti&ity).

2ach partition


40/108

$ report can be generated for each chart.

1sing the information in these charts! a de&eloper can for instance

pinpoint performance bottlenecs and re"design the +ob to impro&eperformance.

*n addition to instance performance! o&erall machine statistics are

a&ailable. ;hen a +ob is running! information about the machine is also

collected and is a&ailable in the Performance $nalysis tool including@

6&erall CP1 1tilization

4emory 1tilization

Dis 1tilizationDe&elopers can also correlate statistics bet%een the machine information

and the +ob performance. iltering capabilities exist to only display specic

stages.

The information collected and sho%n in the Performance $nalysis tool can

easily be analyzed to identify possible bottlenecs. These bottlenecs are

usually situated in the general +ob design! %hich %ill be described in the

follo%ing chapter.

'!%!"$) *O( D!SI'% (!ST "$&TI&!S

The ability to process large &olumes of data in a short period of time

depends on all aspects of the ,o% and the en&ironment being optimized

for maximum throughput and performance. Performance tuning and

optimization are iterati&e processes that begin %ith +ob design and unit

tests! proceed through integration and &olume testing! and continue

throughout the production life cycle of the application. Xere are someperformance pointers@


41/108

#olumns and type con%ersions

3emo&e unneeded columns as early as possible %ithin the +ob ,o%. 2&ery

additional unused column re-uires additional bu?er memory! %hich can

impact performance and mae each ro% transfer from one stage to the

next more expensi&e. *f possible! %hen reading from databases! use a

select list to read only the columns re-uired! rather than the entire table.

$&oid propagation of unnecessary metadata bet%een the stages. 1se the

4odify stage and drop the metadata. The 4odify stage %ill drop the

metadata only %hen explicitly specied using the D36P clause.

So only columns that are really needed in the +ob should be used and the

columns should be dropped from the moment they are not neededanymore. The 6SX0P3*8T0SCX24$S en&ironment &ariable can be set to

&erify that runtime schemas match the +ob design column denitions.

;hen using stage &ariables on a Transformer stage! ensure that their data

types match the expected result types. $&oid that DataStage needs to

perform unnecessary type con&ersions as it %ill use time and resources

for these con&ersions.

ransformer stages

*t is best practice to a&oid ha&ing multiple stages %here the functionality

could be incorporated into a single stage! and use other stage types to

perform simple transformation operations. Try to balance load on

Transformers by sharing the transformations across existing Transformers.

This %ill ensure a smooth ,o% of data.

;hen type casting! renaming of columns or addition of ne% columns is

re-uired! use Copy or 4odify Stages to achie&e this. The Copy stage! for

example! should be used instead of a Transformer for simple operations

including @

Job Design placeholder bet%een stages

3enaming Columns

Dropping Columns

*mplicit (default) Type Con&ersions

$ de&eloper should try to minimize the stage &ariables in a Transformer

stage because the performance of a +ob decreases as stage &ariables are


42/108

added in a Transformer stage. The number of stage &ariables should be

limited as much as possible.

$lso if a particular stage has been identied as one that taes a lot of time

in a +ob! lie a Transformer stage ha&ing complex functionality %ith a lot ofstage &ariables and transformations! then the design of +obs could be

done in such a %ay that this stage is put in a separate +ob all together

(more resources for the Transformer Stage).

;hile designing *B4 DataStage Jobs! care should be taen that a single

+ob is not o&erloaded %ith stages. 2ach extra stage put into a +ob

corresponds to less resources being a&ailable for e&ery stage! %hich

directly a?ects the +ob performance. *f possible! complex +obs ha&ing alarge number of stages should be logically split into smaller units.

Sorting

$ sort done on a database is usually a lot faster than a sort done in

DataStage. So # if possible # try to already do the sorting %hen reading

data from the database instead of using a Sort stage or sorting on the

input lin. This could also mean a big performance gain in the +ob!

although it is not al%ays possible to a&oid needing a Sort stage in +obs.

Careful +ob design can impro&e the performance of sort operations! both in

standalone Sort stages and in on"lin sorts specied in other stage types!

%hen not being able to mae use of the database sorting po%er.

*f data has already been partitioned and sorted on a set of ey columns!

specify the Ydon


43/108

of the Sort stage. The default setting is HA 4B per partition. 8ote that sort

memory usage can only be specied for standalone Sort stages! it cannot

be changed for inline (on a lin) sorts.

Se*uential files

;hile handling huge &olumes of data! the Se-uential ile stage can itself

become one of the ma+or bottlenecs as reading and %riting from this

stage is slo%. Certainly do not use se-uential les for intermediate storage

bet%een +obs. *t causes performance o&erhead! as it needs to do data

con&ersion before %riting and reading from a le. 3ather Dataset stages

should be used for intermediate storage bet%een di?erent +obs.

Datasets are ey to good performance in a set of lined +obs. They help in

achie&ing end"to"end parallelism by %riting data in partitioned form and

maintaining the sort order. 8o repartitioning or import:export con&ersions

are needed.

*n order to ha&e faster reading from the Se-uential ile stage the number

of readers per node can be increased (default &alue is one). This means!

for example! that a single le can be partitioned as it is read (e&en though

the stage is constrained to running se-uentially on the conductor mode).

This is an optional property and only applies to les containing xed"

length records. But this pro&ides a %ay of partitioning data contained in a

single le. 2ach node reads a single le! but the le can be di&ided

according to the number of readers per node! and %ritten to separate

partitions. This method can result in better *:6 performance on an S4P

(Symmetric 4ulti Processing) system.

*t can also be specied that single les can be read by multiple nodes.

This is also an optional property and only applies to les containing xed"


44/108

length records. Set this option to 7es to allo% indi&idual les to be read

by se&eral nodes. This can impro&e performance on cluster systems.

*B4 DataStage no%s the number of nodes a&ailable! and using the xed

length record size! and the actual size of the le to be read! allocates to

the reader on each node a separate region %ithin the le to process. The

regions %ill be of roughly e-ual size.

The options 3ead rom 4ultiple 8odes and 8umber of 3eaders Per

8ode are mutually exclusi&e.

1untime #olumn $ropagation

$lso %hile designing +obs! care must be taen that unnecessary column

propagation is not done. Columns! %hich are not needed in the +ob ,o%!

should not be propagated from one stage to another and from one +ob to

the next. $s much as possible! 3CP (3untime Column Propagation) should

be disabled in the +obs.

&oin, 0oo(up or "erge

6ne of the most important mistaes that de&elopers mae is to not ha&e&olumetric analyses done before deciding to use Join! =ooup or 4erge

stages.

*B4 DataStage does not no% ho% large the data set is! so it cannot mae

an informed choice %hether to combine data using a Join stage or a

=ooup stage. Xere is ho% to decide %hich one to use Z

There are t%o data sets being combined. 6ne is the primary or dri&ing

data set! sometimes called the left of the +oin. The other dataset are thereference data set or the right of the +oin.


45/108

*n all cases! the size of the reference data sets is a concern. *f these tae

up a large amount of memory relati&e to the physical 3$4 memory size of

the computer DataStage is running on! then a =ooup stage might crash

because the reference datasets might not t in 3$4 along %ith e&erything

else that has to be in 3$4. This results in a &ery slo% performance since

each looup operation can! and typically %ill! cause a page fault and an

*:6 operation.

So! if the reference datasets are big enough to cause trouble! use a +oin. $

+oin does a high"speed sort on the dri&ing and reference datasets. This can

in&ol&e *:6 if the data is big enough! but the *:6 is all highly optimized and

se-uential. $fter the sort is o&er! the +oin processing is &ery fast and ne&er

in&ol&es paging or other *:6.

Databases

The best choice is to use Connector stages if a&ailable for the database.

The next best choice is the 2nterprise database stages as these gi&e

maximum parallel performance and features %hen compared to [plug"ine 4a#e successfully implemente5 A,,)*,AT("+ usingT)A+SF")M*) Stage

L""P(+, (+ T)A+SF")M*) STA,* DATASTA,* E9. : *AMPL* /


51/108

Looping in a Transformer 7as ne7 feature t4at 7as intro5uce5 inDataStage E9.9 +o7 7e 5onGt 4a#e to 5esign comple logic toperform looping functions on input 5ata9

)at4er t4an going on a$out 74at is ne7 an5 74at

#aria$les or functions 4a#e $een a55e5 to ac4ie#e Loopingin DataStage; 7e 7ill look at a couple of eamples t4at7ill eplain all t4e ne7 functionalities in a simple an5 easy7ay9

*ample /:9ow to con%ert a single row into multiple rows +o7 you can argue t4at t4is is possi$le using a pi#otstage9 But for t4e sake of t4is article lets try 5oing t4isusing a TransformerH

Belo7 is a screens4ot of our input 5ata

%ity State +ame/ +ame0 +ame?

xy VX Sam Dean ;inchester

>e are going to rea5 t4e a$o#e 5ata from a sequential 8lean5 transform it to look like t4is

%ity State +ame

xy VX Sam

xy VX Dean

xy VX ;inchester

So lets get to t4e 6o$ 5esign

Step /: )ea5 t4e input 5ata

Step 0: Logic for Looping in Transformer


52/108

(n t4e a56acent image you can see a ne7 $o calle5 Loop%on5ition9 T4is 74ere 7e are going to control t4e loop#aria$les9

Belo7 is t4e screens4ot 74en 7e epan5 t4e Loop%on5ition $o

T4e Loop >4ile constraint is use5 to implement afunctionality similar to @>C(L* statement inprogramming9 So; similar to a 74ile statement nee5 to4a#e a con5ition to i5entify 4o7 many times t4e loop issuppose5 to $e eecute59

To ac4ie#e t4is I(T*)AT("+ system #aria$le 7asintro5uce59 (n our eample 7e nee5 to loop t4e 5ata ?times to get t4e column 5ata onto su$sequent ro7s9

So lets 4a#e I(T*)AT("+ JK?


53/108

+o7 create a ne7 Loop #aria$le 7it4 t4e name Loop+ame

T4e 5eri#ation for t4is loop #aria$le s4oul5 $e

If @ITERATION=1 Then DSLink2.Name1 Else If @ITERATION=2 Then DSLink2.Name2

Else DSLink2.Name3

Belo7 is a screens4ot illustrating t4e same

+o7 all 7e 4a#e to 5o is map t4is Loop #aria$le Loop+ame

to our output column +ame

Lets map the output to a sequential fle stage and see ithe output is a desired.


54/108

After running t4e 6o$; 7e 5i5 a #ie7 5ata on t4e outputstage an5 4ere is t4e 5ata as 5esire59

Making some t7eaks to t4e a$o#e 5esign 7e canimplement t4ings like

A55ing ne7 ro7s to eisting ro7s

Splitting 5ata in a single column to multiple ro7s an5many more suc4 stu99

$"$))!) D!(U''I%' I%

D$T$ST$'! +,-

(BM 4as a55e5 a ne7 5e$ug feature for DataStage fromt4e E9. #ersion9 (t pro#i5es $asic 5e$ugging features fortesting5e$ugging t4e 6o$ 5esign9

T4is feature is use5 from t4e 5esigner client9 So lets 6ump

into creating a simple 6o$ an5 start 5e$ugging itH


55/108

As you can see t4e a$o#e 6o$ 5esign is #ery simpleN ( 4a#ea ro7 generator stage t4at 7ill generate /2 recor5s fort4e column @num$er

An5 in t4e transformer (G#e 4ar5-co5e5 a column calle5@name 7it4 t4e #alue OAB%D*F,G

T4e 8nal stage is t4e peek stage t4at 7e all 4a#e $eenusing for 5e$ugging our 6o$ 5esign all t4ese years9+o7 t4at 7e un5erstan5 t4e 6o$ 5esing; 7eGll start5e$ugging it9

A55 $reakpoints to links

)ig4t click on t4e link 74ere you 7is4 to c4eck5e$ug

t4e 5ata an5 t4e follo7ing 5rop 5o7n menu appears


56/108

Select t4e option; @Toggle Breakpoint

"nce selecte5 a small icon appears on t4e link ass4o7n in t4e $elo7 screen s4ot

>e can also e5it t4e Breakpoint; to 5o so 7e 4a#e torig4t click on t4e link again an5 select @*5itBreakpoints9 Follo7ing 74ic4 t4e follo7ing popup7ill $e 5isplaye5

By 5efault t4e $reakpoint 7ill $e acti#ate5 for e#eryro79 (f t4e nee5 arises; 7e can c4ange t4is to any

#alue an5 t4e 5e$ugger 7ill pause t4e o7 of recor5sas an5 74en it reac4es t4at num$er*: (f 7e set t4e #alue for @*#ery + )o7s to .; t4e5e$ugger 7ill 5isplay e#ery recor5 t4at is a multipleof .; i9e9; .t4 recor5s; /2t4 recor5; /.t4 recor5 an soon99

For t4e sake of t4is eample 7eGll c4ange t4e #alue to0


57/108

>e are 5one setting up t4e $reakpoints for t4e 5e$ugger;so lets compile our 6o$ no79 "nce compile5; click on 5e$ugmenu option an5 select @,"9 "r simply press F.

T4e follo7ing popup appears an5 7e are aske5 to run our6o$9 ,o a4ea5 an5 run it -

Screens4ot of 6o$ running in 5e$ug mo5e


58/108

+o7 our 6o$ is running in 5e$ug mo5e an5 as soon as t4e0n5 recor5 passes t4roug4 t4e link


59/108


60/108

Select t4e @)un to *n5 option on t4e 5e$ugger 7in5o7an5 t4e 6o$ 7ill 8nis4 an5 7ait for t4e net user input9

%lose t4e 7in5o7 an5 go to t4e menu option @5e$ug9 (nt4at select @%lear All Breakpoints9 T4is 7ill remo#e all$reakpoints t4at you 4a#e set in your 6o$9


61/108

D$T$ST$'! &O##O%!""O"S./$"%I%'S $%D SO)UTIO%S

0 1/9 >4ile connecting @)emote Desktop; Terminal ser#er4as $een ecee5e5 maimum num$er of allo7e5connections

S"L: (n %omman5 Prompt; type mstsc #: ip a55ress ofser#er a5min ") mstsc #: ipa55ress console09 SRL02.0/+9 *rror occurre5 processing a con5itional

compilation 5irecti#e near string9 )eason co5eKrc9Follo7ing link 4as issue 5escription:4ttp:pic954e9i$m9cominfocenter5$0lu7#1rin5e96sptopicK0Fcom9i$m95$09lu79messages9sql95oc0F5oc0Fmsql02.0/n94tml?9 SU')*TA(L*)',)"&P'B)D(,*;/: runLocallyarning 7ill $e 5isappeare5 $y regenerating SUFile9

9 >4ile connecting to Datastage client; t4ere is noresponse; an5 74ile restarting 7e$sp4ere ser#ices;follo7ing errors occurre5rootIpoluloro2/ $inWX 9stopSer#er9s4 ser#er/ -user7asa5min -pass7or5 >asa5min22EADM&2//3(: Tool information is $eing logge5 in 8le opti$m>e$Sp4ereAppSer#erpro8les5efaultlogsser#er/stopSer#er9logADM&2/0E(: Starting tool 7it4 t4e 5efault pro8leADM&?/22(: )ea5ing con8guration for ser#er: ser#er/ADM&2///*: Program eiting 7it4 error:

6a#a9management9!M)untime*ception:ADM+2200*: Access is 5enie5 for t4e stop operation onSer#er MBean $ecause of insuYcient or emptycre5entials9ADM&//?*: Verify t4at username an5 pass7or5information is on t4e comman5 line


62/108

ADM&20//(: *rror 5etails may $e seen in t4e 8le: opti$m>e$Sp4ereAppSer#erpro8les5efaultlogsser#er/stopSer#er9logS"L: >asa5min an5 Meta pass7or5s nee5s to $e reset

an5 comman5s are $elo799 rootIpoluloro2/ $inWX c5opti$m(nformationSer#erASBSer#er$inrootIpoluloro2/ $inWX 9AppSer#erA5min9s4 -7as -user7asa5min-pass7or5 >asa5min22E(nfo >AS instance +o5e:poluloro2/Ser#er:ser#er/up5ate5 7it4 ne7 user information(nfo Meta5ataSer#er 5aemon script up5ate5 7it4 ne7 userinformation

rootIpoluloro2/ $inWX 9AppSer#erA5min9s4 -7as -usermeta -pass7or5 meta22E(nfo >AS instance +o5e:poluloro2/Ser#er:ser#er/up5ate5 7it4 ne7 user information(nfo Meta5ataSer#er 5aemon script up5ate5 7it4 ne7 userinformation.9 @T4e speci8e5 8el5 5oesnGt eist in #ie7 a5apte5sc4emaS"L: Most of t4e time @T4e speci8e5 8el5: 5oes

not eist in t4e #ie7 a5apte5 sc4ema occurre5 74en 7emisse5 a 8el5 to map9 *#ery stage 4as got an output ta$ ifuse5 in t4e $et7een of t4e 6o$9 Make sure you 4a#emappe5 e#ery single 8el5 require5 for t4e net stage9Sometime e#en after mapping t4e 8el5s t4is error can $eoccurre5 an5 one of t4e reason coul5 $e t4at t4e #ie7a5apter 4as not linke5 t4e input an5 output 8el5s9 Cencein t4is case t4e require5 8el5 mapping s4oul5 $e 5roppe5an5 recreate59

!ust to gi#e an insig4t on t4is; t4e #ie7 a5apter is anoperator 74ic4 is responsi$le for mapping t4e input an5output 8el5s9 Cence DataStage creates an instance ofAPT'Vie7A5apter 74ic4 translate t4e components of t4eoperator input interface sc4ema to matc4ing componentsof t4e interface sc4ema9 So if t4e interface sc4ema is not4a#ing t4e same columns as operator input interfacesc4ema t4en t4is error 7ill $e reporte59DATASTA,* %"MM"+ *))")S>A)+(+,S A+D S"L&T("+S 0

/9 +o 6o$s or logs s4o7ing in (BM DataStage Director %lient;4o7e#er 6o$s are still accessi$le from t4e Designer %lient9


63/108

S"L: SyncPro6ect cm5 t4at is installe5 7it4 DataStageE9. can $e run to analyQe an5 reco#er pro6ects

SyncPro6ect -(SFile islogin -pro6ect 5stage? 5stage. Fi

09 %ASC"&T'DTL: (n#ali5 property #alue%onnectionData$ase e$sp4ere ser#ices9

39 T4e connection 7as refuse5 or t4e )P% 5aemon is notrunning


64/108

)%: T4e 5sprc5 process must $e running in or5er to $ea$le to login to DataStage9

(f you restart DataStage; $ut t4e socket use5 $y t4e5srpc5 A(T; or %L"S*'>A(T9T4ese 7ill pre#ent t4e 5sprc5 from starting9 T4e sockets7it4 status F(+'>A(T or %L"S*'>A(T 7ill e#entually timeout an5 5isappear; allo7ing you to restart DataStage9

T4en )estart DS*ngine9


65/108


66/108

"utput link

T4e output link 4as t7o columns9 T4e 6o$name an5 one fort4e function names9 Per input ro7; t4ere are /2 ro7soutputte59 *#en if t4e 8el5s are +&LL9 Di#i5e t4e8el5names 7it4 spaces; not 7it4 any ot4er 5elimiter9

DATASTA,* PA)ALL*L )"&T(+*S ?: T" %C*%U (F A ,(V*+ DAT* (S VAL(D ") +"T

)equirement: To c4eck if a gi#en 5ate is #ali5 or not9

Datastage 4as a $uilt in function


67/108

For eample if 7e 4a#e to test for string 2/2//111 t4esynta 7ill $e

IsValid('DATE', StringToDate("01/01/1999","%mm/%dd/%yyyy"))

(f t4e 5ate is #ali5 5ate; it 7ill return one else Qero9 But; if

7e 4a#e t4e 5ate as @///111b t4oug4 t4e 5ate is #ali5 it7ill return 2; as 5ate an5 mont4 8el5s 4a#e only one 5igit9T4erefore; 7e 7ill create a Parallel )outine; 74ic4 7illtake 5ate in #arious format #iQ9 mm55yyyy; m55yyyy;mm5yyyy; mm55yy; m5yy etc an5 return it in a 8e5format of mm55yyyy; 74ic4 7e can use in t4e functionspeci8e5 a$o#e9

%%\\ %o5e:

T4e co5e 7ritten $elo7 takes a 5ate string in any of t4eformats mentione5 a$o#e an5 returns a 5ate string inmm55yyyy format


68/108

//handling the date part

if ((fst-str)==1)

{

strcpy(date,"0");

strncat(date,str,1);

}

else if ((fst-str)==2)

{

strncpy(date,str,2);

}

else

return ("Invalid");

//handling the month part

if ((lst-fst-1)==1)

{

strcat(date,"0");

strncat(date,fst+1,1);

}

else if ((lst-fst-1)==2)

{

strncat(date,fst+1,2);


69/108

}

else

return ("Invalid");

//handling the year part

if ((len-(lst-str)-1)==2)

{

decade=atoi(lst+1);

if (decade>=50)

{

strcat(date,"19");

}

else

{

strcat(date,"20");

}

strncat(date,lst+1,2);

}

else if ((len-(lst-str)-1)==4)

{

strncat(date,lst+1,4);

}


70/108

else

return ("Invalid");

return(date);

}

%reate t4e 9o 8le


71/108

As 7e are passing a 5ate string as input 7e 4a#e a55e5 a

parameter 4ere9Sa#e an5 %lose it9

&sage in Datastage !o$

>e 4a#e create5 a simple 6o$ to 5emonstrate of t4eParallel )outine @5ateformat 74ic4 7e 4a#e 6ust create59

T4e 6o$ 4as t4ree stages:

/9SAMPL*'DAT*S'inp stage pro#i5es sample 5ates fortesting9


72/108

09MAST*)'T)A+SF")M*)'tra stage calls t4e parallelroutine9

?9"&T'l5r loa5s t4e 5ata into a ta$ 5elimite5 8le9The fgure shows the sample data in SAML!"#AT!S"inp

The fgure shows the properties o the

MAST!$"T$A%S&'$M!$"tra stage


73/108

Various transformation rules are clearly seen; t4e place74ere @5ateformat is calle5 is un5erline5 in re5 color

The fgure shows the data loaded in the output fle.


74/108

DAT*'(+: t4is column consists of all t4e recor5s from t4einput 8le as it is9

sample_dates_inp.DATE_IN

DAT*'VAL(D: t4is column s4o7 t4e #alue returne5 $y t4e

(S'VAL(D function 7it4out using t4e parallel routine@5ateformat create5 $y us9

IsValid('DATE', StringToDate(sample_dates_inp.DATE_IN,"%mm/%dd/%yyyy"))

M"D(F(*D'DAT*: t4is column s4o7s t4e 5ate returne5 $yroutine @5ateformat 74en DAT*'(+ is gi#en as t4e input9

dateformat(sample_dates_inp.DATE_IN)

M"D(F(*D'DAT*'VAL(D: t4is column s4o7 t4e #aluereturne5 $y t4e (S'VAL(D function AFT*) using t4e parallelroutine @5ateformat create5 $y us9

IsValid('DATE', StringToDate(dateformat(sample_dates_inp.DATE_IN),"%mm%dd

%yyyy"))

>e can see t4at all 5ate formats 4a#e $een i5enti8e5 as#ali5 5ates after using t4e routine9

Let us see one more usage of Datastage parallel routinesin my net post9

D$T$ST$'! &O##O%!""O"S./$"%I%'S $%D SO)UTIO%S

0 2/9 >4ile running 9+o5eAgents9s4 start comman5getting t4e follo7ing error: @LoggingAgent9s4 processstoppe5 unepecte5ly

S"L: nee5s to kill LoggingAgentSocket(mpl

Ps ef ^ grep LoggingAgentSocket(mpl arning: A user 5e8ne5 sort operator 5oes notsatisfy t4e requirements9


75/108

S"L: %4eck t4e or5er of sorting columns an5 make sureuse t4e same or5er 74en use 6oin stage after sort to 6oingt7o inputs99 %on#ersion error calling con#ersion routine

timestamp'from'string 5ata may 4a#e $een lost9fm!ournals;/: %on#ersion error calling con#ersion routine5ecimal'from'string 5ata may 4a#e $een lost

S"L: c4eck for t4e correct 5ate format or 5ecimal formatan5 also null #alues in t4e 5ate or 5ecimal 8el5s $eforepassing to 5atastage StringToDate;DateToString;DecimalToString or StringToDecimalfunctions9

.9 To 5isplay all t4e 6o$s in comman5 line

S"L:

cd /opt/ibm/InformationServer/Server/DSEngine/bin

./dsjob -ljobs

39 @*rror trying to query 5sa5mW9 T4ere mig4t $e anissue in 5ata$ase ser#er

S"L: %4eck M*TA connecti#ity95$0 connect to meta


76/108

S"L: )emo#e t4e locks an5 try to run DSJ20B$DP$3$4 ParamNameis not a parameter namein the +ob.

"] DSJ20B$D^$=12 *n&alid MaxNumber&alue.

"F DSJ20B$DT7P2 *nformation or e&ent type %asunrecognized.

"G DSJ20;368VJ6B Job for this JobHandle%as not startedfrom a call to DS)un!o$by the


77/108

CODE ERROR TOKEN DESCRIPTION

A DSJ208623363 8o *nfoSphere DataStage $P* error hasoccurred.

current process.

" DSJ20B$DST$V2 StageNamedoes not refer to a no%nstage in the +ob.

"O DSJ2086T*8ST$V2 *nternal engine error.

" DSJ20B$D=*8U LinkNamedoes not refer to a no%nlin for the stage in -uestion.

"IA DSJ20J6B=6CU2D The +ob is loced by another process.

"II DSJ20J6BD2=2T2D The +ob has been deleted.

"IH DSJ20B$D8$42 *n&alid pro+ect name.

"I> DSJ20B$DT*42 *n&alid StartTimeor EndTime&alue.

"I] DSJ20T*4261T The +ob appears not to ha&e startedafter %aiting for a reasonable length oftime. ($bout >A minutes.)

"IF DSJ20D2C37PT233 ailed to decrypt encrypted &alues.

"IG DSJ2086$CC2SS Cannot get &alues! default &alues ordesign default &alues for any +obexcept the current +ob.

" DSJ2032P23363 Veneral engine error.

"IAA DSJ2086T$D4*81S23 1ser is not an administrator.

"IAI DSJ20*S$D4*8$*=2D ailed to determine %hether user is anadministrator.

"IAH DSJ2032$DP36JP36P23T7 ailed to read property."IA> DSJ20;3*T2P36JP36P23T7 Property not supported.

"IA] DSJ20B$DP36P23T7 1nno%n property name.

"IAF DSJ20P36P86TS1PP63T2D 1nsupported property.

"IAG DSJ20B$DP36P^$=12 *n&alid &alue for this property.

"IA DSJ206SX^*S*B=2=$V ailed to get &alue for 6SX^isible.

"IAO DSJ20B$D28^^$38$42 *n&alid en&ironment &ariable name.


78/108



"IA DSJ20B$D28^^$3T7P2 *n&alid en&ironment &ariable type.

"IIA DSJ20B$D28^^$3P364PT 8o prompt supplied.

"III DSJ2032$D28^^$3D28S ailed to read en&ironment &ariabledenitions.

"IIH DSJ2032$D28^^$3^$=12S ailed to read en&ironment &ariable&alues.

"II> DSJ20;3*T228^^$3D28S ailed to %rite en&ironment &ariabledenitions.

"II] DSJ20;3*T228^^$3^$=12S ailed to %rite en&ironment &ariable&alues.

"IIF DSJ20D1P28^^$38$42 2n&ironment &ariable being addedalready exists.

"IIG DSJ20B$D28^^$3 2n&ironment &ariable does not exist.

"II DSJ2086T1S23D2*82D 2n&ironment &ariable is not user"dened and therefore cannot be

deleted.

"IIO DSJ20B$DB66=2$8^$=12 *n&alid &alue gi&en for a booleanen&ironment &ariable.

"II DSJ20B$D81423*C^$=12 *n&alid &alue gi&en for an integeren&ironment &ariable.

"IHA DSJ20B$D=*ST^$=12 *n&alid &alue gi&en for a listen&ironment &ariable.

"IHI DSJ20P586T*8ST$==2D 2n&ironment &ariable is specic to

parallel +obs %hich are not a&ailable.

"IHH DSJ20*SP$3$==2==*C28C2D ailed to determine if parallel +obs area&ailable.

"IH> DSJ2028C6D2$*=2D ailed to encode an encrypted &alue.

"IH] DSJ20D2=P36J$*=2D ailed to delete pro+ect denition.

"IHF DSJ20D2=P36J*=2S$*=2D ailed to delete pro+ect les.

"IHG DSJ20=*STSCX2D1=2$*=2D ailed to get list of scheduled +obs for

pro+ect.


79/108



"IH DSJ20C=2$3SCX2D1=2$*=2D ailed to clear scheduled +obs forpro+ect.

"IHO DSJ20B$DP36J8$42 *n&alid pro+ect name supplied.

"IH DSJ20V2TD2$1=TP$TX$*=2D ailed to determine default pro+ectdirectory.

"I>A DSJ20B$DP36J=6C$T*68 *n&alid path name supplied.

"I>I DSJ20*8 $=*DP36J2CT=6C$T*68 *n&alid path name supplied.

"I>H DSJ206P28$*=2D ailed to open 1 .$CC618T le.

"I>> DSJ2032$D1$*=2D ailed to loc pro+ect create locrecord.

"I>] DSJ20$DDP36J2CTB=6CU2D $nother user is adding a pro+ect.

"I>F DSJ20$DDP36J2CT$*=2D ailed to add pro+ect.

"I>G DSJ20=*C28S2P36J2CT$*=2D ailed to license pro+ect.

"I> DSJ2032=2$S2$*=2D ailed to release pro+ect create loc

record.

"I>O DSJ20D2=2T2P36J2CTB=6CU2D Pro+ect loced by another user.

"I> DSJ2086T$P36J2CT ailed to log to pro+ect.

"I]A DSJ20$CC618TP$TX$*=2D ailed to get account path.

"I]I DSJ20=6VT6$*=2D ailed to log to 1^ account.

"HAI DSJ2018U86;80J6B8$42 The supplied +ob name cannot befound in the pro+ect.

"IAAI

DSJ20864632 $ll e&ents matching the lter criteriaha&e been returned.

"IAAH

DSJ20B$DP36J2CT ProjectNameis not ano%n *nfoSphere DataStage pro+ect.

"IAA>

DSJ20860D$T$ST$V2 *nfoSphere DataStage is not installedon the system.

"

IAA]

DSJ206P28$*= The attempt to open the +ob failed #

perhaps it has not been compiled.


80/108



"IAAF

DSJ20860424637 ailed to allocate dynamic memory.

"IAAG

DSJ20S23^23023363 $n unexpected or unno%n erroroccurred in the engine.

"IAA

DSJ2086T0$^$*=$B=2 The re-uested information %as notfound.

"IAAO

DSJ20B$D0^23S*68 The engine does not support this&ersion of the *nfoSphere

DataStage $P*."IAA

DSJ20*8C64P$T*B=20S23^23 The engine &ersion is incompatible%ith this &ersion of the *nfoSphereDataStage $P*.

AP( error co5es in numerical or5er

ERROR NUMBER DESCRIPTION

>IHI The *nfoSphere DataStage license has expired.

>I>] The *nfoSphere DataStage user limit has been reached.

OAAII *ncorrect system name or in&alid user name or pass%ordpro&ided.

OAAI Pass%ord has expired.

AP( %ommunication Layer *rror %o5es

I(# "!D(OO3 &O"%!"

Belo7 are a list of re5$ooks t4at are a#aila$le on t4e (BMsite for topics relate5 to Datastage9 (ts pretty useful fort4e person 74o 7ants to kno7 more9 %4eck it out

(nformation Ser#er (nstallation an5 %on8guration ,ui5e

T4is (BM re5$ook pu$lication pro#i5es suggestions; 4intsan5 tips; 5irections; installation steps; c4ecklists of


81/108

prerequisites; an5 con8guration information collecte5from a num$er of (nformation Ser#er eperts9 (t isinten5e5 to minimiQe t4e time require5 to successfullyinstall an5 con8gure (nformation Ser#er9

4ttp:7779re5$ooks9i$m9coma$stractsre5p.1394tml"pen

Preparing for DB0 +ear-)ealtime Business (ntelligence

T4is (BM re5$ook is primarily inten5e5 for use $y (BM%lients an5 (BM Business Partners9 T4e purpose of t4ere5$ook is to 5iscuss t4e topic of preparing for near-realtime $usiness intelligence


82/108

DataStaged an5 relate5 tec4nologies using a typical retailin5ustry scenario9 (t is aime5 at (T arc4itects; (nformationManagement specialists; an5 (nformation (ntegrationspecialists responsi$le for 5e#eloping (BM (nformation

Ser#er on a Microsoftd >in5o7sd 022? platform94ttp:7779re5$ooks9i$m9coma$stractssg0.394tml"pen

Deploying a ,ri5 Solution 7it4 t4e (BM (nfoSp4ere(nformation Ser#er

T4is (BMd )e5$ooksd pu$lication 5ocuments t4emigration of an (BM (nfoSp4ere (nformation Ser#erVersion E929/ parallel frame7ork implementation on a )e5

Cat *nterprise Linud


83/108

G!%!$AL tab o a arallel $outine

Description of #arious 8el5s in t4e ,eneral ta$:

)outine name: T4e name of t4e routine as it 7ill appear int4e DataStage )epository9

%ategory: T4e name of t4e category 74ere t4e routine isstore5 in t4e )epository9 (f you 5o not enter a name in t4is8el5 74en creating a routine; t4e routine is create5 un5ert4e main )outines $ranc49

*ternal su$routine name: T4e % function t4at t4is routineis calling


84/108

A$G(M!%TS tab o a arallel $outine

&n5er t4e arguments ta$ enter t4e names of all t4eparameters t4at t4e function takes as input 7it4 t4eircorrespon5ing types9


85/108

T4is series of posts eplains in 5etail t4e process ofcreating a routine 7it4 t4e 4elp of eamples9But $eforeco5ing anyt4ing; please c4eck out t4e post compilerspeci8cation9

#oding in #


86/108

t4is co5e 74en calle5 from DataStage can return t4estring9

After making t4e c4anges t4e a$o#e co5e 7ill look liket4is:

#include

char * hello ()

{

char * s="hello world"

return s;

}

+o7 you 7ill 4a#e to create an o$6 8le for t4e a$o#e co5et4at can $e 5one $y using t4e comman5

/usr/vacpp/bin/xlC_r -O -c -qspill=32704

(n t4is case it 7ill $e

/usr/vacpp/bin/xlC_r -O -c -qspill=32704 hellworld.cpp

(f t4e co5e is correct t4e comman5 7ill create a 8lename5 4ello7orl59o in t4e same fol5er9

(n net postin t4is series; 7e 7ill create a DataStageroutine an5 link it to t4e o$6ect co5e t4at 7e 5e#elope59

&O#I)!" S!&IFI&$TIO%I% D$T$ST$'!

"ne of t4e pre-requisites to 7riting Parallel rotines using%%\\ is to c4eck for t4e compiler speci8cation9 T4is postgui5es in seeing t4e %%\\ compiler a#aila$le in ourinstallation9 To 8n5 compiler; 7e nee5 to login intoA5ministrator an5 go to t4e pat4 $elo7
http://dwbitutorials.wordpress.com/2013/03/12/datastage-parallel-routines-2/http://dwbitutorials.wordpress.com/2013/03/12/datastage-parallel-routines-2/


87/108

DataStage Z A5ministrator Z Properties Z *n#ironmentZ Parallel Z %ompiler

Step /: Log in to Data Stage A5ministration9

Step 0: %lick on t4e Pro6ects ta$9

Step ?: %lick on t4e Properties ta$


88/108

Step : %lick on *n#ironment9

Step .: Select t4e compiler option un5er Parallel ta$9


89/108

Step 3: %ompiler speci8cation of 5atastage is present4ere9

.cpp


90/108

(n %ase "f ,+& compiler 7e compile a cpp co5e like t4is

)*ompiler %ame+ ) *ompile option+ ) cpp &ile name+

e.g. g++ -o test.cpp

%ompiling comman5 for 5atastage routine 7ill $e:-,usr,-acpp,bin,l*"r /' /c /qspill012345 test

T4is comman5 7ill create an o$6ect 8le test9o

6#isclaimer7 / The *ompiler name and options areinstallation dependent.8

D$T$ST$'! )!$"%I%'S56:"!US$(I)IT7 I% D$T$ST$'!

Belo7 are some of t4e 7ays t4roug4 74ic4 reusa$ility can$e ac4ie#e5 in DataStage9

Multiple (nstance !o$s9

Parallel S4are5 %ontainer

After-6o$ )outines9

Multiple-(nstance !o$s:,enerally; in 5ata 7are4ousing; t4ere 7oul5 $escenariosrequirement to loa5 t4e maintenance ta$les ina55ition to t4e ta$les in 74ic4 7e loa5 t4e actual 5ata9T4ese maintenance ta$les typically 4a#e 5ata like

/9%ount of recor5s loa5e5 in t4e actual target ta$les909Batc4 Loa5 +um$erDay on 74ic4 t4e loa5 occurre5

?9T4e last processe5 sequence i5 for a particular loa5T4is information can also $e use5 for 5ata count#ali5ation; email noti8cation to users9 So; for loa5ingt4ese maintenance ta$les; it is al7ays suggeste5 to5esign a single multiple-instance 6o$ instea5 of 5ierent

6o$s 7it4 t4e same logic9 (n t4e !o$ Properties ta$; c4eckt4e option @Allo7 Multiple (nstance9 T4e 6o$ acti#ity74ic4 attac4 a multiple-instance 6o$ s4o7s an a55itionalta$ @(n#ocation (5 9So 7e can 4a#e 5ierent 6o$sequences triggering t4e same 6o$ 7it4 5ierentin#ocation i5s9 +ote t4at a 6o$ triggere5 $y 5ierent


91/108

sequences 7it4 same in#ocation i5 7ill a$ort eit4er one or$ot4 of t4e 6o$ sequences9

Parallel S4are5 %ontainer:

A s4are5 container 7oul5 allo7 a common logic to $es4are5 across multiple 6o$s9 *na$le )%P at t4e stageprece5ing t4e S4are5 container stage9For t4e purpose ofs4aring across #arious 6o$s; 7e 7ill not propagate all t4e

column meta5ata from t4e 6o$ into t4e stages present ins4are5 container9 Also ;ensure t4at 6o$s are re-compile574en t4e s4are5 container is c4ange59

(n our pro6ect; 7e 4a#e use5 s4are5 container forimplementing t4e logic to generate unique sequence i5 foreac4 recor5 t4at is loa5e5 into t4e target ta$le9 >e 7ill5iscuss 5ierent 7ays to generate sequence i5 e mig4t 4a#e a scenario 74ere in 7e s4oul5nGt 4a#e anyof t4e input recor5s to $e re6ecte5 $y any of t4e stages int4e 6o$9 So 7e 5esign a 6o$ 74ic4 4a#e re6ect links for5ierent stages in t4e 6o$ an5 t4en co5e a common after-

6o$ routine 74ic4 counts t4e num$er of recor5s in t4e

re6ect links of t4e 6o$ an5 a$orts t4e 6o$ 74en t4e countecee5s a pre-5e8ne5 limit9


92/108

T4is routine can $e parameterise5 for stage an5 linknames an5 can t4en $e re-use5 for 5ierent 6o$s9

D$T$ST$'! )!$"%I%'S51: US$'!

OF $T8O)D8(OU%D!D8)!%'THor &ariable"length elds! the parallel engine al%ays allocates thespace e-ui&alent to their maximum specied lengths. This isbecause the engine can easily determine the eld boundaries%ithout ha&ing to loo for delimiters! %hich %ill ease the CP1processing. Xo%e&er! the *:6 operations %ill be costlier. Therefore!if there are many &alues! %hich are much shorter than theirmaximum length! then there is lot of unused space! %hich mo&es

around bet%een the operators! datasets and xed"%idth les. Thisis a huge performance hit for all stages especially sort stage!%hich consumes a lot of scratch space! bu?er memory! CP1resources. This is e&en more dangerous if %e ha&e many of suchcolumns.

This is common for elds lie address! %here lengths allocated aremuch higher than the a&erage length of an address. 6ne %ay topre&ent this scenario is by setting the en&ironmental&ariableAPT'"LD'B"&+D*D'L*+,TC9

Disa5#antages:Some parallel datasets generated %ith*nfoSphere\ DataStage\ .A.I and later releases re-uire moredis space %hen the columns are of type ^arChar %hen comparedto .A. This is due to changes added for performanceimpro&ements for bounded length ^arChars in .A.I.

Set $PT06=D0B618D2D0=28VTX to any &alue to re&ert to pre".A.I storage beha&ior %hen using bounded length &archars.Setting this &ariable can ha&

Documents

Datastage Performance Guide