Datastage Performance Guide

  • Upload
    joshi

  • View
    244

  • Download
    1

Embed Size (px)

Citation preview

  • 7/24/2019 Datastage Performance Guide

    1/108

    DataStage Performance Tuning

    Performance Tuning - BasicsBasicsParallelism Parallelism in DataStage Jobs should be optimized ratherthan maximized. The degree of parallelism of a DataStage Job isdetermined by the number of nodes that is dened in the Congurationile! for example! four"node! eight #node etc. $ conguration le %ith alarger number of nodes %ill generate a larger number of processes and%ill in turn add to the processing o&erheads as compared to aconguration le %ith a smaller number of nodes. Therefore! %hilechoosing the conguration le one must %eigh the benets of increasedparallelism against the losses in processing e'ciency (increasedprocessing o&erheads and slo% start up time).*deally ! if the amount ofdata to be processed is small ! conguration les %ith less number of

    nodes should be used %hile if data &olume is more ! conguration les%ith larger number of nodes should be used.

    Partioning :Proper partitioning of data is another aspect of DataStage Job design!%hich signicantly impro&es o&erall +ob performance. Partitioning shouldbe set in such a %ay so as to ha&e balanced data ,o% i.e. nearly e-ualpartitioning of data should occur and data se% should be minimized.

    Memory :*n DataStage Jobs %here high &olume of data is processed! &irtual memory

    settings for the +ob should be optimised. Jobs often abort in cases %here asingle looup has multiple reference lins. This happens due to lo% tempmemory space. *n such +obs /$PT0B12304$5*4140424637!/$PT0468*T630S*92 and /$PT0468*T630T*42 should be set tosu'ciently large &alues.

    Performance Analysis of Various stages inDataStag

    Sequential File Stage -

    The se-uential le Stage is a le Stage. *t is the most common *:6 Stageused in a DataStage Job. *t is used to read data from or %rite data to oneor more ,at iles. *t can ha&e only one input lin or one 6utput lin .*t canalso ha&e one re+ect lin. ;hile handling huge &olumes of data! this Stagecan itself become one of the ma+or bottlenecs as reading and %ritingfrom this Stage is slo%.Se-uential les should be used in follo%ingconditions;hen %e are reading a ,at le (xed %idth or delimited) from18*5 en&ironment %hich is TPed from some external systems;hen some18*5 operations has to be done on the le Don

  • 7/24/2019 Datastage Performance Guide

    2/108

    node can be increased (default &alue is one).

    Data Set Stage :The Data Set is a le Stage! %hich allo%s reading data from or %ritingdata to a dataset. This Stage can ha&e a single input lin or single 6utput

    lin. *t does not support a re+ect lin. *t can be congured to operate inse-uential mode or parallel mode. DataStage parallel extender +obs useDataset to store data being operated on in a persistent form.Datasets areoperating system les %hich by con&ention has the su'x .dsDatasets aremuch faster compared to se-uential les.Data is spread across multiplenodes and is referred by a control le.Datasets are not 18*5 les and no18*5 operation can be performed on them.1sage of Dataset results in agood performance in a set of lined +obs.They help in achie&ing end"to"end parallelism by %riting data in partitioned form and maintaining thesort order.

    Lookup Stage $ =oo up Stage is an $cti&e Stage. *t is used to perform a looup on anyparallel +ob Stage that can output data. The looup Stage can ha&e areference lin! single input lin! single output lin and single re+ectlin.=oo up Stage is faster %hen the data &olume is less.*t can ha&emultiple reference lins (if it is a sparse looup it can ha&e only onereference lin)The optional re+ect lin carries source records that do notha&e a corresponding input looup tables.=ooup Stage and type of looupshould be chosen depending on the functionality and &olume ofdata.Sparse looup type should be chosen only if primary input data

    &olume is small.*f the reference data &olume is more! usage of =ooupStage should be a&oided as all reference data is pulled in to local memory

    !oin Stage :Join Stage performs a +oin operation on t%o or more datasets input to the+oin Stage and produces one output dataset. *t can ha&e multiple inputlins and one 6utput lin.There can be > types of +oin operations *nner

    Join! =eft:3ight outer Join! ull outer +oin. Join should be used %hen thedata &olume is high. *t is a good alternati&e to the looup stage andshould be used %hen handling huge &olumes of data.Join uses the pagingmethod for the data matching.

    Merge Stage :The 4erge Stage is an acti&e Stage. *t can ha&e multiple input lins! asingle output lin! and it supports as many re+ect lins as input lins. The4erge Stage taes sorted input. *t combines a sorted master data set %ithone or more sorted update data sets. The columns from the records in themaster and update data sets are merged so that the output recordcontains all the columns from the master record plus any additionalcolumns from each update record. $ master record and an update recordare merged only if both of them ha&e the same &alues for the merge ey

    column(s) that you specify. 4erge ey columns are one or more columnsthat exist in both the master and update records. 4erge eys can be more

  • 7/24/2019 Datastage Performance Guide

    3/108

    than one column. or a 4erge Stage to %or properly master dataset andupdate dataset should contain uni-ue records. 4erge Stage is generallyused to combine datasets or les.

    Sort Stage :

    The Sort Stage is an acti&e Stage. The Sort Stage is used to sort inputdataset either in $scending or Descending order. The Sort Stage o?ers a&ariety of options of retaining rst or last records %hen remo&ingduplicate records! Stable sorting! can specify the algorithm used forsorting to impro&e performance! etc. 2&en though data can be sorted on alin! Sort Stage is used %hen the data to be sorted is huge.;hen %e sortdata on lin ( sort : uni-ue option) once the data size is beyond the xedmemory limit ! *:6 to dis taes place! %hich incurs an o&erhead.

    Therefore! if the &olume of data is large explicit sort stage should be usedinstead of sort on lin.Sort Stage gi&es an option on increasing the bu?ermemory used for sorting this %ould mean lo%er *:6 and betterperformance.

    Transformer Stage@The Transformer Stage is an acti&e Stage! %hich can ha&e a single inputlin and multiple output lins. *t is a &ery robust Stage %ith lot of inbuiltfunctionality. Transformer Stage al%ays generates C"code! %hich is thencompiled to a parallel component. So the o&erheads for using atransformer Stage are high. Therefore! in any +ob! it is imperati&e that theuse of a transformer is ept to a minimum and instead other Stages areused! such as@Copy Stage can be used for mapping input lins %ith

    multiple output lins %ithout any transformations. ilter Stage can be usedfor ltering out data based on certain criteria. S%itch Stage can be used tomap single input lin %ith multiple output lins based on the &alue of aselector eld. *t is also ad&isable to reduce the number of transformers ina Job by combining the logic into a single transformer rather than ha&ingmultiple transformers .

    Funnel Stage unnel Stage is used to combine multiple inputs into a single outputstream. But presence of a unnel Stage reduces the performance of a +ob.*t %ould increase the time taen by +ob by >A (obser&ations). ;hen a

    unnel Stage is to be used in a large +ob it is better to isolate itself to one+ob. ;rite the output to Datasets and funnel them in ne% +ob. unnelStage should be run in continuous mode! %ithout hindrance.

    "#erall !o$ Design :;hile designing DataStage Jobs care should be taen that a single +ob isnot o&erloaded %ith Stages. 2ach extra Stage put in a Job corresponds tolesser number of resources a&ailable for e&ery Stage! %hich directlya?ects the Jobs Performance. *f possible! big +obs ha&ing large number ofStages should be logically split into smaller units. $lso if a particular Stage

    has been identied to be taing lot of time in a +ob! lie a transformerStage ha&ing complex functionality %ith a lot of Stage &ariables and

  • 7/24/2019 Datastage Performance Guide

    4/108

    transformations! then the design of +obs could be done in such a %ay thatthis Stage is put in a separate +ob all together (more resources for thetransformer StageEEE). $lso %hile designing +obs! care must be taen thatunnecessary column propagation is not done. Columns! %hich are notneeded in the +ob ,o%! should not be propagated from one Stage to

    another and from one +ob to the next. $s far as possible! 3CP (3untimeColumn Propagation) should be disabled in the +obs. Sorting in a +obshould be taen care try to minimise number sorts in a +ob. Design a +ob insuch a %ay as to combine operations around same sort eys! if possiblemaintain same hash eys. 4ost often neglected option is don

  • 7/24/2019 Datastage Performance Guide

    5/108

    You may get many errors in datastage while compiling the jobs or running the jobs.

    Some of the errors are as follows

    a)Source file not found.If you are trying to read the file, which was not there with that name.

    b)Some times you may get Fatal Errors.

    c) Data type mismatches.his will occur when data type mismaches occurs in the jobs.

    d) Field Si!e errors.

    e) "eta data "ismach

    f) Data type si!e between source and target different

    g) #olumn "ismatch

    i) $ricess time out.If ser%er is busy. his error will come some time.

    Some of the errors in detail:

    ds0Trailer03ec@ ;hen checing operator@ ;hen binding output schema &ariable Kout3ecK@;hen binding output interface eld KTrailerDetail3ecCountK to eldKTrailerDetail3ecCountK@ *mplicit con&ersion from source type KustringK to result typeKstringLmaxMHFFNK@ Possible truncation of &ariable length ustring %hen con&erting to

    string using codepage *S6"OOF"I.

    Solution@* resol&ed changing the extended col under meta data of thetransformer to unicode

    ;hen checing operator@ $ se-uential operator cannot preser&e thepartitioningof the parallel data set on input port A.

    Solution@* resol&ed by changing the preser&e partioning to QclearQ undertransformer properties

    Syntax error@ 2rror in KgroupK operator@ 2rror in output redirection@ 2rror in outputparameters@ 2rror in modify adapter@ 2rror in binding@ Could not nd type@ KsubrecK! line>F

    Solution@*ts the issue of le&el number of those columns %hich %ere being added in transformer.Their le&el number %as blan and the columns that %ere being taen from c? le had it as AH.$dded the le&el number and +ob %ored.

    Out_Trailer: When checking operator: When binding output schema variable "outRec": When binding output interface field

    "STDC_TR!R_RC_C#T" to field "STDC_TR!R_RC_C#T": $mplicit conversion from source t%pe "dfloat" to result

    t%pe "decimal&'()(*": +ossible range,precision limitation-

    C_Trailer: When checking operator: When binding output interface field "Data" to field "Data": $mplicit conversion from

  • 7/24/2019 Datastage Performance Guide

    6/108

    source t%pe "string" to result t%pe "string&ma./0((*": +ossible truncation of variable length string-

    $mplicit conversion from source t%pe "dfloat" to result t%pe "decimal&'()(*": +ossible range,precision limitation-

    Solution: 1sed to transformer function2D3loatToDecimal2- s target field is Decimal- 4% default the output from aggregator

    output is double) getting the above b% using above function able to resolve the 5arning-

    When binding output schema variable "outputData": When binding output interface field "RecordCount" to field

    "RecordCount": $mplicit conversion from source t%pe "string&ma./600*" to result t%pe "int'7": Converting string to number-

    Problem(Abstract)

    &obs that process a large amount of data in a column can abort with this error'

    the record is too big to fit in a bloc( the length re*uested is' ++++, the ma+ bloc( length is' ++++.

    Resolving the problem

    o fi+ this error you need to increase the bloc( si!e to accommodate the record si!e'

    '- !og into Designer and open the 8ob-

    6- Open the 8ob properties99 parameters99add environment variable and select:+T_D31!T_TR#S+ORT_4!OC;_S$ou can set this up to 607?4 but %ou reall% shouldn2t need to go over '?4-#OT: value is in ;4

    3or e.ample to set the value to '?4:

    +T_D31!T_TR#S+ORT_4!OC;_S$

  • 7/24/2019 Datastage Performance Guide

    7/108

    . S51E-I0E1G13/$41DIGE,7' run0ocallyH) did not reach E3F on its input data set ?.

    S30' arning will be disappeared by regenerating S5 File.

    8. hile connecting to Datastage client, there is no response, and while restarting websphere ser%ices,following errors occurred

    rootJpoluloro?7 binKL .

  • 7/24/2019 Datastage Performance Guide

    8/108

    Sometime e%en after mapping the fields this error can be occurred and one of the reason could be that the %iewadapter has not lin(ed the input and output fields. 9ence in this case the re*uired field mapping should bedropped and recreated.

    &ust to gi%e an insight on this, the %iew adapter is an operator which is responsible for mapping the input andoutput fields. 9ence DataStage creates an instance of -$Piew-dapter which translate the components of theoperator input interface schema to matching components of the interface schema. So if the interface schema isnot ha%ing the same columns as operator input interface schema then this error will be reported.

    7)hen we use same partitioning in datastage transformer stage we get the following warning in C.@.= %ersion.

    F#$????8 = inputtfm' Input dataset ? has a partitioning method other than entire specifieddisabling memory sharing.

    his is (nown issue and you can safely demote that warning into informational by adding this warning to $rojectspecific message handler.

    =) arning' - se*uential operator cannot preser%e the partitioning of input data set on input port ?

    1esolution' #lear the preser%e partition flag before Se*uential file stages.

    )DataStage parallel job fails with for(H) failed, 1esource temporarily una%ailable

    3n ai+ e+ecute following command to chec( ma+uproc setting and increase it if you plan to run multiple jobs atthe same time.

    lsattr ME Ml sys? V grep ma+uprocma+uproc 7?=8 "a+imum number of $13#ESSES allowed per user rue

    8)FI$?????? -ggstg' hen chec(ing operator' hen binding input interface field:#/S-#241; to field :#/S-#241;' Implicit con%ersion from source type :string@K; to result type:dfloat;' #on%erting string to number.

    1esolution' use the "odify stage e+plicitly con%ert the data type before sending to aggregator stage.

    @)arning' - user defined sort operator does not satisfy the re*uirements.

    1esolution'chec( the order of sorting columns and ma(e sure use the same order when use join stage after sortto joing two inputs.

    O)F"?????? = Stgtfmheader,7' #on%ersion error calling con%ersion routinetimestampfromstring data may ha%e been lost

    F"?????? 7 +fm&ournals,7' #on%ersion error calling con%ersion routine decimalfromstring datamay ha%e been lost

    1esolution'chec( for the correct date format or decimal format and also null %alues in the date or decimal fieldsbefore passing to datastage StringoDate, DateoString,DecimaloString or StringoDecimal functions.

    C)3S3???77B = &oinsort' hen chec(ing operator' Data claims to already be sorted on thespecified (eys the WsortedT option can be used to confirm this. Data will be resorted as necessary. $erformancemay impro%e if this sort is remo%ed from the flow

    1esolution' Sort the data before sending to join stage and chec( for the order of sorting (eys and join (eys andma(e sure both are in the same order.

    N)F31?????? = 7 &oin3uter' hen chec(ing operator' Dropping component :#/S241;

    because of a prior component with the same name.

    1esolution'If you are using join,diff,merge or comp stages ma(e sure both lin(s ha%e the differnt column names

  • 7/24/2019 Datastage Performance Guide

    9/108

    other than (ey columns

    B)FI$????== 7 ocioraclesource' hen chec(ing operator' hen binding output interface field:"E"4E12-"E; to field :"E"4E12-"E;' #on%erting a nullable source to a nonMnullable result

    1esolution'If you are reading from oracle database or in any processing stage where incoming column is definedas nullable and if you define metadata in datastage as nonMnullable then you will get abo%e issue.if you want to

    con%ert a nullable field to non nullable ma(e sure you apply a%ailable null functions in datastage or in the e+tract*uery.

    D--S-GE #3""32 E1131S

  • 7/24/2019 Datastage Performance Guide

    10/108

    1un :ps Mef V grep dsapisla%e; to chec( if any dsapisla%e processes e+ist. If so, (ill them.

    1un :netstat Ma V grep dsprc; to see if any processes ha%e soc(ets that are ES-40IS9ED, FI2-I, or#03SE-I. hese will pre%ent the dsprcd from starting. he soc(ets with status FI2-I or #03SE-Iwill e%entually time out and disappear, allowing you to restart DataStage.

    hen 1estart DSEngine. Hif abo%e doesnTt wor() 2eeds to reboot the system.

    C. o sa%e Datastage logs in notepad or readable format

    S30' a)

  • 7/24/2019 Datastage Performance Guide

    11/108

    $s Xef V grep 0ogging-gentSoc(etImpl H31)

    $S Xef V grep -gent Hto chec( the process id of the abo%e)

    =. arning' - se*uential operator cannot preser%e the partitioning of input data set on input port ?

    S30' #lear the preser%e partition flag before Se*uential file stages.

    . arning' - user defined sort operator does not satisfy the re*uirements.

    S30' #hec( the order of sorting columns and ma(e sure use the same order when use join stage after sort tojoing two inputs.

    8. #on%ersion error calling con%ersion routine timestampfromstring data may ha%e been lost. +fm&ournals,7'#on%ersion error calling con%ersion routine decimalfromstring data may ha%e been lost

    S30' chec( for the correct date format or decimal format and also null %alues in the date or decimal fieldsbefore passing to datastage StringoDate, DateoString,DecimaloString or StringoDecimal functions.

    @. o display all the jobs in command line

    S30'

    cd

  • 7/24/2019 Datastage Performance Guide

    12/108

    then try to restart it.

    7?. hile attempting to compile job, :failed to in%o(e Gen1unime using $hantom process helper;

    1#'

  • 7/24/2019 Datastage Performance Guide

    13/108

    If any processes are in closewait, then wait for some time, those processes

    will not be %isible.

    Stop the Datastage ser%ices.

    cd ^DS93"E

    .

  • 7/24/2019 Datastage Performance Guide

    14/108

    T%pes of Datasets in Datastage:

    ccording to the scheme above) there are the t5o groups of Datasets 9 persistent and

    virtual-

    The first t%pe) persistent Datasets are marked 5ith *.dse.tensions) 5hile for second t%pe)

    virtual datasets *.ve.tension is reserved- $t2s important to mention) that no -v files might be

    visible in the 1ni. file s%stem) as long as the% e.ist onl% virtuall%) 5hile inhabiting R?

    memor%- .tesion -v itself is characteristic strictl% for OSE 9 the Orchestrate language of

    scriptingF-

    3urther differences are much more significant- +rimaril%) persistent Datasets are being

    stored in Unix files using internal Datastage EE format ) 5hile virtual Datasets are never

    stored on disk9 the% do e.ist 5ithin links) and in format) but in R? memor%- 3inall%)

    persistent Datasets are readable and rewriteable with the DataSet Stage) and virtual

    Datasets - might be passed through in memory-

    ccurateness demands mentioning the third t%pe of Datasets 9 the%2re called filesets) 5hile

    their storage places are diverse 1ni. files and the%2re human9readable- 3ilesets are generall%

    marked 5ith the *.fse.tension-

    There are a fe5 features specific for Datasets that have to be mentioned to complete the

    Datasets polic%- 3irstl%) as a single Dataset contains multiple records) it is obvious that all of

    them must undergo the same processes and modifications- $n a 5ord) all of them must go

    through the same successive stage-

    Secondl%) it should be e.pected that different Datasets usuall% have different schemas)

    therefore the% cannot be treated commonl%-

    ?ore accuratel%) t%picall% Orchestrate Datasets are a single9file eGuivalent for the 5hole

    seGuence of records- With datasets) the% might be sho5n as a one ob8ect- Datasets themselves

    help ignoring the fact that demanded data reall% are compounded of multiple and diverse files

    remaining spread across different sectors of processors and disks of parallel computers- long

    that) comple.it% of programming might be significantl% reduced) as sho5n in the e.ample

    belo5-

  • 7/24/2019 Datastage Performance Guide

    15/108

    Datasets parallel architecture in Datastage:

    +rimar% multiple files 9 sho5n on the left side of the scheme 9 have been bracketed together)5hat resulted in five nodes- While using datasets) all the files) all the nodes might be boiled

    do5n to the onl% one) single Dataset 9 sho5n on the right side of the scheme- Thereupon) %ou

    might program onl% one file and get results on all the input files- That significantl% shorten

    time needed for modif%ing the 5hole group of separate files) and reduce the possibilit% of

    engendering accidental errors- What are its measurable profitsH ?ainl%) significantl%

    increasing speed of applications basing on large data volumes-

    ll in all) is it 5orth95hileH Surel%- speciall%) 5hile talking about technological advance and

    succeeding data9dependence- The% do coerce using larger and larger volumes of data) and 9 as

    a conseGuence 9 s%stems able to rapid cooperate on them became a necessit%-

    Datastage Senarios and solutions

    3ield mapping using Transformer stage:

    ReGuirement:

    field 5ill be right 8ustified Iero filled) Take last 'A characters

    Solution:

    Right"((((((((((":Trim!nk_Jfm_Trans-linkF)'AF

    Scenario ':

    We have t5o datasets 5ith @ cols each 5ith different names- We should create a dataset 5ith

    @ cols = from one dataset and one col 5ith the record count of one dataset-

    We can use aggregator 5ith a dumm% column and get the count from one dataset and do a

    look up from other dataset and map it to the = rd dataset

    Something similar to the belo5 design:

  • 7/24/2019 Datastage Performance Guide

    16/108

    Scenario 6:

    3ollo5ing is the e.isting 8ob design- 4ut reGuirement got changed to: Eead and trailer

    datasets should populate even if detail records is not present in the source file- 4elo5

    8ob don2t do that 8ob-

    Eence changed the above 8ob to this follo5ing reGuirement:

    1sed ro5 generator 5ith a cop% stage- Kiven default value IeroF for col countF coming in

    from ro5 generator- $f no detail records it 5ill pick the record count from ro5 generator-

    We have a source 5hich is a seGuential file 5ith header and footer- Eo5 to remove the header

    and footer 5hile reading this file using seGuential file stage of DatastageH

    Sol:Type command in putty@ sed QIdR/dQ le0namene%0le0name (type

    this in +ob before +ob subroutine then use ne% le in se- stage)

    $3 $ EL SO1RC !$; CO!' 4 #D TRKT !$; CO!' CO!6 ' 6 4'-

    EOW TO CE$L TE$S O1T+1T 1S$#K STK LR$4! $# TR#S3OR?R

    STKH

    $f ke%Change /' Then ' lse stagevaraibleM'

    Suppose that @ 8ob control b% the seGuencer like 8ob ') 8ob 6) 8ob =) 8ob @ Fif 8ob ' have

    '()((( ro5 )after run the 8ob onl% 0((( data has been loaded in target table remaining are notloaded and %our 8ob going to be aborted then-- Eo5 can short out the problem-Suppose 8ob

  • 7/24/2019 Datastage Performance Guide

    17/108

    seGuencer s%nchronies or control @ 8ob but 8ob ' have problem) in this condition should go

    director and check it 5hat t%pe of problem sho5ing either data t%pe problem) 5arning

    massage) 8ob fail or 8ob aborted) $f 8ob fail means data t%pe problem or missing column action

    -So u should go Run 5indo5 9Click9 Tracing9+erformance or $n %our target table

    9general 9 action9 select this option here t5o option

    iF On 3ail 99 commit ) ContinueiiF On Skip 99 Commit) Continue-

    3irst u check ho5 man% data alread% load after then select on skip option then continue and

    5hat remaining position data not loaded then select On 3ail ) Continue ------ gain Run the 8ob

    defiantl% u get successful massage

    Nuestion: $ 5ant to process = files in seGuentiall% one b% one ho5 can i do that- 5hile

    processing the files it should fetch files automaticall% -

    ns:$f the metadata for all the files r same then create a 8ob having file name as parameter

    then use same 8ob in routine and call the 8ob 5ith different file name---or u can create

    seGuencer to use the 8ob--

    +arameteriIe the file name-4uild the 8ob using that parameter

    4uild 8ob seGuencer 5hich 5ill call this 8ob and 5ill accept the parameter for file name-

    Write a 1#$J shell script 5hich 5ill call the 8ob seGuencer three times b% passing different

    file each time-

    R: What Eappens if RC+ is disable H

    $n such case Osh has to perform $mport and e.port ever% time 5hen the 8ob runs and the

    processing time 8ob is also increased---

    Runtime column propagation RC+F: $f RC+ is enabled for an% 8ob and specificall% for those

    stages 5hose output connects to the shared container input then meta data 5ill be propagated

    at run time so there is no need to map it at design time-

    $f RC+ is disabled for the 8ob in such case OSE has to perform $mport and e.port ever% time

    5hen the 8ob runs and the processing time 8ob is also increased-

    Then %ou have to manuall% enter all the column description in each stage-RC+9 Runtime

    column propagation

    Nuestion:

    Source: Target

    no name no name

    ' a)b ' a6 c)d 6 b

    = e)f = c

    source has 6 fields like

    CO?+#> !OCT$O#

    $4? E>D

    TCS 4#

    $4? CEEC! E>D

  • 7/24/2019 Datastage Performance Guide

    18/108

    TCS CE

    $4? 4#

    EC! 4#

    EC! CE

    !$; TE$S-------

    TE# TE O1T+1T !OO;S !$; TE$S----

    Compan% loc count

    TCS E>D =

    4#

    CE

    $4? E>D =

    4#

    CEEC! E>D =

    4#

    CE

    6Finput is like this:

    no)char

    ')a

    6)b

    =)a

    @)b

    0)a

    7)a

    B)b

    A)a

    4ut the output is in this form 5ith ro5 numbering of Duplicate occurence

    output:

    no)char)Count

    "'")"a")"'"

    "7")"a")"6"

    "0")"a")"="

    "A")"a")"@"

    "=")"a")"0"

    "6")"b")"'"

    "B")"b")"6"

    "@")"b")"="

    =F$nput is like this:

    file''(

  • 7/24/2019 Datastage Performance Guide

    19/108

    6(

    '(

    '(

    6(

    =(

    Output is like:

    file6 file=duplicatesF

    '( '(

    6( '(

    =( 6(

    @F$nput is like:

    file'

    '(

    6(

    '('(

    6(

    =(

    Output is like ?ultiple occurrences in one file and single occurrences in one file:

    file6 file=

    '( =(

    '(

    '(

    6(

    6(

    0F$nput is like this:

    file'

    '(

    6(

    '(

    '(

    6(

    =(

    Output is like:file6 file=

    '( =(

    6(

    7F$nput is like this:

    file'

    '

    6

    =

    @

    0

    7B

  • 7/24/2019 Datastage Performance Guide

    20/108

    A

    '(

    Output is like:

    file6oddF file=evenF' 6

    = @

    0 7

    B A

    '(

    BFEo5 to calculate SumsalF) vgsalF) ?insalF) ?a.salF 5ith out using ggregator stage--

    AFEo5 to find out 3irst sal) !ast sal in each dept 5ith out using aggregator stage

    FEo5 man% 5a%s are there to perform remove duplicates function 5ith out using Remove

    duplicate stage--

    Scenario:

    source has 6 fields like

    CO?+#> !OCT$O#

    $4? E>D

    TCS 4#

    $4? CE

    EC! E>D

    TCS CE

    $4? 4#

    EC! 4#

    EC! CE

    !$; TE$S-------

    TE# TE O1T+1T !OO;S !$; TE$S----

    Compan% loc count

    TCS E>D =

    4#

    CE

    $4? E>D =

    4#

    CE

    EC! E>D =

    4#

    CE

    Solution:

  • 7/24/2019 Datastage Performance Guide

    21/108

  • 7/24/2019 Datastage Performance Guide

    22/108

    3ou can get it by using Sort stage Transformer stage RemoveDuplicates Transformer tgt

    -$

    4irst Rea an 'oa t%e ata into your source file# 4or (ample Sequential 4ile &

    !n in Sort stage select $ey c%ange column = True # To 5enerate 5roup is&

    5o to Transformer stage

    Create one stage variable.

    3ou can o t%is by rig%t clic$ in stage variable go to properties an name it as your6is% # 4or eample temp&

    an in epression 6rite as belo6

    if $eyc%ange column =1 t%en column name else temp:)*):column name

    T%is column name is t%e one you 6ant in t%e require column 6it% elimite

    commas.

    -n remove uplicates stage $ey is col1 an set option uplicates retain to> 'ast.in transformer rop col/ an efine / columns li$e col0*col/*col2in col1 erivation give 4iel#"nputColumn*7*7*1& anin col1 erivation give 4iel#"nputColumn*7*7*0& an

    in col1 erivation give 4iel#"nputColumn*7*7*/&

    Scenario:10&Consier t%e follo6ing employees ata as source8employee9i* salary

    1* 1

    0* 0/* /

    2* ;

    Create a

  • 7/24/2019 Datastage Performance Guide

    23/108

    T%e scenario is t%at*

    to t%e out put 2 > t%e recors 6%ic% are common to bot% 1 an 0 s%oul go.

    to t%e output / > t%e recors 6%ic% are only in 1 but not in 0 s%oul go

    to t%e output ; > t%e recors 6%ic% are only in 0 but not in 1 s%oul go.sltn:src1>copy1>>output91#only left table&

    oin#inner type&> ouput91src0>copy0>>output9/#only rig%t table&

    Consier t%e follo6ing employees ata as source8employee9i* salary1* 10* 0/* /

    2* ;

    Scenario:

    Create a Transformer#! ne6 Column on bot% t%e output lin$s an assign avalue as 1 &> 1& !ggregator #Do group by

    using t%at ne6 column&0&loo$up

  • 7/24/2019 Datastage Performance Guide

    24/108

    seq>trans>seq

    create one stage variable as delimiter..an put erivation on stage as DS'in$2.sno : 7*7 : DS'in$2.sname : 7*7 :DS'in$2.mar$1 : 7*7 :DS'in$2.mar$0 : 7*7 : DS'in$2.mar$/

    an o mapping an create one more column count as integer type.

    an put erivation on count column as Count#delimter* 7*7&

    scenario:sname total9vo6els9count!llen 0Scott 1ar 1

    ner Transformer Stage Description:

    total9Eo6els9Count=Count#DS'in$/.last9name*7a7&FCount#DS'in$/.last9name*7e7&

    FCount#DS'in$/.last9name*7i7&FCount#DS'in$/.last9name*7o7&FCount#DS'in$/.last9name*7u7&.

    Scenario:

    1&-n aily 6e r getting some %uge files ata so all files metaata is same 6e %ave to

    loa in to target table %o6 6e can loa8se 4ile ,attern in sequential file

    0& -ne column %aving 1 recors at run time 6e %ave to sen ;t% an @t% recor to

    target at run time %o6 6e can sen8Can get t%roug%*by using G"H comman in sequential file filter option

    Io6 can 6e get 1A mont%s ate ata in transformer stage8se transformer stage after input seq file an try t%is one as constraint intransformer stage :

    DaysSince4romDate#CurrentDate#&* DS'in$/.ate91A&J=;2A -RDaysSince4romDate#CurrentDate#&* DS'in$/.ate91A&J=;2@

    6%ere ate91A column is t%e column %aving t%at ate 6%ic% nees to be less or

    equal to 1A mont%s an ;2A is no. of ays for 1A mont%s an for leap year it is

    ;2@#t%ese numbers you nee to c%ec$&.

    %at is ifferences bet6een 4orce Compile an Compile 8

    Diff b/w Compile and Validate?Compile option only c%ec$s for all manatory requirements li$e lin$ requirements* stage

    options an all. ut it 6ill not c%ec$ if t%e atabase connections are vali.Ealiate is equivalent to Running a

  • 7/24/2019 Datastage Performance Guide

    25/108

    How to FInd Out Duplicate Values Using Transformer?

    You can capture the duplicate records based on (eys using ransformer stage %ariables.

    7. Sort and partition the input data of the transformer on the (eyHs) which defines the duplicate.

    =. Define two stage %ariables, lets say StgPar$re%5ey#olHdata type same as 5ey#ol) and StgPar#ntr as Integer

    with default %alue ?

    where 5ey#ol is your input column which defines the duplicate.

    E+pression for StgPar#ntrH7st stg %arMM maintain order)'

    If DS0in(nn.5ey#ol A StgPar$re%5ey#ol hen StgPar#ntr Z 7 Else 7

    E+pression for StgPar$re%5ey#olH=nd stg %ar)'

    DS0in(nn.5ey#ol

    . 2ow in constrain, if you filter rows where StgPar#ntr A 7 will gi%e you the uni*ue records and if you filter

    StgPar#ntr R 7 will gi%e you duplicate records.

    ?% source is !ike

    Sr_no) #ame

    '()a

    '()b

    6()c

    =()d

    =()e

    @()f

    ?% target Should !ike:

    Target ':Onl% uniGue means 5hich records r onl% onceF

    6()c

    @()f

    Target 6:Records 5hich r having more than ' timeF

    '()a

    '()b

    =()d=()e

    Eo5 to do this in DataStage----

    use aggregator and transformer stages

    source99aggregator99transformat99target

    perform count in aggregator) and take t5o op links in trasformer) filter data count' for one llinkand put count/' for second link-

  • 7/24/2019 Datastage Performance Guide

    26/108

    Senario!

    in m% i,p source i have # no-of records

    $n output i have = targets

    i 5ant o,p like 'st rec goes to 'st targt and

    6nd rec goes to 6nd target and

    =rd rec goes to =rd target again

    @th rec goes to 'st taget ------------ like this

    do this ""5ithout using partition techniGues "" remember it-

    source999trans9999target

    in trans use conditions on constraints

    modempno)=F/'

    modempno)=F/6

    modempno)=F/(

    Senario!

    im having i,p as

    col

    a_b_c

    ._3_$D_KE_$3

    5e hav to mak it as

    col' col 6 col=

    a b c

    . f i

    de gh if

    Transformer

    create = columns 5ith derivation

    col' 3ieldcol)2_2)'F

    col6 3ieldcol)2_2)6F

    col= 3ieldcol)2_2)=F

    3ield function divides the column based on the delimeter)

    if the data in the col is like )4)Cthen

  • 7/24/2019 Datastage Performance Guide

    27/108

    3ieldcol)2)2)'F gives

    3ieldcol)2)2)6F gives 4

    3ieldcol)2)2)=F gives C

    Unix Shell Sripting

    Script for pass5ordless sftp uni.:

    cd ^targetdirectorysftp sftpuserJ^sftpmachine QQE3Fcd sourcedirectory

    binget sourcefilenamee+itE3F

    DataSet in DataStage

    $nside a $nfoSphere DataStage parallel 8ob) data is moved around in data sets- These carr%

    meta data 5ith them) both column definitions and information about the onfigurationthat

    5as in effect 5hen the data set 5as created- $f for e.ample) %ou have a stage 5hich limits

    e.ecution to a subset of available nodes) and the data set 5as created b% a stage using allnodes) $nfoSphere DataStage can detect that the data 5ill need repartitioning-

    $f reGuired) data sets can be landed as persistent data sets) represented b% a Data Set stage

    -This is the most efficient 5a% of moving data bet5een linked 8obs- +ersistent data sets are

    stored in a series of files linked b% a control file note that %ou should not attempt to

    manipulate these files using 1#$J tools such as R? or ?L- l5a%s use the tools provided

    5ith $nfoSphere DataStageF-

    there are the t5o groups of Datasets 9 persistent and virtual-

    The first t%pe) persistent Datasets are marked 5ith *.dse.tensions) 5hile for second t%pe)virtual datasets *.ve.tension is reserved- $t2s important to mention) that no -v files might be

    visible in the 1ni. file s%stem) as long as the% e.ist onl% virtuall%) 5hile inhabiting R?

    memor%- .tesion -v itself is characteristic strictl% for OSE 9 the Orchestrate language of

    scriptingF-

    3urther differences are much more significant- +rimaril%) persistent Datasets are being

    stored in Unix files using internal Datastage EE format ) 5hile virtual Datasets are never

    stored on disk9 the% do e.ist 5ithin links) and in format) but in R? memor%- 3inall%)

    persistent Datasets are readable and rewriteable with the DataSet Stage) and virtual

    Datasets - might be passed through in memory-

  • 7/24/2019 Datastage Performance Guide

    28/108

    data set comprises a descriptor file and a number of other files that are added as the data set

    gro5s- These files are stored on multiple disks in %our s%stem- data set is organiIed in

    terms of partitionsand segments.

    ach partition of a data set is stored on a single processing node- ach data segment contains

    all the records 5ritten b% a single 8ob- So a segment can contain files from man% partitions)and a partition has files from man% segments-

    3irstl%) as a single Dataset contains multiple records) it is obvious that all of them must

    undergo the same processes and modifications- $n a 5ord) all of them must go through the

    same successive stage-

    Secondl%) it should be e.pected that different Datasets usuall% have different schemas)

    therefore the% cannot be treated commonl%-

    lias names of Datasets are

    'F Orchestrate 3ile6F Operating S%stem file

    nd Dataset is multiple files- The% are

    aF Descriptor 3ile

    bF Data 3ile

    cF Control file

    dF Eeader 3iles

    $n Descriptor 3ile) 5e can see the Schema details and address of data-

    $n Data 3ile) 5e can see the data in #ative format-

    nd Control and Eeader files reside in Operating S%stem-

    Starting a Dataset "anager

    #hoose $ools Data Set "anagement% a &rowse 'iles dialog box appears!

    '- #avigate to the director% containing the data set %ou 5ant to manage- 4% convention)

    data set files have the suffi. -ds-

  • 7/24/2019 Datastage Performance Guide

    29/108

    6- Select the data set %ou 5ant to manage and click O;- The Data Set Lie5er appears-

    3rom here %ou can cop% or delete the chosen data set- >ou can also vie5 its schema

    column definitionsF or the data it contains-

    (. S#D $ype )@- Slowly hanging dimension $ype )is a model 5here the 5hole histor% is stored inthe database- n additional dimension record is created and the segmenting bet5een

    the old record values and the ne5 currentF value is eas% to e.tract and the histor% is

    clear-

    The fields 2effective date2 and 2current indicator2 are ver% often used in that dimension

    and the fact table usuall% stores dimension ke% and version number-

    . S#D ) implementation in Datastage7- The 8ob described and depicted belo5 sho5s ho5 to implement SCD T%pe 6 in

    Datastage- $t is one of man% possible designs 5hich can implement this dimension-

    3or this e.ample) 5e 5ill use a table 5ith customers data it2s name is

    D_C1STO?R_SCD6F 5hich has the follo5ing structure and data:

    D0C1ST6423 dimension table before loading

    %&ST'(D

    %&ST' %&ST' %&ST' %&ST' )*%' )*%' )*%'

    +AM*,)"&P

    '(DTP*'

    (D%"&+T)

    '(DV*)S(

    "+*FFDT

    %&))*+T'(+D

    D3B61$

    DreamBaset 2= S P= I

    AI"IA"HAAG 7

    *T(MAA.

    *TLtoolsinfo B( % F( /

    01-21-0223

    $44$A a+atso D S CD I

    H"A"HAAG 7

    *C*=$AirstPactonic D C *T I

    HF"A"HAAG 7

    3D55$H rasir 2= C SU I

    H>"A"HAAG 7

    V$46P$

    Vanpa=TD. D C 1S I

    HI"A"HAAG 7

    VV46P$

    VVelectroni

    2= S 31 I I"A"

    7

  • 7/24/2019 Datastage Performance Guide

    30/108

    cs HAAG

    V=4*$GVlasithlini D S P= I

    I"A"HAAG 7

    V=4P2$ Vlobiteleco TC S * I

    IF"

    A"HAAG 7

    V68D;$F

    Voli$irlines B8 S VB I

    I>"A"HAAG 7

    The most important facts and stages of the C1ST_SCD6 8ob processing:

    The dimension table 5ith customers is refreshed dail% and one of the data sources is a te.t

    file- 3or the purpose of this e.ample the C1ST_$D/T$?0 differs from the one stored inthe database and it is the onl% record 5ith changed data- $t has the follo5ing structure and

    data:

    SCD 6 9 Customers file e.tract:

  • 7/24/2019 Datastage Performance Guide

    31/108

    There is a hashed file Eash_#e5CustF 5hich handles a lookup of the ne5 data coming

    from the te.t file-

    T(('_!ookups transformer does a lookup into a hashed file and maps ne5 and old

    values to separate columns-SCD 6 lookup transformer

    T((6_Check_Discrepacies_e.ist transformer compares old and ne5 values of records and

    passes through onl% records that differ-

    SCD 6 check discrepancies transformer

  • 7/24/2019 Datastage Performance Guide

    32/108

    T((= transformer handles the 1+DT and $#SRT actions of a record- The old record is

    updated 5ith current indictator flag set to no and the ne5 record is inserted 5ith current

    indictator flag set to %es) increased record version b% ' and the current date-

    SCD 6 insert9update record transformer

  • 7/24/2019 Datastage Performance Guide

    33/108

  • 7/24/2019 Datastage Performance Guide

    34/108

    V$46P$

    Vanpa=TD. D C 1S I

    HI"A"HAAG 7

    VV46P$

    VVelectroni

    cs 2= S 31 I

    I"A"

    HAAG 7

    V=4*$GVlasithlini D S P= I

    I"A"HAAG 7

    V=4P2$

    Vlobiteleco TC S * I

    IF"A"HAAG 7

    V68D;$F

    Voli$irlines B8 S VB I

    I>"A"HAAG 7

    I4" I2F3S$9E1ED--S-GE $E1F31"-2#E/2I2G' 3PE1PIE 3F 4ES

    $1-#I#ESI213D/#I32

    Data integration processes are &ery time and resource consuming. The

    amount of data and the size of the datasets are constantly gro%ing but

    data and information are still expected to be deli&ered on"time.

    Performance is therefore a ey element in the success of a Business

    *ntelligence W Data ;arehousing pro+ect and in order to guarantee the

    agreed le&el of ser&ice! management of data %arehouse performance andperformance tuning ha&e to tae a full role during the data %arehouse and

    2T= de&elopment process.

    Tuning! ho%e&er! is not al%ays straightfor%ard. $ chain is only as strong

    as its %eaest lin. *n this context! there are &e crucial domains that

    re-uire attention %hen tuning an *B4 *nfosphere DataStage en&ironment @

    System *nfrastructure

    8et%or

    Database

  • 7/24/2019 Datastage Performance Guide

    35/108

  • 7/24/2019 Datastage Performance Guide

    36/108

    *n order to resol&e any performance issues it is essential to ha&e an

    understanding of the data ,o% %ithin the +obs. To help understand a +ob

    ,o%! a score dump should be taen. This can be done by setting the

    $PT0D14P0SC632 en&ironment &ariable to true prior to running the +ob.

    ;hen enabled! the score dump produces a report %hich sho%s the

    operators! processes and data sets in the +ob and contains information

    about @

    ;here and ho% data %as repartitioned.

    ;hether *B4 *nfoSphere DataStage has inserted extra operators in

    the ,o%.

    The degree of parallelism each operator has run %ith! and on %hich

    nodes.

    ;here data %as bu?ered.

    The score dump information is included in the +ob log %hen a +ob is run

    and is particularly useful in sho%ing %here *B4 *nfoSphere DataStage is

    inserting additional components:actions in the +ob ,o%! in particular extra

    data partitioning and sorting operators as they can both be detrimental to

    performance. $ score dump %ill help to detect super,uous operators and

    amend the +ob design to remo&e them.

    /sage of 1esource Estimation

    Predicting hard%are resources needed to run DataStage +obs in order to

    meet processing time re-uirements can sometimes be more of an art than

    a science.

    ;ith ne% sophisticated analytical information and deep understanding of

    the parallel frame%or! *B4 has added 3esource 2stimation to DataStage

    (and ualityStage) O.x. This can be used to determine the needed system

    re-uirements or to analyze if the current infrastructure can support the

    +obs that ha&e been created.

    ;ithin a +ob design! a ne% toolbar option is a&ailable called 3esource

    2stimation.

  • 7/24/2019 Datastage Performance Guide

    37/108

    This option opens a dialog called 3esource 2stimation. The 3esource

    2stimation is based on a modelization of the +ob. There are t%o types of

    models that can be created@

    Static. The static model does not actually run the +ob to create the

    model. CP1 utilization cannot be estimated! but dis space can. The

    record size is al%ays xed. The best case scenario is considered

    %hen the input data is propagated. The %orst case scenario is

    considered %hen computing record size.

    Dynamic. The 3esource 2stimation tool actually runs the +ob %ith a

    sample of the data. Both CP1 and dis space are estimated. This is

    a more predictable %ay to produce estimates.

    3esource 2stimation is used to pro+ect the resources re-uired to execute

    the +ob based on &arying data &olumes for each input data source.

  • 7/24/2019 Datastage Performance Guide

    38/108

    $ pro+ection is then executed using the model selected. The results sho%

    the total CP1 needed! dis space re-uirements! scratch space

    re-uirements! and other rele&ant information.

    Di?erent pro+ections can be run %ith di?erent data &olumes and each can

    be sa&ed. Vraphical charts are also a&ailable for analysis! %hich allo% the

    user to drill into each stage and each partition. $ report can be generated

    or printed %ith the estimations.

  • 7/24/2019 Datastage Performance Guide

    39/108

    This feature %ill greatly assist de&elopers in estimating the time and

    machine resources needed for +ob execution. This ind of analysis can

    help %hen analyzing the performance of a +ob! but *B4 DataStage also

    o?ers another possibility to analyze +ob performance.

    /sage of $erformance -nalysis

    *solating +ob performance bottlenecs during a +ob execution or e&en

    seeing %hat else %as being performed on the machine during the +ob run

    can be extremely di'cult. *B4 *nfosphere DataStage O.x adds a ne%

    capability called Performance $nalysis.

    *t is enabled through a +ob property on the execution tab %hich collects

    data at +ob execution time. ( 8ote@ by default! this option is disabled ) .

    6nce enabled and %ith a +ob open! a ne% toolbar option! called

    Performance $nalysis! is made a&ailable .

    This option opens a ne% dialog called Performance $nalysis. The rst

    screen ass the user %hich +ob instance to perform the analysis on.

    Detailed charts are then a&ailable for that specic +ob run including@

    Job timeline

    3ecord Throughput

    CP1 1tilization

    Job Timing Job 4emory 1tilization

    Physical 4achine 1tilization (sho%s %hat else is happening o&erall

    on the machine! not +ust the DataStage acti&ity).

    2ach partition

  • 7/24/2019 Datastage Performance Guide

    40/108

    $ report can be generated for each chart.

    1sing the information in these charts! a de&eloper can for instance

    pinpoint performance bottlenecs and re"design the +ob to impro&eperformance.

    *n addition to instance performance! o&erall machine statistics are

    a&ailable. ;hen a +ob is running! information about the machine is also

    collected and is a&ailable in the Performance $nalysis tool including@

    6&erall CP1 1tilization

    4emory 1tilization

    Dis 1tilizationDe&elopers can also correlate statistics bet%een the machine information

    and the +ob performance. iltering capabilities exist to only display specic

    stages.

    The information collected and sho%n in the Performance $nalysis tool can

    easily be analyzed to identify possible bottlenecs. These bottlenecs are

    usually situated in the general +ob design! %hich %ill be described in the

    follo%ing chapter.

    '!%!"$) *O( D!SI'% (!ST "$&TI&!S

    The ability to process large &olumes of data in a short period of time

    depends on all aspects of the ,o% and the en&ironment being optimized

    for maximum throughput and performance. Performance tuning and

    optimization are iterati&e processes that begin %ith +ob design and unit

    tests! proceed through integration and &olume testing! and continue

    throughout the production life cycle of the application. Xere are someperformance pointers@

  • 7/24/2019 Datastage Performance Guide

    41/108

    #olumns and type con%ersions

    3emo&e unneeded columns as early as possible %ithin the +ob ,o%. 2&ery

    additional unused column re-uires additional bu?er memory! %hich can

    impact performance and mae each ro% transfer from one stage to the

    next more expensi&e. *f possible! %hen reading from databases! use a

    select list to read only the columns re-uired! rather than the entire table.

    $&oid propagation of unnecessary metadata bet%een the stages. 1se the

    4odify stage and drop the metadata. The 4odify stage %ill drop the

    metadata only %hen explicitly specied using the D36P clause.

    So only columns that are really needed in the +ob should be used and the

    columns should be dropped from the moment they are not neededanymore. The 6SX0P3*8T0SCX24$S en&ironment &ariable can be set to

    &erify that runtime schemas match the +ob design column denitions.

    ;hen using stage &ariables on a Transformer stage! ensure that their data

    types match the expected result types. $&oid that DataStage needs to

    perform unnecessary type con&ersions as it %ill use time and resources

    for these con&ersions.

    ransformer stages

    *t is best practice to a&oid ha&ing multiple stages %here the functionality

    could be incorporated into a single stage! and use other stage types to

    perform simple transformation operations. Try to balance load on

    Transformers by sharing the transformations across existing Transformers.

    This %ill ensure a smooth ,o% of data.

    ;hen type casting! renaming of columns or addition of ne% columns is

    re-uired! use Copy or 4odify Stages to achie&e this. The Copy stage! for

    example! should be used instead of a Transformer for simple operations

    including @

    Job Design placeholder bet%een stages

    3enaming Columns

    Dropping Columns

    *mplicit (default) Type Con&ersions

    $ de&eloper should try to minimize the stage &ariables in a Transformer

    stage because the performance of a +ob decreases as stage &ariables are

  • 7/24/2019 Datastage Performance Guide

    42/108

    added in a Transformer stage. The number of stage &ariables should be

    limited as much as possible.

    $lso if a particular stage has been identied as one that taes a lot of time

    in a +ob! lie a Transformer stage ha&ing complex functionality %ith a lot ofstage &ariables and transformations! then the design of +obs could be

    done in such a %ay that this stage is put in a separate +ob all together

    (more resources for the Transformer Stage).

    ;hile designing *B4 DataStage Jobs! care should be taen that a single

    +ob is not o&erloaded %ith stages. 2ach extra stage put into a +ob

    corresponds to less resources being a&ailable for e&ery stage! %hich

    directly a?ects the +ob performance. *f possible! complex +obs ha&ing alarge number of stages should be logically split into smaller units.

    Sorting

    $ sort done on a database is usually a lot faster than a sort done in

    DataStage. So # if possible # try to already do the sorting %hen reading

    data from the database instead of using a Sort stage or sorting on the

    input lin. This could also mean a big performance gain in the +ob!

    although it is not al%ays possible to a&oid needing a Sort stage in +obs.

    Careful +ob design can impro&e the performance of sort operations! both in

    standalone Sort stages and in on"lin sorts specied in other stage types!

    %hen not being able to mae use of the database sorting po%er.

    *f data has already been partitioned and sorted on a set of ey columns!

    specify the Ydon

  • 7/24/2019 Datastage Performance Guide

    43/108

    of the Sort stage. The default setting is HA 4B per partition. 8ote that sort

    memory usage can only be specied for standalone Sort stages! it cannot

    be changed for inline (on a lin) sorts.

    Se*uential files

    ;hile handling huge &olumes of data! the Se-uential ile stage can itself

    become one of the ma+or bottlenecs as reading and %riting from this

    stage is slo%. Certainly do not use se-uential les for intermediate storage

    bet%een +obs. *t causes performance o&erhead! as it needs to do data

    con&ersion before %riting and reading from a le. 3ather Dataset stages

    should be used for intermediate storage bet%een di?erent +obs.

    Datasets are ey to good performance in a set of lined +obs. They help in

    achie&ing end"to"end parallelism by %riting data in partitioned form and

    maintaining the sort order. 8o repartitioning or import:export con&ersions

    are needed.

    *n order to ha&e faster reading from the Se-uential ile stage the number

    of readers per node can be increased (default &alue is one). This means!

    for example! that a single le can be partitioned as it is read (e&en though

    the stage is constrained to running se-uentially on the conductor mode).

    This is an optional property and only applies to les containing xed"

    length records. But this pro&ides a %ay of partitioning data contained in a

    single le. 2ach node reads a single le! but the le can be di&ided

    according to the number of readers per node! and %ritten to separate

    partitions. This method can result in better *:6 performance on an S4P

    (Symmetric 4ulti Processing) system.

    *t can also be specied that single les can be read by multiple nodes.

    This is also an optional property and only applies to les containing xed"

  • 7/24/2019 Datastage Performance Guide

    44/108

    length records. Set this option to 7es to allo% indi&idual les to be read

    by se&eral nodes. This can impro&e performance on cluster systems.

    *B4 DataStage no%s the number of nodes a&ailable! and using the xed

    length record size! and the actual size of the le to be read! allocates to

    the reader on each node a separate region %ithin the le to process. The

    regions %ill be of roughly e-ual size.

    The options 3ead rom 4ultiple 8odes and 8umber of 3eaders Per

    8ode are mutually exclusi&e.

    1untime #olumn $ropagation

    $lso %hile designing +obs! care must be taen that unnecessary column

    propagation is not done. Columns! %hich are not needed in the +ob ,o%!

    should not be propagated from one stage to another and from one +ob to

    the next. $s much as possible! 3CP (3untime Column Propagation) should

    be disabled in the +obs.

    &oin, 0oo(up or "erge

    6ne of the most important mistaes that de&elopers mae is to not ha&e&olumetric analyses done before deciding to use Join! =ooup or 4erge

    stages.

    *B4 DataStage does not no% ho% large the data set is! so it cannot mae

    an informed choice %hether to combine data using a Join stage or a

    =ooup stage. Xere is ho% to decide %hich one to use Z

    There are t%o data sets being combined. 6ne is the primary or dri&ing

    data set! sometimes called the left of the +oin. The other dataset are thereference data set or the right of the +oin.

  • 7/24/2019 Datastage Performance Guide

    45/108

    *n all cases! the size of the reference data sets is a concern. *f these tae

    up a large amount of memory relati&e to the physical 3$4 memory size of

    the computer DataStage is running on! then a =ooup stage might crash

    because the reference datasets might not t in 3$4 along %ith e&erything

    else that has to be in 3$4. This results in a &ery slo% performance since

    each looup operation can! and typically %ill! cause a page fault and an

    *:6 operation.

    So! if the reference datasets are big enough to cause trouble! use a +oin. $

    +oin does a high"speed sort on the dri&ing and reference datasets. This can

    in&ol&e *:6 if the data is big enough! but the *:6 is all highly optimized and

    se-uential. $fter the sort is o&er! the +oin processing is &ery fast and ne&er

    in&ol&es paging or other *:6.

    Databases

    The best choice is to use Connector stages if a&ailable for the database.

    The next best choice is the 2nterprise database stages as these gi&e

    maximum parallel performance and features %hen compared to [plug"ine 4a#e successfully implemente5 A,,)*,AT("+ usingT)A+SF")M*) Stage

    L""P(+, (+ T)A+SF")M*) STA,* DATASTA,* E9. : *AMPL* /

  • 7/24/2019 Datastage Performance Guide

    51/108

    Looping in a Transformer 7as ne7 feature t4at 7as intro5uce5 inDataStage E9.9 +o7 7e 5onGt 4a#e to 5esign comple logic toperform looping functions on input 5ata9

    )at4er t4an going on a$out 74at is ne7 an5 74at

    #aria$les or functions 4a#e $een a55e5 to ac4ie#e Loopingin DataStage; 7e 7ill look at a couple of eamples t4at7ill eplain all t4e ne7 functionalities in a simple an5 easy7ay9

    *ample /:9ow to con%ert a single row into multiple rows +o7 you can argue t4at t4is is possi$le using a pi#otstage9 But for t4e sake of t4is article lets try 5oing t4isusing a TransformerH

    Belo7 is a screens4ot of our input 5ata

    %ity State +ame/ +ame0 +ame?

    xy VX Sam Dean ;inchester

    >e are going to rea5 t4e a$o#e 5ata from a sequential 8lean5 transform it to look like t4is

    %ity State +ame

    xy VX Sam

    xy VX Dean

    xy VX ;inchester

    So lets get to t4e 6o$ 5esign

    Step /: )ea5 t4e input 5ata

    Step 0: Logic for Looping in Transformer

  • 7/24/2019 Datastage Performance Guide

    52/108

    (n t4e a56acent image you can see a ne7 $o calle5 Loop%on5ition9 T4is 74ere 7e are going to control t4e loop#aria$les9

    Belo7 is t4e screens4ot 74en 7e epan5 t4e Loop%on5ition $o

    T4e Loop >4ile constraint is use5 to implement afunctionality similar to @>C(L* statement inprogramming9 So; similar to a 74ile statement nee5 to4a#e a con5ition to i5entify 4o7 many times t4e loop issuppose5 to $e eecute59

    To ac4ie#e t4is I(T*)AT("+ system #aria$le 7asintro5uce59 (n our eample 7e nee5 to loop t4e 5ata ?times to get t4e column 5ata onto su$sequent ro7s9

    So lets 4a#e I(T*)AT("+ JK?

  • 7/24/2019 Datastage Performance Guide

    53/108

    +o7 create a ne7 Loop #aria$le 7it4 t4e name Loop+ame

    T4e 5eri#ation for t4is loop #aria$le s4oul5 $e

    If @ITERATION=1 Then DSLink2.Name1 Else If @ITERATION=2 Then DSLink2.Name2

    Else DSLink2.Name3

    Belo7 is a screens4ot illustrating t4e same

    +o7 all 7e 4a#e to 5o is map t4is Loop #aria$le Loop+ame

    to our output column +ame

    Lets map the output to a sequential fle stage and see ithe output is a desired.

  • 7/24/2019 Datastage Performance Guide

    54/108

    After running t4e 6o$; 7e 5i5 a #ie7 5ata on t4e outputstage an5 4ere is t4e 5ata as 5esire59

    Making some t7eaks to t4e a$o#e 5esign 7e canimplement t4ings like

    A55ing ne7 ro7s to eisting ro7s

    Splitting 5ata in a single column to multiple ro7s an5many more suc4 stu99

    $"$))!) D!(U''I%' I%

    D$T$ST$'! +,-

    (BM 4as a55e5 a ne7 5e$ug feature for DataStage fromt4e E9. #ersion9 (t pro#i5es $asic 5e$ugging features fortesting5e$ugging t4e 6o$ 5esign9

    T4is feature is use5 from t4e 5esigner client9 So lets 6ump

    into creating a simple 6o$ an5 start 5e$ugging itH

  • 7/24/2019 Datastage Performance Guide

    55/108

    As you can see t4e a$o#e 6o$ 5esign is #ery simpleN ( 4a#ea ro7 generator stage t4at 7ill generate /2 recor5s fort4e column @num$er

    An5 in t4e transformer (G#e 4ar5-co5e5 a column calle5@name 7it4 t4e #alue OAB%D*F,G

    T4e 8nal stage is t4e peek stage t4at 7e all 4a#e $eenusing for 5e$ugging our 6o$ 5esign all t4ese years9+o7 t4at 7e un5erstan5 t4e 6o$ 5esing; 7eGll start5e$ugging it9

    A55 $reakpoints to links

    )ig4t click on t4e link 74ere you 7is4 to c4eck5e$ug

    t4e 5ata an5 t4e follo7ing 5rop 5o7n menu appears

  • 7/24/2019 Datastage Performance Guide

    56/108

    Select t4e option; @Toggle Breakpoint

    "nce selecte5 a small icon appears on t4e link ass4o7n in t4e $elo7 screen s4ot

    >e can also e5it t4e Breakpoint; to 5o so 7e 4a#e torig4t click on t4e link again an5 select @*5itBreakpoints9 Follo7ing 74ic4 t4e follo7ing popup7ill $e 5isplaye5

    By 5efault t4e $reakpoint 7ill $e acti#ate5 for e#eryro79 (f t4e nee5 arises; 7e can c4ange t4is to any

    #alue an5 t4e 5e$ugger 7ill pause t4e o7 of recor5sas an5 74en it reac4es t4at num$er*: (f 7e set t4e #alue for @*#ery + )o7s to .; t4e5e$ugger 7ill 5isplay e#ery recor5 t4at is a multipleof .; i9e9; .t4 recor5s; /2t4 recor5; /.t4 recor5 an soon99

    For t4e sake of t4is eample 7eGll c4ange t4e #alue to0

  • 7/24/2019 Datastage Performance Guide

    57/108

    >e are 5one setting up t4e $reakpoints for t4e 5e$ugger;so lets compile our 6o$ no79 "nce compile5; click on 5e$ugmenu option an5 select @,"9 "r simply press F.

    T4e follo7ing popup appears an5 7e are aske5 to run our6o$9 ,o a4ea5 an5 run it -

    Screens4ot of 6o$ running in 5e$ug mo5e

  • 7/24/2019 Datastage Performance Guide

    58/108

    +o7 our 6o$ is running in 5e$ug mo5e an5 as soon as t4e0n5 recor5 passes t4roug4 t4e link

  • 7/24/2019 Datastage Performance Guide

    59/108

  • 7/24/2019 Datastage Performance Guide

    60/108

    Select t4e @)un to *n5 option on t4e 5e$ugger 7in5o7an5 t4e 6o$ 7ill 8nis4 an5 7ait for t4e net user input9

    %lose t4e 7in5o7 an5 go to t4e menu option @5e$ug9 (nt4at select @%lear All Breakpoints9 T4is 7ill remo#e all$reakpoints t4at you 4a#e set in your 6o$9

  • 7/24/2019 Datastage Performance Guide

    61/108

    D$T$ST$'! &O##O%!""O"S./$"%I%'S $%D SO)UTIO%S

    0 1/9 >4ile connecting @)emote Desktop; Terminal ser#er4as $een ecee5e5 maimum num$er of allo7e5connections

    S"L: (n %omman5 Prompt; type mstsc #: ip a55ress ofser#er a5min ") mstsc #: ipa55ress console09 SRL02.0/+9 *rror occurre5 processing a con5itional

    compilation 5irecti#e near string9 )eason co5eKrc9Follo7ing link 4as issue 5escription:4ttp:pic954e9i$m9cominfocenter5$0lu7#1rin5e96sptopicK0Fcom9i$m95$09lu79messages9sql95oc0F5oc0Fmsql02.0/n94tml?9 SU')*TA(L*)',)"&P'B)D(,*;/: runLocallyarning 7ill $e 5isappeare5 $y regenerating SUFile9

    9 >4ile connecting to Datastage client; t4ere is noresponse; an5 74ile restarting 7e$sp4ere ser#ices;follo7ing errors occurre5rootIpoluloro2/ $inWX 9stopSer#er9s4 ser#er/ -user7asa5min -pass7or5 >asa5min22EADM&2//3(: Tool information is $eing logge5 in 8le opti$m>e$Sp4ereAppSer#erpro8les5efaultlogsser#er/stopSer#er9logADM&2/0E(: Starting tool 7it4 t4e 5efault pro8leADM&?/22(: )ea5ing con8guration for ser#er: ser#er/ADM&2///*: Program eiting 7it4 error:

    6a#a9management9!M)untime*ception:ADM+2200*: Access is 5enie5 for t4e stop operation onSer#er MBean $ecause of insuYcient or emptycre5entials9ADM&//?*: Verify t4at username an5 pass7or5information is on t4e comman5 line

  • 7/24/2019 Datastage Performance Guide

    62/108

    ADM&20//(: *rror 5etails may $e seen in t4e 8le: opti$m>e$Sp4ereAppSer#erpro8les5efaultlogsser#er/stopSer#er9logS"L: >asa5min an5 Meta pass7or5s nee5s to $e reset

    an5 comman5s are $elo799 rootIpoluloro2/ $inWX c5opti$m(nformationSer#erASBSer#er$inrootIpoluloro2/ $inWX 9AppSer#erA5min9s4 -7as -user7asa5min-pass7or5 >asa5min22E(nfo >AS instance +o5e:poluloro2/Ser#er:ser#er/up5ate5 7it4 ne7 user information(nfo Meta5ataSer#er 5aemon script up5ate5 7it4 ne7 userinformation

    rootIpoluloro2/ $inWX 9AppSer#erA5min9s4 -7as -usermeta -pass7or5 meta22E(nfo >AS instance +o5e:poluloro2/Ser#er:ser#er/up5ate5 7it4 ne7 user information(nfo Meta5ataSer#er 5aemon script up5ate5 7it4 ne7 userinformation.9 @T4e speci8e5 8el5 5oesnGt eist in #ie7 a5apte5sc4emaS"L: Most of t4e time @T4e speci8e5 8el5: 5oes

    not eist in t4e #ie7 a5apte5 sc4ema occurre5 74en 7emisse5 a 8el5 to map9 *#ery stage 4as got an output ta$ ifuse5 in t4e $et7een of t4e 6o$9 Make sure you 4a#emappe5 e#ery single 8el5 require5 for t4e net stage9Sometime e#en after mapping t4e 8el5s t4is error can $eoccurre5 an5 one of t4e reason coul5 $e t4at t4e #ie7a5apter 4as not linke5 t4e input an5 output 8el5s9 Cencein t4is case t4e require5 8el5 mapping s4oul5 $e 5roppe5an5 recreate59

    !ust to gi#e an insig4t on t4is; t4e #ie7 a5apter is anoperator 74ic4 is responsi$le for mapping t4e input an5output 8el5s9 Cence DataStage creates an instance ofAPT'Vie7A5apter 74ic4 translate t4e components of t4eoperator input interface sc4ema to matc4ing componentsof t4e interface sc4ema9 So if t4e interface sc4ema is not4a#ing t4e same columns as operator input interfacesc4ema t4en t4is error 7ill $e reporte59DATASTA,* %"MM"+ *))")S>A)+(+,S A+D S"L&T("+S 0

    /9 +o 6o$s or logs s4o7ing in (BM DataStage Director %lient;4o7e#er 6o$s are still accessi$le from t4e Designer %lient9

  • 7/24/2019 Datastage Performance Guide

    63/108

    S"L: SyncPro6ect cm5 t4at is installe5 7it4 DataStageE9. can $e run to analyQe an5 reco#er pro6ects

    SyncPro6ect -(SFile islogin -pro6ect 5stage? 5stage. Fi

    09 %ASC"&T'DTL: (n#ali5 property #alue%onnectionData$ase e$sp4ere ser#ices9

    39 T4e connection 7as refuse5 or t4e )P% 5aemon is notrunning

  • 7/24/2019 Datastage Performance Guide

    64/108

    )%: T4e 5sprc5 process must $e running in or5er to $ea$le to login to DataStage9

    (f you restart DataStage; $ut t4e socket use5 $y t4e5srpc5 A(T; or %L"S*'>A(T9T4ese 7ill pre#ent t4e 5sprc5 from starting9 T4e sockets7it4 status F(+'>A(T or %L"S*'>A(T 7ill e#entually timeout an5 5isappear; allo7ing you to restart DataStage9

    T4en )estart DS*ngine9

  • 7/24/2019 Datastage Performance Guide

    65/108

  • 7/24/2019 Datastage Performance Guide

    66/108

    "utput link

    T4e output link 4as t7o columns9 T4e 6o$name an5 one fort4e function names9 Per input ro7; t4ere are /2 ro7soutputte59 *#en if t4e 8el5s are +&LL9 Di#i5e t4e8el5names 7it4 spaces; not 7it4 any ot4er 5elimiter9

    DATASTA,* PA)ALL*L )"&T(+*S ?: T" %C*%U (F A ,(V*+ DAT* (S VAL(D ") +"T

    )equirement: To c4eck if a gi#en 5ate is #ali5 or not9

    Datastage 4as a $uilt in function

  • 7/24/2019 Datastage Performance Guide

    67/108

    For eample if 7e 4a#e to test for string 2/2//111 t4esynta 7ill $e

    IsValid('DATE', StringToDate("01/01/1999","%mm/%dd/%yyyy"))

    (f t4e 5ate is #ali5 5ate; it 7ill return one else Qero9 But; if

    7e 4a#e t4e 5ate as @///111b t4oug4 t4e 5ate is #ali5 it7ill return 2; as 5ate an5 mont4 8el5s 4a#e only one 5igit9T4erefore; 7e 7ill create a Parallel )outine; 74ic4 7illtake 5ate in #arious format #iQ9 mm55yyyy; m55yyyy;mm5yyyy; mm55yy; m5yy etc an5 return it in a 8e5format of mm55yyyy; 74ic4 7e can use in t4e functionspeci8e5 a$o#e9

    %%\\ %o5e:

    T4e co5e 7ritten $elo7 takes a 5ate string in any of t4eformats mentione5 a$o#e an5 returns a 5ate string inmm55yyyy format

  • 7/24/2019 Datastage Performance Guide

    68/108

    //handling the date part

    if ((fst-str)==1)

    {

    strcpy(date,"0");

    strncat(date,str,1);

    }

    else if ((fst-str)==2)

    {

    strncpy(date,str,2);

    }

    else

    return ("Invalid");

    //handling the month part

    if ((lst-fst-1)==1)

    {

    strcat(date,"0");

    strncat(date,fst+1,1);

    }

    else if ((lst-fst-1)==2)

    {

    strncat(date,fst+1,2);

  • 7/24/2019 Datastage Performance Guide

    69/108

    }

    else

    return ("Invalid");

    //handling the year part

    if ((len-(lst-str)-1)==2)

    {

    decade=atoi(lst+1);

    if (decade>=50)

    {

    strcat(date,"19");

    }

    else

    {

    strcat(date,"20");

    }

    strncat(date,lst+1,2);

    }

    else if ((len-(lst-str)-1)==4)

    {

    strncat(date,lst+1,4);

    }

  • 7/24/2019 Datastage Performance Guide

    70/108

    else

    return ("Invalid");

    return(date);

    }

    %reate t4e 9o 8le

  • 7/24/2019 Datastage Performance Guide

    71/108

    As 7e are passing a 5ate string as input 7e 4a#e a55e5 a

    parameter 4ere9Sa#e an5 %lose it9

    &sage in Datastage !o$

    >e 4a#e create5 a simple 6o$ to 5emonstrate of t4eParallel )outine @5ateformat 74ic4 7e 4a#e 6ust create59

    T4e 6o$ 4as t4ree stages:

    /9SAMPL*'DAT*S'inp stage pro#i5es sample 5ates fortesting9

  • 7/24/2019 Datastage Performance Guide

    72/108

    09MAST*)'T)A+SF")M*)'tra stage calls t4e parallelroutine9

    ?9"&T'l5r loa5s t4e 5ata into a ta$ 5elimite5 8le9The fgure shows the sample data in SAML!"#AT!S"inp

    The fgure shows the properties o the

    MAST!$"T$A%S&'$M!$"tra stage

  • 7/24/2019 Datastage Performance Guide

    73/108

    Various transformation rules are clearly seen; t4e place74ere @5ateformat is calle5 is un5erline5 in re5 color

    The fgure shows the data loaded in the output fle.

  • 7/24/2019 Datastage Performance Guide

    74/108

    DAT*'(+: t4is column consists of all t4e recor5s from t4einput 8le as it is9

    sample_dates_inp.DATE_IN

    DAT*'VAL(D: t4is column s4o7 t4e #alue returne5 $y t4e

    (S'VAL(D function 7it4out using t4e parallel routine@5ateformat create5 $y us9

    IsValid('DATE', StringToDate(sample_dates_inp.DATE_IN,"%mm/%dd/%yyyy"))

    M"D(F(*D'DAT*: t4is column s4o7s t4e 5ate returne5 $yroutine @5ateformat 74en DAT*'(+ is gi#en as t4e input9

    dateformat(sample_dates_inp.DATE_IN)

    M"D(F(*D'DAT*'VAL(D: t4is column s4o7 t4e #aluereturne5 $y t4e (S'VAL(D function AFT*) using t4e parallelroutine @5ateformat create5 $y us9

    IsValid('DATE', StringToDate(dateformat(sample_dates_inp.DATE_IN),"%mm%dd

    %yyyy"))

    >e can see t4at all 5ate formats 4a#e $een i5enti8e5 as#ali5 5ates after using t4e routine9

    Let us see one more usage of Datastage parallel routinesin my net post9

    D$T$ST$'! &O##O%!""O"S./$"%I%'S $%D SO)UTIO%S

    0 2/9 >4ile running 9+o5eAgents9s4 start comman5getting t4e follo7ing error: @LoggingAgent9s4 processstoppe5 unepecte5ly

    S"L: nee5s to kill LoggingAgentSocket(mpl

    Ps ef ^ grep LoggingAgentSocket(mpl arning: A user 5e8ne5 sort operator 5oes notsatisfy t4e requirements9

  • 7/24/2019 Datastage Performance Guide

    75/108

    S"L: %4eck t4e or5er of sorting columns an5 make sureuse t4e same or5er 74en use 6oin stage after sort to 6oingt7o inputs99 %on#ersion error calling con#ersion routine

    timestamp'from'string 5ata may 4a#e $een lost9fm!ournals;/: %on#ersion error calling con#ersion routine5ecimal'from'string 5ata may 4a#e $een lost

    S"L: c4eck for t4e correct 5ate format or 5ecimal formatan5 also null #alues in t4e 5ate or 5ecimal 8el5s $eforepassing to 5atastage StringToDate;DateToString;DecimalToString or StringToDecimalfunctions9

    .9 To 5isplay all t4e 6o$s in comman5 line

    S"L:

    cd /opt/ibm/InformationServer/Server/DSEngine/bin

    ./dsjob -ljobs

    39 @*rror trying to query 5sa5mW9 T4ere mig4t $e anissue in 5ata$ase ser#er

    S"L: %4eck M*TA connecti#ity95$0 connect to meta

  • 7/24/2019 Datastage Performance Guide

    76/108

    S"L: )emo#e t4e locks an5 try to run DSJ20B$DP$3$4 ParamNameis not a parameter namein the +ob.

    "] DSJ20B$D^$=12 *n&alid MaxNumber&alue.

    "F DSJ20B$DT7P2 *nformation or e&ent type %asunrecognized.

    "G DSJ20;368VJ6B Job for this JobHandle%as not startedfrom a call to DS)un!o$by the

  • 7/24/2019 Datastage Performance Guide

    77/108

    CODE ERROR TOKEN DESCRIPTION

    A DSJ208623363 8o *nfoSphere DataStage $P* error hasoccurred.

    current process.

    " DSJ20B$DST$V2 StageNamedoes not refer to a no%nstage in the +ob.

    "O DSJ2086T*8ST$V2 *nternal engine error.

    " DSJ20B$D=*8U LinkNamedoes not refer to a no%nlin for the stage in -uestion.

    "IA DSJ20J6B=6CU2D The +ob is loced by another process.

    "II DSJ20J6BD2=2T2D The +ob has been deleted.

    "IH DSJ20B$D8$42 *n&alid pro+ect name.

    "I> DSJ20B$DT*42 *n&alid StartTimeor EndTime&alue.

    "I] DSJ20T*4261T The +ob appears not to ha&e startedafter %aiting for a reasonable length oftime. ($bout >A minutes.)

    "IF DSJ20D2C37PT233 ailed to decrypt encrypted &alues.

    "IG DSJ2086$CC2SS Cannot get &alues! default &alues ordesign default &alues for any +obexcept the current +ob.

    " DSJ2032P23363 Veneral engine error.

    "IAA DSJ2086T$D4*81S23 1ser is not an administrator.

    "IAI DSJ20*S$D4*8$*=2D ailed to determine %hether user is anadministrator.

    "IAH DSJ2032$DP36JP36P23T7 ailed to read property."IA> DSJ20;3*T2P36JP36P23T7 Property not supported.

    "IA] DSJ20B$DP36P23T7 1nno%n property name.

    "IAF DSJ20P36P86TS1PP63T2D 1nsupported property.

    "IAG DSJ20B$DP36P^$=12 *n&alid &alue for this property.

    "IA DSJ206SX^*S*B=2=$V ailed to get &alue for 6SX^isible.

    "IAO DSJ20B$D28^^$38$42 *n&alid en&ironment &ariable name.

  • 7/24/2019 Datastage Performance Guide

    78/108

    CODE ERROR TOKEN DESCRIPTION

    A DSJ208623363 8o *nfoSphere DataStage $P* error hasoccurred.

    "IA DSJ20B$D28^^$3T7P2 *n&alid en&ironment &ariable type.

    "IIA DSJ20B$D28^^$3P364PT 8o prompt supplied.

    "III DSJ2032$D28^^$3D28S ailed to read en&ironment &ariabledenitions.

    "IIH DSJ2032$D28^^$3^$=12S ailed to read en&ironment &ariable&alues.

    "II> DSJ20;3*T228^^$3D28S ailed to %rite en&ironment &ariabledenitions.

    "II] DSJ20;3*T228^^$3^$=12S ailed to %rite en&ironment &ariable&alues.

    "IIF DSJ20D1P28^^$38$42 2n&ironment &ariable being addedalready exists.

    "IIG DSJ20B$D28^^$3 2n&ironment &ariable does not exist.

    "II DSJ2086T1S23D2*82D 2n&ironment &ariable is not user"dened and therefore cannot be

    deleted.

    "IIO DSJ20B$DB66=2$8^$=12 *n&alid &alue gi&en for a booleanen&ironment &ariable.

    "II DSJ20B$D81423*C^$=12 *n&alid &alue gi&en for an integeren&ironment &ariable.

    "IHA DSJ20B$D=*ST^$=12 *n&alid &alue gi&en for a listen&ironment &ariable.

    "IHI DSJ20P586T*8ST$==2D 2n&ironment &ariable is specic to

    parallel +obs %hich are not a&ailable.

    "IHH DSJ20*SP$3$==2==*C28C2D ailed to determine if parallel +obs area&ailable.

    "IH> DSJ2028C6D2$*=2D ailed to encode an encrypted &alue.

    "IH] DSJ20D2=P36J$*=2D ailed to delete pro+ect denition.

    "IHF DSJ20D2=P36J*=2S$*=2D ailed to delete pro+ect les.

    "IHG DSJ20=*STSCX2D1=2$*=2D ailed to get list of scheduled +obs for

    pro+ect.

  • 7/24/2019 Datastage Performance Guide

    79/108

    CODE ERROR TOKEN DESCRIPTION

    A DSJ208623363 8o *nfoSphere DataStage $P* error hasoccurred.

    "IH DSJ20C=2$3SCX2D1=2$*=2D ailed to clear scheduled +obs forpro+ect.

    "IHO DSJ20B$DP36J8$42 *n&alid pro+ect name supplied.

    "IH DSJ20V2TD2$1=TP$TX$*=2D ailed to determine default pro+ectdirectory.

    "I>A DSJ20B$DP36J=6C$T*68 *n&alid path name supplied.

    "I>I DSJ20*8 $=*DP36J2CT=6C$T*68 *n&alid path name supplied.

    "I>H DSJ206P28$*=2D ailed to open 1 .$CC618T le.

    "I>> DSJ2032$D1$*=2D ailed to loc pro+ect create locrecord.

    "I>] DSJ20$DDP36J2CTB=6CU2D $nother user is adding a pro+ect.

    "I>F DSJ20$DDP36J2CT$*=2D ailed to add pro+ect.

    "I>G DSJ20=*C28S2P36J2CT$*=2D ailed to license pro+ect.

    "I> DSJ2032=2$S2$*=2D ailed to release pro+ect create loc

    record.

    "I>O DSJ20D2=2T2P36J2CTB=6CU2D Pro+ect loced by another user.

    "I> DSJ2086T$P36J2CT ailed to log to pro+ect.

    "I]A DSJ20$CC618TP$TX$*=2D ailed to get account path.

    "I]I DSJ20=6VT6$*=2D ailed to log to 1^ account.

    "HAI DSJ2018U86;80J6B8$42 The supplied +ob name cannot befound in the pro+ect.

    "IAAI

    DSJ20864632 $ll e&ents matching the lter criteriaha&e been returned.

    "IAAH

    DSJ20B$DP36J2CT ProjectNameis not ano%n *nfoSphere DataStage pro+ect.

    "IAA>

    DSJ20860D$T$ST$V2 *nfoSphere DataStage is not installedon the system.

    "

    IAA]

    DSJ206P28$*= The attempt to open the +ob failed #

    perhaps it has not been compiled.

  • 7/24/2019 Datastage Performance Guide

    80/108

    CODE ERROR TOKEN DESCRIPTION

    A DSJ208623363 8o *nfoSphere DataStage $P* error hasoccurred.

    "IAAF

    DSJ20860424637 ailed to allocate dynamic memory.

    "IAAG

    DSJ20S23^23023363 $n unexpected or unno%n erroroccurred in the engine.

    "IAA

    DSJ2086T0$^$*=$B=2 The re-uested information %as notfound.

    "IAAO

    DSJ20B$D0^23S*68 The engine does not support this&ersion of the *nfoSphere

    DataStage $P*."IAA

    DSJ20*8C64P$T*B=20S23^23 The engine &ersion is incompatible%ith this &ersion of the *nfoSphereDataStage $P*.

    AP( error co5es in numerical or5er

    ERROR NUMBER DESCRIPTION

    >IHI The *nfoSphere DataStage license has expired.

    >I>] The *nfoSphere DataStage user limit has been reached.

    OAAII *ncorrect system name or in&alid user name or pass%ordpro&ided.

    OAAI Pass%ord has expired.

    AP( %ommunication Layer *rror %o5es

    I(# "!D(OO3 &O"%!"

    Belo7 are a list of re5$ooks t4at are a#aila$le on t4e (BMsite for topics relate5 to Datastage9 (ts pretty useful fort4e person 74o 7ants to kno7 more9 %4eck it out

    (nformation Ser#er (nstallation an5 %on8guration ,ui5e

    T4is (BM re5$ook pu$lication pro#i5es suggestions; 4intsan5 tips; 5irections; installation steps; c4ecklists of

  • 7/24/2019 Datastage Performance Guide

    81/108

    prerequisites; an5 con8guration information collecte5from a num$er of (nformation Ser#er eperts9 (t isinten5e5 to minimiQe t4e time require5 to successfullyinstall an5 con8gure (nformation Ser#er9

    4ttp:7779re5$ooks9i$m9coma$stractsre5p.1394tml"pen

    Preparing for DB0 +ear-)ealtime Business (ntelligence

    T4is (BM re5$ook is primarily inten5e5 for use $y (BM%lients an5 (BM Business Partners9 T4e purpose of t4ere5$ook is to 5iscuss t4e topic of preparing for near-realtime $usiness intelligence

  • 7/24/2019 Datastage Performance Guide

    82/108

    DataStaged an5 relate5 tec4nologies using a typical retailin5ustry scenario9 (t is aime5 at (T arc4itects; (nformationManagement specialists; an5 (nformation (ntegrationspecialists responsi$le for 5e#eloping (BM (nformation

    Ser#er on a Microsoftd >in5o7sd 022? platform94ttp:7779re5$ooks9i$m9coma$stractssg0.394tml"pen

    Deploying a ,ri5 Solution 7it4 t4e (BM (nfoSp4ere(nformation Ser#er

    T4is (BMd )e5$ooksd pu$lication 5ocuments t4emigration of an (BM (nfoSp4ere (nformation Ser#erVersion E929/ parallel frame7ork implementation on a )e5

    Cat *nterprise Linud

  • 7/24/2019 Datastage Performance Guide

    83/108

    G!%!$AL tab o a arallel $outine

    Description of #arious 8el5s in t4e ,eneral ta$:

    )outine name: T4e name of t4e routine as it 7ill appear int4e DataStage )epository9

    %ategory: T4e name of t4e category 74ere t4e routine isstore5 in t4e )epository9 (f you 5o not enter a name in t4is8el5 74en creating a routine; t4e routine is create5 un5ert4e main )outines $ranc49

    *ternal su$routine name: T4e % function t4at t4is routineis calling

  • 7/24/2019 Datastage Performance Guide

    84/108

    A$G(M!%TS tab o a arallel $outine

    &n5er t4e arguments ta$ enter t4e names of all t4eparameters t4at t4e function takes as input 7it4 t4eircorrespon5ing types9

  • 7/24/2019 Datastage Performance Guide

    85/108

    T4is series of posts eplains in 5etail t4e process ofcreating a routine 7it4 t4e 4elp of eamples9But $eforeco5ing anyt4ing; please c4eck out t4e post compilerspeci8cation9

    #oding in #

  • 7/24/2019 Datastage Performance Guide

    86/108

    t4is co5e 74en calle5 from DataStage can return t4estring9

    After making t4e c4anges t4e a$o#e co5e 7ill look liket4is:

    #include

    char * hello ()

    {

    char * s="hello world"

    return s;

    }

    +o7 you 7ill 4a#e to create an o$6 8le for t4e a$o#e co5et4at can $e 5one $y using t4e comman5

    /usr/vacpp/bin/xlC_r -O -c -qspill=32704

    (n t4is case it 7ill $e

    /usr/vacpp/bin/xlC_r -O -c -qspill=32704 hellworld.cpp

    (f t4e co5e is correct t4e comman5 7ill create a 8lename5 4ello7orl59o in t4e same fol5er9

    (n net postin t4is series; 7e 7ill create a DataStageroutine an5 link it to t4e o$6ect co5e t4at 7e 5e#elope59

    &O#I)!" S!&IFI&$TIO%I% D$T$ST$'!

    "ne of t4e pre-requisites to 7riting Parallel rotines using%%\\ is to c4eck for t4e compiler speci8cation9 T4is postgui5es in seeing t4e %%\\ compiler a#aila$le in ourinstallation9 To 8n5 compiler; 7e nee5 to login intoA5ministrator an5 go to t4e pat4 $elo7

    http://dwbitutorials.wordpress.com/2013/03/12/datastage-parallel-routines-2/http://dwbitutorials.wordpress.com/2013/03/12/datastage-parallel-routines-2/
  • 7/24/2019 Datastage Performance Guide

    87/108

    DataStage Z A5ministrator Z Properties Z *n#ironmentZ Parallel Z %ompiler

    Step /: Log in to Data Stage A5ministration9

    Step 0: %lick on t4e Pro6ects ta$9

    Step ?: %lick on t4e Properties ta$

  • 7/24/2019 Datastage Performance Guide

    88/108

    Step : %lick on *n#ironment9

    Step .: Select t4e compiler option un5er Parallel ta$9

  • 7/24/2019 Datastage Performance Guide

    89/108

    Step 3: %ompiler speci8cation of 5atastage is present4ere9

    .cpp

  • 7/24/2019 Datastage Performance Guide

    90/108

    (n %ase "f ,+& compiler 7e compile a cpp co5e like t4is

    )*ompiler %ame+ ) *ompile option+ ) cpp &ile name+

    e.g. g++ -o test.cpp

    %ompiling comman5 for 5atastage routine 7ill $e:-,usr,-acpp,bin,l*"r /' /c /qspill012345 test

    T4is comman5 7ill create an o$6ect 8le test9o

    6#isclaimer7 / The *ompiler name and options areinstallation dependent.8

    D$T$ST$'! )!$"%I%'S56:"!US$(I)IT7 I% D$T$ST$'!

    Belo7 are some of t4e 7ays t4roug4 74ic4 reusa$ility can$e ac4ie#e5 in DataStage9

    Multiple (nstance !o$s9

    Parallel S4are5 %ontainer

    After-6o$ )outines9

    Multiple-(nstance !o$s:,enerally; in 5ata 7are4ousing; t4ere 7oul5 $escenariosrequirement to loa5 t4e maintenance ta$les ina55ition to t4e ta$les in 74ic4 7e loa5 t4e actual 5ata9T4ese maintenance ta$les typically 4a#e 5ata like

    /9%ount of recor5s loa5e5 in t4e actual target ta$les909Batc4 Loa5 +um$erDay on 74ic4 t4e loa5 occurre5

    ?9T4e last processe5 sequence i5 for a particular loa5T4is information can also $e use5 for 5ata count#ali5ation; email noti8cation to users9 So; for loa5ingt4ese maintenance ta$les; it is al7ays suggeste5 to5esign a single multiple-instance 6o$ instea5 of 5ierent

    6o$s 7it4 t4e same logic9 (n t4e !o$ Properties ta$; c4eckt4e option @Allo7 Multiple (nstance9 T4e 6o$ acti#ity74ic4 attac4 a multiple-instance 6o$ s4o7s an a55itionalta$ @(n#ocation (5 9So 7e can 4a#e 5ierent 6o$sequences triggering t4e same 6o$ 7it4 5ierentin#ocation i5s9 +ote t4at a 6o$ triggere5 $y 5ierent

  • 7/24/2019 Datastage Performance Guide

    91/108

    sequences 7it4 same in#ocation i5 7ill a$ort eit4er one or$ot4 of t4e 6o$ sequences9

    Parallel S4are5 %ontainer:

    A s4are5 container 7oul5 allo7 a common logic to $es4are5 across multiple 6o$s9 *na$le )%P at t4e stageprece5ing t4e S4are5 container stage9For t4e purpose ofs4aring across #arious 6o$s; 7e 7ill not propagate all t4e

    column meta5ata from t4e 6o$ into t4e stages present ins4are5 container9 Also ;ensure t4at 6o$s are re-compile574en t4e s4are5 container is c4ange59

    (n our pro6ect; 7e 4a#e use5 s4are5 container forimplementing t4e logic to generate unique sequence i5 foreac4 recor5 t4at is loa5e5 into t4e target ta$le9 >e 7ill5iscuss 5ierent 7ays to generate sequence i5 e mig4t 4a#e a scenario 74ere in 7e s4oul5nGt 4a#e anyof t4e input recor5s to $e re6ecte5 $y any of t4e stages int4e 6o$9 So 7e 5esign a 6o$ 74ic4 4a#e re6ect links for5ierent stages in t4e 6o$ an5 t4en co5e a common after-

    6o$ routine 74ic4 counts t4e num$er of recor5s in t4e

    re6ect links of t4e 6o$ an5 a$orts t4e 6o$ 74en t4e countecee5s a pre-5e8ne5 limit9

  • 7/24/2019 Datastage Performance Guide

    92/108

    T4is routine can $e parameterise5 for stage an5 linknames an5 can t4en $e re-use5 for 5ierent 6o$s9

    D$T$ST$'! )!$"%I%'S51: US$'!

    OF $T8O)D8(OU%D!D8)!%'THor &ariable"length elds! the parallel engine al%ays allocates thespace e-ui&alent to their maximum specied lengths. This isbecause the engine can easily determine the eld boundaries%ithout ha&ing to loo for delimiters! %hich %ill ease the CP1processing. Xo%e&er! the *:6 operations %ill be costlier. Therefore!if there are many &alues! %hich are much shorter than theirmaximum length! then there is lot of unused space! %hich mo&es

    around bet%een the operators! datasets and xed"%idth les. Thisis a huge performance hit for all stages especially sort stage!%hich consumes a lot of scratch space! bu?er memory! CP1resources. This is e&en more dangerous if %e ha&e many of suchcolumns.

    This is common for elds lie address! %here lengths allocated aremuch higher than the a&erage length of an address. 6ne %ay topre&ent this scenario is by setting the en&ironmental&ariableAPT'"LD'B"&+D*D'L*+,TC9

    Disa5#antages:Some parallel datasets generated %ith*nfoSphere\ DataStage\ .A.I and later releases re-uire moredis space %hen the columns are of type ^arChar %hen comparedto .A. This is due to changes added for performanceimpro&ements for bounded length ^arChars in .A.I.

    Set $PT06=D0B618D2D0=28VTX to any &alue to re&ert to pre".A.I storage beha&ior %hen using bounded length &archars.Setting this &ariable can ha&