55
1. Q:1 Difference between fileset,dataset and sequential file . I was extracting some data into sequential files and the job failed with error message file full. I replace the sequential file by a file set but the job is running for more than 45 mins and does not stop after sending multiple stop instruction to the job. Any help or advice will be highly appreciated. Ans : A sequential file is an operating system file. It's size limit is determined by the operating system (and your ulimit setting on UNIX). A File Set is a construction of one or more operating system files per processing node, each of which may be no more than 2GB. You can have up to 1000 files per processing node, disk space permitting, in a File Set. There exists a control file, whose name ends in ".fs", that reports where each of the files is. A persistent Data Set is exactly the same as a File Set. The only differences are that a File Set stores data in human-readable form, while a Data Set stores data in internal form (binary numbers), and that the control file name suffix is ".ds" for a persistent Data Set. A virtual Data Set has no visibility as files per processing node in the operating system, as it (the virtual Data Set) exists entirely in memory except for its control file, which has a ".v" suffix. *·A sequential file can only be accessed on one node. In general it can only be accessed sequentially by a single process, hence concept of parllelism is gone. *·dataset preserves partition.It stores data on the nodes,so when you read from a dataset you dont have to re partition your data. * ·we cannot use Unix cp or rm commands to copy or delete a dataset becuse,Datastage represents a single data set with multiple files.Using rm simply removes the descriptor file,leaving the much larger files behind.

Data Stage

Embed Size (px)

DESCRIPTION

Data Stage

Citation preview

Page 1: Data Stage

1.

Q:1 Difference between fileset,dataset and sequential file .

I was extracting some data into sequential files and the job failed with error message file full. I replace the sequential file by a file set but the job is running for more than 45 mins and does not stop after sending multiple stop instruction to the job. Any help or advice will be highly appreciated.

Ans :

A sequential file is an operating system file. It's size limit is determined by the operating system (and your ulimit setting on UNIX).

A File Set is a construction of one or more operating system files per processing node, each of which may be no more than 2GB. You can have up to 1000 files per processing node, disk space permitting, in a File Set. There exists a control file, whose name ends in ".fs", that reports where each of the files is.

A persistent Data Set is exactly the same as a File Set. The only differences are that a File Set stores data in human-readable form, while a Data Set stores data in internal form (binary numbers), and that the control file name suffix is ".ds" for a persistent Data Set.

A virtual Data Set has no visibility as files per processing node in the operating system, as it (the virtual Data Set) exists entirely in memory except for its control file, which has a ".v" suffix.

*·A sequential file can only be accessed on one node. In general it can only be accessed

sequentially by a single process, hence concept of parllelism is gone.

*·dataset preserves partition.It stores data on the nodes,so when you read from a dataset

you dont have to re partition your data.

* ·we cannot use Unix cp or rm commands to copy or delete a dataset becuse,Datastage

represents a single data set with multiple files.Using rm simply removes the descriptor

file,leaving the much larger files behind.

* . Unix command to find and delete the whole set of file Code:

find /detld1/etl/ascential/ascential/DataStage/Projects/CTI_London/IAM/staging/ -name "*.ds" -mtime +5 -print -exec compress {} \;

or find /detld1/etl/ascential/ascential/DataStage/Projects/CTI_London/IAM/staging/ -name "*.ds" -mtime +5 -print -exec compress {} \;

path sould be change to their environment.....

But reg file set, is it max of 1000 files per node or 10000 files per processing node???????

2.

Page 2: Data Stage

1)what is difference between serverjobs & paraller jobs?2) Orchestrate Vs Datastage Parallel Extender?3) What are OConv () and Iconv () functions and where are they used?4) What is aggregate cache in aggregator transforamtion?5) What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job?6) How do you rename all of the jobs to support your new File-naming conventions? 7) How do you merge two files in DS?8) How did you handle an 'Aborted' sequencer?9) What are Performance tunings you have done in your last project to increase the performance of slowly running jobs?10) If data is partitioned in your job on key 1 and then you aggregate on key 2, what issues could arise?11) how can you do incremental load in datastage?12) What is Full load & Incremental or Refresh load?13) What r the different types of Type2 dimension maping?-

3

1)Data Stage Architecturea) it is a client server architecture with client components being admninstrator,designer,manager and director and the server components being …ds server, repository and package installer2) How do you create a project?a) Through Datastage Administrator ..loctaion (C:\Ascential\DataStage\Projects\)3) How many projects can you create maximum?a) It Depends upon license keys..4) How do you create users and give the permission a) Through administrator, permissions are :- Datastage Operator, Datastage Production Manager, Datastage Developer,None(should know roles for each one of them)5) What are the permissions available in Administratora) :- Datastage Operator:Schedule and Run the jobs. Datastage Production Manager:Full access to DataStage Datastage Developer.:Create and modify jobs,debuugg,run,schedule,import and export…can release locks but canot create protected projects.6) Is it possible an operator to view the full log informationa) Yes..only if the administrator gives operator the permission (i.e my checking the box “ DataStage operator can view full log”7) Tell me the type of jobs (Active or passive also odbc and plugins)a)Server jobs,paralle,mainframe and job sequence8)How do you lookup through seq filea) One cannot do a lookup through a seq file9)What is the Stage variable,a) Stage variable is variable which executes locally within the stage.10)What does a constraint doa) A constraint is like a filter condition which is used to limit the records depending upon the business logic11) What is the derivation doa) derivation is like an expression which is used to derive some value from the input columns and also modify the input columns according to the business needs (explain with example)12) Tell me the sequence of execution (StageVariable, Constraint, Derivation)……..??????a) Stage variable, derivation, constraints (explain with example)13) Why do you use hash file

Page 3: Data Stage

a) primarily used as a lookup 14) Difference between hash file and seq filea) hash file is a file with one or multiple key based fields…(what abt sequential files ..even they can be key based right…no it cannot be used )15) Name some type of seq file ?a) fixed-width , delimited files16) What is the size of your hash file……………??????a) By default the size of the hash file is 128 mb which is specified in the administrator..under the tunables tab..and has a max value of 999mb 17) how do we calculate the size of our hash files One is through hash calculator……and on some equations…18) What is hash algorithm a) is a property to be set for dynamic hash files inorder to determine the way in which it applies the hash function to incoming record and spreads the records into multiple groups...Seqnum or general…read hashed file stage from page 25….)19) How many types are available in hash filea) Static In which there are 18 types whereas in dynamic we have type 3020) Which type of hash file do you used, whya) type 30 file (dynamic) because of incremental data loads …21) How do you create a hash filea) Create.file <filename> DYNAMIC or through ds designer ..using create file option…22) How do you specify the hash filea) we need to define key columns while specifying the hash files..and there is no limit for the number of key columns…(check this out)23) Is it possible to view the records in hash file through any editor, if yes which editora) through Data Browser editor Datastage designer…24) What is the extension of a hash filea) .30 extension. Do confirm 25) Is it possible to create a hash file contains all the columns in a normal seq file(with out key columns)a) no26) Difference between static and dynamic hash filea) Dynamic hash file allocates dynamically memory size, whereas static hash file does not beyond the specified file.27) Tell me the different types of stagesa) Active and passive stages… 28) Difference between Active stage and passive stagea) Active stages are those which does some processing in it whereas, ---sort,agg,transformer,pivot..Passive stages are those which does do not do any processing in them..—seq file,odbc.. 29) Is it possible to check constraint at Active stages, if yes howa) Yes, through a transformer stage…30) Where do you define the constraint?a) In the transformer (constraint entry field on the output link) 31) What is the job parameter, where do you define ita) The job parameter is parameter through which run-time details can be manipulated.It can be defined through the designer at job level (explain with some examples) 32) What is the environment variable, where do you define it

Page 4: Data Stage

a) Environment variable can be defined in the Administrator. (Explain with some examples) ,envt variables are like global variables which can be used across the project..-33) Difference between job parameter, environment variable and stage variablea) Environment variable is one through which one can define project wide defaults.Job parameter is one which through which one can override any previous defaults andCan be applicable to the particular jobStage Variable is one which is locally executed for the active stage (explain with some examples) 34) While running a job, Is it possible to control other job through a stage, not job control coding, if yes how, and what are the stages supported ita) Before After Sub Routines.. through a job control also …do explain) 35) Have you written job control, what is it usea) To control running of jobs , status of jobs, and can implement the job logic within it. (explain with some examples) 36) How do you attach a job in job controla) DSAttachJob ..it is a utility (explain with some examples) 37) How do you set a job parameter in job controla) DSSetParam (Job Handle Parameter name, value) this allows one to set parameters. (explain with some examples) 38) What is routine a) Routines are pieces of code which can be executed before or after a job to trigeer of some activities…39) Different types of routinea) Transformer routines , Before/After (explain with some examples) 40) What is the use of routinea) To trigger some activities41) Where the routines are stored……………..?????a) Routines are stored in the Routines branch of the Data Stage Repository,42) How many windows are shown in DS designer, what are theya) Designer window, repository,pallete43) What are the uses of transformer, aggregator, pivot, and sort stagesa) Transformer allows one to transform the stage i.e to modify incoming records,apply filter rules…,transform data…Aggregator allows one to do internal sorting and also allows on to records based on groups Pivot allows one to change from vertical to horizontal.44) What is the use of merge stagea) merge allows us to merge two sequential files into one or more outputlinks. A Merge stage is a passive stage that can have no input links and one or moreoutput links.45) Is it possible to join more than two seq file using merge stage, if no is there any stage to solve this

a) it is not possible to join two or more seq file using merge stage ..but it is possible through link collector provided the metadata for the sources are the same….what if we have to join two or more seq files with diff metadata…?????

46) Name all the join typea) inner join, right outer, left outer, full outer join. ..(7 types see merge stage…)47) How do you extract data from database?a) Through ODBC and OCI48) Name all the update action a)there are 8 update actions…(found on the input link of ODBC stage) 49) In job control which language is used to write

Page 5: Data Stage

a) BASIC

50) Is it possible to call a basic program which is written externally and use it in DS? a) We can call a basic program but don’t know how…..????????51) Is it possible to run a job in DS designer if yes how ----a) YES in version 5.1 run through debugger, in version 7.1 run directly52) Is it possible to lookup a lookup hash file, if yes howa) it is not possible to lookup a lookup hash file…53) What does the director doa) Job Locks. Job Resources. , Job Report in XML, Scheduling, Viewing Logs,Run the jobs..54) Have you schedule the job, howa) yes,through data stage director…55) A job is running, but I would like to stop the job, what are the ways to stop the joba) Through Director,….and through Cleanup resources in the director.56) What is the use of log file

a) To check for the execution of the Job and to track down warnings and errors57) Describe cleanup resource and clear status filea) Cleanup resource allows one to remove locks or / kill the jobsClear status clears the last run status of the job and resets the job status to Has been reset.(what are the implications of this..)…58) Situations wherin there is a need to clear the status filea) 59) Is it enabled in DS director if not how to enable ita)by default clean up resource and clear status file is not enabled in the director one can enable it through administrator by checking the “Enable Job administrator in Director” in general tab…60) Tell me the types of scheduling a) today,tomorrow,every,next,daily61) How do you find the no of rows per second in DS director a) in designer we can do it by choosing “view performance statistics” but in director it is through “tools-New monitor”62) How do you know the job statusa) through director…..or dsjob.status63) What is the difference between warning, fatal messagea) Warnings do not abort the job where as fatal messages abort the job……64) In Log file Control, info messages shows what thingsa) shows the time when the execution of job started…shows information about individual jobs and their status…warnings and fatal errors.65) What is a phantom error how do u resolve it…?.........??????a) Ask Sakthi while doing routines…66) What are the other error have you faceda) see error documentation67) What is the difference between run and validate a joba) run is to execute the job whereas validate is to check for errors like file exists odbc connections,intermediate existence of hash files….68) What is the use of DS managera) Ds manager is used to edit and manage the contents of the repository..like used to create or edit routines ,table definations..export and import of jobs or the entire project…69) How do you import export the projecta) Through data stage manager70) What is a Meta data, where is it storeda) metadata are table definations ..stored under repository…

Page 6: Data Stage

71) How do you write a routinea) we can write a routine by going to the routine category in ds manager and selecting create routine option…routines are written in basic language…72) What is the use of release a joba) releasing a job is significant to clean up the resources of a job which is locked or idle…73) What is the use of table definition in Managera) table definations depict the metadata of a table..we can use the ds manager to edit or create table definations…74) What is the difference between local container and shared containera local container can be used within the job itself and does not appear in the repository window. Whereas shared containers are available throughout the project and appear in the repository window…75) what are containersa) containers are a collection of group stages and links…which can be reused (shared container) 76) Difference between Annotation and Description Annotationa) Annotations are shot or long descriptions. we can have multiple annotations in a job and they can be copied to other jobs as well where as Description Annotation ..we can have only one per job…and they cannot be copied into other jobs…77) What are the advantages of Description Annotation a) advantage of Description Annotation is that it gets automatically reflected in the Manager and director…78) What are the various types of compilation and run time errors u have faced…?a) 79) explain the “allow stage write cache option “ for hash files and what are its implications..?a)it caches the hash file in memory …should no use this option when we are reading and writing to the same hash file…….80) What are the caching properties while creating hash files?81) Where do u specify the size of ur hash filea) in the administrator under tunables tab….default size is 128 mb and the max is 999mb

4.

Just thought of sharing good collection of DataStage Interview Questions site... may be helps some one who is learning.... each questions is kind of disucssion thread... so it definatly helps to understand the concept rather just reading them...

http://www.geekinterview.com/Interview-Questions/Data-Warehouse/DataStage

http://www.geekinterview.com/

1  Dimension Modelling types along with their significance Comments: 0 | Answered : Yes | Last update: March 28, 2005    Data Modelling is Broadly classified into 2 types. a) E-R Diagrams (Entity - Relatioships). b) Dimensional Modelling.

  2  Dimensional modelling is again sub divided into 2 types. Comments: 1 | Answered : Yes | Last update: July 27, 2005 

Page 7: Data Stage

  a)Star Schema - Simple & Much Faster. Denormalized form. b)Snowflake Schema - Complex with more Granularity. More normalized form.

3  Importance of Surrogate Key in Data warehousing?  Comments: 0 | Answered : Yes | Last update: March 28, 2005    Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is it is independent of underlying database. i.e Surrogate Key is not affected by the changes going on with a databas... 

4  Differentiate Database data and Data warehouse data?  Comments: 1 | Answered : Yes | Last update: March 30, 2005    Data in a Database is a) Detailed or Transactional b) Both Readable and Writable. c) Current. 5  What is the flow of loading data into fact & dimensional tables?  Comments: 0 | Answered : Yes | Last update: March 28, 2005    Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional table. Consists of fields with numeric values. Dimension table - Table with Unique Primary Key... 

  6  Orchestrate Vs Datastage Parallel Extender?  Comments: 0 | Answered : Yes | Last update: March 28, 2005    Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform. Datastage used Orchestrate with Datastage XE (Beta version of 6.0) to incorporate the p... 

  7  Differentiate Primary Key and Partition Key?  Comments: 0 | Answered : Yes | Last update: March 28, 2005    Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key. Partition Key is a just a part of Primary Key. There are several methods of ... 

  8  How do you execute datastage job from command line prompt?  Comments: 0 | Answered : Yes | Last update: March 28, 2005    Using "dsjob" command as follows. dsjob -run -jobstatus projectname jobname

    9  What are Stage Variables, Derivations and Constants?  Comments: 0 | Answered : Yes | Last update: March 28, 2005    Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the value into target column.

Derivation - _Expression that specifies value to be passed o... 

  10  What is the default cache size? How do you change the cache size if needed?  Comments: 0 | Answered : Yes | Last update: March 28, 2005 

Page 8: Data Stage

  Default cache size is 256 MB. We can incraese it by going into Datastage Administrator and selecting the Tunable Tab and specify the cache size over there.

---------------------------------------------------------------------------------------------------------------------------------5.

1) Can we use shared container as lookup in datastage server jobs?

2) If data is partitioned in your job on key 1 and then you aggregate on key 2, what issues could arise?

3) If your running 4 ways parallel and you have 10 stages on the canvas, how many processes does datastage?

4) Does Enterprise Edition only add the parallel processing for better performance? Are any stages/transformations?

A) DataStage Standard Edition was previously called DataStage and DataStage Server Edition.• DataStage Enterprise Edition was originally called Orchestrate, then renamed to Parallel Extender when purchased by Ascential.• DataStage Enterprise: Server jobs, sequence jobs, parallel jobs. The enterprise edition offers parallel processing features for scalable high volume solutions. Designed originally for Unix, it now supports Windows, Linux and Unix System Services on mainframes.• DataStage Enterprise MVS: Server jobs, sequence jobs, parallel jobs, mvs jobs. MVS jobs are jobs designed using an alternative set of stages that are generated into Cobol/JCL code and are transferred to a mainframe to be compiled and run. Jobs are developed on a Unix or Windows server transferred to the mainframe to be compiled and run. The first two versions share the same Designer interface but have a different set of design stages depending on the type of job you are working on. Parallel jobs have parallel stages but also accept some server stages via a container. Server jobs only accept server stages, MVS jobs only accept MVS stages. There are some stages that are common to all types (such as aggregation) but they tend to have different fields and options within that stage.

5) How can you implement Complex Jobs in datastage

A) What do u mean by complex jobs.

If u used more than 15 stages in a job and if you used 10 lookup tables in a job then u can call it as a complex job

6) Can u join flat file and database

A) Yes, we can do it in an indirect way. First create a job, which can populate the data from database into a Sequential file and name it as Seq_First1. Take the flat file, which you are, having and use a Merge Stage to join the two files. You have various join types in Merge Stage like Pure Inner Join, Left Outer Join, Right Outer Join etc., You can use any one of these which suits your requirements.

Page 9: Data Stage

7) What is trouble shooting in server jobs? What are the diff kinds of errors encountered while running?

Page 10: Data Stage

8) What is the mean of Try to have the constraints in the 'Selection' criteria of the jobs itself? This will eliminate the unnecessary records even getting in before joins are made?

A) It probably means that u can put the selection criteria in the where clause, i.e whatever data u need to filter, filter it out in the SQL, rather than carrying it forward and then filtering it out

9) What is Data stage Multi-byte, Single-byte file conversions? How we use those conversions in data stage?

10) What is difference between server jobs & parallel jobs

A) Server jobs. These are available if you have installed DataStage Server. They run on the DataStage Server, connecting to other data sources as necessary.

Parallel jobs. These are only available if you have installed Enterprise Edition. These run on DataStage servers that are SMP, MPP, or cluster systems. They can also run on a separate z/OS (USS) machine if required.

11) What is merge? And how to use merge?

A) Merge is a stage that is available in both parallel and server jobs.

The merge stage is used to join two tables (server/parallel) or two tables/datasets (parallel). Merge requires that the master table/dataset and the update table/dataset to be sorted. Merge is performed on a key field, and the key field is mandatory in the master and update dataset/table

Merge stage is used to merge two flat files in server jobs.12) How we use NLS function in Datastage? What are advantages of NLS function? Where we can use that one? Explain briefly?

A) Dear User, As per the manuals and documents, weB) have different level of interfaces. Can you be more specific? Like Teradata

interface operators, DB2 interface operators, Oracle Interface operators and SAS-Interface operators. Orchestrate National Language Support (NLS) makes it possible for you to process data in international languages using Unicode character sets. International Components for Unicode (ICU) libraries support NLS functionality in Orchestrate. Operator NLS Functionality* Teradata Interface Operators * switch Operator * filter Operator * The DB2 Interface Operators * The Oracle Interface Operators* The SAS-Interface Operators * transform Operator * modify Operator * import and export Operators * generator Operator Should you need any further assistance pls let me know. I shall share as much as I can. You can email me at [email protected] or [email protected]

-----------------------By using NLS function we can do the following- Process the data in a wide range of languages- Use Local formats for dates, times and money- Sort the data according to the local rules

If NLS is installed, various extra features appear in the product.

Page 11: Data Stage

For Server jobs, NLS is implemented in DataStage Server engineFor Parallel jobs, NLS is implemented using the ICU library.

13) What is APT_CONFIG in datastage?A) The APT_CONFIG_FILE (not just APT_CONFIG) is the configuration file that defines the nodes, (the scratch area, temp area) for the specific projectDatastage understands the architecture of the system through this file (APT_CONFIG_FILE). For example this file consists information of node names, disk storage information...etcAPT_CONFIG is just an environment variable used to identify the *.apt file. Don’t confuse that with *.apt file that has the node's information and Configuration of SMP/MMP server.

14) What is merge and how it can be done plz explain with simple example taking 2 tables...

A) Merge is used to join two tables. It takes the Key columns sort them in Ascending or descending order. Let us consider two table i.e. Emp, Dept.If we want to join these two tables we are having DeptNo as a common Key so we can give that column name as key and sort Deptno in ascending order and can join those two tables

15) What is version Control?A) i) Version Control stores different versions of DS jobs

ii) Runs different versions of same job iii) Reverts to previous version of a job iv) View version histories16) What are the Repository Tables in DataStage and what are they?

A) Dear User. A data warehouse is a repository (centralized as well as distributed) of Data, able to answer any adhoc, analytical, historical or complex queries. Metadata is data about data. Examples of metadata include data element descriptions, data type descriptions, attribute/property descriptions, range/domain descriptions, and process/method descriptions. The repository environment encompasses all corporate metadata resources: database catalogs, data dictionaries, and navigation services. Metadata includes things like the name, length, valid values, and description of a data element. Metadata is stored in a data dictionary and repository. It insulates the data warehouse from changes in the schema of operational systems. In data stage I/O and Transfer, under interface tab: input, out put & transfer pages will have 4 tabs and the last one is build under that u can find the TABLE NAME .The DataStage client components are: AdministratorAdministers DataStage projects and conducts housekeeping on the serverDesignerCreates DataStage jobs that are compiled into executable programs Director Used to run and monitor the DataStage jobsManagerAllows you to view and edit the contents of the repository. Sould you need any further assistance pls revert to this mail id [email protected] or [email protected]

17) Where does Unix script of datastage execute weather in client machine or in server? Suppose if it executes on server then it will execute?

A) Datastage jobs are executed in the server machines only. There is nothing that is stored in the client machine.

Page 12: Data Stage

18) Defaults nodes for datastage parallel Edition

A) Default nodes is always one

Actually the Number of Nodes depend on the number of processors in your system. If your system is supporting two processors we will get two nodes by default.

19) What happens if RCP is disabling?

A) Runtime column propagation (RCP): If RCP is enabled for any job, and specifically for those stage whose output connects to the shared container input, then meta data will be propagated at run time, so there is no need to map it at design time.

If RCP is disabled for the job, in such case OSH has to perform Import and export every time when the job runs and the processing time job is also increased.

20) I want to process 3 files in sequentially one by one, how can i do that. While processing the files it should fetch files automatically.

21) Scenario based Question ........... Suppose that 4 job control by the sequencer like (job 1, job 2, job 3, job 4) if job 1 have 10,000 row, after run the job only 5000 data has been loaded in target table remaining are not loaded and your job going to be aborted then. How can short out the problem.

A) Suppose job sequencer synchronies or control 4 job but job 1 have problem, in this condition should go director and check it what type of problem showing either data type problem, warning massage, job fail or job aborted, If job fail means data type problem or missing column action .So u should go Run window ->Click-> Tracing->Performance or In your target table ->general -> action-> select this option here two option(i) On Fail -- commit, Continue (ii) On Skip -- Commit, Continue.First u check how many data already load after then select on skip option then continue and what remaining position data not loaded then select On Fail, Continue ... Again run the job defiantly u gets successful massage

22) What is the Batch Program and how can generate?A) Batch program is the program it's generate run time to maintain by the datastage it self but u can easy to change own the basis of your requirement (Extraction, Transformation, Loading). Batch program are generate depends your job nature either simple job or sequencer job, you can see this program on job control option

23) What is difference between data stage and informatica?

A) Here is very good articles on these differences... which helps to get an idea. Basically it's depends on what you are trying to accomplish

What are the requirements for your ETL tool?

Do you have large sequential files (1 million rows, for example) that need to be compared every day versus yesterday?

Page 13: Data Stage

If so, then ask how each vendor would do that. Think about what process they are going to do. Are they requiring you to load yesterday’s file into a table and do lookups?

If so, RUN!! Are they doing a match/merge routine that knows how to process this in sequential files? Then maybe they are the right one. It all depends on what you need the ETL to do.

If you are small enough in your data sets, then either would probably be OK.

http://www.dmreview.com/article_sub.cfm?articleId=4306

24) Importance of Surrogate Key in Data warehousing?A)Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is it is independent of underlying database. i.e Surrogate Key is not affected by the changes going on with a database.The concept of surrogate comes into play when there is slowely changing dimension in a table. In such condition there is a need of a key by which we can identify the changes made in the dimensions. These slowely changing dimensions can be of three type namely SCD1,SCD2,SCD3.  These are sustem genereated key.Mainly they are just the sequence of numbers or can be alfanumeric values also.this will be used in the concept of slowly changing dimension. inorder to keep track of changes in primary key

25) what's the difference between Datastage Developers and Datastage Designers. What are the skill's required for this.

A) datastage developer is one how will code the jobs.datastage designer is how will desgn the job, i mean he will deal with blue prints and he will design the jobs the stages that are required in developing the code

26) How do we do the automation of dsjobs?A) "dsjobs" can be automated by using Shell scripts in UNIX system.We can call Datastage Batch Job from Command prompt using 'dsjob'. We can also pass all the parameters from command prompt. Then call this shell script in any of the market available schedulers. The 2nd option is schedule these jobs using Data Stage director.

27) What is DS Manager used for - did u use it?A) The Manager is a graphical tool that enables you to view and manage the contents of the DataStage Repositorydatastage maneger is used to export and import purpose [/B] main use of export and import is sharing the jobs and projects one project to other project.

28) What are types of Hashed File?A) Hashed File is classified broadly into 2 types.

a) Static - Sub divided into 17 types based on Primary Key Pattern. b) Dynamic - sub divided into 2 types     i) Generic    ii) Specific.

Default Hased file is "Dynamic - Type Random 30 D"29) How do you eliminate duplicate rows?

Page 14: Data Stage

A) Delete from from table name where rowid not in(select max/min(rowid)from emp group by column name)Data Stage provides us with a stage Remove Duplicates in Enterprise edition. Using that stage we can eliminate the duplicates based on a key column.The Duplicates can be eliminated by loading thecorresponding data in the Hash file. Specify the columns on which u want to eliminate as the keys of hash.removal of duplicates done in two ways: 1. Use "Duplicate Data Removal" stage  or 2. use group by on all the columns used in select , duplicates will go away.

30) What about System variables?A) DataStage provides a set of variables containing useful system information that you can access from a transform or routine. System variables are read-only.  @DATE The internal date when the program started. See the Date function.  @DAY The day of the month extracted from the value in @DATE.  @FALSE The compiler replaces the value with 0.  @FM A field mark, Char(254).   @IM An item mark, Char(255).  @INROWNUM Input row counter. For use in constrains and derivations in Transformer stages.  @OUTROWNUM Output row counter (per link). For use in derivations in Transformer stages.  @LOGNAME The user login name.   @MONTH The current extracted from the value in @DATE.  @NULL The null value.   @NULL.STR The internal representation of the null value, Char(128).   @PATH The pathname of the current DataStage project.  @SCHEMA The schema name of the current DataStage project.  @SM A subvalue mark (a delimiter used in UniVerse files), Char(252).   @SYSTEM.RETURN.CODE Status codes returned by system processes or commands.   @TIME The internal time when the program started. See the Time function.  @TM A text mark (a delimiter used in UniVerse files), Char(251).  @TRUE The compiler replaces the value with 1. 

Page 15: Data Stage

 @USERNO The user number.   @VM A value mark (a delimiter used in UniVerse files), Char(253).  @WHO The name of the current DataStage project directory.   @YEAR The current year extracted from @DATE.  REJECTED Can be used in the constraint expression of a Transformer stage of an output link. REJECTED is initially TRUE, but is set to FALSE whenever an output link is successfully written.

31) What is DS Designer used for - did u use it?A) You use the Designer to build jobs by creating a visual design that models the flow and transformation of data from the data source through to the target warehouse. The Designer graphical interface lets you select stage icons, drop them onto the Designer work area, and add links.

32) What is DS Administrator used for - did u use it?A) The Administrator enables you to set up DataStage users, control the purging of the Repository, and, if National Language Support (NLS) is enabled, install and manage maps and locales.

33) How to create batches in Datastage from command prompt

34) Dimensional modelling is again sub divided into 2 types.A) a)Star Schema - Simple & Much Faster. Denormalized form. b) Snowflake Schema - Complex with more Granularity. More normalized form.

35) How will you call external function or subroutine from datastage?A) There is datastage option to call external programs . execSH

36) How do you pass filename as the parameter for a job?A) While job developement we can create a paramater 'FILE_NAME' and the value can be passed while running the job.

1. Go to DataStage Administrator->Projects->Properties->Environment->UserDefined. Here you can see a grid, where you can enter your parameter name and the corresponding the path of the file.

2. Go to the stage Tab of the job, select the NLS tab, click on the "Use Job Parameter" and select the parameter name which you have given in the above. The selected parameter name appears in the text box beside the "Use Job Parameter" button. Copy the parameter name from the text box and use it in your job. Keep the project default in the text box.

37) How to handle Date convertions in Datastage? Convert a mm/dd/yyyy format to yyyy-dd-mm?

A) We use a) "Iconv" function - Internal Convertion. b) "Oconv" function - External Convertion.

Function to convert mm/dd/yyyy format to yyyy-dd-mm is Oconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-MDY[2,2,4]")

Page 16: Data Stage

38) What is the difference between operational data stage (ODS) & data warehouse?

A) A dataware house is a decision support database for organisational needs.It is subject oriented,non volatile,integrated ,time varient collect of data.

ODS (Operational Data Source) is a integrated collection of related information. It contains maximum 90 days information.

39) When should we use ODS?

40) How can we create Containers?

A) There are Two types of containers

1.Local Container

2.Shared Container

Local container is available for that particular Job only.

Where as Shared Containers can be used any where in the project.

Local container:

Step1:Select the stages required

Step2:Edit>ConstructContainer>Local

SharedContainer:

Step1:Select the stages required

Step2:Edit>ConstructContainer>Shared

Shared containers are stored in the SharedContainers branch of the Tree Structure

41) How can we improve the performance of DataStage jobs?

A) Performance and tuning of DS jobs:

1.Establish Baselines

2.Avoid the Use of only one flow for tuning/performance testing

3.Work in increment

4.Evaluate data skew

Page 17: Data Stage

5.Isolate and solve

6.Distribute file systems to eliminate bottlenecks

7.Do not involve the RDBMS in intial testing

8.Understand and evaluate the tuning knobs available.

42) What are the Job parameters?

A) These Parameters are used to provide Administrative access and change run time values of the job.

EDIT>JOBPARAMETERS

In that Parameters Tab we can define the name,prompt,type,value

43) What is the difference between routine and transform and function

44) How can we implement Lookup in DataStage Server jobs?

A) By using the hashed files u can implement the lookup in datasatge,

hashed files stores data based on hashed algorithm and key values

45) How can we join one Oracle source and Sequential file?.

A) Join and look up used to join oracle and sequential file

46) What is iconv and oconv functions?

47) Difference between Hashfile and Sequential File?

A) Hash file stores the data based on hash algorithm and on a key value. A sequential file is just a file with no key column. Hash file used as a reference for look up. Sequential file cannot

48) How do you rename all of the jobs to support your new File-naming conventions?

49) Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a Truncate statement to the DB or does it do some kind of Delete logic.

A) There is no TRUNCATE on ODBC stages. It is Clear table blah blah and that is a delete from statement. On an OCI stage such as Oracle, you do have both Clear and Truncate options. They are radically different in permissions (Truncate requires you to have alter table permissions where Delete doesn't).

50) The above might rise another question: Why do we have to load the dimensional tables first, then fact tables:

Page 18: Data Stage

A) As we load the dimensional tables the keys (primary) are generated and these keys (primary) are Foreign keys in Fact tables.

51) How will you determine the sequence of jobs to load into data warehouse?

A) First we execute the jobs that load the data into Dimension tables, then Fact tables, then load the Aggregator tables (if any).

52) What are the command line functions that import and export the DS jobs?

A) A. dsimport.exe- imports the DataStage components. B. dsexport.exe- exports the DataStage components.

53) What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director?

A) Use crontab utility along with dsexecute() function along with proper parameters passed.

"AUTOSYS": Thru autosys u can automate the job by invoking the shell script written to schedule the datastage jobs

54) What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job.

A) A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive.B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file.

55) Read the String functions in DS

A) Functions like [] -> sub-string function and ':' -> concatenation operator

Syntax: string [ [ start, ] length ]string [ delimiter, instance, repeats ]

56) What are Sequencers?

A) Sequencers are job control programs that execute other jobs with preset Job parameters.

57) How did you handle an 'Aborted' sequencer?

A) In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again.

58) How did you handle an 'Aborted' sequencer?

Page 19: Data Stage

A) In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again.

59) What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs?

A)

1. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts.2. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.3. Tuned the 'Project Tunables' in Administrator for better performance.4. Used sorted data for Aggregator.5. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs6. Removed the data not used from the source as early as possible in the job.7. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries8. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.9. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.10. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.11. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.12. Tuning should occur on a job-by-job basis.13. Use the power of DBMS.14. Try not to use a sort stage when you can use an ORDER BY clause in the database.15. Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….16. Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.1. Minimise the usage of Transformer (Instead of this use Copy, modify, Filter, Row Generator)2. Use SQL Code while extracting the data3. Handle the nulls4. Minimise the warnings5. Reduce the number of lookups in a job design6. Use not more than 20stages in a job7. Use IPC stage between two passive stages Reduces processing time8. Drop indexes before data loading and recreate after loading data into tables9. Gen\'ll we cannot avoid no of lookups if our requirements to do lookups compulsory.

Page 20: Data Stage

10. There is no limit for no of stages like 20 or 30 but we can break the job into small jobs then we use dataset Stages to store the data.11. IPC Stage that is provided in Server Jobs not in Parallel Jobs12. Check the write cache of Hash file. If the same hash file is used for Look up and as well as target, disable this Option.13. If the hash file is used only for lookup then \"enable Preload to memory\". This will improve the performance. Also, check the order of execution of the routines.14. Don\'t use more than 7 lookups in the same transformer; introduce new transformers if it exceeds 7 lookups.15. Use Preload to memory option in the hash file output.16. Use Write to cache in the hash file input.17. Write into the error tables only after all the transformer stages.18. Reduce the width of the input record - remove the columns that you would not use.19. Cache the hash files you are reading from and writting into. Make sure your cache is big enough to hold the hash files.20. Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files.

This would also minimize overflow on the hash file.21. If possible, break the input into multiple threads and run multiple instances of the job.22. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts.23. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.24. Tuned the 'Project Tunables' in Administrator for better performance.25. Used sorted data for Aggregator.26. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs27. Removed the data not used from the source as early as possible in the job.28. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries29. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.30. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.31. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.32. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.33. Tuning should occur on a job-by-job basis.34. Use the power of DBMS.35. Try not to use a sort stage when you can use an ORDER BY clause in the database.36. Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….

Page 21: Data Stage

37. Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.

60) How did you handle reject data?

A) Typically a Reject-link is defined and the rejected data is loaded back into data warehouse. So Reject link has to be defined every Output link you wish to collect rejected data. Rejected data is typically bad data like duplicates of Primary keys or null-rows where data is expected.

61) What are Routines and where/how are they written and have you written any routines before?

A) RoutinesRoutines are stored in the Routines branch of the DataStage Repository,where you can create, view, or edit them using the Routine dialog box. Thefollowing program components are classified as routines:• Transform functions. These are functions that you can use whendefining custom transforms. DataStage has a number of built-intransform functions which are located in the Routines ➤ Examples➤ Functions branch of the Repository. You can also defineyour own transform functions in the Routine dialog box.• Before/After subroutines. When designing a job, you can specify asubroutine to run before or after the job, or before or after an activestage. DataStage has a number of built-in before/after subroutines,which are located in the Routines ➤ Built-in ➤ Before/Afterbranch in the Repository. You can also define your ownbefore/after subroutines using the Routine dialog box.• Custom UniVerse functions. These are specialized BASIC functionsthat have been defined outside DataStage. Using the Routinedialog box, you can get DataStage to create a wrapper that enablesyou to call these functions from within DataStage. These functionsare stored under the Routines branch in the Repository. Youspecify the category when you create the routine. If NLS is enabled,9-4 Ascential DataStage Designer Guideyou should be aware of any mapping requirements when usingcustom UniVerse functions. If a function uses data in a particularcharacter set, it is your responsibility to map the data to and fromUnicode.• ActiveX (OLE) functions. You can use ActiveX (OLE) functions asprogramming components within DataStage. Such functions aremade accessible to DataStage by importing them. This creates awrapper that enables you to call the functions. After import, youcan view and edit the BASIC wrapper using the Routine dialogbox. By default, such functions are located in the Routines ➤Class name branch in the Repository, but you can specify your owncategory when importing the functions.When using the Expression Editor, all of these components appear underthe DS Routines… command on the Suggest Operand menu.A special case of routine is the job control routine. Such a routine is usedto set up a DataStage job that controls other DataStage jobs. Job controlroutines are specified in the Job control page on the Job Properties dialogbox. Job control routines are not stored under the Routines branch in theRepository.TransformsTransforms are stored in the Transforms branch of the DataStage Repository,where you can create, view or edit them using the Transform dialogbox. Transforms specify the type of data transformed, the type it is transformedinto, and the expression that performs the transformation.DataStage is supplied with a number of built-in transforms (which youcannot edit). You can also define your own custom transforms, which arestored in the Repository and can be used by other DataStage jobs.When using the Expression Editor, the transforms appear under the DSTransform… command on the Suggest Operand menu.FunctionsFunctions

Page 22: Data Stage

take arguments and return a value. The word “function” isapplied to many components in DataStage:• BASIC functions. These are one of the fundamental buildingblocks of the BASIC language. When using the Expression Editor,Programming in DataStage 9-5you can access the BASIC functions via the Function… commandon the Suggest Operand menu.• DataStage BASIC functions. These are special BASIC functionsthat are specific to DataStage. These are mostly used in job controlroutines. DataStage functions begin with DS to distinguish themfrom general BASIC functions. When using the Expression Editor,you can access the DataStage BASIC functions via the DS Functions…command on the Suggest Operand menu.The following items, although called “functions,” are classified as routinesand are described under “Routines” on page 9-3. When using the ExpressionEditor, they all appear under the DS Routines… command on theSuggest Operand menu.• Transform functions• Custom UniVerse functions• ActiveX (OLE) functionsExpressionsAn expression is an element of code that defines a value. The word“expression” is used both as a specific part of BASIC syntax, and todescribe portions of code that you can enter when defining a job. Areas ofDataStage where you can use such expressions are:• Defining breakpoints in the debugger• Defining column derivations, key expressions and constraints inTransformer stages• Defining a custom transformIn each of these cases the DataStage Expression Editor guides you as towhat programming elements you can insert into the expression.

Page 23: Data Stage

62) What are OConv () and Iconv () functions and where are they used?

A) IConv() - Converts a string to an internal storage format OConv() - Converts an expression to an output format.

63) Explain the differences between Oracle8i/9i?

64) What are Static Hash files and Dynamic Hash files?

A) As the names itself suggest what they mean. In general we use Type-30 dynamic Hash files. The Data file has a default size of 2Gb and the overflow file is used if the data exceeds the 2GB size.

65) Have you ever involved in updating the DS versions like DS 5.X, if so tell us some the steps you have taken in doing so?

A) Yes. The following are some of the steps; I have taken in doing so:1) Definitely take a back up of the whole project(s) by exporting the project as a .dsx file2) See that you are using the same parent folder for the new version also for your old jobs using the hard-coded file path to work.3) After installing the new version import the old project(s) and you have to compile them all again. You can use 'Compile All' tool for this.4) Make sure that all your DB DSN's are created with the same name as old one's. This step is for moving DS from one machine to another.5) In case if you are just upgrading your DB from Oracle 8i to Oracle 9i there is tool on DS CD that can do this for you.6) Do not stop the 6.0 server before the upgrade, version 7.0install process collects project information during the upgrade. There is NO rework (recompilation of existing jobs/routines) needed after the upgrade.

66) Did you Parameterize the job or hard-coded the values in the jobs?

A) Always parameterized the job. Either the values are coming from Job Properties or from a ‘Parameter Manager’ – a third part tool. There is no way you will hard–code some parameters in your jobs. The often Parameterized variables in a job are: DB DSN name, username, password, dates W.R.T for the data to be looked against at.

67) Tell me the environment in your last projects

A) Give the OS of the Server and the OS of the Client of your recent most project

68) How do you catch bad rows from OCI stage?

69) Suppose if there are million records did you use OCI? if not then what stage do you prefer?

Page 24: Data Stage

A) Using Orabulk

70) How do you pass the parameter to the job sequence if the job is running at night?

A) Two ways1. Ste the default values of Parameters in the Job Sequencer and map these parameters to job.2. Run the job in the sequencer using dsjobs utility where we can specify the values to be taken for each parameter

71) What is the order of execution done internally in the transformer with the stage editor having input links on the lft hand side and output links?

A) Stage variables, constraints and column derivation or expressions.

72) Differentiate Database data and Data warehouse data?

A) Data in a Database is a) Detailed or Transactional b) Both Readable and Writable. c) Current.

By Database, one means OLTP (On Line Transaction Processing). This can be the source systems or the ODS (Operational Data Store), which contains the transactional data.

73) Dimension Modelling types along with their significance

A) Data Modelling is Broadly classified into 2 types. a) E-R Diagrams (Entity - Relatioships). b) Dimensional Modelling

Data Modeling1) E-R Diagrams2) Dimensional modeling 2.a) logical modeling 2.b)Physical modeling

74) What is the flow of loading data into fact & dimensional tables?

A) Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional table. Consists of fields with numeric values. Dimension table - Table with Unique Primary Key.

Load - Data should be first loaded into dimensional table. Based on the primary key values in dimensional table, the data should be loaded into Fact table.

75) Orchestrate Vs Datastage Parallel Extender?

A) Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform. Datastage used Orchestrate with Datastage XE (Beta version of 6.0) to incorporate the parallel processing

Page 25: Data Stage

capabilities. Now Datastage has purchased Orchestrate and integrated it with Datastage XE and released a new version Datastage 6.0 i.e Parallel Extender.

76) Differentiate Primary Key and Partition Key?

A) Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key. Partition Key is a just a part of Primary Key. There are several methods of partition like Hash, DB2, Random etc..While using Hash partition we specify the Partition Key.

77) How do you execute datastage job from command line prompt?

A) Using "dsjob" command as follows. dsjob -run -jobstatus projectname jobname

78) What are Stage Variables, Derivations and Constants?

A) Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the value into target column.

Derivation - Expression that specifies value to be passed on to the target column.

Constant - Conditions that are either true or false that specifies flow of data with a link.

79) What is the default cache size? How do you change the cache size if needed?

A) Default cache size is 256 MB. We can incraese it by going into Datastage Administrator and selecting the Tunable Tab and specify the cache size over there.

80) Compare and Contrast ODBC and Plug-In stages?

A) ODBC : a) Poor Performance. b) Can be used for Variety of Databases. c) Can handle Stored Procedures.

Plug-In: a) Good Performance. b) Database specific.(Only one database) c) Cannot handle Stored Procedures.

81) How to run a Shell Script within the scope of a Data stage job

A) By using "ExcecSH" command at Before/After job properties.

82) Types of Parallel Processing?

A) Parallel Processing is broadly classified into 2 types. a) SMP - Symmetrical Multi Processing. b) MPP - Massive Parallel Processing.

Page 26: Data Stage

83) What does a Config File in parallel extender consist of?

A) Config file consists of the following. a) Number of Processes or Nodes. b) Actual Disk Storage Location.

84) Functionality of Link Partitioner and Link Collector?

A) Link Partitioner : It actually splits data into various partitions or data flows using various partition methods .

Link Collector : It collects the data coming from partitions, merges it into a single data flow and loads to target.

85) What is Modulus and Splitting in Dynamic Hashed File?

A) In a Hashed File, the size of the file keeps changing randomly. If the size of the file increases it is called as "Modulus". If the size of the file decreases it is called as "Splitting".

86) Types of vies in Datastage Director?

A) There are 3 types of views in Datastage Director a) Job View - Dates of Jobs Compiled. b) Log View - Status of Job last run c) Status View - Warning Messages, Event Messages, Program Generated Messages.

87) What is ' insert for update ' in datastage

88) How can we pass parameters to job by using file.

A) You can do this, by passing parameters from unix file, and then calling the execution of a datastage job. the ds job has the parameters defined (which are passed by unix)

6.

I'm evaluating DataStage from Ascential and PowerCenter from Informatica they seem to have pretty similar functionality, can anyone tell me what the main differences between them are?

    Ask The Experts published in DMReview.comNovember 20, 2001

Page 27: Data Stage

 

 By Chuck Kelley and Les Barbusinski and Joyce Bischoff

Q:

I'm evaluating DataStage from Ascential and PowerCenter from Informatica they seem to have pretty similar functionality, can anyone tell me what the main differences between them are?

A:

Chuck Kelley’s Answer: You are right, they have pretty much similar functionality. However, what are the requirements for your ETL tool? Do you have large sequential files (1 million rows, for example) that need to be compared every day versus yesterday? If so, then ask how each vendor would do that. Think about what process they are going to do. Are they requiring you to load yesterday’s file into a table and do lookups? If so, RUN!! Are they doing a match/merge routine that knows how to process this in sequential files? Then maybe they are the right one. It all depends on what you need the ETL to do. If you are small enough in your data sets, then either would probably be OK.

Les Barbusinski’s Answer: Without getting into specifics, here are some differences you may want to explore with each vendor:

Does the tool use a relational or a proprietary database to store its meta data and scripts? If proprietary, why?

What add-ons are available for extracting data from industry-standard ERP, Accounting, and CRM packages?

Can the tool’s meta data be integrated with third-party data modeling and/or business intelligence tools? If so, how and with which ones?

How well does each tool handle complex transformations, and how much external scripting is required?

What kinds of languages are supported for ETL script extensions?

Almost any ETL tool will look like any other on the surface. The trick is to find out which one will work best in your environment. The best way I’ve found to make this determination is to ascertain how successful each vendor’s clients have been using their product. Especially clients who closely resemble your shop in terms of size, industry, in-house skill sets, platforms, source systems, data volumes and transformation complexity.Ask both vendors for a list of their customers with characteristics similar to your own that have used their ETL product for at least a year. Then interview each client (preferably several people at each site) with an eye toward identifying unexpected problems, benefits, or quirkiness with the tool that have been encountered by that customer. Ultimately, ask each customer – if they had it all to do over again – whether or not they’d choose the same tool and why? You might be surprised at some of the answers.Joyce Bischoff’s Answer: You should do a careful research job when selecting products. You should first document your requirements, identify all possible products and evaluate each product against the detailed requirements. There are numerous ETL products on the market and it seems that you are looking at only two of them. If you are unfamiliar with the many products available, you may refer to www.tdan.com, the Data Administration Newsletter, for product lists. If you ask the vendors, they will certainly be able to tell you which of their product’s features are

Page 28: Data Stage

stronger than the other product. Ask both vendors and compare the answers, which may or may not be totally accurate. After you are very familiar with the products, call their references and be sure to talk with technical people who are actually using the product. You will not want the vendor to have a representative present when you speak with someone at the reference site. It is also not a good idea to depend upon a high-level manager at the reference site for a reliable opinion of the product. Managers may paint a very rosy picture of any selected product so that they do not look like they selected an inferior product.

...............................................................................

7.

DataStage PX questions

1. What is difference between server jobs & parallel jobs?2. What is orchestrate?3. Orchestrate Vs DataStage Parallel Extender?4. What are the types of Parallelism?5. Is Pipeline parallelism in PX is same what Inter-processes does in

Server?6. What are partitioning methods available in PX?7. What is Re-Partitioning? When actually re-partition will occur?8. What are OConv () and Iconv () functions and where are they used?

Can we use these functions in PX?9. What does a Configuration File in parallel extender consist of? 10.What is difference between file set and data set?11.Lookup Stage :Is it Persistent or non-persistent? (What is happening behind

the scene)12.How can we maintain the partitioning in Sort stage?13.Where we need partitioning (In processing or some where)14.If we use SAME partitioning in the first stage which partitioning

method it will take?15.What is the symbol we will get when we are using round robin

partitioning method?16.If we check the preserve partitioning in one stage and if we don’t

give any partitioning method (Auto) in the next stage which partition method it will use?

17. Can we give node allocations i.e. for one stage 4 nodes and for next stage 3 nodes?

18.What is combinability, non-combinability?19.What are schema files?20.Why we need datasets rather than sequential files?21.Is look-up stage returns multi-rows or single rows?22.Why we need sort stage other than sort-merge collective method and

perform sort option in the stage in advanced properties?23.For surrogate key generator stage where will be the next value

stored?24.In surrogate key generator stage how it generates the number?

(Based on nodes or Based on rows)25.What is the preserve portioning in Advanced tab?26.What is the difference between stages and operators?

Page 29: Data Stage

27.Why we need filter, copy and column export stages instead of transformer stage?

28.Describe the types of Transformers used in DataStagePX for processing? And uses

29.What is aggregate cache in aggregator transformation?30.What will you do in a situation where somebody wants to send you a

file and use that file as an input or reference and then run job?31.How do you rename all of the jobs to support your new File-naming

conventions?32.How do you merge two files in DS?33.How did you handle an 'Aborted' sequencer?34.What are Performance tunings you have done in your last project to

increase the performance of slowly running jobs?35.If data is partitioned in your job on key 1 and then you aggregate on

key 2, what issues could arise?36.What is Full load & Incremental or Refresh load?37.Describe cleanup resource and clear status file?38.What is lookup stage? Can you define derivations in the lookup stage

output?39.What is copy stage? When do you use that?40.What is Change Capture stage? Which execution mode would you use

when you used for comparison of data?41.What is Dataset Stage?42.How do you drop dataset?43.How do you eliminate duplicate in a Dataset?44.What is Peek Stage? When do you use it?45.What are the different join options available in Join Stage?46.What is the difference between Lookup, Merge & Join Stage?47.What is RCP? How it is implemented?48.What is row generator? When do you use it?49.How to extract data from more than 1 heterogeneous Sources?50.How can we pass parameters to job by using file?

8.

1) Lookup Stage :Is it Persistent or non-persistent?(What is happening behind the scene)

Ans: Look up stage is non-persistent

2) Is Pipeline parallelism in PX is same what Interprocessesor does in Server? Ans: Yes and no. The IPC stage buffers data so that the next process (or next

stage in the same process) can pick it up.

Pipeline parallelism in parallel jobs is much more complete. Do you understand the

relationship between stages and Orchestrate operators? Essentially each stage

generates an operator. These (assuming that they don't combine into single

processes) can form a pipeline so that, if you examine the generated OSH, it might

have the form

Code:

Page 30: Data Stage

Op1 < DataSet1 | op2 | op3 | op4 | op5 | op6 > DataSet2

Very slick, very fast.

3) How can we maintain the partitioning in Sort stage?

4) Where we need partitioning (In processing or some where)

5) If we use SAME partitioning in the first stage which partitioning method it will take

6) What is the symbol we will get when we are using round robin partitioning method?

7) If we check the preserve partitioning in one stage and if we don’t give any partitioning method which partition method it will use?

8) What is orchestrate?

Ans: Orchestrate was product from Torrent before being bought over by Ascential. Ochestrate provides the OSH framework, which has the UNIX command line interface.

9) Can we give node allocations i.e. for one stage 4 nodes and for next stage 3 nodes?

10) What is combinability, non-combinability?

11) What are schema files?

12) Why we need datasets rather than sequential files?

Ans: A sequential file as a source or target needs to be repartitioned as it is (as the name suggests) a single sequential stream of data. A dataset can be saved across nodes using the partitioning method selected so it is always faster when used as a source or target.

13) Is look-up stage returns multi-rows or single rows?

14) Why we need sort stage other than sort-merge collective method and perform sort option in the stage advanced properties?

15) For surrogate key generator stage where will be the next value stored?

16) When actually re-partition will occur?

Page 31: Data Stage

17) In transformer stage can we give constraints?Ans: Yes, We Can give

18) What is a constraint in the Advanced tab?

19) What is the diff between Range and Range Map partitioning?

8.

DataStage PX questions

51.What is difference between server jobs & parallel jobs?

--- Server generates DataStage BASIC, parallel generates Orchestrate shell script (osh) and C++, mainframe generates COBOL and JCL.

In server and mainframe you tend to do most of the work in Transformer stage. In parallel you tend to use specific stage types for specific tasks (and the Transformer stage doesn't do lookups). There are many more stage types for parallel than server or mainframe, and parallel stages correspond to Orchestrate operators.

Finally, of course, there's the automatic partitioning and collection of data in the parallel environment, which would have to be managed manually (if at all) in the server environment.

52.What is orchestrate?

--- Orchestrate is the old name of the underlying parallel execution engine. Ascential re-named the technology "Parallel Extender".

DataStage PX GUI generates OSH (Orchestrate Shell) scripts for the jobs you run. An OSH script is a quoted string which specifies the operators and connections of a single Orchestrate step. In its simplest form, it is:

Osh “op < in.ds > out.ds”. Where op – Orchestrate operator. In.ds - input dataset. Out.ds – Output dataset.

53.Orchestrate Vs DataStage Parallel Extender?

54.What are the types of Parallelism?

--- There are 2 types of Parallel Processing. They are:a. Pipeline Parallelism –

Page 32: Data Stage

It is the ability for a downstream stage to begin processing a row as soon as an upstream stage has finished processing that row (rather than processing one row completely through the job before beginning the next row). In Parallel jobs, it is managed automatically. For example, consider a job(src Tranformer Tgt) running on a system having three processors:

--- The source stage starts running on one processor, reads the data from the source and starts filling a pipeline with the read data.

--- Simultaneously, the Transformer stage starts running on another processor, processes the data in the pipeline and starts filling another pipeline.--- At the same time, the target stage starts running on another processor, writes data to the target as soon as the data is available.

b. Partitioning Parallelism – Partitioning parallelism means that entire record set is partitioned into small sets and processed on different nodes. That is, several processors can run the same job simultaneously, each handling a separate subset of the total data.

For example if there are 100 records, then if there are 4 logical nodes then each node would process 25 records each. This enhances the speed at which loading takes place.

55.Is Pipeline parallelism in PX is same what Inter-processes does in Server?YES. IPC stage is a stage which helps one passive stage read data from another as soon as data is available. In other words, stages do not have to wait for the entire set of records to be read first and then transferred to the next stage. It means as soon as the data is available between stages( in pipes or links), it can be exchanged between them without waiting for the entire record set to be read.

Note:- Link partitioner and Link collector stages can be used to achieve a certain degree of partitioning parallelism.

56.What are partitioning methods available in PX?

The Partitioning methods available in PX are:1. Auto:

--- It chooses the best partitioning method depending on: The mode of execution of the current stage and the preceding

stage. The number of nodes available in the configuration file.

2. Round robin:--- Here, the first record goes to the first processing node, the second to the second processing node, and so on. This method is useful for resizing partitions of an input dataset that are not equal in size to approximately equal-sized partitions.Data Stage uses ‘Round robin’ when it partitions the data initially.

3. Same:--- It implements the Partitioning method same as the one used by the preceding stage. The records stay on the same processing node; that is, data is not redistributed or repartitioned. Same is considered as the fastest Partitioning method.

Page 33: Data Stage

Data Stage uses ‘Same’ when passing data between stages in a job.

4. Random:--- It distributes the records randomly across all processing nodes and guarantees that each processing node receives approximately equal-sized partitions.

5. Entire:--- It distributes the complete dataset as input to every instance of a stage on every processing node. It is mostly used with stages that create lookup tables for their input.

6. Hash:--- It distributes all the records with identical key values to the same processing node so as to ensure that related records are in the same partition. This does not necessarily mean that the partitions will be equal in size. --- When Hash Partitioning, hashing keys that create a large number of partitions should be selected. Reason: For example, if you hash partition a dataset based on a zip code field, where a large percentage of records are from one or two zip codes, it can lead to bottlenecks because some nodes are required to process more records than other nodes.

7. Modulus:--- Partitioning is based on a key column modulo the number of partitions. The modulus partitioner assigns each record of an input dataset to a partition of its output dataset as determined by a specified key field in the input dataset.

8. Range:--- It divides a dataset into approximately equal-sized partitions, each of which contains records with key columns within a specific range. It guarantees that all records with same partitioning key values are assigned to the same partition.

Note: In order to use a Range partitioner, a range map has to be made using the ‘Write range map’ stage.

9. DB2:--- Partitions an input dataset in the same way that DB2 would partition it. For example, if this method is used to partition an input dataset containing update information for an existing DB2 table, records are assigned to the processing node containing the corresponding DB2 record. Then during the execution of the parallel operator, both the input record and the DB2 table record are local to the processing node.

57.What is Re-Partitioning? When actually re-partition will occur?

--- Re-Partitioning is the rearranging of data among the partitions. In a job, the Parallel-to-Parallel flow results in Re-Partitioning.

Page 34: Data Stage

For example, consider the EMP data that is initially processed based on SAL, but now you want to process the data grouped by DEPTNO. Then you will need to Repartition to ensure that all the employees falling under the same DEPTNO are in the same group.

58.What are IConv () and Oconv () functions and where are they used?Can we use these functions in PX?

--- ‘Iconv()’ converts a string to an internal storage format. Syntax:- Iconv(string,code[@VM code]…)

- string evaluates to the string to be converted.- Code indicates the conversion code which specifies how the data

needs to be formatted for output or internal storage. Like, MCA – Extracts alphabetic characters from a field. MCN – Extracts numeric characters from a field. MCL - Converts Uppercase Letters to Lowercase….

--- ‘Oconv()’ converts an expression to an output format. Syntax:- Oconv(expression, conversion[@VM conversion]…)

- Expression is a string stored in internal format that needs to be converted to output format.

- Code indicates the conversion code which specifies how the string needs to be formatted.

MCA – Extracts alphabetic characters from a field. MCN – Extracts numeric characters from a field. MCL - Converts Uppercase Letters to Lowercase….

These functions can’t be used directly in PX. The only stage which allows the usage

of Iconv() and Oconv() in PX is ‘Basic Transformer’ stage. It gives access to the functions supported by the DataStage Server Engine. Note:- Basic Transformer can be used only on SMP systems but not on MPP or Cluster syntems.

59.What does a Configuration File in parallel extender consist of?

The Configuration File consists of all the processing nodes of a parallel system.

It can be defined and edited using the DataStage Manager. It describes every

processing node that DataStage uses to run an application. When you run a job, DataStage first reads the Configuration File to determine the available nodes. When a system is modified by adding or removing processing nodes or

by reconfiguring nodes, the DataStage jobs need not be altered or even recompiled. Instead, editing the Configuration file alone will suffice. The configuration file also gives control over parallelization of a job

during development cycle. For example, by editing the Configuration file, first a job can be run on a single processing node, then on two nodes, then four, and so on.

Page 35: Data Stage

60.What is difference between file set and data set?

Dataset:- Datasets are operating system files, each referred to by a control file, which has the suffix .ds. PX jobs use datasets to manage data within a job. The data in the Datasets is stored in internal format.A Dataset consists of two parts:

Descriptor file: It contains metadata and data location. Data file: It contains the data.

--The Dataset Stage is used to read data from or write data to a dataset. It allows you to store data in persistent form, which can then be used by other jobs.

Fileset:- DataStage can generate and name exported files, write them to their destination, and list the files it has generated in a file with extension, .fs. The data files and the file that lists them is called a ‘Fileset’. A fileset consists of two parts:

Descriptor file: It contains location of raw data files and the meta data.

Individual raw Data files.-- The Fileset Stage is used to read data from or write data to a fileset.

61. Lookup Stage :Is it Persistent or non-persistent? (What is happening behind the scene)--- Lookup stage is non-persistent.

62.How can we maintain the partitioning in Sort stage?-- Partitioning in Sort stage can be maintained using the Partitioning method, ‘Same’. For example, assume you sort a dataset on a system with four processing nodes and store the results to a Dataset stage. The dataset will therefore have four partitions. You then use that dataset as input to a stage executing on a different number of nodes. DataStage automatically repartitions the dataset to spread it out to all the processing nodes. This destroys the sort order of the data.This can be avoided by specifying the Same Partitioning method so that the original partitions are preserved.

63.Where we need partitioning (In processing or some where)--- Partitioning is needed in processing. It means we need Partitioning where we have huge volumes of data to process.

64.If we use SAME partitioning in the first stage which partitioning method it will take?--- As Auto is the default partitioning method, it is taken as the partitioning method.

65.What is the symbol we will get when we are using round robin partitioning method?-- BOW TIE.

Page 36: Data Stage

Given below is the list of icons that appear on the link based on the mode of execution, parallel or sequential, of the current stage and the preceding stage, and the type of Partitioning method:Preceding Stage Current StageSequential mode-- (FAN OUT) -- Parallel mode (Indicates Partitioning)Parallel mode -- (FAN IN) -- Sequential mode(Indicates Collecting)Parallel mode -- (BOX) -- Parallel mode (Indicates AUTO method)Parallel mode -- (BOW TIE)-- Parallel mode (Indicates Repartitioning)Parallel mode -- (PARALLEL LINES) Parallel mode (Indicates SAME partitioning)

66.If we check the preserve partitioning in one stage and if we don’t give any partitioning method (Auto) in the next stage which partition method it will use?

-- In this case, the partitioning method used by the preceding stage is used.-- Preserve Partitioning indicates whether the stage wants to preserve the

partitioning at the next stage of the job. Options in this tab are: Set – Sets the Preserve partitioning flag. Clear – Clears the preserve partitioning flag. Propagate – Sets the flag to Set or Clear depending on the

option selected in the previous stage.

67. Can we give node allocations i.e. for one stage 4 nodes and for next stage 3 nodes?--- Generally all the processing nodes for a project are defined in the Configuration file. So, the node allocation is common for the project as a whole but not for individual stages in a job. It means that node allocation is project specific.

68.What is combinability, non-combinability?--- Using Combinability, DataStage combines the operators that underlie parallel stages so that they run in the same process. It lets the DataStage compiler potentially 'optimize' the number of processes used at runtime by combining operators. This saves significant amount of data copying and preparation in passing data between operators. It has three options to set:

Auto: Use the default combination method. Combinable: Combine all possible operators. Don’t combine: Never combine operators.

Usually this setting is left ot its default so that DataStage can tune jobs for performance automatically.

69.What are schema files?-- ‘Schema file’ is a plain text file in which meta data for a stage is specified.--- It is stored outside the DataStage Repository in a document management system or a source code control system. In contrast, Table definitions are stored in DataStage Repository and can be loaded into stages as and when required.--- Schema is an alternative way to specify column definitions for the data used by the parallel jobs. By default most parallel job stages take their meta

Page 37: Data Stage

data from the columns tab. For some stages you can specify a property that causes the stages to take its meta data from the specified schema file.--- A Schema consists of a record definition. The following is an example for record schema: record( name:string[]; address:nullable string[]; date:date[]; )--- Import any table definition, load it into a parallel job, then choose "Show Schema". The schema defines the columns and their characteristics in a data set, and may also contain information about key columns.

70.Why we need datasets rather than sequential files?

--- A Sequential file as the source or target needs to be repartitioned as it is(as name suggests) a single sequential stream of data. A dataset can be saved across nodes using partitioning method selected, so it is always faster when we used as a source or target. The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by otherDataStage jobs. Data sets are operating system files, each referred to by a control file, which by convention has the suffix .ds. Using datasets wisely can be key to good performance in a set of linked jobs.

21. Is look-up stage returns multi-rows or single rows?---. Lookup stage returns the rows related to the key values. It can be multiple rows depends on the keys you are mentioning for the lookup in the stage

22. Why we need sort stage other than sort-merge collective method and perform sort option in the stage in advanced properties?--- Sort Stage is used to perform more complex sort operations which is not possible using stages Advanced tab properties..Many stages have an optional sort function via the partition tab. This means if you are partitioning your data in a stage you can define the sort at the same time. The sort stage is for use when you don't have any stage doing partitioning in your job but you still want to sort your data, or if you want to sort your data in descending order, or if you want to use one of the sort stage options such as "Allow Duplicates" or "Stable Sort". If you are processing very large volumes and need to sort you will find the sort stage is more flexible then the partition tab sort.

23. For surrogate key generator stage where will be the next value stored?

24. In surrogate key generator stage how it generates the number? (Based on nodes or Based on rows)--- Based on the nodes we are generating key values, and the input data partitions should be perfectly balanced across the nodes. This can be achieved using round robin partitioning method when your starting point is sequential

25. What is the preserve partioning flag in Advanced tab?

Page 38: Data Stage

--- It indicates whether the stage wants to preserve partitioning at the next stage of the job. There a three options 1. Set 2.Clear 3.Propagate.

Set. Sets the preserve partitioning flag, this indicates to the next stage in the job that it should preserve existing partitioning if possible.

Clear. Clears the preserve partitioning flag, this indicates that this stage doesn’t care which partitioning method the next stage uses.

Propagate. Set the flag to set or clear depending on what in previous stage of the job has set(or if that is set to propagate the stage before that and so on until a preserve partitioning flag setting is encountered).

26. What is the difference between stages and operators?

--- Stages are generic user interface from where we can read and write from files and databases, trouble shoot and develop jobs, also it's capable of doing processing of data. Different types of stages are

Database. These are stages that read or write data contained in a database. Examples of database stages are the Oracle Enterprise and DB2/UDB Enterprise stages.

Development/Debug. These are stages that help you when you are developing and troubleshooting parallel jobs. Examples are the Peek and Row Generator stages.

File. These are stages that read or write data contained in a file or set of files. Examples of file stages are the Sequential File and Data Set stages.

Processing. These are stages that perform some processing on the data that is passing through them. Examples of processing stages are the Aggregator and Transformer stages.

Real Time. These are the stages that allow Parallel jobs to be made available as RTI services. They comprise the RTI Source and RTI Target stages. These are part of the optional Web Services package.

Restructure. These are stages that deal with and manipulate data containing columns of complex data type. Examples are Make Sub record and Make Vector stages.

--- Operators are the basic functional units of an orchestrate application. In orchestrate framework DataStage stages generates an orchestrate operator directly.

27. Why we need filter, copy and column export stages instead of transformer stage?

Page 39: Data Stage

--- In parallel jobs we have specific stage types for performing specialized tasks. Filter, copy, column export stages are operator stages. These operators are the basic functional units of an orchestrate application. The operators in your Orchestrate application pass data records from one operator to the next, in pipeline fashion. For example, the operators in an application step might start with an import operator, which reads data from a file and converts it to an Orchestrate data set. Subsequent operators in the sequence could perform various processing and analysis tasks. The processing power of Orchestrate derives largely from its ability to execute operators in parallel on multiple processing nodes. By default, Orchestrate operators execute on all processing nodes in your system. Orchestrate dynamically scales your application up or down in response to system configuration changes, without requiring you to modify your application. Thus using operator stages will increase the speed of data processing applications rather than using transformer stages.

28. Describe the types of Transformers used in DataStage PX for processing and uses?

TransformerBasic Transformer

Transformer-: The Transformer stage is a processing stage. Transformer stages allow you to create transformations to apply to your data. These transformations can be simple or complex and can be applied to individual columns in your data. Transformations are specified using a set of functions. Transformer stages can have a single input and any number of outputs. It can also have a reject link that takes any rows, which have not been written to any of the outputs links by reason of a write failure or expression evaluation failure.

Basic Transformer:- The BASIC Transformer stage is a also a processing stage. It is similar in appearance and function to the Transformer stage in parallel jobs. It gives access to BASIC transforms and functions (BASIC is the language supported by the datastage server engine and available in server jobs). BASIC Transformer stage can have a single input and any number of outputs.

29. What is aggregate cache in aggregator transformation?

--- Aggregate cache is the memory used for grouping operations by the aggregator stage.

30. What will you do in a situation where somebody wants to send you a file and use that file as an input or reference and then run job?

--- Use wait for file activity stage between job activity stages in job sequencer.

31. How do you rename all of the jobs to support your new File-naming conventions?

Page 40: Data Stage

--- Create a file with new and old names. Export the whole project as a dsx. Write a script, which can do a simple rename of the strings looking up the file. Then import the new dsx file and recompile all jobs. Be cautious that the name of the jobs has also been changed in your job control jobs or Sequencer jobs. So you have to make the necessary changes to these Sequencers.

32. How do you merge two files in DS?

--- We can merge two files in 3 different ways. Either go for Merge stage, Join Stage or Lookup Stage All these merge or join occurs based on the key values. The three stages differ mainly in the memory they use, the treatment of rows with unmatched keys, and their requirements for data being input of key columns.

33. How did you handle an 'Aborted' sequencer?

--- By using check point information we can restart the sequence from failure. if u enabled the check point information reset the aborted job and run again.

34. What are Performance tunings you have done in your last project to increase the performance of slowly running jobs?

1. Using Dataset stage instead of sequential files wherever necessary.2. Use Join stage instead of Lookup stage when the data is huge.3.Use Operator stages like remove duplicate, Filter, Copy etc instead of transformer stage.4. Sort the data before sending to change capture stage or remove duplicate stage.5. Key column should be hash partitioned and sorted before aggregate operation.6.Filter unwanted records in beginning of the job flow itself.

35. If data is partitioned in your job on key 1 and then you aggregate on key 2, what issues could arise?

--- It will result in false output even though job runs successfully. In aggregator key value should be hash partitioned so that identical key values will be in the same node, which gives the desired result and also eases grouping operation.

36. What is Full load & Incremental or Refresh load?37. Describe cleanup resource and clear status file?The Cleanup Resources command is to• View and end job processes• View and release the associated locksCleanup Resources command is available in director under Job menu.Clear Status file command is for resetting the status records associated with all stages in that job. 38. What is lookup stage? Can you define derivations in the lookup stage output? Lookup stage is used to perform lookup operations on a data set read into memory from any other Parallel job stage that can output data. It can also perform lookups

Page 41: Data Stage

directly in a DB2 or Oracle database or in a lookup table contained in a Lookup File Set stage.The most common use for a lookup is to map short codes in the input data set onto expanded information from a lookup table which is then joined to the incoming data and output.Lookups can also be used for validation of a row. If there is no corresponding entry in a lookup table to the key’s values, the row is rejected.The Lookup stage can have a reference link, a single input link, a single output link, and a single rejects link. Depending upon the type and setting of the stage(s) providing the look up information, it can have multiple reference links (where it is directly looking up a DB2 table or Oracle table, it can only have a single reference link). A lot of the setting up of a lookup operation takes place on the stage providing the lookup table.The input link carries the data from the source data set and is known as the primary link. The following pictures show some example jobs performing lookups.For each record of the source data set from the primary link, the Lookup stage performs a table lookup on each of the lookup tables attached by reference links. The table lookup is based on the values of a set of lookup key columns, one set for each table. The keys are defined on the Lookup stage.Lookup stages do not require data on the input link or reference links to be sorted. Each record of the output data set contains columns from a source record plus columns from all the corresponding lookup records where corresponding source and lookup records have the same value for the lookup key columns. The lookup key columns do not have to have the same names in the primary and the reference links. The optional reject link carries source records that do not have a corresponding entry in the input lookup tables.There are some special partitioning considerations for lookup stages. You need to ensure that the data being looked up in the lookup table is in the same partition as the input data referencing it. One way of doing this is to partition the lookup tables using the Entire method. Another way is to partition it in the same way as the input data (although this implies sorting of the data).Yes, 39. What is copy stage? When do you use that?The Copy stage copies a single input data set to a number of output data sets. Each record of the input data set is copied to every output data set. Records can be copied without modification or you can drop or change the order of. Copy stage is useful when we want to make a backup copy of a data set on disk while performing an operation on another copy.Copy stage with a single input and a single output, needs Force set to be TRUE. This prevents DataStage from deciding that the Copy operation is superfluous and optimizing it out of the job.40. What is Change Capture stage? Which execution mode would you use when you used for comparison of data?The Change Capture stage takes two input data sets, denoted before and after, and outputs a single data set whose records represent the changes made to the before data set to obtain the after data set. The stage produces a change data set, whose table definition is transferred from the after data set’s table definition with the addition of one column: a change code with values encoding the four actions: insert, delete, copy, and edit. The preserve-partitioning flag is set on the change data set.The compare is based on a set a set of key columns, rows from the two data sets are assumed to be copies of one another if they have the same values in these key columns. You can also optionally specify change values. If two rows have identical key columns, you can compare the value columns in the rows to see if one is an edited copy of the other.

Page 42: Data Stage

The stage assumes that the incoming data is key-partitioned and sorted in ascending order. The columns the data is hashed on should be the key columns used for the data compare. You can achieve the sorting and partitioning using the Sort stage or by using the built-in sorting and partitioning abilities of the Change Capture stage.We can use both Sequential as well as parallel mode of execution for change capture stage.41. What is Dataset Stage?The Data Set stage is a file stage. It allows you to read data from or write data to a data set. It can be configured to execute in parallel or sequential mode. DataStage parallel extender jobs use data sets to manage data within a job. The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other DataStage jobs. Data sets are operating system files, each referred to by a control file, which by convention has the suffix .ds. Using data sets wisely can be key to good performance in a set of linked jobs. You can also manage data sets independently of a job using the Data Set Management utility, available from the DataStage Designer, Manager, or Director42. How do you drop dataset?There two ways of dropping a data set, first is by using Data Set Management Utility (GUI) located in the Manager, Director, Designer and second is by using Unix command-line utility orchadmin.43. How do you eliminate duplicate in a Dataset?The simplest way to remove duplicate is by the use of Remove Duplicate Stage. The Remove Duplicates stage takes a single sorted data set as input, removes all duplicate rows, and writes the results to an output data set. 44. What is Peek Stage? When do you use it?The Peek stage is a Development/Debug stage. It can have a single input link and any number of output links. The Peek stage lets you print record column values either to the job log or to a separate output link as the stage copies records from its input data set to one or more output data sets. Like the Head stage and the Tail stage, the Peek stage can be helpful for monitoring the progress of your application or to diagnose a bug in your application.45. What are the different join options available in Join Stage?There are four join option available in the join stage.Inner: Transfers records from input data sets whose key columns contain equal values to the output data set. Records whose key columns do not contain equal values are droppedLeft Outer: Transfers all values from the left data set but transfers values from the right data set and intermediate data sets only where key columns match. The stage drops the key column from the right and intermediate data sets.Right Outer: Transfers all values from the right data set and transfers values from the left data set and intermediate data sets only where key columns match. The stage drops the key column from the left and intermediate data sets.Full Outer: Transfers records in which the contents of the key columns are equal from the left and right input data sets to the output data set. It also transfers records whose key columns contain unequal values from both input data sets to the output data set. (Full outer joins do not support more than two input links.)The default is inner

46. What is the difference between Lookup, Merge & Join Stage?These "three Stages" combine two or more input links according to values of user-designated "key" column(s).They differ mainly in:– Memory usage– Treatment of rows with unmatched key values

Page 43: Data Stage

– Input requirements (sorted, de-duplicated)--The main difference between joiner and lookup is in the way they handle the data and the reject links. In joiner, no reject links are possible. So we cannot get the rejected records directly. Lookup provides a reject link. Also lookup is used if the data being looked up can fit in the available temporary memory. If the volume of data is quite huge, then it is safe to go for Joiner--Join requires the input dataset to be key partitioned and sorted. Lookup does not have this requirement--Lookup allows reject links. Join does not allow reject links. If the volume of data is huge to be fit into memory you go for join and avoid lookup as paging can occur when lookup is used.---Merge stage allow us to capture failed lookups from each reference input separately. It also requires identically sorted and partitioned inputs and, if more than one reference input, de-duplicated reference inputsIn case of merge stage as part of pre processing step duplicates should be removed from master dataset. If there are more than one update dataset then duplicates should be removed from update datasets as well. The above-mentioned step is not required for join and lookup stages.

47. What is RCP? How it is implemented?DataStage is flexible about Meta data. It can cope with the situation where Meta data isn’t fully defined. You can define part of your schema and specify that, if your job encounters extra columns that are not defined in the meta data when it actually runs, it will adopt these extra columns and propagate them through the rest of the job. This is known as runtime column propagation (RCP). This can be enabled for a project via the DataStage Administrator, and set for individual links via the Outputs Page Columns tab for most stages, or in the Outputs page General tab for Transformer stages. You should always ensure that runtime column propagation is turned on.RCP is implemented through Schema File. The schema file is a plain text file contains a record (or row) definition.

48. What is row generator? When do you use it?The Row Generator stage is a Development/Debug stage. It has no input links, and a single output link.The Row Generator stage produces a set of mock data fitting the specified meta data. This is useful where we want to test our job but have no real data available to process. Row Generator is also useful when we want processing stages to execute at least once in absence of data from the source.

49. How to extract data from more than 1 heterogeneous Sources?We can extract data from different sources (ie ORACL, DB2 , Sequential file etc) in a job. After getting the data we can use Join, Merge, Aggregator or Lookup stage to unify the incoming.

50. How can we pass parameters to job by using file?This can be done through the shell script where we can read the different parameter from the file and call dsjob command to execute the job for those interpreted parameters.

Page 44: Data Stage